Evaluation
- DeepEval ⭐ 14k - a simple-to-use, open-source evaluation framework for LLM applications.
- Galileo AI: The Generative AI Evaluation Company
- Evaluating LLM Models: A Guide to Popular Frameworks
- GitHub - tatsu-lab/alpaca_eval: An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast. ⭐ 2.0k
- Eureka: OSS Framework to evaluate LLMs - by Bugra Akyildiz
- The Needle In a Haystack Test. Evaluating the performance of RAG… | by Aparna Dhinakaran | Towards Data Science
- GitHub - gkamradt/LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy ⭐ 2.2k
- The Needle In a Haystack Test: Evaluating the Performance of LLM RAG Systems - Arize AI
- Unlocking precision: The "Needle-in-a-Haystack" test for LLM evaluation
- The Needle in the Haystack Test and How Gemini Pro Solves It | Google Cloud Blog
- GitHub - huggingface/evaluation-guidebook: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval! ⭐ 2.1k
- Fiddler Auditor ⭐ 189 - a tool to evaluate the robustness of language models.
- ragas ⭐ 13k - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines.
- Objective metrics, intelligent test generation, and data-driven insights for LLM apps
- Evaluating RAG Applications with RAGAs | by Leonie Monigatti | Towards Data Science
- tvalmetrics ⭐ 325 - Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
- GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. ⭐ 18k
- GitHub - langchain-ai/openevals: Readymade evaluators for your LLM apps · GitHub ⭐ 1.0k
- GitHub - strands-agents/evals: A comprehensive evaluation framework for AI agents and LLM applications. · GitHub
Key Aspects of LLM Evaluation
- Accuracy and Quality: Measures how well the model predicts or generates correct outputs for tasks like text generation, translation, or summarization.
- Bias and Fairness: Assesses the presence of biases related to gender, race, or other demographics to ensure ethical AI.
- Robustness: Tests the model’s resilience to noisy or adversarial inputs, ensuring consistent performance across diverse contexts.
- Generalization: Evaluates how well the model can adapt to new, unseen data outside its training set.
- Efficiency: Checks the computational cost, memory usage, and inference time, ensuring scalability.
- Interpretability: Examines whether the model’s decision-making process is understandable, allowing developers to trace and reason about its outputs.
- Alignment: Verifies whether the model’s outputs align with human values and expected behaviors, particularly important for applications like conversational AI.
Benchmarks vs Evals vs Tests
- Benchmarks provide a consistent point of comparison. They are standardized datasets and tasks used to measure the general capabilities of models across the industry. For example, benchmarks like SQuAD or WMT test LLMs on tasks like question-answering or translation, giving a sense of overall performance. However, benchmarks are static and limited — they likely won’t capture the unique challenges or context your specific application faces.
- Evals focus on understanding how your LLM-powered components behave in your specific application environment. While benchmarks offer a general comparison, evals go deeper into the intricacies of the system’s performance on real-world tasks. For example, if your GenAI system is a chatbot, your evals might include how well it maintains context in long conversations, detects user emotions, or handles ambiguous queries. Evals allow you to probe deeper into the “why” behind system behaviors, offering insights into failure modes, edge cases, and emergent properties that are specific to the involvement of an LLM.
- Tests are all about validation. They ensure that the system or software behaves as expected, often using pass/fail criteria. You might imagine, then, that in order to validate you must first understand the requirements of the system in the production environment. In LLM-powered systems, it would make sense to create tests on top of evals to action on the understanding.
In summary:
- Use benchmarks (with grains of salt) to compare model capabilities directly
- Use evals to measure and understand the performance characteristics of your system
- Use tests to validate and action upon these learnings (e.g. “fail” if some set of metrics dip below acceptable thresholds)
LLM benchmarks, evals and tests. A mental model | by Thoughtworks | Medium
Evaluation Methods
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add scores.
| Method | What | Use when |
|---|---|---|
| LLM-as-a-Judge | Use an LLM to evaluate outputs based on custom criteria | Subjective assessments at scale (tone, accuracy, helpfulness) |
| Scores via UI | Manually add scores to traces directly in the Langfuse UI | Quick quality spot checks, reviewing individual traces |
| Annotation Queues | Structured human review workflows with customizable queues | Building ground truth, systematic labeling, team collaboration |
| Scores via API/SDK | Programmatically add scores using the Langfuse API or SDK | Custom evaluation pipelines, deterministic checks, automated workflows |
| Concepts - Langfuse |
Evaluation of LLM Applications - Langfuse
Automated Evaluations of LLM Applications - Langfuse