Skip to main content

Evaluation

Key Aspects of LLM Evaluation

  1. Accuracy and Quality: Measures how well the model predicts or generates correct outputs for tasks like text generation, translation, or summarization.
  2. Bias and Fairness: Assesses the presence of biases related to gender, race, or other demographics to ensure ethical AI.
  3. Robustness: Tests the model’s resilience to noisy or adversarial inputs, ensuring consistent performance across diverse contexts.
  4. Generalization: Evaluates how well the model can adapt to new, unseen data outside its training set.
  5. Efficiency: Checks the computational cost, memory usage, and inference time, ensuring scalability.
  6. Interpretability: Examines whether the model’s decision-making process is understandable, allowing developers to trace and reason about its outputs.
  7. Alignment: Verifies whether the model’s outputs align with human values and expected behaviors, particularly important for applications like conversational AI.

Benchmarks vs Evals vs Tests

  • Benchmarks provide a consistent point of comparison. They are standardized datasets and tasks used to measure the general capabilities of models across the industry. For example, benchmarks like SQuAD or WMT test LLMs on tasks like question-answering or translation, giving a sense of overall performance. However, benchmarks are static and limited — they likely won’t capture the unique challenges or context your specific application faces.
  • Evals focus on understanding how your LLM-powered components behave in your specific application environment. While benchmarks offer a general comparison, evals go deeper into the intricacies of the system’s performance on real-world tasks. For example, if your GenAI system is a chatbot, your evals might include how well it maintains context in long conversations, detects user emotions, or handles ambiguous queries. Evals allow you to probe deeper into the “why” behind system behaviors, offering insights into failure modes, edge cases, and emergent properties that are specific to the involvement of an LLM.
  • Tests are all about validation. They ensure that the system or software behaves as expected, often using pass/fail criteria. You might imagine, then, that in order to validate you must first understand the requirements of the system in the production environment. In LLM-powered systems, it would make sense to create tests on top of evals to action on the understanding.

In summary:

  • Use benchmarks (with grains of salt) to compare model capabilities directly
  • Use evals to measure and understand the performance characteristics of your system
  • Use tests to validate and action upon these learnings (e.g. “fail” if some set of metrics dip below acceptable thresholds)

LLM benchmarks, evals and tests. A mental model | by Thoughtworks | Medium

Evaluation Methods

Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add scores.

MethodWhatUse when
LLM-as-a-JudgeUse an LLM to evaluate outputs based on custom criteriaSubjective assessments at scale (tone, accuracy, helpfulness)
Scores via UIManually add scores to traces directly in the Langfuse UIQuick quality spot checks, reviewing individual traces
Annotation QueuesStructured human review workflows with customizable queuesBuilding ground truth, systematic labeling, team collaboration
Scores via API/SDKProgrammatically add scores using the Langfuse API or SDKCustom evaluation pipelines, deterministic checks, automated workflows
Concepts - Langfuse

Evaluation of LLM Applications - Langfuse

Automated Evaluations of LLM Applications - Langfuse