Skip to main content

Evaluation / Benchmarking

LLM Evaluation / Monitoring

Key Aspects of LLM Evaluation

  1. Accuracy and Quality: Measures how well the model predicts or generates correct outputs for tasks like text generation, translation, or summarization.
  2. Bias and Fairness: Assesses the presence of biases related to gender, race, or other demographics to ensure ethical AI.
  3. Robustness: Tests the model’s resilience to noisy or adversarial inputs, ensuring consistent performance across diverse contexts.
  4. Generalization: Evaluates how well the model can adapt to new, unseen data outside its training set.
  5. Efficiency: Checks the computational cost, memory usage, and inference time, ensuring scalability.
  6. Interpretability: Examines whether the model’s decision-making process is understandable, allowing developers to trace and reason about its outputs.
  7. Alignment: Verifies whether the model’s outputs align with human values and expected behaviors, particularly important for applications like conversational AI.

Metrics

  • Token Per Second (TPS)
  • Time to first token (TTFT)
  • GPU Usage

Time to first token (TTFT)

It refers to the amount of time an LLM takes to generate the first token in its response after receiving an input or prompt. It is typically measured in seconds or milliseconds, and a lower TTFT indicates faster model responsiveness.

Why TTFT Matters for Performance Benchmarking?

TTFT is a key metric for understanding a model’s responsiveness, especially when input complexity varies. It helps benchmark how efficiently the model handles different types of inputs.

Token Per Second (TPS)

TPS refers to the number of tokens that a LLM can generate or process in one second. A higher TPS indicates faster model responses.

TPS is generally calculated using the formula:

TPS = (Input Tokens + Output Tokens) / Total Turnaround Time (TAT in seconds)

This value represents the average TPS, accounting for both the input and output tokens over the total time taken.

However, it’s also important to evaluate Output TPS, which specifically measures how many tokens the model generates per second, independent of the input tokens.

Output TPS can be calculated as:

Output TPS = Output Tokens / Time to Generate Output Tokens (TAT in seconds)

Output TPS is a more focused metric that excludes input token processing. It provides a clear measure of the model’s raw generation speed, offering insights into how efficiently the model produces output regardless of the complexity or size of the input.

GPU Usage

GPU usage benchmarking is critical for two main reasons: it helps you avoid performance bottlenecks in production and ensures you’re not overpaying for unused resources. Without proper GPU benchmarking, you might end up with a model that either crashes due to insufficient resources or wastes money on underutilized hardware.

When measuring GPU usage, we look at two key metrics:

Volatile GPU Utilization (0–100%) :

This shows how hard your GPU is working during model inference. Think of it like your GPU’s ‘effort level’ — it sits at 0% when idle and ramps up as the model generates output. We need to know:

  • What percentage of GPU power the model typically uses
  • How utilization varies with different batch sizes and input lengths
  • Peak utilization during heavy loads

GPU Memory Usage:

This tells us how much VRAM your model needs. It’s crucial to measure:

  • Base memory required just to load the model (idle state)
  • Peak memory usage during generation
  • Memory patterns with different input sizes
  • How memory usage scales with batch processing

Understanding both metrics helps you right-size your infrastructure and avoid nasty surprises in production like out-of-memory errors or performance degradation.

Using these metrics (TTFT, TPS, and GPU usage) together helps us make a smart comparison between different GPUs and infrastructure options. This way, we can pick the setup that gives us the best performance for our specific use case without overspending.

Benchmarks

Decoding AI Benchmarks: The 7 Essential LLM Benchmarks You Need to Understand

Benchmark NameOrganizer / SourceFormat / DatasetTask Type / ContentEvaluation / ContextReference / Link
Agentic Coding (SWE-bench)SWE-bench team2,294 real GitHub issues & PRs from 12 Python reposBug fixes, feature implementation, multi-file codebase editsPass/fail via test suites; requires codebase understanding, multi-file coordination, long-contextswebench.com
Agentic Terminal CodingTerminal Bench teamSuite of tasks: English descriptions, Docker envs, test scripts (T-Bench-Core-vO)Compiling code, server management, software install, system-level reasoning in sandboxed terminalPass/fail via test scripts; one attempt per task; real terminal, no external tool accesstbench.ai
Graduate-level Reasoning (GPQA)GPQA research team448 (main), 546 (extended), 198 (diamond) MCQs in biology, chemistry, physicsGraduate-level, Google-proof, expert-created MCQsAccuracy vs. PhD experts (65-74%) & non-experts (34%); hardest: "Diamond" subsetarxiv.org/abs/231 1.12022
Agentic Tool Use (TAU Bench)TAU-bench research teamSimulated users, realistic databases, policy docs, API tools (retail/airline)Customer support, workflow automation, multi-step reasoning, tool/API use, policy followingCorrect end-state in database; must follow domain-specific rules; multi-turn, multi-tool interactionarxiv.org/pdf/240 6.12045
MMMLU (Multilingual MMLU)MMMLU research team57 subjects, 14 languages, elementary to advanced professional levelMultiple-choice, zero/few-shot, broad subject coverage (STEM, humanities, law, ethics, etc.)Accuracy across subjects/languages; identifies model blind spotsmmmu- benchmark.githu b.io
Visual Reasoning (MMMU)MMMU research team11,500 college-level problems, 6 disciplines, 30 subjectsVisual & textual reasoning: charts, diagrams, tables, maps, music sheets, chemical structuresPerception, knowledge, reasoning in multimodal (text+image) context; expert-level subject knowledgemmmu- benchmark.githu b.io
High School Math (AIME 2025)Mathematic al Association of America (MAA)15 questions, 3 hours; 3-digit integer answers; no calculatorsAlgebra, geometry, combinatorics, number theory, probability; creative problem-solvingOne attempt per student/year (AIME I or Il); high school competition levelvals.ai/benchmar ks/aime-2025- 03-13