Benchmarking / Monitoring

evaluation

LangFuse

GitHub - langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23 ⭐ 26k

Observation Types

event is the basic building block. An event is used to track discrete events in a trace.
span represents durations of units of work in a trace.
generation logs generations of AI models incl. prompts, token usage and costs.
agent decides on the application flow and can for example use tools with the guidance of a LLM.
tool represents a tool call, for example to a weather API.
chain is a link between different application steps, like passing context from a retriever to a LLM call.
retriever represents data retrieval steps, such as a call to a vector store or a database.
evaluator represents functions that assess relevance/correctness/helpfulness of a LLM's outputs.
embedding is a call to a LLM to generate embeddings and can include model, token usage and costs
guardrail is a component that protects against malicious content or jailbreaks.

Observation Types - Langfuse

Trace complex LLM applications with the Langfuse decorator (Python) - Langfuse

LLM Monitoring / LLMOps

Leaderboards

Metrics

Token Per Second (TPS)
Time to first token (TTFT)
GPU Usage

Time to first token (TTFT)

It refers to the amount of time an LLM takes to generate the first token in its response after receiving an input or prompt. It is typically measured in seconds or milliseconds, and a lower TTFT indicates faster model responsiveness.

Why TTFT Matters for Performance Benchmarking?

TTFT is a key metric for understanding a model’s responsiveness, especially when input complexity varies. It helps benchmark how efficiently the model handles different types of inputs.

Token Per Second (TPS)

TPS refers to the number of tokens that a LLM can generate or process in one second. A higher TPS indicates faster model responses.

TPS is generally calculated using the formula:

TPS = (Input Tokens + Output Tokens) / Total Turnaround Time (TAT in seconds)

This value represents the average TPS, accounting for both the input and output tokens over the total time taken.

However, it’s also important to evaluate Output TPS, which specifically measures how many tokens the model generates per second, independent of the input tokens.

Output TPS can be calculated as:

Output TPS = Output Tokens / Time to Generate Output Tokens (TAT in seconds)

Output TPS is a more focused metric that excludes input token processing. It provides a clear measure of the model’s raw generation speed, offering insights into how efficiently the model produces output regardless of the complexity or size of the input.

GPU Usage

GPU usage benchmarking is critical for two main reasons: it helps you avoid performance bottlenecks in production and ensures you’re not overpaying for unused resources. Without proper GPU benchmarking, you might end up with a model that either crashes due to insufficient resources or wastes money on underutilized hardware.

When measuring GPU usage, we look at two key metrics:

Volatile GPU Utilization (0–100%) :

This shows how hard your GPU is working during model inference. Think of it like your GPU’s ‘effort level’ — it sits at 0% when idle and ramps up as the model generates output. We need to know:

What percentage of GPU power the model typically uses
How utilization varies with different batch sizes and input lengths
Peak utilization during heavy loads

GPU Memory Usage:

This tells us how much VRAM your model needs. It’s crucial to measure:

Base memory required just to load the model (idle state)
Peak memory usage during generation
Memory patterns with different input sizes
How memory usage scales with batch processing

Understanding both metrics helps you right-size your infrastructure and avoid nasty surprises in production like out-of-memory errors or performance degradation.

Using these metrics (TTFT, TPS, and GPU usage) together helps us make a smart comparison between different GPUs and infrastructure options. This way, we can pick the setup that gives us the best performance for our specific use case without overspending.

Benchmarks

Decoding AI Benchmarks: The 7 Essential LLM Benchmarks You Need to Understand

Benchmark Name	Organizer / Source	Format / Dataset	Task Type / Content	Evaluation / Context	Reference / Link
Agentic Coding (SWE-bench)	SWE-bench team	2,294 real GitHub issues & PRs from 12 Python repos	Bug fixes, feature implementation, multi-file codebase edits	Pass/fail via test suites; requires codebase understanding, multi-file coordination, long-context	swebench.com
Agentic Terminal Coding	Terminal Bench team	Suite of tasks: English descriptions, Docker envs, test scripts (T-Bench-Core-vO)	Compiling code, server management, software install, system-level reasoning in sandboxed terminal	Pass/fail via test scripts; one attempt per task; real terminal, no external tool access	tbench.ai
Graduate-level Reasoning (GPQA)	GPQA research team	448 (main), 546 (extended), 198 (diamond) MCQs in biology, chemistry, physics	Graduate-level, Google-proof, expert-created MCQs	Accuracy vs. PhD experts (65-74%) & non-experts (34%); hardest: "Diamond" subset	arxiv.org/abs/231 1.12022
Agentic Tool Use (TAU Bench)	TAU-bench research team	Simulated users, realistic databases, policy docs, API tools (retail/airline)	Customer support, workflow automation, multi-step reasoning, tool/API use, policy following	Correct end-state in database; must follow domain-specific rules; multi-turn, multi-tool interaction	arxiv.org/pdf/240 6.12045
MMMLU (Multilingual MMLU)	MMMLU research team	57 subjects, 14 languages, elementary to advanced professional level	Multiple-choice, zero/few-shot, broad subject coverage (STEM, humanities, law, ethics, etc.)	Accuracy across subjects/languages; identifies model blind spots	mmmu- benchmark.githu b.io
Visual Reasoning (MMMU)	MMMU research team	11,500 college-level problems, 6 disciplines, 30 subjects	Visual & textual reasoning: charts, diagrams, tables, maps, music sheets, chemical structures	Perception, knowledge, reasoning in multimodal (text+image) context; expert-level subject knowledge	mmmu- benchmark.githu b.io
High School Math (AIME 2025)	Mathematic al Association of America (MAA)	15 questions, 3 hours; 3-digit integer answers; no calculators	Algebra, geometry, combinatorics, number theory, probability; creative problem-solving	One attempt per student/year (AIME I or Il); high school competition level	vals.ai/benchmar ks/aime-2025- 03-13

LangFuse​

Observation Types​

LLM Monitoring / LLMOps​

Leaderboards​

Metrics​

Time to first token (TTFT)​

Token Per Second (TPS)​

GPU Usage​

Benchmarks​

Links​