---
title: "How to Evaluate Agent Systems: A Practical Framework"
description: "A practical guide to evaluating LLM agent systems using unit tests, simulated environments, LLM-as-judge, and human review. Covers Inspect, promptfoo."
tldr: "Evaluating agent systems requires layered methods: deterministic unit tests catch regressions fast, simulated environments test multi-step behavior, LLM-as-judge scales qualitative review, and human-in-the-loop catches what automation misses. No single method is sufficient."
url: "https://aigentic.blog/evaluating-agent-systems"
publishedAt: "2026-05-17T13:00:15.410Z"
updatedAt: "2026-05-17T13:00:15.410Z"
category: "pillar"
tags: ["deep-dive","agent-evaluation","llm-testing","observability","agent-frameworks"]
---

# How to Evaluate Agent Systems: A Practical Framework

> Evaluating agent systems requires layered methods: deterministic unit tests catch regressions fast, simulated environments test multi-step behavior, LLM-as-judge scales qualitative review, and human-in-the-loop catches what automation misses. No single method is sufficient.

Agent evaluation is one of the least standardized problems in applied AI. Unlike a classification model with a fixed test set, an agent produces variable-length action sequences, calls external tools, and may reach the correct final answer via a path that is nonetheless brittle or unsafe. The evaluation stack has to cover at least four distinct concerns: functional correctness (did the agent do the right thing?), behavioral reliability (does it do the right thing consistently?), safety (does it avoid harmful actions?), and cost efficiency (does it reach the goal without unnecessary steps?). This guide walks through each major evaluation method, names the frameworks that implement them, and is explicit about where each one breaks down.

## Why Agent Evaluation Differs from Model Evaluation

Standard LLM evaluation benchmarks like [MMLU](https://arxiv.org/abs/2009.03300) or [HumanEval](https://arxiv.org/abs/2107.03374) test a single forward pass: one prompt in, one completion out, one score. Agent evaluation has to deal with a fundamentally different object: a trajectory. A trajectory is the sequence of observations, decisions, tool calls, and intermediate outputs that an agent produces before delivering a final result.

This matters for three practical reasons.

First, the same correct final answer can be reached via a good path or a bad one. An agent that browses five unnecessary web pages before finding the answer is functionally correct but operationally expensive. An agent that calls a write API before a read API may corrupt state even if it eventually produces a right answer.

Second, agents are non-deterministic. Temperature, tool response latency, and context window state all introduce variance. A single-run evaluation is not reliable; you need enough samples to separate signal from noise.

Third, agents have side effects. A model that produces a wrong answer is annoying. An agent that sends an email, submits a form, or modifies a database with a wrong answer has caused real harm. Evaluation frameworks that do not sandbox side effects are not safe to use for agent testing.

These three properties force a layered evaluation strategy. No single method handles all of them.

## Layer 1: Unit Tests for Deterministic Components

The cheapest and fastest evaluation layer is unit testing individual, deterministic components: tool call formatting, routing logic, prompt construction, and input/output parsing. These tests do not involve a live model; they mock the LLM and assert that the surrounding scaffolding behaves correctly.

[promptfoo](https://github.com/promptfoo/promptfoo) is well-suited to this layer. It uses a YAML-based configuration to define prompt variants, expected output patterns, and assertion rules. A promptfoo test can assert that a given prompt template produces output matching a regex, contains a required JSON key, or does not include a forbidden string. Because it runs against actual model completions (or mocked ones), it sits at the boundary between classical unit tests and statistical evals.

A minimal promptfoo configuration looks like this:

```yaml
prompts:
  - "Summarize the following in one sentence: {{text}}"
providers:
  - openai:gpt-4o-mini
tests:
  - vars:
      text: "The Eiffel Tower was built in 1889."
    assert:
      - type: icontains
        value: "1889"
      - type: javascript
        value: "output.split(' ').length < 30"
```

This is fast to run, easy to diff in CI, and catches prompt regressions immediately. The weakness is that it tests one turn at a time. An agent that passes every single-turn test can still fail catastrophically when those turns are chained together.

For tool call validation specifically, unit tests should verify that the agent's output schema matches the tool's input schema before any live tool execution. This prevents the most common class of agent failure: malformed tool calls that produce runtime errors mid-trajectory.

### What unit tests cannot do

Unit tests do not cover emergent multi-step behavior. They cannot detect that an agent correctly formats a search query but then fails to incorporate the search result into its next step. They also cannot evaluate anything that requires a real environment response, such as whether the agent correctly interprets a 404 from an API.

## Layer 2: Simulated Environments and Trajectory Evaluation

The next layer up introduces a simulated or sandboxed environment that the agent operates in across multiple steps. The evaluation records the full trajectory and scores it at both the step level and the final-state level.

[Inspect](https://github.com/UKGovernmentBEIS/inspect_ai), developed by the UK AI Safety Institute, is the most rigorous open-source framework for this kind of evaluation. Inspect defines tasks as Python objects with a dataset, a solver pipeline, and a scorer. It supports multi-turn agent loops natively, logs every tool call and observation, and produces reproducible evaluation records that can be compared across model versions.

A key concept in Inspect is the scorer, which operates on the final task state. Scorers can be deterministic (did the agent produce the correct file? did it land on the correct URL?) or model-graded (does the output satisfy the rubric?). Inspect separates these cleanly, which matters because mixing them in a single metric obscures whether failures come from the agent's reasoning or from an imprecise rubric.

[LangSmith](https://docs.smith.langchain.com/), developed by LangChain, approaches trajectory evaluation from an observability angle. Every agent run is traced, and each trace includes the full chain of LLM calls, tool invocations, and their inputs and outputs. LangSmith then lets teams attach evaluators to those traces, either pre-built (exact match, embedding similarity) or custom Python functions. The tight integration with [LangGraph](https://github.com/langchain-ai/langgraph) agents means traces are automatically structured around graph nodes, which makes it easier to pinpoint which step in a multi-step plan failed.

### Sandboxing side effects

Any simulated environment test must sandbox external side effects. Inspect does this via Docker-based task environments for code execution. LangSmith relies on teams to mock or stub external APIs themselves. promptfoo has no built-in sandboxing for agent use cases. The gap here is real: most frameworks leave the sandboxing problem to the user, which means teams often skip it and test against live systems, producing non-reproducible evaluations.

The practical minimum for a sandboxed agent eval:

- Mock all write APIs; return realistic but inert responses
- Use a separate database or namespace for reads that the test populates with known data
- Log every outbound network call and fail the test if an unexpected one occurs
- Reset environment state between test runs

This is more infrastructure than most teams want to build, but skipping it means your evaluations do not reflect what happens in production.

## Layer 3: LLM-as-Judge for Qualitative Scoring

Many agent outputs cannot be evaluated deterministically. "Did the agent's response follow the user's tone?" or "Was the agent's plan reasonable given the constraints?" are questions that do not have a binary correct answer. LLM-as-judge addresses this by using a separate, usually stronger, language model to score outputs against a rubric.

The technique was formalized in the [MT-Bench paper](https://arxiv.org/abs/2306.05685) (Zheng et al., 2023), which showed that GPT-4 judgments correlated well with human preferences on open-ended chat tasks. The method has since been widely adopted, but its limitations are worth stating precisely.

**Known failure modes of LLM-as-judge:**

| Failure mode | Description | Mitigation |
|---|---|---|
| Verbosity bias | Judges prefer longer, more detailed outputs regardless of accuracy | Add explicit length-normalization instructions to the rubric |
| Self-preference | A model used as judge tends to score outputs from the same model family higher | Use a judge from a different model family than the agent |
| Rubric drift | Vague rubric criteria produce inconsistent scores across runs | Use structured, binary sub-questions instead of holistic scores |
| Factual blindness | The judge cannot detect factual errors it was not prompted to check | Pair LLM-as-judge with retrieval-grounded fact checking |
| Calibration gap | Judge scores do not correlate with human scores without calibration | Collect human labels on a calibration set and measure correlation |

[Braintrust](https://www.braintrustdata.com/) is a managed platform that implements LLM-as-judge at scale. It allows teams to define custom scoring functions (called "scorers") in Python, run them over logged traces, and track score distributions over time. Braintrust's built-in scorers include factuality, answer relevance, and summarization quality, each implemented as a structured LLM prompt with a numeric output. The platform also stores every prompt-response pair, so you can audit why a particular score was assigned.

[promptfoo](https://github.com/promptfoo/promptfoo) supports LLM-as-judge via its `llm-rubric` assertion type, which sends the agent's output and a rubric string to a configurable judge model and returns a pass/fail. This is lightweight and integrates directly into CI pipelines, but it does not provide the historical tracking or score distribution analytics that Braintrust does.

### Calibration is not optional

The most common mistake teams make with LLM-as-judge is treating judge scores as ground truth without calibration. Before deploying a rubric, collect human ratings on at least 100 to 200 examples that span your expected output distribution. Measure Cohen's kappa or Spearman correlation between the judge scores and the human scores. If correlation is below 0.7, the rubric needs revision before it can be trusted for regression testing.

This calibration step is also what allows you to set a meaningful score threshold. Saying "the agent must score above 0.8 on the helpfulness rubric" is only meaningful if you know that a 0.8 judge score corresponds to something your human raters would actually approve.

## Layer 4: Human-in-the-Loop Review

Human review is the highest-fidelity evaluation method and the least scalable. It belongs in two specific places in the evaluation stack: release gate reviews for high-stakes deployments, and rubric development for LLM-as-judge calibration.

The practical structure for human review in agent evaluation differs from model evaluation. Reviewers should evaluate the full trajectory, not just the final output. A final answer that looks correct may have been reached via an unsafe intermediate step, and a reviewer who only sees the answer will miss it. This means your evaluation tooling must surface the full action sequence to reviewers, not just the terminal state.

LangSmith's annotation queues support this workflow. Reviewers can step through a trace, annotate individual tool calls, and attach a structured rating to any node in the trajectory. Braintrust has a similar human review interface. Neither is as purpose-built for this as a dedicated annotation tool like [Scale AI's Evaluation platform](https://scale.com/evaluation), but both are adequate for small-to-medium review volumes.

**When to require human review before deployment:**

- The agent has write access to external systems (email, databases, APIs with side effects)
- The domain requires professional judgment (legal, medical, financial)
- The agent will interact with users who cannot easily detect errors
- A new capability (tool, retrieval source, or system prompt change) has been added since the last human review cycle

For teams running continuous deployment, it is unrealistic to require human review of every release. A pragmatic approach is to run automated evaluation on every commit and reserve human review for commits that change the system prompt, add or remove tools, or trigger a statistically significant regression in automated scores.

## Framework Comparison

The four frameworks named in this guide cover different parts of the evaluation stack. The table below summarizes their positioning across the dimensions that matter most for agent evaluation.

| Framework | Primary use case | Multi-turn support | Sandboxing | LLM-as-judge | Managed platform | Open source |
|---|---|---|---|---|---|---|
| [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) | Research-grade benchmarks | Native, via solver pipeline | Docker-based environments | Via model scorers | No | Yes |
| [promptfoo](https://github.com/promptfoo/promptfoo) | CI regression testing | Partial (sequential turns) | None built-in | Via `llm-rubric` assertion | Optional cloud | Yes |
| [Braintrust](https://www.braintrustdata.com/) | Managed eval and tracing | Via trace logging | None built-in | Built-in scorers | Yes | Partial |
| [LangSmith](https://docs.smith.langchain.com/) | Observability and tracing | Native for LangGraph agents | None built-in | Custom evaluators | Yes | No |

A few observations from this comparison. Inspect is the only framework with built-in environment sandboxing, which makes it the most appropriate for evaluating agents that have code execution or file system access. promptfoo is the most CI-native; its YAML configuration diffs cleanly in pull requests and its CLI runs without a server. Braintrust and LangSmith are closest to production observability tools and are most useful for teams who want to monitor agent behavior after deployment, not just before.

No single framework covers the full evaluation stack. The common pattern in teams with mature evaluation practices is Inspect or a custom harness for benchmark-style testing, promptfoo for CI regression checks, and LangSmith or Braintrust for production tracing and periodic human review.

## Metrics that Actually Matter

The choice of metric is as important as the choice of framework. Most agent evaluation failures are metric failures: teams measure what is easy to measure rather than what predicts real-world performance.

**Metrics worth tracking, and what each captures:**

| Metric | What it measures | Weakness |
|---|---|---|
| Task success rate | Fraction of tasks where agent reaches correct final state | Does not penalize unsafe or inefficient paths |
| Step efficiency | Ratio of minimum required steps to actual steps taken | Requires knowing the optimal path, which is often undefined |
| Tool call accuracy | Fraction of tool calls with correct parameters | Does not capture whether the right tool was called |
| Trajectory faithfulness | Whether intermediate steps support the final answer | Expensive to compute; usually requires LLM-as-judge |
| Error recovery rate | Fraction of tasks where agent correctly handles a tool failure | Requires injecting failures into the environment |
| Latency per task | Wall-clock time to completion | Easy to measure but often in tension with quality |
| Token cost per task | Total tokens consumed, including tool responses | Correlates with latency but more directly tied to operating cost |

Task success rate is the most commonly reported metric and the most gameable. An agent can achieve high task success on a benchmark by memorizing patterns from training data without learning to generalize. When reporting task success rate, always specify whether the evaluation set is held out from any fine-tuning data and whether the tasks were seen during any few-shot prompting.

Error recovery rate is underused and highly diagnostic. An agent that can only succeed when all tools work correctly is brittle in production. Injecting realistic tool failures (timeouts, malformed responses, 429s) and measuring how often the agent recovers correctly gives a much sharper picture of production readiness than a clean-environment benchmark.

## The Limits of Automated Evaluation

Automated evaluation, at every layer, has systematic blind spots. Naming them explicitly prevents overconfidence.

**Distributional shift.** Evaluation sets become stale. The tasks your team wrote six months ago may not reflect what real users are asking today. Without periodic refresh of evaluation sets from production logs, scores can look stable while real-world performance degrades.

**Reward hacking.** Agents fine-tuned against an automated evaluator can learn to satisfy the evaluator without satisfying the underlying goal. This is the agent-evaluation equivalent of Goodhart's Law. The mitigation is to keep some evaluation criteria entirely out of the training loop and use them only as held-out metrics.

**Compound error masking.** In a multi-step trajectory, two errors can cancel each other out, producing a correct final answer. A scoring function that only looks at the final answer will miss this. Trajectory-level scoring is the correct approach, but it is expensive and not uniformly implemented across frameworks.

**Coverage gaps.** Most evaluation sets are under-representative of rare but high-consequence inputs: adversarial prompts, ambiguous user instructions, or edge cases in tool responses. Without deliberate adversarial case generation, coverage gaps go undetected until they become production incidents.

The most honest position is that automated evaluation tells you whether the agent has not regressed on the scenarios you thought to test. It does not tell you whether it is safe or reliable in the full distribution of real-world inputs. Human review and production monitoring are not optional supplements to automated evaluation; they are what gives automated evaluation its validity.

## What to Watch in the Next Quarter

**1. Standardization of trajectory logging formats.** There is currently no common schema for agent trajectory logs. Inspect, LangSmith, and Braintrust each use proprietary formats. A shared trace schema (analogous to the OpenTelemetry standard for distributed systems) would allow evaluation tools to be decoupled from agent frameworks. Watch for movement on this from the [OpenTelemetry GenAI working group](https://github.com/open-telemetry/oteps/blob/main/text/0266-genai-semantic-conventions.md) and from framework maintainers responding to interoperability pressure.

**2. Evaluation of agent-to-agent communication.** As multi-agent architectures become more common (orchestrator agents delegating to specialist subagents), evaluation frameworks need to score not just individual agent behavior but the correctness of handoffs between agents. Current frameworks are not built for this. Expect purpose-built tooling to emerge from teams running large-scale multi-agent deployments.

**3. Red-teaming as a first-class evaluation step.** Regulatory pressure in the EU AI Act and NIST AI RMF is pushing teams to document adversarial evaluation alongside standard performance evaluation. Inspect's task-based architecture is well-positioned to host red-team scenarios as standard tasks. Watch for red-team task libraries to emerge as shared resources, similar to how benchmark datasets are shared today.

**4. Cost-quality Pareto tracking.** As model providers add cheaper, faster options (smaller distilled models, cached responses, speculative decoding), teams are increasingly making routing decisions based on cost-quality tradeoffs at the task level. Expect evaluation platforms to add native Pareto-front visualizations that show the tradeoff curve across model configurations, not just point scores.

**5. Formal verification for narrow agent behaviors.** For agents operating in constrained, high-stakes domains (financial reporting, medical coding), probabilistic evaluation is insufficient to meet compliance requirements. Formal methods applied to the finite-state portions of agent behavior (routing logic, tool call schemas) are a research area to watch, with potential to produce provable guarantees on specific behavioral invariants.

## Further Reading

- [Inspect AI framework repository](https://github.com/UKGovernmentBEIS/inspect_ai): The UK AI Safety Institute's open-source agent evaluation framework, with full documentation on task definitions, solvers, and scorers.
- [Judging LLM-as-a-Judge with MT-Bench (arXiv:2306.05685)](https://arxiv.org/abs/2306.05685): The original paper formalizing the LLM-as-judge methodology and quantifying its agreement with human raters.
- [promptfoo GitHub repository](https://github.com/promptfoo/promptfoo): Open-source CLI and library for prompt and agent regression testing, with examples for LLM-as-judge and multi-turn evaluation.
- [LangSmith documentation](https://docs.smith.langchain.com/): Official docs for LangChain's tracing and evaluation platform, including guides on custom evaluators and annotation queues.
- [NIST AI Risk Management Framework](https://www.nist.gov/system/files/documents/2023/01/26/AI%20RMF%201.0.pdf): The US government's framework for managing AI risk, relevant context for teams designing evaluation processes that need to meet regulatory or audit requirements.

## Frequently asked

### What is the best framework for evaluating LLM agent systems?

There is no single best framework. Inspect suits research-grade, reproducible benchmarks. promptfoo is strong for CI-integrated, config-driven regression testing. Braintrust fits teams wanting a managed platform with scoring and tracing. LangSmith integrates tightly with LangChain-based agents. Most production teams combine two or more approaches.

### How does LLM-as-judge work for agent evaluation, and what are its weaknesses?

LLM-as-judge uses a separate language model to score agent outputs against a rubric. It scales qualitative review without human labor, but it inherits the biases of the judge model, tends to prefer verbose outputs, and cannot reliably detect factual errors it was not prompted to check. Calibration against human labels is strongly recommended before relying on it.

### Can unit tests cover multi-step agent behavior?

Unit tests cover individual tool calls, prompt formatting, and deterministic routing well. They struggle with emergent, multi-step behavior because each test isolates a single step. Simulated environment tests or trajectory-level evaluation are needed to assess whether an agent reaches its goal across a full task sequence.

### What is trajectory evaluation in agent testing?

Trajectory evaluation scores the sequence of actions an agent takes to reach a final state, not just the final output. It can catch inefficient paths, unnecessary tool calls, or unsafe intermediate steps that a final-answer check would miss. Tools like Inspect and Braintrust support trajectory-level logging and scoring.

### When should human review be included in agent evaluation?

Human review is most valuable for tasks where correctness is subjective, stakes are high, or the agent operates in a domain where automated judges are unreliable (legal, medical, code with side effects). It is also the right calibration source for LLM-as-judge rubrics. Full human review does not scale to regression testing, so it is best reserved for release gates and rubric development.