Evaluating AI agents isn't a luxury: it's how you move from firefighting in production to improving with intent. How do you know if your agent is actually getting better or if you're just changing the noise? Anthropic shares a practical, technical guide to design evals that really measure behavior and don't penalize useful creativity.
What is an evaluation (eval) and why it matters
An eval is an automated test: you give an input to the agent and apply grading logic to measure success. Sounds simple, but with agents that act over multiple turns, call tools, and change state, evaluation gets messy. What exactly does the eval measure: the response, the sequence of steps the agent took, or the final state in the database? The right answer is often all three.
Good evals make behavior changes visible before they affect users.
By defining key terms — task (test case), trial (attempt), grader (logic that scores), transcript (full record), outcome (final state), harness (infrastructure that runs the eval), and suite (collection of tasks) — you get the vocabulary to build a repeatable strategy.
Types of graders and when to use them
Anthropic recommends combining three families of graders depending on what you want to measure: deterministic (code-based), model-based (LLM graders), and humans.
Code-based graders: fast, cheap, and reproducible. They’re ideal for objective checks: unit tests, static analysis, verifying tool calls and final state. Their weakness: they’re brittle to legitimate variation.
Model-based graders: use LLMs as judges with structured rubrics. They work well for evaluating language quality, empathy, or coverage in open-ended responses. They need human calibration and mechanisms to return Unknown when they can’t judge.
Human graders: essential for calibration and resolving ambiguities. Use them strategically, not for everything.
For each task you can combine graders and use binary, weighted, or hybrid scoring.
Useful metrics to deal with non-determinism
Agents are stochastic. Two practical metrics:
pass@k: probability that at least one of k attempts is correct. Useful when one valid solution among many is enough.
pass^k: probability that the k attempts are all correct. Important when consistency matters.
Numerical example: if the per-trial rate is 75% and you run 3 trials, pass^3 = 0.75^3 ≈ 0.42. That tells you that requiring 3 consistent successes is a much tougher threshold.
How to evaluate different types of agents (practical)
Coding agents: rely on deterministic tests. If the code passes the test suite without breaking other parts, it passes. Complement with static analysis (ruff, mypy, bandit) and LLM rubrics for code quality and style.
Conversational agents: mix verifiable outcomes (state in ticketing), transcript metrics (n_turns) and LLM rubrics for tone and empathy. Simulate users with another LLM to cover adversarial conversations.
Research agents: combine groundedness checks (citations), coverage checks (key facts) and source-quality assessments. Continuously calibrate with human experts.
Computer-use agents: run them in sandboxes. Verify post-task artifacts: filesystem, DB, UI elements. Balance between DOM (fast but token-heavy) and screenshots (more token-efficient).
Use the essentials: in practice, unit tests + an LLM rubric usually suffice, and you add more graders only when the case requires them.
Practical roadmap: from zero to reliable evals
Start early: 20–50 real tasks are enough to begin. Don’t wait for hundreds.
Turn manual checks and real bugs into eval tasks. Prioritize by user impact.
Write unambiguous tasks and create reference solutions that pass all graders.
Balance the dataset: include positive and negative cases to avoid one-directional optimizations.
Build a stable eval harness: start each trial from a clean environment to avoid noise.
Design graders resilient to hacks and avoid tests that are too rigid about step sequences.
Read transcripts regularly: numbers aren’t valuable without context.
Maintain the suite: assign owners, welcome contributions, and treat evals like unit tests.
How evals fit with other practices
Automated evals are the first line in CI/CD. Complement them with:
Production monitoring to catch drift.
A/B tests to validate significant changes with real traffic.
User feedback and manual transcript review for calibration.
Systematic human studies for subjective outputs.
The right combination depends on product stage and associated risk.
Risks and common pitfalls you should avoid
Eval setups that penalize useful creativity: grade outcomes, not always the exact path.
Poorly specified tasks that cause 0% pass@100: rewrite the task or fix the grader.
Shared environments that introduce unwanted dependencies between trials.
LLM rubrics without calibration: sync them with humans and allow "Unknown" responses.
Final reflection
Well-designed evals turn uncertainty into actionable metrics. They aren’t a cost: they’re leverage. If you treat them like tests and maintain them, they accelerate model adoption, reduce regressions, and let teams trust fast changes. Want to improve an agent? Define what "better" means in tests, automate those tests, and read the transcripts.
Stay up to date!
Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.