Benchmarking agents on open models and your tooling | Keryc
The question stops being just 'did the agent get it right?' and becomes 'how much work did it have to do to get it right?'. In Hugging Face's new benchmark they measure exactly that: not only the final answer, but the path, the tokens, the errors and the decisions an agent makes when using your library.
Why does this matter to you? A confusing API doesn't just frustrate human developers — it makes the agent spend more tokens, time and steps to reach the same answer. Measuring only the final result leaves you blind to those costs. Sounds important, right?
What they measured and why it matters
The authors ran a technical, reproducible experiment focused on how agents use transformers as a case study. The idea is simple and powerful: when an agent codes for you, the API must be more than correct and fast; it must be discoverable, have clear examples, and be tested for agentic use.
Why? Because a cluttered or unclear API makes agents do extra work: more tokens, more time, more steps. If you only check correctness, you miss that hidden overhead.
Benchmark design: models, revisions and "tiers"
Each run varies along four axes: the model driving the agent, the transformers revision (tags or commits), the specific task, and the "tier" of help the agent receives. The tiers are:
bare: pip install transformers and nothing else.
clone: full repository checkout in the working directory.
skill: a packaged Skill with docs and task-specific examples.
Every combination (model × revision × task × tier) runs as a Hugging Face Job on identical hardware to keep comparisons fair and scalable.
Metrics that really matter
A simple "match" is not enough. The harness checks several dimensions:
match %: whether the output contains the expected answer (substring, regex or exact depending on the task).
median time and median tokens: split between new, cached and generated tokens.
runs with error %: includes silent failures (0 tokens, no tool calls).
marker adoption: tags that describe concrete agent behaviors.
Also, every run produces native agent traces stored in the Hub so you can inspect command-by-command what the agent did.
Having the trace is a luxury: you don't just see the result, you see the process. That lets you reproduce failures and understand bad decisions.
Key findings and concrete examples
The same task, very different paths
Two agents can reach the same 'POSITIVE (0.9999)' but with very different profiles. One writes and debugs a long Python script; another executes a single CLI call. Both succeed, but they differ drastically in tokens, latency and error probability.
Example comparison (sentiment):
# long option: script piped to python
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased-finetuned-sst-2-english')
# ...preprocessing, inference and print...
PY
# short option: CLI
transformers classify --model distilbert/distilbert-base-uncased-finetuned-sst-2-english --text 'I absolutely loved the movie, it was fantastic!'
CLI + Skill tradeoff: less time, more tokens in clone
When a CLI and examples are introduced, large models use the new interface and finish tasks faster (fewer turns). But in the clone tier, the checkout includes code and examples, and many agents first read those files: median new tokens rise (for example from ~4k to ~6.4k), even if time decreases. It's a tradeoff you should know before merging.
Not all models benefit equally
Large models: tend to leverage the Skill and CLI, reducing time and turns.
Small/local models: can sometimes get worse. Small models may rely on memorized patterns (for example pipeline(...)) and get confused by a new surface. In extreme cases, a Skill can reduce match % or even break the solution.
Concrete reported cases:
Qwen3-4B: in clone, median new tokens rose from ~2.4k to ~23k because it read the new code in bulk, without improving accuracy.
Qwen3-14B: adding the Skill made match % drop in classify-sentiment from 100% to 0% because the model started emitting calls to a tool transformers(command=...) that never existed; it interpreted the Skill as an invocable API in the agent context.
Markers: tagging behaviors to understand the 'how'
Markers are patterns defined by the tool profile that are searched for in commands, generated code, files read, or in the final response. For transformers two stood out:
cli: the agent executed transformers in the shell.
pipeline: the agent used the high-level API pipeline(...).
With these markers you can measure whether a change to the library actually changed agent behavior, not just the final result.
Practical recommendations for maintainers
Test for agents: add tests that reproduce agentic flows. If it's not tested, it doesn't exist for an agent.
Document with agents in mind: structure examples and docs so the agent can find them quickly (README, per-task examples, minimal snippets).
Evaluate across model sizes: a change can save work for large models and break small ones. Measure both cases.
Consider generating Skills automatically: Upskill is the idea of turning a strong-model solution into a Skill only if it helps weaker models.
Use traces: check the agent-traces viewer to understand why an agent failed or deviated.
How to use the harness right now
The project is a CLI called agent-eval. Minimal flow:
Install and read the README and SECURITY.md.
Define deterministic tasks and the expected answers.
Launch the sweep models × revisions as HF Jobs; each run needs HF_TOKEN to serve models.
Publish the report as a Space and review traces on the Hub.
Security warning: the harness runs agents that may execute code from the repo you point to. Only use this in trusted environments and review SECURITY.md before sharing results.
Final reflection
This work is not just a new metric, it's a wake-up call: if you want your library to work well with agents, design, document and test with agents in mind. Sometimes an improvement for big models is a trap for small ones. Measuring the process, not just the outcome, gives you actionable information so you can make informed decisions before merging important changes.