olmo-eval: tool to evaluate LLMs in development

Jun 12, 2026Keryc Díaz3 minutes

While you build an LLM, you evaluate it again and again: you change data, architecture, or hyperparameters and run through the same loop. How do you know if an improvement in a small experiment holds up in the final training? olmo-eval shows up to automate and tidy exactly that part of the workflow.

What olmo-eval is and why it matters

olmo-eval is an evaluation workbench for the model development cycle that extends the idea of OLMES. OLMES brought order to how benchmarks are reported; olmo-eval makes those measurements useful during active development. It’s not just about getting a final score: it’s about repeating, comparing, and understanding changes between checkpoints with statistical rigor.

Sound familiar? “Is that +2.4 percentage points real or just noise?” olmo-eval helps you answer that by showing standard errors and the minimum detectable effect, and by comparing answers question by question between checkpoints.

How does the evaluation routine change?

Modularity: it separates the benchmark definition from the runtime. That means you can change how you run a task without rewriting the task itself.
Execution flexibility: by default it uses the lightweight path (running the model directly). If a benchmark requires isolation (for example, executing code), it uses containers or sandboxes.
Reuse: tools, harnesses, and auxiliary models are interchangeable components. You can plug in a grader LLM without touching the original task.
Fine-grained analysis: beyond averages, it aligns the same questions between two checkpoints to detect small but consistent improvements or regressions.

Main components (architecture)

olmo-eval is structured into four blocks that you can use separately or combine:

Task / Suite / Harness: defines what is evaluated (Task), groups tasks (Suite), and controls how they run (Harness). This lets you run the same Task in baseline mode or with tools without touching the definition.
Sandbox and capability routing: supports agentic evaluations where the model uses tools. olmo-eval runs those tools and returns their outputs to the model to assess tool use under realistic conditions.
Normalized experiment schema: records each run, its configuration, and results in a structured format. Ideal for tracking long development efforts.
Results viewer for paired comparisons: aligns two checkpoints question by question to see minimal changes that an average might hide.

Technical example: authoring a task and variants

Here’s an example in Python (exactly how olmo-eval expects it):

from olmo_eval.common.formatters import ChatFormatter
from olmo_eval.common.metrics import AccuracyMetric
from olmo_eval.common.scorers import ExactMatchScorer
from olmo_eval.common.types import Instance, SamplingParams
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")
class InternalFreshQA(Task):
    data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")
    formatter = ChatFormatter()
    sampling_params = SamplingParams(temperature=0.0)
    metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

    @property
    def instances(self):
        loader = DataLoader()
        for idx, doc in enumerate(loader.load(self.config.get_data_source())):
            yield Instance(
                question=doc["question"],
                gold_answer=doc["answer"],
                metadata={"id": doc.get("id", f"freshqa_{idx}")},
            )

# variants without duplicating the task
register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("internal_freshqa", "zero", num_fewshot=0)

Suites and runs from the CLI:

# Run a baseline
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

# Same task, runtime with tools / search
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

How it compares with Harbor (and when to use each)

Focus: Harbor is designed to run and publish agentic benchmarks inside sealed containers. olmo-eval is meant for daily development work, where you iterate quickly.
Containers: Harbor runs everything inside reproducible containers. olmo-eval lets you choose: lightweight execution by default and sandbox only when the benchmark requires it.
Quick integration: adding a benchmark to Harbor usually involves verification steps for publishing; in olmo-eval you just define a Task or wrap an existing benchmark with ExternalEval if it already has its own runner.
Analysis: olmo-eval focuses on paired comparisons and inferential metrics to distinguish real improvement from noise.

Practical recommendations for technical teams

Integrate olmo-eval into your checkpoint pipeline. Run key suites automatically to catch deviations early.
Use variants to test prompt policies or few-shot setups without duplicating benchmark code.
Leverage the normalized schema to build dashboards and traceability between experiments.
Look at the standard error and the minimum detectable effect before declaring victories. A small change in the mean may not be significant.

Final reflection

olmo-eval isn’t just another benchmarking tool; it’s a toolbox so evaluation follows development, not the other way around. If your recurring question is “what changed between this checkpoint and the previous one?”, olmo-eval is designed to give you a reproducible, actionable answer — from lightweight runs to sandboxed execution when needed.

Original source

https://huggingface.co/blog/allenai/olmo-eval

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What olmo-eval is and why it matters

How does the evaluation routine change?

Modularity: it separates the benchmark definition from the runtime. That means you can change how you run a task without rewriting the task itself.

Execution flexibility: by default it uses the lightweight path (running the model directly). If a benchmark requires isolation (for example, executing code), it uses containers or sandboxes.

Reuse: tools, harnesses, and auxiliary models are interchangeable components. You can plug in a grader LLM without touching the original task.

Fine-grained analysis: beyond averages, it aligns the same questions between two checkpoints to detect small but consistent improvements or regressions.

Main components (architecture)

olmo-eval is structured into four blocks that you can use separately or combine:

Task / Suite / Harness: defines what is evaluated (Task), groups tasks (Suite), and controls how they run (Harness). This lets you run the same Task in baseline mode or with tools without touching the definition.

Sandbox and capability routing: supports agentic evaluations where the model uses tools. olmo-eval runs those tools and returns their outputs to the model to assess tool use under realistic conditions.

Normalized experiment schema: records each run, its configuration, and results in a structured format. Ideal for tracking long development efforts.

Results viewer for paired comparisons: aligns two checkpoints question by question to see minimal changes that an average might hide.

Technical example: authoring a task and variants

Here’s an example in Python (exactly how olmo-eval expects it):

from olmo_eval.common.formatters import ChatFormatter from olmo_eval.common.metrics import AccuracyMetric from olmo_eval.common.scorers import ExactMatchScorer from olmo_eval.common.types import Instance, SamplingParams from olmo_eval.data import DataLoader, DataSource from olmo_eval.evals.tasks.common import Task, register, register_variant @register("internal_freshqa") class InternalFreshQA(Task): data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test") formatter = ChatFormatter() sampling_params = SamplingParams(temperature=0.0) metrics = (AccuracyMetric(scorer=ExactMatchScorer),) @property def instances(self): loader = DataLoader() for idx, doc in enumerate(loader.load(self.config.get_data_source())): yield Instance( question=doc["question"], gold_answer=doc["answer"], metadata={"id": doc.get("id", f"freshqa_{idx}")}, ) # variants without duplicating the task register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234) register_variant("internal_freshqa", "zero", num_fewshot=0)

Suites and runs from the CLI:

# Run a baseline olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero # Same task, runtime with tools / search olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

How it compares with Harbor (and when to use each)

Focus: Harbor is designed to run and publish agentic benchmarks inside sealed containers. olmo-eval is meant for daily development work, where you iterate quickly.

Containers: Harbor runs everything inside reproducible containers. olmo-eval lets you choose: lightweight execution by default and sandbox only when the benchmark requires it.

Quick integration: adding a benchmark to Harbor usually involves verification steps for publishing; in olmo-eval you just define a Task or wrap an existing benchmark with ExternalEval if it already has its own runner.

Analysis: olmo-eval focuses on paired comparisons and inferential metrics to distinguish real improvement from noise.

Practical recommendations for technical teams

Integrate olmo-eval into your checkpoint pipeline. Run key suites automatically to catch deviations early.

Use variants to test prompt policies or few-shot setups without duplicating benchmark code.

Leverage the normalized schema to build dashboards and traceability between experiments.

Look at the standard error and the minimum detectable effect before declaring victories. A small change in the mean may not be significant.

Final reflection