AssetOpsBench: benchmark of AI agents for industrial operations | Keryc
AssetOpsBench proposes something many benchmarks don't: evaluating AI agents in the real complexity of industrial operations, with noisy sensors, work orders and the need to coordinate multiple agents. Why does this matter to you? Because on the plant floor it's not enough to get a single question right; you need traceability, fault handling and prudent decisions under uncertainty.
Qué es AssetOpsBench y por qué importa
AssetOpsBench is an evaluation framework designed for agentic agents managing the asset lifecycle (for example, chillers and air handling units). Its goal is to close the gap between academic tests and real operational demands: multi-agent setups, heterogeneous data, failure modes and working with incomplete context.
Unlike benchmarks that measure isolated tasks (coding, web browsing), AssetOpsBench measures how agents solve real workflows, how they explain their decisions and how they manage risk when information is partial or inconsistent.
Contenido del benchmark: datos y escenarios
The dataset and scenarios are robust and operation-focused:
2.3M sensor telemetry points.
140+ curated scenarios involving 4 different agents.
4.2K work orders covering a variety of cases.
53 structured failure modes, plus discovery of new patterns.
Experts also reviewed 150+ scenarios and each case includes metadata: task type, output format, category and sub-agents. The covered tasks include:
Anomaly detection in sensor time series.
Reasoning about failure modes and diagnosis.
Forecasting and KPI analysis.
Summarizing and prioritizing work orders.
If you've ever worked with a chiller that reports strange noises and inconsistent readings, you'll understand why simulating those flows is crucial before trusting an agent in production.
Las seis dimensiones cualitativas de evaluación
AssetOpsBench doesn't optimize a single score. It evaluates each agent run according to six criteria designed to reflect real operational constraints:
Task Completion (completion of the task)
Retrieval Accuracy (accuracy when retrieving evidence)
Result Verification (result verification)
Sequence Correctness (sequencing of actions)
Clarity and Justification (clarity and justification)
Hallucination rate (rate of hallucinations)
Important: the focus on explanations and verification turns failure into useful information, not a simple 0/1. In industry, understanding why an agent failed is often worth more than knowing that it failed.
TrajFM: analizar fallas a nivel de trayectoria
A central contribution is TrajFM, a pipeline to analyze execution trajectories:
Extraction of trajectory-level failures guided by an LLM with diagnostic prompts.
Embedding-based clustering to group recurring patterns.
Analysis and visualization for developer feedback.
This combination of LLM reasoning and statistical clustering lets you discover emerging failures without relying only on a fixed taxonomy. That's key when new failure modes appear as different agent designs are deployed.
Hallazgos técnicos y resultados de la comunidad
A community evaluation ran with 225 users, 300+ agents and open source models. Results summarized:
Familia de modelos
Mejor Planning
Mejor Execution
Limitación clave
GPT-4.1
68.2
72.4
Alucina en workflows complejos
Mistral-Large
64.7
69.1
Dificultad con secuencias multi-herramienta
LLaMA-4 Maverick
66.0
70.8
Falta preguntas de clarificación (arreglable)
LLaMA-3-70B
52.3
58.9
Colapsa en coordinación multi-agente
No model reached 85 points, the threshold defined as ready for deployment.
Failure distribution across 881 execution traces:
Ineffective Error Recovery: 31.2%
Overstated Completion: 23.8%
Formatting Issues: 21.4%
Unhandled Tool Errors: 10.3%
Ignored Feedback: 8.0%
Other: 5.3%
185 traces showed a new failure pattern and 164 had multiple new failures. Recurrent patterns:
Misalignment between telemetry, alerts and historical work orders.
Overconfident conclusions with incomplete or late evidence.
Inconsistent aggregation of heterogeneous data across agents.
Premature action selection without sufficient verification.
Failures in multi-agent coordination: ignored inputs or action-reasoning mismatch.
Concrete operational insights:
Tool usage is a differentiator: top agents showed 94% accuracy in tool use vs 61% for low performers.
Multi-agent multiplies failures: single-agent task accuracy 68% vs multi-agent 47%.
RAG and access to manuals/failure docs improves performance, but requires structured reasoning to be effective.
Ambiguity (missing sensors, conflicting logs) reduces success rate by 34% if the agent doesn't ask clarifying questions.
Qué significa esto para diseñar agentes industriales
If you develop or evaluate plant agents, these points are practical:
Design explicit verification and escalation strategies; prefer deferring action under high uncertainty.
Implement active clarification (questions to the operator) when evidence is insufficient.
Model operational contexts and uncertainty; agents that quantify confidence produce more stable trajectories.
Log and analyze traces in aggregate; don't expose sensitive data, but provide actionable feedback.
Cómo participar y privacidad
AssetOpsBench-Live is open for competition. Participation flow:
Local validation with the simulated environment that includes representative data and a failure catalog.
Containerize your agent and submit it for remote execution on hidden scenarios.
You receive aggregated scores on the six dimensions and clustered failure summaries without exposing raw traces.
The evaluation is reproducible and preserves industrial confidentiality, delivering feedback designed to iterate on the agent without leaking sensitive data.
Reflexión final
AssetOpsBench isn't just another benchmark: it's a tool to move AI agents from isolated answers toward reliable, explainable workflows that can learn from their failures. Do you want an agent to operate in a machine room or support technical decisions? You need metrics that measure verification, sequencing and error handling. This benchmark gives you that mirror.