IBM Research and UC Berkeley put a practical magnifying glass on why conversational and automation agents fail in real IT environments. Instead of staying at a success percentage, they turn thousands of traces into actionable signals so you—as an engineer or product lead—know exactly what to tweak.
Qué hicieron: ITBench + MAST para diagnosticar fallas
They partnered to analyze how LLM agents break on SRE, security, and FinOps tasks: incident triage, log and metric queries, and Kubernetes actions inside long tool-using loops.
They annotated 310 ITBench SRE traces produced by agents using three models: Gemini-3-Flash, Kimi-K2 and GPT-OSS-120B.
They used MAST (Multi-Agent System Failure Taxonomy) to transform raw traces into structured failure vectors.
Main evaluation metric: recall, because in SRE runs models produce few outputs and SREs prioritize getting the correct result back.
Key results (recall by model):
Gemini-3-Flash: 100 traces, 75.5% mean recall
Kimi-K2: 105 traces, 28.6% mean recall
GPT-OSS-120B: 105 traces, 12.4% mean recall
Qué es MAST y por qué importa
MAST is not just another metric. It's a taxonomy that classifies failures into 14 patterns grouped in 3 main categories:
FC1: System Design Issues - 'the skeleton' (e.g. FM-1.3 Step Repetition, FM-1.4 Loss of Conversation History, FM-1.5 Unaware of Termination)
FC2: Inter-Agent Misalignment - 'the communication' (e.g. FM-2.2 Fail to Ask for Clarification, FM-2.6 Action-Reasoning Mismatch)
The benefit: MAST doesn't just tell you an agent failed; it tells you what failed, where, and which intervention gives you the most leverage.
Hallazgos técnicos por modelo y sus firmas de falla
Perfil de Gemini-3-Flash: fallas "quirúrgicas"
Density: ~2.6 failure modes per failed trace.
Pattern: high internal coherence; failures come from isolated chokepoints, especially FM-3.3 Incorrect Verification.
Diagnosis: the model tends to "declare victory" without external evidence.
Engineering recommendation: externalize verification. Don't let the LLM grade itself; demand evidence from tools (for example, AlertManager clearance or observable changes in K8s state) before finishing.
Practical note: prompt engineering gives limited gains. In the cited analysis, prompt tweaks yielded ~15.6% improvement, while adding auxiliary agents or state control (summarizer + state machine) reached ~53%.
Perfil de Kimi-K2: el problema es terminar y ejecutar correctamente
Density: ~4.7 failure modes per failed trace.
Dominant patterns: FM-1.5 Unaware of Termination Conditions and FM-2.6 Action-Reasoning Mismatch (92% of failures include this mode).
Symptoms: the model over-reasons but fails at execution, either terminating early or entering meta debugging loops.
Engineering recommendation: take termination decisions out of the model. Implement a deterministic state machine, loop detectors for repeated tool calls, and clear stop rules.
Perfil de GPT-OSS-120B: colapso en cascada por pérdida de contexto
Density: ~5.3 failure modes per failed trace.
Critical patterns: FM-1.4 Loss of Conversation History (24% of traces), and FM-2.6 Reasoning-Action Mismatch (94%).
Diagnosis: small early reasoning discrepancies poison the task history and cause total drift.
Engineering recommendation: aggressive context hygiene, early error detection, and mechanisms that rebuild or reinforce state before the agent proceeds.
Lecciones generales y acciones concretas
Don't let the LLM be the arbiter of its own success: externalize verification with tool-backed checks.
Control termination outside the model: add explicit stop conditions, loop detectors, or state machines.
Treat ambiguity as an explicit branch in policy: force 'clarify-or-read-only' when input is unclear. Avoid letting the agent guess.
Auxiliary agents help: a Summarizer Agent or a State Manager that updates context can fix FM-1.4 and substantially boost performance.
Distinguish benign failures from fatal ones: not every repetition or weak verification kills the task. Memory loss, premature termination, or failure to clarify tend to be fatal.
Cómo esto cambia tu evaluación de agentes
If you build agents for business workflows, you want more than a success number. You need a failure map that tells you where to apply engineering for the best return. MAST applied to ITBench does exactly that: it turns traces into prioritized fixes.
Recomendaciones rápidas para equipos de producto e ingeniería
For frontier models (e.g. Gemini): implement external verification gates before marking success.
For mid-tier models (e.g. Kimi-K2): use a state machine and loop detectors; shorten redundant reasoning chains.
For large open models (e.g. GPT-OSS-120B): invest in context management and periodic checkpoints to prevent progressive forgetting.
MAST-Data (1600+ traces): available on Hugging Face and associated repos
The practical conclusion is simple: measuring is useful, diagnosing is indispensable. If you want agents that work in production, don't optimize only for success rate; prioritize tools that tell you why and how to fix failures.