MAST and ITBench reveal why agents fail in IT

IBM Research and UC Berkeley put a practical magnifying glass on why conversational and automation agents fail in real IT environments. Instead of staying at a success percentage, they turn thousands of traces into actionable signals so you—as an engineer or product lead—know exactly what to tweak.

Qué hicieron: ITBench + MAST para diagnosticar fallas

They partnered to analyze how LLM agents break on SRE, security, and FinOps tasks: incident triage, log and metric queries, and Kubernetes actions inside long tool-using loops.

They annotated 310 ITBench SRE traces produced by agents using three models: Gemini-3-Flash, Kimi-K2 and GPT-OSS-120B.
They used MAST (Multi-Agent System Failure Taxonomy) to transform raw traces into structured failure vectors.
Main evaluation metric: recall, because in SRE runs models produce few outputs and SREs prioritize getting the correct result back.

Key results (recall by model):

Gemini-3-Flash: 100 traces, 75.5% mean recall

Qué hicieron: ITBench + MAST para diagnosticar fallas

They partnered to analyze how LLM agents break on SRE, security, and FinOps tasks: incident triage, log and metric queries, and Kubernetes actions inside long tool-using loops.

They annotated 310 ITBench SRE traces produced by agents using three models: Gemini-3-Flash, Kimi-K2 and GPT-OSS-120B.
They used MAST (Multi-Agent System Failure Taxonomy) to transform raw traces into structured failure vectors.
Main evaluation metric: recall, because in SRE runs models produce few outputs and SREs prioritize getting the correct result back.

Key results (recall by model):

Gemini-3-Flash: 100 traces, 75.5% mean recall

Qué hicieron: ITBench + MAST para diagnosticar fallas

Qué hicieron: ITBench + MAST para diagnosticar fallas

Qué es MAST y por qué importa

Hallazgos técnicos por modelo y sus firmas de falla

Lecciones generales y acciones concretas

Cómo esto cambia tu evaluación de agentes

Recomendaciones rápidas para equipos de producto e ingeniería

Recursos y lectura técnica

Fuente original

Stay up to date!

MAST and ITBench reveal why agents fail in IT