IBM Research and Artificial Analysis present ITBench-AA, the first benchmark focused on agentic IT tasks for enterprise environments. It starts with Site Reliability Engineering (SRE) tasks on Kubernetes snapshots and shows that frontier models still have limited performance: none exceed 50%.
Does that mean AI is useless for ops? Not at all. It means the tasks are hard in a specific way — you need minimal, evidence-based answers, not long lists of guesses. ITBench-AA is designed to test exactly that.
