Anthropic designs AI-resistant technical evaluations

Jan 20, 20264 minutes

Tristan Hume, from Anthropic's performance optimization team, explains how their take-home for hiring performance engineers stopped distinguishing humans once the Claude models reached new capability. Interesting, right? It's not just an anecdote: it's a practical lesson on how to design technical evaluations that keep providing signal in the era of AI assistance.

Why the test existed and what it aimed to measure

They needed to evaluate many people without burning interviewers' time on live interviews. The goal was simple: a realistic, engaging exercise that measured optimization skills and engineering thinking at a fine-grained level.

The design principles were straightforward and effective: represent real work, high signal (lots of opportunities to show skill), avoid narrow domain knowledge, allow quick develop-debug loops and —this is important— be compatible with AI assistance. At Anthropic they explicitly allowed candidates to use AI if they wanted; the point was to see what extra value the human brought when AI was available.

The technical simulation: a machine that demands engineering

They built a Python simulator of a TPU-like accelerator with details relevant to optimization:

memory scratchpad managed manually
VLIW: multiple units executing in parallel per cycle
SIMD: vector operations that process many elements per instruction
multicore: distribute work across cores

The core task was a parallel tree traversal (not oriented to deep learning) to avoid favoring people already experienced in ML. The exercise progressed from multicore parallelism to SIMD or VLIW packing, with an intentional bug in the original version to force candidates to build debugging tools.

How the Claude models broke the test (and what they did)

The test worked great at first: a thousand candidates completed it, and many hires came from that pool. But when they tried internal Claude models, the problem changed fast.

Claude Opus 4, within a 4-hour window, outperformed almost all humans. Cutting the time to 2 hours and adding depth to the problem (version 2) separated candidates again—until Opus 4.5 matched and then tied the best human performance in that time bracket.

Opus 4.5 didn't just apply common micro-optimizations: given more time and guidance from an engineer's prompts, it discovered structural tricks (for example, transforming computation instead of data) and kept improving past the 2-hour mark. The result: under the take-home's time constraint, delegating work to Claude Code became the dominant strategy.

Attempts to recover signal: harder and rarer problems

First strategy: design a new hard optimization, like bank-conflict-free data transposition. Result: Claude found unexpected solutions and, with more "thinking" budget, solved that version too.

Second strategy: design out-of-distribution problems. Inspired by Zachtronics-style games, they created puzzles with extremely limited instruction sets where optimization demands creativity and building your own debugging tools. Here the AI failed more often, and humans with solid engineering judgment regained advantage.

Key points from this phase:

No visualization or tools by default: part of the evaluation is how much and how you build debugging tools.
Multiplicity of subproblems: reduces variance and lowers the chance the AI has a single winning solution.
Force tradeoffs of engineering and judgment, not just recipe application.

Technical and practical lessons for designing AI-resistant evaluations

If you're creating interviews today, these principles help:

Favor out-of-distribution problems or tasks that require re-contextualizing the solution, not just applying recipes.
Measure process as well as product: ask candidates to explain choices, what tools they built, and why.
Include engineering tasks that involve tool design, debugging, and correctness verification.
Use reasonable time limits: short windows reveal who can prioritize, but longer windows are useful when you want to measure depth.
Design multiple independent subproblems to reduce variance and avoid a single optimal solution the AI can memorize.
Evaluate human engineering ability to steer, critique, and simplify model-generated code. Human value today is in judgment, verification, and system design.
Test your own evaluation against the most capable models you have: adversarial iteration against AI shows where the signal fails.

It's not about banning AI. It's about designing tasks where getting help doesn't erase the human contribution.

The open challenge and benchmarks

Anthropic publishes the original take-home as an open challenge: with unlimited time, humans still beat current models. Here are some reference numbers in simulator cycles:

2164 cycles: Claude Opus 4 after many hours in the test harness
1790 cycles: Claude Opus 4.5 in a casual session, approaching the best human in 2 hours
1579 cycles: Claude Opus 4.5 after 2 hours in the test harness
1548 cycles: Claude Sonnet 4.5 after many hours
1487 cycles: Claude Opus 4.5 after 11.5 hours in the harness
1363 cycles: Claude Opus 4.5 with an improved harness after many hours

Download it on GitHub and, if you optimize it below 1487 cycles, Anthropic invites you to send code and a resume to performance-recruiting@anthropic.com.

Final reflection

This take-home story is a small picture of how engineering changes when powerful models are present. Does AI take away human work? Not exactly. It raises the kinds of questions we must ask to evaluate talent.

Now we look for creativity, judgment, and the ability to build and verify systems—not just the ability to apply a known pattern. Curious? If you're interested in designing evaluations or want a technical challenge that tests your ingenuity against current models, this case is both a practical guide and an invitation to compete.

Original source

https://www.anthropic.com/engineering/AI-resistant-technical-evaluations

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.