Google publishes cognitive framework to measure progress toward AGI | Keryc
Google DeepMind published on March 17, 2026 a new framework to evaluate progress toward Artificial General Intelligence (AGI). The proposal connects cognitive science research with practical tests and invites the community to build real evaluations through a hackathon on Kaggle.
What the framework proposes and why it matters
The idea is simple and powerful: if you want to know how close AI systems are to general intelligence, you need to measure concrete cognitive abilities—just like we do with humans. Google DeepMind proposes a taxonomy of 10 cognitive abilities that, together, describe what we mean by general intelligence.
These 10 abilities are:
Perception: extracting and processing sensory information from the environment.
Generation: producing coherent and useful text, voice, or actions.
Attention: focusing on what matters and filtering out the irrelevant.
Learning: acquiring new knowledge from experience or instruction.
Memory: storing and retrieving information over time.
Reasoning: drawing valid, structured inferences.
Metacognition: monitoring and knowing your own thought processes.
Executive functions: planning, inhibiting impulses, and switching strategies.
Problem solving: finding effective solutions in concrete domains.
Social cognition: understanding and interacting appropriately in social contexts.
Why is this list useful? Because it turns the abstract idea of AGI into measurable, comparable pieces. So you can know not only whether a model “seems” intelligent, but in which aspects it’s strong or weak. Useful, right?
How they propose measuring it
The framework goes beyond theory and suggests a three-stage evaluation protocol:
Evaluate systems across a broad set of cognitive tasks, using held-out test sets to avoid data contamination.
Collect human benchmarks on those same tasks with demographically representative adult samples.
Map each system’s performance against the distribution of human performance for each ability.
The comparison with humans isn’t about needlessly anthropomorphizing models, but about placing them against a practical reference: do they perform better than average? Do they approach the 90th percentile? Do they systematically fail at metacognition?
This approach fixes two common problems: it avoids isolated metrics that don’t tell the whole story, and it builds an empirical baseline to compare models over time.
From framework to practice: the hackathon on Kaggle
Recognizing abilities is one thing; designing robust tests is a lot of work that benefits from many hands. That’s why Google is launching a hackathon on Kaggle called "Measuring progress toward AGI: Cognitive abilities".
Key points of the hackathon:
They’re seeking evaluations for five abilities with the largest gaps: learning, metacognition, attention, executive functions, and social cognition.
Kaggle’s Community Benchmarks platform will let people build and validate those evaluations against state-of-the-art models.
Prizes: $10,000 for the top two in each of the five tracks, and four grand prizes of $25,000 for the best global proposals. Total: $200,000.
Timeline: submissions open March 17–April 16; results on June 1.
If you’ve ever designed a test, evaluated users, or created datasets, consider this a direct invite: you contribute real use cases and help the community get public, comparable tools.
What does this mean for researchers, companies, and users?
For researchers: it offers a roadmap to prioritize evaluations where models still underperform.
For companies: it provides richer metrics to assess risks and capabilities before deploying products.
For users and regulators: it makes it easier to understand what models do and where they might fail, especially in sensitive areas like social cognition or metacognition.
Think of everyday examples: a model that produces convincing text might still fail to self-correct (metacognition) or to plan multi-step actions (executive functions). Those failures slip by without tests designed to catch them.
Open questions
How do we define human reference standards that are fair and global? How independent should tasks be to avoid models solving them by data correlation? Can these evaluations stay relevant as models learn continuously?
No single evaluation will close the discussion, but this proposal points a clear direction: measure to understand and improve. And it does so by inviting the community to build the necessary tools.