OpenAI presents GDPval
, an evaluation designed to measure how well AI models perform real, economically valuable work tasks — not just academic tests. Why does this matter now? Because it helps shift the conversation from "what they could do" to "what they already do" in everyday work.
What GDPval measures
GDPval
evaluates representative knowledge-work tasks across 44 occupations chosen from the industries that contribute most to the United States' Gross Domestic Product. The initial release contains 1,320 specialized tasks and an open gold subcollection of 220 tasks.
Each task is built from real deliverables — a legal brief, a presentation, or an engineering drawing — which makes the evaluation more like actual work than a classroom exam. (openai.com)
How they chose occupations and built the dataset
They selected 9 industries that each contribute more than 5% to U.S. GDP, and within those, the 5 highest-wage occupations that are predominantly knowledge work. For each occupation they worked with experienced professionals (about 14 years of experience on average) who wrote and reviewed tasks through multiple quality-control cycles.
The goal was to reflect real deliverables and ensure representativeness so you can see outputs that matter in actual jobs. (openai.com)
A concrete example
One task asks a manufacturing engineer to design a jig to simplify a cable-winding test on an assembly line, create a presentation, and upload a PDF with the results. It isn't just a text question: it includes reference files and expects a professional-type deliverable.
That helps you see how AI performs on work that currently pays real salaries. (openai.com)
How they grade the responses
Model outputs are compared blind against human deliverables; experts in the same field judge whether the AI output is 'better', 'equal', or 'worse' than the human version. OpenAI also trained an automated grader to predict human judgments, available as an experimental service at evals.openai.com, but they note it doesn't replace human experts for now. (openai.com)
Early results worth knowing
Early results show frontier models are already approaching expert quality on many tasks. In blind comparisons among models like GPT-4o
, o4-mini
, OpenAI o3
, GPT-5
, Claude Opus 4.1
, Gemini 2.5 Pro
, and Grok 4
, some models tied or outperformed human deliverables on a meaningful fraction of tasks.
Claude Opus 4.1
stood out in aesthetics and formatting, while GPT-5
was strong in precision and domain knowledge. They also report measurable gains across generations: performance rose markedly from GPT-4o
to GPT-5
.
In raw inference and API billing terms, models can be about 100x faster and 100x cheaper than a human expert — but that doesn't include the human oversight, iteration, and integration you'll still need in real workflows. (openai.com)
What does this mean for work?
Will AI replace jobs? It's not that simple. GDPval
shows where models are strong: repetitive, well-defined tasks. That opens the possibility for people to spend more time on creative work and judgment. But it also highlights the need for policies, training, and job redesign so benefits are shared.
OpenAI suggests the goal should be to "keep everyone riding the elevator up" through democratic access to tools and support for workers in transition. Does that sound idealistic? Maybe — but it's a useful policy direction to discuss as these tools spread. (openai.com)
Limitations and next steps
GDPval
is a first step: for now it's a one-shot evaluation and doesn't capture interactive workflows where a task is refined over multiple iterations or requires dialogue with colleagues or clients. OpenAI plans to expand occupation coverage, add interactivity, and measure more ambiguous, contextual tasks.
They also released a gold subset and the public grading service so other researchers can reproduce and extend the work. (openai.com)
If you want to read the original source or the technical paper, OpenAI published a product note and a paper with data and metrics. See OpenAI's announcement about GDPval and read the paper. (openai.com)
Final reflection
GDPval
isn't the final answer to how the workforce will change with AI, but it is a powerful tool to move the conversation toward concrete evidence. How would you imagine integrating an AI that handles routine tasks well into your work? Would you use it to save time on the repetitive and focus on the strategic?
This evaluation helps answer those questions with data, not just assumptions.