OpenAI presents IndQA, a new benchmark designed to measure how much AI models understand questions that actually matter in India: culture, history, food and everyday life, written in native languages. Why does this matter to you? Because most people in the world don’t have English as their main language, and current tests don’t capture those local nuances.
What IndQA is and why it exists
IndQA is a set of 2,278 questions written in 12 Indian languages and organized into 10 cultural domains. The goal isn’t to check if a model translates a sentence well, but whether it reasons and understands cultural context: can it explain a local historical reference, tell apart regional food variants, or answer about religious practices with sensitivity?
India is a logical place to start: nearly a billion people don’t use English as their primary language, there are 22 official languages and several with tens of millions of speakers. Also, ChatGPT has a large user base there, so improvements here have real-world impact.
How IndQA was built
- 261 native experts from India participated: journalists, linguists, historians, artists, curators and more. Each question was written by specialists in their area.
 - The questions cover domains such as Architecture and Design, Arts and Culture, Everyday Life, Food, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Sports.
 - Languages included: Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil. Hinglish was added explicitly because code-switching is so common.
 - Each item includes: the prompt in the native language, an English translation for auditing, scoring criteria and an expert-expected ideal answer.
 
Important: the questions aren’t simple multiple choice. They’re tasks with evaluation criteria, like an essay rubric, to capture nuance and reasoning.
Evaluation methodology
Scoring uses detailed criteria written by the experts. Each criterion has a weighted value and an evaluator model checks whether the response meets it. The final result is the sum of points obtained over the total possible.
Key steps:
- Questions created by native experts with peer review.
 - Adversarial filtering: questions were tested against OpenAI’s strongest models at the time (for example 
GPT-4o,OpenAI o3,GPT-4.5and partiallyGPT-5). Only questions that most of those models didn’t answer satisfactorily were kept. That preserves space to measure future progress. - Rubrics and ideal answers accompany each question for transparency.
 
What IndQA shows about model performance
With IndQA, OpenAI observes significant improvements in its models on Indian languages over recent years, but also admits there’s still a lot to do. There’s a key caveat: because the questions were selected precisely because the best models failed at them, the selection is adversarial and can bias comparisons between models from different teams.
So IndQA shouldn’t be taken as a direct leaderboard between languages. Its main purpose is to measure improvement within a family of models or configurations over time, and to reveal where cultural and linguistic gaps persist.
Human examples behind the benchmark
The 261 authors include diverse profiles: an award-winning Telugu actor and screenwriter, Marathi journalists, a Kannada lexicographer, an international chess grandmaster, Tamil writers and poets, Punjabi composers, Gujarati curators, Malayalam poets and professors of history and architecture specialized in regional heritage.
That range ensures the questions touch real, local matters: from variants of a regional dish to interpretations of an inscription or the meaning of an architectural tradition. Think of it like asking someone not just what a recipe is, but how it changes from one village to the next — a detail that matters in everyday life.
And now what? Impact and future
IndQA opens a practical path for researchers and developers to create similar benchmarks in other countries and languages. Questions with deep cultural context help models do more than translate — they help models understand and respond with local relevance.
If you work in AI, language or culture, consider this an invitation: building evaluations with local experts may be the best way to spot real failures and set clear improvement goals. If you use models in multilingual markets, IndQA gives you a reference on where to start measuring quality.
It’s good news that AI teams are starting to look beyond English. There’s still a long way to go, but benchmarks like IndQA turn fuzzy problems into concrete, measurable goals.
