Imagine you ask a model: "Is coffee good for you?" What does the model reply if it doesn't know whether you're pregnant, have high blood pressure, or need a short, practical answer? That lack of context makes many language-model evaluations unfair or useless.
What they propose and how it works
The Allen Institute for AI (Ai2) proposes a protocol called Contextualized Evaluations: instead of throwing vague questions at models with no background, they generate pairs of follow-up questions and answers that simulate the kind of information a real user might provide in a conversation. This lets both models and evaluators work from the same scenario and criteria. (allenai.org)
To generate that context, Ai2 uses large language models with simple prompts and then validates the options with humans. In their study, most generated questions were deemed important and the alternative answers realistic, complete, and diverse. That shows you can create plausible contexts automatically and at scale. (allenai.org)
How they tested it
They designed three evaluation scenarios to see the effect of context:
- Standard evaluation with no context for anyone. This is common on many leaderboards.
- Evaluation with context only shown to the evaluator, not the model, to reveal the model's implicit assumptions.
- Adaptive evaluation where model and evaluator share the same context.
They ran paired comparisons between popular models on 1,881 queries and collected judgments from human and automated evaluators. (allenai.org)
What they found and why it matters
The results have clear practical consequences:
-
Greater agreement between evaluators and changes in ranking. Adding context increased agreement among judges by 3–10 percentage points, and in some cases the winning model changed when both had context. That suggests current leaderboards might be mismeasuring a model's adaptability. (allenai.org)
-
More substantive judgments. With context, evaluators focus on the substance of the answer—relevance, correctness, fit to needs—instead of style or form. Isn't that exactly what you want to know when you integrate an assistant into a real product? (allenai.org)
-
Default-answer bias. By revealing context only to the evaluator, Ai2 shows default responses tend to favor WEIRD contexts (Western, Educated, Industrialized, Rich, Democratic). In other words, without explicit instructions, models often align better with Western, higher-income users, which raises equity risks in real applications. (allenai.org)
And what can you do with this?
-
If you're a researcher or model developer: add synthetic contexts to your benchmarks to measure adaptability and fairness, not just average accuracy with no information.
-
If you build products: test how useful the model is when you give it concrete user data, and design flows that ask key clarifying questions before taking action.
-
If you're a user or product manager: question "default" results and demand evaluations that consider different user profiles to avoid decisions that only work well for some people.
Resources to dive deeper
You can read the original paper, check the code, and download the data used in the evaluation: paper on arXiv, code on GitHub and dataset on Hugging Face. These materials make it easy to replicate and adapt the method to your own question sets. (allenai.org)
Asking better questions and giving context sounds obvious, but in practice it radically changes which models look "better." Wouldn't you rather judge an AI by how it responds when it knows who you are and what you need? That's the core idea of contextualized evaluations: make tests more human, useful, and fair.