Hugging Face integrates Every Eval Ever into AI model pages | Keryc
EEE (Every Eval Ever) was born in February 2026 as an inter-institutional effort to fix a very prosaic problem: model evaluation results are everywhere and in every format. What’s the effect? Hard comparisons, lack of traceability, and distrust when the same benchmarks return different numbers depending on who ran them. Sound familiar?
Qué es EEE y por qué importa
EEE proposes a solution that’s both simple and technical: a single JSON schema for each evaluation result that records essential data. What does it store exactly? Who ran the evaluation, which model was evaluated, how the model was accessed, the generation configuration, what the metric means and, optionally, a JSONL file with per-sample outputs.
That changes how reporting works. Instead of scores scattered across papers, harness logs, leaderboards and posts, everything ends up in the same structure. Since its launch EEE has fed a datastore on Hugging Face with ~229000 evaluation results, covering over 22000 models and 2200 benchmarks, extracted from 31 different formats. Reproducing just those runs from scratch would cost hundreds of thousands of dollars, so keeping traceability is also cost-efficient.
Cómo funciona la integración con Hugging Face Community Evals
Hugging Face launched Community Evals to decentralize publishing scores on the Hub. The integration with EEE joins two goals: visibility and readability.
Benchmarks are registered as dataset repositories that include an eval.yaml. Those repositories feed leaderboards that aggregate all scores reported against that benchmark.
In each model repository, scores are stored in .eval_results/*.yaml and appear on the model card. Each entry carries a badge that indicates whether the score comes from the author, the community, or is verified.
When you submit a result to both Community Evals and EEE, your score appears on the model page and on the leaderboard, and it also includes a badge that links to the full EEE record, where the generation config, harness version, reproducibility notes and instance-level data are stored. Visible and readable results at the same time.
El convertidor: qué hace y cómo transforma tus registros
To avoid maintaining two formats manually, a converter was developed that takes your EEE records and generates the small YAML files that Hugging Face expects. The main mapping is straightforward:
source_data.hf_repo -> dataset.id
evaluation_name -> task_id
score_details.score -> value
evaluation_timestamp -> date
The URL of the object in the EEE datastore is placed in source.url
Example of an input that ends up on the model card:
The converter does more than rewrite fields. When you point it at a datastore collection, the flow is:
Downloads the collection and the referenced objects and validates hashes.
Detects which scores map to supported benchmarks (today: MMLU-Pro, GPQA, HLE and GSM8K).
Audits the model repo on the Hub by reading each .eval_results on the main branch and open PRs.
Classifies each potential entry into states: already_present, score_conflict, missing_hf_model or ready.
Nothing is pushed without your confirmation. The converter writes local YAML previews, generates a review file and a report. It only opens PRs when you type OPEN PRS and leave a commit message. If you run the same process again, it reuses the cache unless you pass --force.
Estado de confianza y verificación
A key point for researchers and policy leads: if you submit results through your organization’s official Hugging Face account, those entries appear with a verified mark in EvalEval. That mark helps distinguish between numbers published by the source and scores added by third parties.
Also, each score on the model page carries the minimal metadata in YAML and a direct link to the EEE JSON. That means anyone who wants to reproduce or inspect the evaluation has access to the full structured record, not just the number.
Cómo usar el convertidor hoy
The GitHub repository contains the code, examples and the contribution guide. To process a collection you can use the CLI like this:
uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro --datastore evaleval/EEE_datastore@main
Review the previews and the report; when everything looks OK, type OPEN PRS to create the pull requests. The full documentation of the schema, CLI and converters is at evalevalai.com/every_eval_ever/hf-community-evals.
Impacto práctico y recomendaciones técnicas
If you work with evaluations: upload your full records to EEE and use the converter to expose them on Hugging Face. Why? Because you get traceability, visibility on the Hub and a public backup that makes audits and comparisons easier.
If you’re a model author: review the PRs in your repo; you can accept, close or hide results. Keep eval_results up to date and document your harness and generation configuration to avoid score discrepancies.
If you’re a policymaker or analyst: there’s now a verifiable route to trace a number back to its original execution. That reduces noise and improves interpretation of safety and performance metrics.
EEE and Community Evals don’t reinvent evaluation, they organize it. Traceability, auditability and an integration designed for collaboration: that moves from being a best practice to something reproducible and automatable.