QIMMA قمّة arrives with a simple but powerful proposal: validate the quality of benchmarks before you evaluate models. Sounds obvious, right? It is — but in practice many leaderboards run models on datasets without checking whether the questions make sense in Arabic, whether the golden answers are correct, or whether a translation changed the original intent.
Qué es QIMMA y por qué importa
QIMMA is a leaderboard designed for Arabic that prioritizes quality. It doesn’t just gather benchmarks as-is. First it validates each sample; then it evaluates models. The result is a unified suite of 109 subsets, more than 52,000 samples, and coverage across seven domains: cultural, STEM, legal, medical, safety, poetry, and code.
Why does this change the game? Because Arabic is spoken by over 400 million people in many dialects and cultural contexts. If you use translated or unchecked data, scores can reflect benchmark artifacts, not the model’s true ability.
Metodología: validación antes de evaluar
The core of QIMMA is its staged validation pipeline. Every sample goes through two LLMs with strong Arabic capability: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. Both score each sample using a 10-point rubric made of binary criteria (0 or 1) that add up to 10.
