QIMMA قمّة arrives with a simple but powerful proposal: validate the quality of benchmarks before you evaluate models. Sounds obvious, right? It is — but in practice many leaderboards run models on datasets without checking whether the questions make sense in Arabic, whether the golden answers are correct, or whether a translation changed the original intent.
Qué es QIMMA y por qué importa
QIMMA is a leaderboard designed for Arabic that prioritizes quality. It doesn’t just gather benchmarks as-is. First it validates each sample; then it evaluates models. The result is a unified suite of 109 subsets, more than 52,000 samples, and coverage across seven domains: cultural, STEM, legal, medical, safety, poetry, and code.
Why does this change the game? Because Arabic is spoken by over 400 million people in many dialects and cultural contexts. If you use translated or unchecked data, scores can reflect benchmark artifacts, not the model’s true ability.
Metodología: validación antes de evaluar
The core of QIMMA is its staged validation pipeline. Every sample goes through two LLMs with strong Arabic capability: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. Both score each sample using a 10-point rubric made of binary criteria (0 or 1) that add up to 10.
- Elimination threshold: if either model gives less than 7/10, the sample is flagged. If both models flag it, the sample is discarded outright. If only one flags it, the sample goes to human review.
Revisión humana y sensibilidad dialectal
Flagged cases are reviewed by native annotators with dialectal and cultural knowledge. Here they judge:
- cultural context and regional variation
- dialectal nuances
- subjective interpretations
- subtle issues automatic evaluation misses
For sensitive content they seek multiple perspectives, because in practice “correctness” can change across Arab regions.
Qué problemas encontraron en los benchmarks
The findings were consistent: many benchmarks, even respected ones, show systematic errors. Common problems include literal translations that change intent, incorrect golden answers, encoding errors, and poor annotation consistency.
- QIMMA consolidated 109 subsets from 14 sources and found variable discard rates. For example, ArabicMMLU discarded 436 of 14,163 samples (3.1%).
- For code tasks they didn’t discard prompts; instead they refined them. The Arabic adaptations of HumanEval+ and MBPP+ were modified in 88% and 81% of cases respectively.
The code prompt modifications included:
- normalization to a natural standard Arabic
- clarification of instructions and constraints
- correction of structural errors in examples
- preservation of the original problem intent
Cómo evalúa QIMMA: métricas y herramientas
QIMMA uses LightEval, EvalPlus and FannOrFlop to maintain reproducibility and consistency. The main metric mappings are:
- MCQ:
Normalized Log-Likelihood Accuracy - Multi-select MCQ: cumulative probability over the correct options
- Generative QA:
F1 BERTScore (AraBERT v02) - Code:
Pass@1
Additionally, QIMMA standardizes prompts by format and keeps the original system prompts for specific benchmarks like MizanQA and ArabCulture.
Important: QIMMA publishes per-sample outputs, not just aggregates. That makes auditing, reproducibility and failure analysis much easier.
Resultados: quién lidera y qué significa
They evaluated 46 open-source models, from ~1B to 400B parameters. Key observations:
- Jais-2-70B-Chat tops the overall ranking (65.81) and dominates Cultural, STEM, Legal and Safety. It shows that Arabic-focused training brings clear gains.
- Qwen2.5-72B-Instruct was very close (65.75), showing large multilingual models remain competitive in Arabic.
- Llama-3.3-70B-Instruct dominates in Medical, and Qwen3.5-27B stands out in Coding.
- Arabic-specialized models often beat similarly sized multilingual models in several domains, but code favors multilingual models.
In practice this suggests two things: Arabic-focused training pays off for cultural understanding and correct answers; while code-generation skills may require instruction data in multiple languages or more varied examples.
Recomendaciones técnicas y de investigación
- Validate data before you evaluate. This is the central lesson: without quality control, metrics can deceive.
- Publish per-sample outputs and evaluation scripts for reproducibility.
- Treat code evaluation as a special subtype: fixing prompts instead of deleting samples preserves comparability with international benchmarks.
- Incorporate dialectal reviews and multiple cultural perspectives for sensitive tasks.
If you work with models in Arabic, QIMMA gives you a reproducible framework and a fairer comparison baseline. You can use its approach to audit your own datasets or to design new benchmarks without common traps.
Reflexión final
QIMMA isn’t just another leaderboard; it’s a methodological wake-up call. Validate first, evaluate after. Publish per-sample outputs. Respect the diversity of Arabic. With measures like these, model comparisons stop being noise and become useful information for researchers, developers and users.
