Arabic is an extremely diverse language: Modern Standard Arabic coexists with regional dialects that vary in vocabulary, syntax and cultural load. What happens when a model trained mostly on MSA faces a casual Emirati conversation? To answer that, Alyah الياه was created — a benchmark focused on the Emirati dialect that measures not only lexical correctness, but cultural, pragmatic and figurative understanding.
What is Alyah
Alyah (which means North Star ⭐️ in Emirati) is an evaluation suite designed to test LLMs’ competence in the Emirati dialect. It’s not a bank of formal sentences: these are expressions, greetings, proverbs, short poetry and anecdotes collected from native speakers.
The final dataset contains 1,173 samples, all manually curated by Emirati speakers to ensure cultural and linguistic authenticity. Each example is a multiple-choice question with four alternatives (exactly one correct). The distractors were synthetically generated by LLMs and then human-reviewed to ensure plausibility.
How the benchmark was built
- Manual collection by native speakers to capture expressions that are poorly documented in written text.
- Format: multiple-choice question with 4 candidates; the correct answer’s position is randomized to avoid positional bias.
- Evaluation based on semantic correctness and pragmatic appropriateness for Emirati usage, not on literal match to a reference.
Distribution by category:
| Category | Number of Samples | Difficulty |
|---|---|---|
| Greetings & Daily Expressions | 61 | Easy |
| Religious & Social Sensitivity | 78 | Medium |
| Imagery & Figurative Meaning | 121 | Medium |
| Etiquette & Values | 173 | Medium |
| Poetry & Creative Expression | 32 | Difficult |
| Historical & Heritage Knowledge | 89 | Difficult |
| Language & Dialect | 619 | Difficult |
This composition lets you evaluate everything from superficial conversational fluency to deep cultural understanding and dialectal phenomena that are hard to learn from formal text alone.
Which models were evaluated and key results
The authors evaluated dozens of contemporary models: native Arabic families like Jais and ALLaM, multilingual models with good Arabic support like Qwen and LLaMA, and regional adaptations such as Fanar and AceGPT.
Note on counts: the report mentions 54 models in one section (23 base + 31 instruct) and 53 models in another (22 base + 31 instruct). That seems like an inconsistency in the original report.
Top models (base, according to the table):
google/gemma-3-27b-pt: 74.68tiiuae/Falcon-H1-34B-Base: 73.66FreedomIntelligence/AceGPT-v2-32B: 67.35
Top models (instruction-tuned, according to the table):
falcon-h1-arabic-7b-instruct: 82.18humain-ai/ALLaM-7B-Instruct-preview: 77.24google/gemma-3-27b-it: 74.68falcon-h1-arabic-3b-instruct: 74.51Qwen/Qwen2.5-72B-Instruct: 74.6
Metrics: the main measure was accuracy on multiple-choice questions. The authors also present per-category analyses and radar charts by model family to compare strengths.
Trends and technical findings
-
Instruction tuning improves performance. Models fine-tuned for instructions tend to outperform their base counterparts, especially on questions about conversational norms and culturally appropriate replies (for example, the Etiquette & Values category).
-
The hardest categories were Language & Dialect and Greetings & Daily Expressions. Why? Because the Emirati dialect is used mainly orally and appears little in written corpora; models see little signal during their pretraining.
-
Strong multilingual models show degradation on Alyah’s toughest questions, suggesting that general multilingual training doesn’t replace specific exposure to the dialect.
-
Uneven performance: a model can excel in figurative language but fail in poetry or heritage knowledge. That indicates dialectal competence is multidimensional and not well captured by a single metric.
-
The best results appear in large, Arabic instruction-tuned models (for example, variants of
JaisandALLaM), highlighting the value of adapting and aligning models with regional data.
Practical recommendations for developers and researchers
-
Collect spoken data and transcripts: the dialect lives in oral use. If you want an LLM to understand greetings and nuances, you need audio transcribed and natural dialogue.
-
Fine-tuning and instruction-tuning with dialectal supervision greatly improves performance in pragmatic categories. Even small models benefit notably.
-
Use semantic and pragmatic evaluation, not just n-gram overlap. In dialects, multiple formulations can be valid; the metric should reflect that.
-
Consider RAG (retrieval-augmented generation) pipelines with local knowledge bases for questions about heritage and cultural history.
-
Keep humans in the loop to generate and review distractors, labels and usage examples; Alyah shows that cultural authenticity requires human curation.
Limitations and next steps
-
Dialectal coverage: Alyah focuses on the Emirates; the Arabic-speaking world has other dialects with equal challenges. This benchmark is a step, not the complete solution.
-
Intrinsic ambiguity: some idiomatic expressions admit multiple interpretations. That complicates annotation and automatic evaluation.
-
Size and representation: 1,173 examples is solid for diagnosis, but expanding samples and sources (more speakers, more contexts) will strengthen reliability and diversity.
Final reflection
Alyah puts the spotlight where many previous evaluations didn’t: the oral dimension, culture and pragmatics of the Emirati dialect. If you work with models for users in the region, this benchmark is not just a scoreboard; it’s a guide on where to invest in data, tuning and human validation.
The key takeaway? Understanding a dialect is both linguistic and cultural. LLMs can get closer, but they need data and evaluation designed for that territory. Alyah is a useful compass for that journey.
