What is the main difference between QIMMA and other Arabic LLM leaderboards?

The primary differentiator is its proactive, multi-stage quality validation pipeline that is applied *before* any model evaluation occurs. While other leaderboards aggregate existing benchmarks, QIMMA first subjects every test sample to review by both multiple AI models and human annotators to filter out errors, inconsistencies, and biases. This 'quality-first' approach ensures that the final scores more accurately reflect a model's true Arabic language skills rather than its ability to navigate flawed test data.

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

New Leaderboard Audits Arabic AI Benchmarks, Reshuffles Model Rankings

A new initiative named QIMMA is addressing a critical flaw in Arabic artificial intelligence by systematically validating evaluation benchmarks before ranking Large Language Models. The project, detailed in a paper from researchers at tiiuae and other institutions, argues that many existing Arabic benchmarks contain systematic errors, from translation artifacts to cultural biases, which corrupt evaluation scores. By implementing a rigorous quality-first pipeline, QIMMA provides a more reliable assessment of how well models truly comprehend and generate Arabic.

A Rigorous, Quality-First Methodology

The core of QIMMA is its multi-stage validation process. Before any evaluation, each of the 52,000+ samples from 14 source benchmarks is scored by two different state-of-the-art LLMs against a 10-point quality rubric. Samples flagged by the models are then passed to native Arabic-speaking human reviewers for a final decision. This process uncovered significant issues, leading to the modification of over 80% of problem statements in code benchmarks and the outright discarding of hundreds of samples in question-answering datasets like ArabicMMLU.

Comprehensive Suite: Consolidates 109 subsets spanning 7 domains, including cultural, legal, medical, and coding tasks.
Native Content: 99% of the evaluation suite is composed of native Arabic content, minimizing issues from English-to-Arabic translation.
Code Evaluation: The first Arabic leaderboard to integrate code generation tasks, using refined Arabic versions of HumanEval+ and MBPP+.
Full Transparency: Provides public, per-sample inference outputs, allowing for complete reproducibility and auditing of results.

Initial Rankings Reveal a Competitive Landscape

The cleaned-up evaluation reveals a nuanced performance landscape. The top spot is currently held by Qwen/Qwen3.5-397B, followed closely by Applied-Innovation-Center/Karnak and inceptionai/Jais-2-70B-Chat. The results demonstrate that model size is not the sole determinant of performance, with several mid-size models outperforming larger competitors. Notably, Arabic-specialized models like Jais-2 show superior performance on cultural and linguistic benchmarks, while multilingual models from Qwen maintain a strong lead in the newly introduced coding evaluations, highlighting a key area for improvement for regional models.

The QIMMA leaderboard's 'validate-then-evaluate' methodology serves as a critical blueprint for the broader AI industry. It proves that without rigorous, upfront auditing of benchmark data, LLM leaderboards risk rewarding models for exploiting dataset flaws rather than for demonstrating genuine capability, a systemic issue that extends far beyond the Arabic language ecosystem.

>> Verify Original Transmission at Hugging Face