What makes the Granite 4.1 8B model's performance notable?

The Granite 4.1 8B instruct model is notable because it matches or surpasses the performance of the previous Granite 4.0 -H-Small model, which was a significantly larger 32B Mixture-of-Experts (MoE) model. This demonstrates a significant efficiency gain, achieving comparable results with a simpler, dense architecture and fewer parameters.

Granite 4.1 LLMs: How They’re Built

IBM Releases Open-Source Granite 4.1 LLMs with 15T Token Training Regimen

IBM has released its Granite 4.1 family of dense, decoder-only large language models, making the 3B, 8B, and 30B parameter models available under the permissive Apache 2.0 license. The release is significant not just for its open nature, but for the detailed methodology IBM disclosed, which includes a five-phase pre-training pipeline using approximately 15 trillion tokens. Notably, the 8B instruct model reportedly achieves performance on par with or exceeding the company's previous 32B Mixture-of-Experts (MoE) model, signaling a strong focus on training efficiency and data quality over raw parameter scale.

A Five-Phase Approach to Pre-Training and Long Context

The foundation of Granite 4.1 is a meticulously structured, five-stage pre-training process that progressively refines the data mixture. The initial phases focus on broad language understanding using web-scale data, but later stages pivot to higher-quality, domain-specific sources, including curated math, code, and synthetic instruction data. The final phase extends the context window up to 512K tokens for the 8B and 30B models. The models all share a common architectural foundation, differing only in scale.

Architecture: Decoder-only dense transformer
Attention: Grouped Query Attention (GQA)
Activations: SwiGLU
Embeddings: Rotary Position Embeddings (RoPE) and shared input/output embeddings
Normalization: RMSNorm

Post-Training Refinement Through LLM-as-Judge and Sequential RL

Following pre-training, the models undergo a rigorous supervised fine-tuning (SFT) stage on ~4.1 million samples curated by an LLM-as-Judge framework designed to enforce high standards for correctness and instruction following. IBM then applies a four-stage reinforcement learning (RL) pipeline using on-policy GRPO with DAPO loss. This sequential process systematically enhances capabilities in distinct domains—starting with a multi-domain mixture before targeting chat (RLHF), self-identification, and finally recovering mathematical reasoning, a capability often diminished during chat-focused tuning. This structured approach aims to maximize performance across multiple domains while minimizing the catastrophic forgetting often seen with monolithic RL training.

IBM's detailed disclosure of its five-stage pre-training and sequential reinforcement learning pipeline for Granite 4.1 signals a strategic shift in the open-source community, where data quality curation and methodical, multi-phase skill enhancement are becoming more critical competitive differentiators than raw parameter count.

>> Verify Original Transmission at Hugging Face