Granite 4.1: architecture, training and benchmarks | Keryc
Granite 4.1 is IBM's new family of dense LLMs (3B, 8B and 30B) trained on ~15T tokens with a five-stage pretraining pipeline and context extension up to 512K tokens. The interesting part: a dense 8B model matches or outperforms a 32B MoE on many benchmarks, and everything is released under Apache 2.0.
What is Granite 4.1 and why it matters
What is this advancement good for? Granite 4.1 shows that training quality and data strategy can make up for model size. Instead of just scaling parameters, the team prioritized progressively curated data mixes, rigorous supervised fine-tuning and a staged RL pipeline.
This matters if you are an engineer looking for efficient models for production, an entrepreneur who wants to deploy tool-enabled assistants, or a researcher studying alternatives to expensive MoE models.
Design and architecture (technical summary)
Granite 4.1 uses a dense decoder-only transformer with these key design decisions:
Grouped Query Attention (GQA)
Rotary Position Embeddings (RoPE)
SwiGLU activation
RMSNorm normalization
Shared input/output embeddings
Main dimensions by variant:
3B: 40 layers, embedding 2560, MLP 8192
8B: 40 layers, embedding 4096, MLP 12800
30B: 64 layers, embedding 4096, MLP 32768
All variants share the same training pipeline and data strategy; only internal dimensions change.
Five-stage pretraining pipeline
They trained from scratch on ~15T tokens using five phases:
Phase 1 - Foundational: broad mix (CommonCrawl dominant), power LR schedule and warmup.
Phase 3 - Initial mid-training: more high-quality data, long chains of reasoning and synthetic instructive data.
Balanced data: CommonCrawl-HQ, Math, Code, Long Chain-of-Thought, Language & Code Instructions, etc.
Phase 4 - Final mid-training: linear LR decay toward zero and focus on highest-quality data.
Data: CommonCrawl-HQ ~40%, Code ~20%, Math ~20%, instructions and CoT present.
Phase 5 - Long-context extension (LCE): gradual extension of context from 4K up to 512K.
Stages: 32K, 128K and 512K. For 512K (8B and 30B) the mix was ~80% books + 20% code repositories.
After each LCE there is a model merge to keep short-context performance.
Goal: for the model to handle very long sequences without sacrificing short windows.
Supervised fine-tuning and LLM-as-Judge
SFT was done on ~4.1M curated samples. To ensure quality they applied an automated judge (LLM-as-Judge) that evaluates only the assistant's responses across multiple dimensions: instruction following, correctness, completeness, conciseness, naturalness and calibration.
They also implemented deterministic rules for normalization, schema validation, leak detection and global deduplication. The pipeline labels samples as accept/borderline/reject, and uses hard-reject for severe defects (hallucinations, false premises, incorrect calculations).
SFT configuration (applicable to all three models):
Instead of a single RL pass, Granite 4.1 applies multiple focused stages:
Algorithm: On-policy GRPO with DAPO loss (dynamic sampling disabled in training for cost)
Stack: SkyRL
Samples per prompt: 16
Train batch size: 1024
Context length in RL: 8192
RL stages described:
Multi-domain RL: avoids forgetting by training on a diverse mix (math, science, logic, IF, Text2SQL, chat, etc.). Effective LR: 5e-7, KL coef beta=0.05.
RLHF / multicultural chat: improves help and chat using a multilingual reward model; improves Alpaca-Eval by ~18.9 average points.
Identity & Knowledge calibration: few steps (e.g. ~40) to improve self-description and calibration.
Math RL: recovers and surpasses the drop in math benchmarks caused by earlier stages.
Learning rates and KL coefficients are adjusted conservatively between stages to avoid policy drift.
Performance and key benchmarks
Granite 4.1 shows predictable scaling with size, and the dense 8B achieves surprising results against the previous MoE 32B-A9B.
Some highlighted results (summary):
RULER long-context scores (example): 8B base 32K=83.6, 64K=79.1, 128K=73.0.
Practical takeaway: the dense 8B competes with much larger models on many reasoning, code and alignment tasks.
Quantization, deployment and quick example
They released quantized variants in fp8 optimized for vLLM, reducing the disk and GPU memory footprint by ~50%. Quantization is applied only to weights and activations of linear operators with LLM Compressor, keeping other layers in original precision.
Minimal example to load the instruct 30B model (adapted):
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-4.1-30b"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# example of tool definition (tool-calling)
tools = [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a specified city.",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}
}]
chat = [{"role": "user", "content": "What's the weather like in London right now?"}]
chat = tokenizer.apply_chat_template(chat, tokenize=False, tools=tools, add_generation_prompt=True)
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
output = model.generate(**input_tokens, max_new_tokens=100)
print(tokenizer.batch_decode(output)[0])
This flow shows how Granite 4.1 integrates tool-calling practically for production assistants.
Infrastructure and license
Training on NVIDIA GB200 NVL72 cluster (CoreWeave), with intra-rack NVLink domains and NDR 400 Gb/s InfiniBand inter-rack network. Large-scale training demands bandwidth and synchrony to handle 15T+ tokens.
Granite 4.1 is released under Apache 2.0, which eases adoption in research and companies.
When to use Granite 4.1?
If you need an open-source model effective for production with latency and cost constraints, the dense 8B is a powerful option.
If you work with very long documents, the context extension up to 512K is a real differentiator.
If you care about answer quality, the SFT + LLM-as-Judge pipeline and RL stages show a serious commitment to safety and calibration.
Granite 4.1 is not magic: it's data engineering, well-designed training stages and pragmatic architectural choices. Want a model that works in practice and is easy to deploy? This is a clear example that the right mix of data and stages can beat pure parameter scaling.