JetBrains launches Mellum2: MoE 12B for code and text

Jun 1, 2026Keryc Díaz3 minutes

Today JetBrains presents Mellum2, a 12-billion-parameter Mixture-of-Experts (MoE) model trained from scratch on text and code. It’s designed for high-frequency, low-latency tasks: routing, RAG, summaries, agent subtasks and IDE autocomplete features. What's the advantage? It only activates 2.5B parameters per token, so it’s cost-effective in production.

What is Mellum2

Mellum2 is an open model under the Apache 2.0 license, intended as a focal component in multi-model systems. It doesn't aim to be the largest model on the market, but an efficient, specialized option for latency-sensitive workloads. It was trained on natural language and code data from scratch to optimize both quality and inference performance.

Mixture-of-Experts architecture and why it matters

Mellum2 uses a Mixture-of-Experts architecture. What does that really mean? Instead of running the whole model for each token, the MoE mechanism selects a subset of "experts" per token via a gating network. The result: high total model capacity (12B parameters) and less work per token (2.5B active), which lowers latency and serving cost.

This brings concrete benefits: more capacity to capture complex patterns in code and language without paying the cost of running all weights every time. But there are operational challenges too: balancing load between experts, runtime support for MoE kernels and memory for expert parameters. JetBrains documents these choices in their technical report.

Important: the theoretical efficiency holds in practice only if your infrastructure supports MoE efficiently. Without those optimizations, the gain can shrink.

Performance and benchmarks

In its technical report Mellum2 shows competitiveness with similarly sized models on code generation, reasoning, science and math benchmarks, and delivers over 2x inference speed compared to peers. That makes it appealing when you need high throughput: RAG pipelines that send many chunks, intermediate agent validations and real-time autocompletion.

If you want details on architecture, training setup, metrics and evaluation methodology, check the technical report on arXiv: https://arxiv.org/pdf/2605.31268

Practical use cases

Routing and orchestration: prompt classification, tool selection and control flow between models.
RAG and retrieval: context compression, postprocessing results and quick summaries before calling a larger model.
Sub-agents and planning: intermediate steps like validation, transformation and context prep without invoking expensive models.
High-frequency code features: autocomplete, suggestions and fast refactorings inside an IDE.
Private deployment: since it’s open and efficient, you can host it on your own infra with sensitive data or proprietary code.

Do you work inside an IDE or build RAG pipelines? Mellum2 is designed precisely to cut latency and cost at those hot spots.

Technical considerations when deploying Mellum2

Infrastructure: to exploit MoE gains you need runtimes with support for gating, load balancing and efficient device communication.
Latency vs cost: activating only 2.5B parameters reduces FLOPs per token, but routing adds overhead. Test and measure in your stack.
Integration in multi-model stacks: use it as a lightweight router or as an intermediate stage. It doesn’t necessarily replace large models for deep reasoning.
Monitoring: watch tail latencies, expert utilization and queue latency to avoid hotspots.

How to try it

Download the model and artifacts from the collection on Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2

The technical report contains the full training recipe, benchmarks and methodology: https://arxiv.org/pdf/2605.31268

Final reflection

Mellum2 doesn’t try to be a swiss army knife for everything. Its strength is being a specialized model: fast, compact at inference and open for private deployments. If your system needs many quick calls —routing, preprocessing, validations or autocompletes— trying Mellum2 can reduce latency and costs without sacrificing quality.

Original source

https://huggingface.co/blog/JetBrains/mellum2-launch

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What is Mellum2

Mixture-of-Experts architecture and why it matters

Important: the theoretical efficiency holds in practice only if your infrastructure supports MoE efficiently. Without those optimizations, the gain can shrink.

Performance and benchmarks

If you want details on architecture, training setup, metrics and evaluation methodology, check the technical report on arXiv: https://arxiv.org/pdf/2605.31268

Practical use cases

Routing and orchestration: prompt classification, tool selection and control flow between models.

RAG and retrieval: context compression, postprocessing results and quick summaries before calling a larger model.

Sub-agents and planning: intermediate steps like validation, transformation and context prep without invoking expensive models.

High-frequency code features: autocomplete, suggestions and fast refactorings inside an IDE.

Private deployment: since it’s open and efficient, you can host it on your own infra with sensitive data or proprietary code.

Do you work inside an IDE or build RAG pipelines? Mellum2 is designed precisely to cut latency and cost at those hot spots.

Technical considerations when deploying Mellum2

Infrastructure: to exploit MoE gains you need runtimes with support for gating, load balancing and efficient device communication.

Latency vs cost: activating only 2.5B parameters reduces FLOPs per token, but routing adds overhead. Test and measure in your stack.

Integration in multi-model stacks: use it as a lightweight router or as an intermediate stage. It doesn’t necessarily replace large models for deep reasoning.

Monitoring: watch tail latencies, expert utilization and queue latency to avoid hotspots.

Final reflection