Today JetBrains presents Mellum2, a 12-billion-parameter Mixture-of-Experts (MoE) model trained from scratch on text and code. It’s designed for high-frequency, low-latency tasks: routing, RAG, summaries, agent subtasks and IDE autocomplete features. What's the advantage? It only activates 2.5B parameters per token, so it’s cost-effective in production.
What is Mellum2
Mellum2 is an open model under the Apache 2.0 license, intended as a focal component in multi-model systems. It doesn't aim to be the largest model on the market, but an efficient, specialized option for latency-sensitive workloads. It was trained on natural language and code data from scratch to optimize both quality and inference performance.
Mixture-of-Experts architecture and why it matters
Mellum2 uses a Mixture-of-Experts architecture. What does that really mean? Instead of running the whole model for each token, the MoE mechanism selects a subset of "experts" per token via a gating network. The result: high total model capacity (12B parameters) and less work per token (2.5B active), which lowers latency and serving cost.
This brings concrete benefits: more capacity to capture complex patterns in code and language without paying the cost of running all weights every time. But there are operational challenges too: balancing load between experts, runtime support for MoE kernels and memory for expert parameters. JetBrains documents these choices in their technical report.
Important: the theoretical efficiency holds in practice only if your infrastructure supports MoE efficiently. Without those optimizations, the gain can shrink.
Performance and benchmarks
In its technical report Mellum2 shows competitiveness with similarly sized models on code generation, reasoning, science and math benchmarks, and delivers over 2x inference speed compared to peers. That makes it appealing when you need high throughput: RAG pipelines that send many chunks, intermediate agent validations and real-time autocompletion.
If you want details on architecture, training setup, metrics and evaluation methodology, check the technical report on arXiv: https://arxiv.org/pdf/2605.31268
Practical use cases
- Routing and orchestration: prompt classification, tool selection and control flow between models.
- RAG and retrieval: context compression, postprocessing results and quick summaries before calling a larger model.
- Sub-agents and planning: intermediate steps like validation, transformation and context prep without invoking expensive models.
- High-frequency code features: autocomplete, suggestions and fast refactorings inside an IDE.
- Private deployment: since it’s open and efficient, you can host it on your own infra with sensitive data or proprietary code.
Do you work inside an IDE or build RAG pipelines? Mellum2 is designed precisely to cut latency and cost at those hot spots.
Technical considerations when deploying Mellum2
- Infrastructure: to exploit MoE gains you need runtimes with support for gating, load balancing and efficient device communication.
- Latency vs cost: activating only 2.5B parameters reduces FLOPs per token, but routing adds overhead. Test and measure in your stack.
- Integration in multi-model stacks: use it as a lightweight router or as an intermediate stage. It doesn’t necessarily replace large models for deep reasoning.
- Monitoring: watch tail latencies, expert utilization and queue latency to avoid hotspots.
How to try it
Download the model and artifacts from the collection on Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2
The technical report contains the full training recipe, benchmarks and methodology: https://arxiv.org/pdf/2605.31268
Final reflection
Mellum2 doesn’t try to be a swiss army knife for everything. Its strength is being a specialized model: fast, compact at inference and open for private deployments. If your system needs many quick calls —routing, preprocessing, validations or autocompletes— trying Mellum2 can reduce latency and costs without sacrificing quality.
