Mellum 2 Technical Report Summary
Summary (Overview)
- Efficient MoE Architecture: Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token (64 experts, 8 active). It is designed for efficient inference, matching the latency of a Qwen2.5-7B dense model while offering a larger parameter envelope.
- Specialized for Software Engineering: As a successor to the dense Mellum-4B, Mellum 2 is a general-purpose language model specialized for software engineering tasks, including code generation/editing, debugging, reasoning, tool use, and conversational assistance.
- Innovative Training Pipeline: The model was pre-trained on ~10.6 trillion tokens using a three-phase curriculum that progressively shifts from web data to curated code/math. It employs the Muon optimizer with FP8 hybrid precision, a Warmup-Hold-Decay schedule, and features a Multi-Token Prediction head for improved performance and speculative decoding.
- Extended Context & Post-Training: The base model's context was extended to 128K using layer-selective YaRN. Two post-trained variants were released: an Instruct model for direct answers and a Thinking model that emits explicit reasoning traces, both refined via reinforcement learning with verifiable rewards (RLVR).
- Competitive Performance: Evaluations show Mellum 2 is competitive with open-weight baselines in the 4B–14B parameter range on code, math, reasoning, and tool-use benchmarks, while operating at the compute cost of a 2.5B dense model.
Introduction and Theoretical Foundation
Large language models (LLMs) have evolved from simple code autocomplete to comprehensive coding assistants capable of code generation, editing, debugging, multi-step reasoning, tool use, and agentic workflows. The open-weights landscape features a trade-off: dense 4–14B models are cheap to serve but plateau on harder tasks, while very large MoE models offer frontier quality at high deployment costs.
Mellum 2 aims to strike a balance, extending the line of small MoE coding models. It is designed as a practical, deployable coding assistant for JetBrains IDEs, succeeding the completion-focused Mellum-4B. The core motivation is to achieve a wide knowledge scope (via a large total parameter count) with low serving cost (via sparse activation), targeting the inference budget of a Qwen2.5-7B model.
Methodology
Model Architecture
Mellum 2 is a decoder-only Transformer based on the Qwen3-MoE recipe, with key efficiency-oriented modifications:
- Backbone: 28 layers, hidden dimension 2,304, with pre-RMSNorm () and SiLU-gated MLPs.
- Attention: Grouped-Query Attention (GQA) with 32 query heads and 4 KV heads (head dimension 128), QK-Norm, and RoPE with base .
- Sliding Window Attention (SWA): Applied to 3 out of every 4 layers with a window size of 1,024 tokens; the remaining layer uses full attention.
- Mixture-of-Experts: 64 experts per layer with 8 active per token (top-8 routing), expert intermediate size 896, no shared expert. Uses dropless routing.
- Multi-Token Prediction (MTP): A single additional transformer layer trained with loss weight . It serves as an auxiliary pre-training objective and a built-in draft model for speculative decoding, and is removed during evaluation.
- Vocabulary: 98,304 tokens, untied embeddings. Native context length is 8,192 (extended to 131,072).
Table 2 — Architecture configuration of Mellum 2.
| Scale | |
|---|---|
| Total parameters | ~12B |
| Active parameters | ~2.5B |
| Vocabulary size | 98,304 |
| Context length | 8,192 / 131,072 ★ |
| Backbone | |
| Layers | 28 |
| Hidden dimension | 2,304 |
| Activation | SiLU (gated) |
| Normalization | RMSNorm () |
| Position encoding | RoPE () |
| Attention | |
| Query heads | 32 |
| KV heads (GQA) | 4 |
| Head dimension | 128 |
| QK-Norm | Yes (RMSNorm) |
| Sliding window | 1,024 (3:1 SWA) |
| Mixture-of-Experts & MTP | |
| Experts (total) | 64 |
| Experts (active) | 8 (top-8) |
| Expert MLP size | 896 |
| Shared expert | None |
| MTP layers | 1 () |
Pre-Training Data and Curriculum
The pre-training corpus of ~10.6T tokens comprises web/general knowledge, source code, and mathematical content. Training follows a three-phase "web early, curated late" curriculum:
Table 3 — Three-phase pre-training curriculum.
| Phase | Tokens (T) | % Total | LR State | Web % | Code % | Math % |
|---|---|---|---|---|---|---|
| 1: Foundation | 6.18 | 58.0 | Warmup → Hold | 70 | 23 | 6 |
| 2: Quality Uplift | 2.79 | 26.2 | Hold | 44 | 42 | 14 |
| 3: Capability Sharpening | 1.69 | 15.9 | Decay | 23 | 59 | 18 |
| Total | 10.65 | 100.0 |
- High-quality data is repeated up to 4x across phases.
- A Fill-in-the-Middle (FIM) objective is used, with its application rate varying by phase (50% in Phase 1, 10% in Phase 2, 50% on code-only in Phase 3).
Training Setup
- Optimizer: Distributed Muon (Moonlight configuration with extra scale factor 0.2), with Adam .
- Schedule: Warmup-Hold-Decay (WHD) with a peak LR of , warming up over 2,000 steps, holding, then decaying linearly to zero over Phase 3.
- Precision: BF16 base with FP8 hybrid mixed-precision training (tensorwise recipe).
- Batch Size: Global batch size ramps from 2,048 to 4,096 sequences (8,192 tokens each).
- MoE Training: Uses a global auxiliary load-balancing loss (coefficient ) and a router z-loss ().
Long Context Extension
The pre-trained base model's context was extended from 8,192 to 131,072 tokens using a layer-selective YaRN method. YaRN frequency re-mapping is applied only to the global (full-attention) layers, leaving the SWA layers unchanged. This approach outperformed uniform scaling methods.
Post-Training
Two variants were created from the same long-context base checkpoint via supervised fine-tuning (SFT):
- Instruct: Answers directly. Loss computed on all assistant turns.
- Thinking: Emits an explicit reasoning trace before the final answer. Loss computed only on the final assistant turn (with its reasoning). Multi-turn conversations are unfolded for training.
Both variants were further refined via Reinforcement Learning with Verifiable Rewards (RLVR), using a variant of the GRPO algorithm. Key adaptations include:
- Asynchronous rollouts with a decoupled verification stack.
- Per-token IcePop truncation to handle train-vs-inference disparity in MoE router log-probabilities.
- Reward shaping with a soft overlong penalty and a concision penalty to suppress unnecessary inline reasoning in the Instruct variant.
The loss function incorporates these modifications:
Empirical Validation / Results
Architectural Ablations
- MTP Impact: Ablation on a 14B MoE model showed MTP provided significant benchmark improvements with only 7% extra training time.
Table 1 — Benchmark comparison between baseline and MTP models (14B MoE, 105B tokens).
Benchmark Metric Baseline + MTP HumanEval pass@1 20.73 31.10 +10.37 MMLU Accuracy 37.49 41.06 +3.57 MMLU-Pro Exact match 19.07 22.32 +3.25 GSM8K Exact match 30.63 33.59 +2.96 - Inference Efficiency: The final architecture matches the single-request (sync) latency of Qwen2.5-7B (192 vs. 193 tokens/s) and provides 21% higher sustained throughput under concurrent loads.
Pre-Training Evaluation
The base Mellum 2 model (2.5B active) was evaluated against 4B-7B dense models.
Table 5 — Pre-training evaluation results (selected highlights).
| Benchmark | Mellum 2 (2.5B/12B) | OLMo-3-7B | Qwen2.5-7B | Qwen3-4B |
|---|---|---|---|---|
| Code Generation | ||||
| HumanEval | 41.5 | 45.1 | 55.5 | 57.3 |
| MBPP | 62.4 | 50.6 | 63.6 | 67.0 |
| Knowledge & Reasoning | ||||
| MMLU | 70.9 | 62.1 | 71.8 | 71.1 |
| MMLU-Pro | 59.3 | 34.5 | 48.6 | 51.5 |
| BBH | 74.9 | 63.6 | 69.0 | 71.3 |
| Math & Science | ||||
| GSM8K | 81.7 | 73.5 | 81.9 | 82.0 |
Key findings: Mellum 2 is competitive with or outperforms 7B dense models on reasoning-heavy benchmarks (MMLU-Pro, BBH) and code (MBPP), despite fewer active parameters.
Post-Training Evaluation
Post-trained Instruct and Thinking variants were evaluated against open-weight models in the 4B–14B range.
Table 9 — Post-training evaluation, instruct (no-thinking) variants (selected).
| Benchmark | Mellum 2 (SFT) | Mellum 2 (RL) | Qwen3.5-9B | Ministral-3-14B |
|---|---|---|---|---|
| Coding | ||||
| LiveCodeBench v6 | 30.9 | 37.2 | 63.7 | 42.4 |
| EvalPlus | 76.2 | 78.4 | 71.8 | 74.1 |
| Tool Use | ||||
| BFCL v3 | 43.1 | 66.3 | 70.5 | 52.7 |
| Math | ||||
| AIME | 29.9 | 41.7 | 58.3 | 33.3 |
| Knowledge | ||||
| MMLU-Redux | 77.4 | 78.1 | 91.1 | 85.9 |
Table 10 — Post-training evaluation, thinking /reasoning variants (selected).
| Benchmark | Mellum 2 (SFT) | Mellum 2 (RL) | Qwen3.5-9B | OLMo-3-7B |
|---|---|---|---|---|
| Coding | ||||
| LiveCodeBench v6 | 75.1 | 69.9 | 68.3 | 59.8 |
| Math | ||||
| AIME | 20.0 | 58.4 | 73.4 | 61.7 |
- Coding: Mellum 2 leads on function-level synthesis (EvalPlus). The Thinking variant shows exceptional performance on LiveCodeBench (75.1), indicating algorithmic reasoning is unlocked with an explicit reasoning budget.
- Tool Use & Math: RL provides significant jumps. Mellum 2 is competitive on tool use (BFCL) and math (GSM-Plus ~87-90%).
- Knowledge: A deliberate weakness. Mellum 2 lags behind models like Qwen3.5-9B on broad knowledge benchmarks (MMLU-Redux, GPQA), reflecting its code-focused training mix.
- Safety: The SFT variant is very safe (HarmBench 8.4%), but RL regresses this somewhat (23.1%), a known alignment tax.
Theoretical and Practical Implications
- Efficiency-Aware Design: Mellum 2 demonstrates that a small MoE architecture, designed with inference cost as a primary constraint, can achieve a favorable quality-cost trade-off, making powerful coding assistants more deployable.
- Curriculum and Data Strategy: The three-phase "web early, curated late" curriculum and strategic data repetition (particularly for MoE training) are validated as effective methods for building capability in specialized domains.
- Specialization vs. Generality: The model's profile—strong in code and reasoning, weaker in broad knowledge—highlights the trade-offs involved in specializing a general-purpose model. It provides a blueprint for creating efficient, domain-specialized assistants.
- RL for Specialized Models: The successful application of RLVR on verifiable code and math tasks shows that reinforcement learning can be effectively used to refine models for deterministic domains without the need for a learned reward model.
- Open Recipe: The release of the model, weights, and detailed technical report under Apache 2.0 provides a valuable open recipe and design point for the community, especially for those interested in efficient, specialized MoE models.
Conclusion
Mellum 2 is an open-weight, 12B-parameter MoE language model specialized for software engineering, activating only 2.5B parameters per token for efficient inference. Through an efficiency-aware architectural design, a large-scale three-phase pre-training curriculum, long-context extension, and a two-stage post-training pipeline, it achieves competitive performance with 4B–14B dense models on coding, reasoning, and tool-use tasks.
The main takeaways are:
- A carefully designed small MoE model can match the inference speed of a smaller dense model while offering greater knowledge capacity.
- A progressive training curriculum focusing on high-quality code and math data is effective for building coding proficiency.
- Explicit reasoning (Thinking variant) unlocks significant performance gains on complex tasks like competitive programming.
- The model's strengths are in code synthesis and developer interaction, with a trade-off in broad world knowledge.
Future directions include pushing further into software engineering RL (repository-level tasks), broadening RL infrastructure, and revisiting the long-context training mix. The released open recipe provides a foundation for future inference-aware MoE coding models.
Related papers
- On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
Parameter-efficient fine-tuning scales one shared foundation model into millions of persistent personal model instances, shown with trillion-parameter LoRA RL.
- GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Training image restoration models on 100,000 real-world image pairs generated by a multimodal foundation model consistently improves their generalization to diverse real-world degradations.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.