Mellum 2 Technical Report Summary

Summary (Overview)

  • Efficient MoE Architecture: Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token (64 experts, 8 active). It is designed for efficient inference, matching the latency of a Qwen2.5-7B dense model while offering a larger parameter envelope.
  • Specialized for Software Engineering: As a successor to the dense Mellum-4B, Mellum 2 is a general-purpose language model specialized for software engineering tasks, including code generation/editing, debugging, reasoning, tool use, and conversational assistance.
  • Innovative Training Pipeline: The model was pre-trained on ~10.6 trillion tokens using a three-phase curriculum that progressively shifts from web data to curated code/math. It employs the Muon optimizer with FP8 hybrid precision, a Warmup-Hold-Decay schedule, and features a Multi-Token Prediction head for improved performance and speculative decoding.
  • Extended Context & Post-Training: The base model's context was extended to 128K using layer-selective YaRN. Two post-trained variants were released: an Instruct model for direct answers and a Thinking model that emits explicit reasoning traces, both refined via reinforcement learning with verifiable rewards (RLVR).
  • Competitive Performance: Evaluations show Mellum 2 is competitive with open-weight baselines in the 4B–14B parameter range on code, math, reasoning, and tool-use benchmarks, while operating at the compute cost of a 2.5B dense model.

Introduction and Theoretical Foundation

Large language models (LLMs) have evolved from simple code autocomplete to comprehensive coding assistants capable of code generation, editing, debugging, multi-step reasoning, tool use, and agentic workflows. The open-weights landscape features a trade-off: dense 4–14B models are cheap to serve but plateau on harder tasks, while very large MoE models offer frontier quality at high deployment costs.

Mellum 2 aims to strike a balance, extending the line of small MoE coding models. It is designed as a practical, deployable coding assistant for JetBrains IDEs, succeeding the completion-focused Mellum-4B. The core motivation is to achieve a wide knowledge scope (via a large total parameter count) with low serving cost (via sparse activation), targeting the inference budget of a Qwen2.5-7B model.

Methodology

Model Architecture

Mellum 2 is a decoder-only Transformer based on the Qwen3-MoE recipe, with key efficiency-oriented modifications:

  • Backbone: 28 layers, hidden dimension 2,304, with pre-RMSNorm (ϵ=106\epsilon = 10^{-6}) and SiLU-gated MLPs.
  • Attention: Grouped-Query Attention (GQA) with 32 query heads and 4 KV heads (head dimension 128), QK-Norm, and RoPE with base θ=500,000\theta = 500,000.
  • Sliding Window Attention (SWA): Applied to 3 out of every 4 layers with a window size of 1,024 tokens; the remaining layer uses full attention.
  • Mixture-of-Experts: 64 experts per layer with 8 active per token (top-8 routing), expert intermediate size 896, no shared expert. Uses dropless routing.
  • Multi-Token Prediction (MTP): A single additional transformer layer trained with loss weight α=0.1\alpha = 0.1. It serves as an auxiliary pre-training objective and a built-in draft model for speculative decoding, and is removed during evaluation.
  • Vocabulary: 98,304 tokens, untied embeddings. Native context length is 8,192 (extended to 131,072).

Table 2 — Architecture configuration of Mellum 2.

Scale
Total parameters~12B
Active parameters~2.5B
Vocabulary size98,304
Context length8,192 / 131,072 ★
Backbone
Layers28
Hidden dimension2,304
ActivationSiLU (gated)
NormalizationRMSNorm (ϵ=106\epsilon = 10^{-6})
Position encodingRoPE (θ=500,000\theta = 500,000)
Attention
Query heads32
KV heads (GQA)4
Head dimension128
QK-NormYes (RMSNorm)
Sliding window1,024 (3:1 SWA)
Mixture-of-Experts & MTP
Experts (total)64
Experts (active)8 (top-8)
Expert MLP size896
Shared expertNone
MTP layers1 (α=0.1\alpha = 0.1)

Pre-Training Data and Curriculum

The pre-training corpus of ~10.6T tokens comprises web/general knowledge, source code, and mathematical content. Training follows a three-phase "web early, curated late" curriculum:

Table 3 — Three-phase pre-training curriculum.

PhaseTokens (T)% TotalLR StateWeb %Code %Math %
1: Foundation6.1858.0Warmup → Hold70236
2: Quality Uplift2.7926.2Hold444214
3: Capability Sharpening1.6915.9Decay235918
Total10.65100.0
  • High-quality data is repeated up to 4x across phases.
  • A Fill-in-the-Middle (FIM) objective is used, with its application rate varying by phase (50% in Phase 1, 10% in Phase 2, 50% on code-only in Phase 3).

Training Setup

  • Optimizer: Distributed Muon (Moonlight configuration with extra scale factor 0.2), with Adam ϵ=108\epsilon = 10^{-8}.
  • Schedule: Warmup-Hold-Decay (WHD) with a peak LR of 3×1043 \times 10^{-4}, warming up over 2,000 steps, holding, then decaying linearly to zero over Phase 3.
  • Precision: BF16 base with FP8 hybrid mixed-precision training (tensorwise recipe).
  • Batch Size: Global batch size ramps from 2,048 to 4,096 sequences (8,192 tokens each).
  • MoE Training: Uses a global auxiliary load-balancing loss (coefficient 10310^{-3}) and a router z-loss (10310^{-3}).

Long Context Extension

The pre-trained base model's context was extended from 8,192 to 131,072 tokens using a layer-selective YaRN method. YaRN frequency re-mapping is applied only to the global (full-attention) layers, leaving the SWA layers unchanged. This approach outperformed uniform scaling methods.

Post-Training

Two variants were created from the same long-context base checkpoint via supervised fine-tuning (SFT):

  1. Instruct: Answers directly. Loss computed on all assistant turns.
  2. Thinking: Emits an explicit reasoning trace before the final answer. Loss computed only on the final assistant turn (with its reasoning). Multi-turn conversations are unfolded for training.

Both variants were further refined via Reinforcement Learning with Verifiable Rewards (RLVR), using a variant of the GRPO algorithm. Key adaptations include:

  • Asynchronous rollouts with a decoupled verification stack.
  • Per-token IcePop truncation to handle train-vs-inference disparity in MoE router log-probabilities.
  • Reward shaping with a soft overlong penalty and a concision penalty to suppress unnecessary inline reasoning in the Instruct variant.

The loss function incorporates these modifications:

Ai=Ri1G1jiRj,A_i = R_i - \frac{1}{G-1} \sum_{j \neq i} R_j, ri,t=πtrain(yi,tyi,<t;θ)πtrain(yi,tyi,<t;θold),ρi,t=πtrain(yi,tyi,<t;θold)πinfer(yi,tyi,<t;θold),r_{i,t} = \frac{\pi_{\text{train}}(y_{i,t} \mid y_{i,<t}; \theta)}{\pi_{\text{train}}(y_{i,t} \mid y_{i,<t}; \theta_{\text{old}})}, \quad \rho_{i,t} = \frac{\pi_{\text{train}}(y_{i,t} \mid y_{i,<t}; \theta_{\text{old}})}{\pi_{\text{infer}}(y_{i,t} \mid y_{i,<t}; \theta_{\text{old}})}, M(ρ)={ρif αρβ,0otherwise,M(\rho) = \begin{cases} \rho & \text{if } \alpha \leq \rho \leq \beta, \\ 0 & \text{otherwise}, \end{cases} LGRPO=1Ntoki,tM(ρi,t)min(ri,tAi,clip(ri,t,1ϵlow,1+ϵhigh)Ai).\mathcal{L}_{\text{GRPO}} = -\frac{1}{N_{\text{tok}}} \sum_{i,t} M(\rho_{i,t}) \min\left( r_{i,t} A_i, \text{clip}(r_{i,t}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}) A_i \right).

Empirical Validation / Results

Architectural Ablations

  • MTP Impact: Ablation on a 14B MoE model showed MTP provided significant benchmark improvements with only 7% extra training time. Table 1 — Benchmark comparison between baseline and MTP models (14B MoE, 105B tokens).
    BenchmarkMetricBaseline+ MTPΔ\Delta
    HumanEvalpass@120.7331.10+10.37
    MMLUAccuracy37.4941.06+3.57
    MMLU-ProExact match19.0722.32+3.25
    GSM8KExact match30.6333.59+2.96
  • Inference Efficiency: The final architecture matches the single-request (sync) latency of Qwen2.5-7B (192 vs. 193 tokens/s) and provides 21% higher sustained throughput under concurrent loads.

Pre-Training Evaluation

The base Mellum 2 model (2.5B active) was evaluated against 4B-7B dense models.

Table 5 — Pre-training evaluation results (selected highlights).

BenchmarkMellum 2 (2.5B/12B)OLMo-3-7BQwen2.5-7BQwen3-4B
Code Generation
HumanEval41.545.155.557.3
MBPP62.450.663.667.0
Knowledge & Reasoning
MMLU70.962.171.871.1
MMLU-Pro59.334.548.651.5
BBH74.963.669.071.3
Math & Science
GSM8K81.773.581.982.0

Key findings: Mellum 2 is competitive with or outperforms 7B dense models on reasoning-heavy benchmarks (MMLU-Pro, BBH) and code (MBPP), despite fewer active parameters.

Post-Training Evaluation

Post-trained Instruct and Thinking variants were evaluated against open-weight models in the 4B–14B range.

Table 9 — Post-training evaluation, instruct (no-thinking) variants (selected).

BenchmarkMellum 2 (SFT)Mellum 2 (RL)Qwen3.5-9BMinistral-3-14B
Coding
LiveCodeBench v630.937.263.742.4
EvalPlus76.278.471.874.1
Tool Use
BFCL v343.166.370.552.7
Math
AIME29.941.758.333.3
Knowledge
MMLU-Redux77.478.191.185.9

Table 10 — Post-training evaluation, thinking /reasoning variants (selected).

BenchmarkMellum 2 (SFT)Mellum 2 (RL)Qwen3.5-9BOLMo-3-7B
Coding
LiveCodeBench v675.169.968.359.8
Math
AIME20.058.473.461.7
  • Coding: Mellum 2 leads on function-level synthesis (EvalPlus). The Thinking variant shows exceptional performance on LiveCodeBench (75.1), indicating algorithmic reasoning is unlocked with an explicit reasoning budget.
  • Tool Use & Math: RL provides significant jumps. Mellum 2 is competitive on tool use (BFCL) and math (GSM-Plus ~87-90%).
  • Knowledge: A deliberate weakness. Mellum 2 lags behind models like Qwen3.5-9B on broad knowledge benchmarks (MMLU-Redux, GPQA), reflecting its code-focused training mix.
  • Safety: The SFT variant is very safe (HarmBench 8.4%), but RL regresses this somewhat (23.1%), a known alignment tax.

Theoretical and Practical Implications

  • Efficiency-Aware Design: Mellum 2 demonstrates that a small MoE architecture, designed with inference cost as a primary constraint, can achieve a favorable quality-cost trade-off, making powerful coding assistants more deployable.
  • Curriculum and Data Strategy: The three-phase "web early, curated late" curriculum and strategic data repetition (particularly for MoE training) are validated as effective methods for building capability in specialized domains.
  • Specialization vs. Generality: The model's profile—strong in code and reasoning, weaker in broad knowledge—highlights the trade-offs involved in specializing a general-purpose model. It provides a blueprint for creating efficient, domain-specialized assistants.
  • RL for Specialized Models: The successful application of RLVR on verifiable code and math tasks shows that reinforcement learning can be effectively used to refine models for deterministic domains without the need for a learned reward model.
  • Open Recipe: The release of the model, weights, and detailed technical report under Apache 2.0 provides a valuable open recipe and design point for the community, especially for those interested in efficient, specialized MoE models.

Conclusion

Mellum 2 is an open-weight, 12B-parameter MoE language model specialized for software engineering, activating only 2.5B parameters per token for efficient inference. Through an efficiency-aware architectural design, a large-scale three-phase pre-training curriculum, long-context extension, and a two-stage post-training pipeline, it achieves competitive performance with 4B–14B dense models on coding, reasoning, and tool-use tasks.

The main takeaways are:

  1. A carefully designed small MoE model can match the inference speed of a smaller dense model while offering greater knowledge capacity.
  2. A progressive training curriculum focusing on high-quality code and math data is effective for building coding proficiency.
  3. Explicit reasoning (Thinking variant) unlocks significant performance gains on complex tasks like competitive programming.
  4. The model's strengths are in code synthesis and developer interaction, with a trade-off in broad world knowledge.

Future directions include pushing further into software engineering RL (repository-level tasks), broadening RL infrastructure, and revisiting the long-context training mix. The released open recipe provides a foundation for future inference-aware MoE coding models.

Related papers