Summary (Overview)

  • VibeThinker-3B is a 3-billion-parameter dense model that achieves frontier-level verifiable reasoning, scoring 94.3 on AIME26 (97.1 with Claim-Level Reliability Assessment), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance rate on unseen LeetCode contests.
  • The model matches or exceeds much larger systems such as DeepSeek V3.2 (671B), GLM-5 (744B), and Gemini 3 Pro, demonstrating that extreme reasoning capability does not strictly require massive parameter counts.
  • It introduces an optimized Spectrum-to-Signal post-training pipeline with curriculum‑based SFT, multi‑domain reinforcement learning (MGPO), a Long2Short efficiency stage, offline self‑distillation, and instruction alignment.
  • The paper proposes the Parametric Compression‑Coverage Hypothesis: verifiable reasoning is a parameter‑dense, compressible core, while open‑domain knowledge requires broad parameter coverage.
  • VibeThinker-3B shows strong out‑of‑distribution generalization on recent LeetCode contests, confirming robust algorithmic problem‑solving beyond static benchmarks.

Introduction and Theoretical Foundation

The paper addresses the common assumption that frontier reasoning ability is concentrated in models with tens or hundreds of billions of parameters. While scaling laws have driven progress, small language models (SLMs, ≤3B parameters) offer advantages in deployment cost, inference efficiency, and academic accessibility but are often considered inherently bottlenecked for complex reasoning. The authors’ previous work on VibeThinker‑1.5B demonstrated that even extremely small models can produce stable logical chains, but its upper bound remained unexplored.

VibeThinker‑3B extends this exploration by asking whether a strictly 3B model can achieve performance comparable to top‑tier LLMs on verifiable reasoning tasks. The work is grounded in the Spectrum‑to‑Signal Principle: the SFT stage constructs a diverse solution space (the “Spectrum”), and reinforcement learning amplifies high‑value reasoning signals (the “Signal”).

The Parametric Compression‑Coverage Hypothesis posits a structural divergence in how capabilities are encoded in parameter space:

  • Parameter‑dense capabilities (e.g., verifiable reasoning) involve search, constraint satisfaction, error correction, and multi‑step composition within a structured solution space. These can be highly compressed into a compact, reusable reasoning core.
  • Parameter‑expansive capabilities (e.g., open‑domain knowledge) require broad coverage over facts, concepts, and long‑tail scenarios – a coverage problem rather than compression.

This leads to a Reasoning‑Knowledge Decoupling Paradigm: large models serve as natural vehicles for expansive knowledge, while compact models can encapsulate high‑density reasoning depth when provided with constrained spaces and reliable training signals.

Methodology

Overall Pipeline (Figure 3)

The post‑training pipeline, built upon Qwen2.5‑Coder‑3B, proceeds sequentially through:

  1. Supervised Fine‑Tuning (SFT) – Two‑stage curriculum learning with diversity‑exploring distillation.
  2. Multi‑Domain Reinforcement Learning (RL) – Math, Code, and STEM RL using MGPO; includes Long2Short for efficiency.
  3. Offline Self‑Distillation – Extracts high‑quality trajectories from RL checkpoints and distills them back.
  4. Instruct RL – Reinforces strict instruction following with constraint checking and rubric‑based rewards.

SFT Data Construction

  • Data Synthesis and Query Expansion: Seed queries with reliable supervision signals (explicit answers for math, unit tests for code) are expanded across dimensions like concept composition and constraints. Pseudo‑labels are generated via majority voting from strong teacher models.
  • Multi‑path Reasoning Distillation: For each query, multiple candidate reasoning traces are sampled from teacher models, preserving intermediate steps to construct a broad solution spectrum.
  • Multi‑level Quality Control: N‑gram filtering (remove anomalies and benchmark contamination), LLM‑based query quality filtering, and trace correctness filtering (answer verification, sandbox execution, majority voting).

SFT Training Process

  • Stage 1 (Broad Coverage): Full quality‑filtered dataset, sequence packing, batch size 128, learning rate 5×1055\times10^{-5} cosine decaying to 8×1088\times10^{-8}, 5 epochs with 5% linear warmup.
  • Stage 2 (Hard Reasoning): Focus on long‑horizon samples. Discard traces <5K tokens; filter easy problems (error rate <0.75 using VibeThinker‑1.5B as reference). 2 additional epochs.
  • Diversity‑Exploring Distillation: During training, domain‑specific checkpoints are periodically selected based on Pass@K diversity, then merged at the parameter level to preserve output diversity.

Reinforcement Learning (MGPO)

MaxEnt‑Guided Policy Optimization (MGPO) is the core RL algorithm. For each prompt qq, GG responses are sampled and evaluated with verifiable rewards. Group accuracy p^(q)\hat{p}(q) is computed:

p^(q)=1Gi=1GI(ri=1)\hat{p}(q) = \frac{1}{G} \sum_{i=1}^G \mathbb{I}(r_i = 1)

Prompts are weighted by how close p^(q)\hat{p}(q) is to the maximum‑entropy point p0=0.5p_0=0.5:

w(q)=exp(γDME(p^(q)p0)),γ>0w(q) = \exp(-\gamma D_{\text{ME}}(\hat{p}(q) \| p_0)), \quad \gamma > 0

The weight w(q)w(q) is applied to the group‑relative advantage inside a GRPO‑style clipped objective:

JMGPO(θ)=Eq,{yi}[1Gi=1G1yit=1yimin(ρi,t(θ)w(q)Ai,  clip(ρi,t(θ),1ε,1+ε)w(q)Ai)]J_{\text{MGPO}}(\theta) = \mathbb{E}_{q, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( \rho_{i,t}(\theta) w(q) A_i,\; \text{clip}(\rho_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon) w(q) A_i \right) \right]

Multi‑domain RL: Training proceeds sequentially – Math RL (answer verification), then Code RL (sandbox execution), then STEM RL (answer matching + option verification). A single 64K long‑context window is used directly (multi‑stage context expansion was found harmful for the stronger initialization). All RL stages are on‑policy to mitigate training‑inference mismatch.

Long2Short Math RL: After standard accuracy‑oriented MGPO, a token‑efficiency stage redistributes rewards among correct trajectories based on brevity. For the correct set C={iri=1}C = \{i | r_i = 1\}, define brevity score si=1/Lis_i = 1/L_i and apply a centered shift:

ri=ri+λsisˉmaxjCsjsˉ,iCr'_i = r_i + \lambda \cdot \frac{s_i - \bar{s}}{\max_{j \in C} |s_j - \bar{s}|}, \quad i \in C

where sˉ\bar{s} is the mean brevity score and λ=0.2\lambda=0.2. The shift is zero‑sum within CC, so the group‑level reward baseline remains unchanged.

Offline Self‑Distillation

Trajectories from Math, Code, and STEM RL checkpoints are filtered by domain‑specific verifiers. A learning potential score estimates distillation value for the student model:

SLP(q,y)=1yt=1ylogπθstu(ytq,y<t)S_{LP}(q, y) = -\frac{1}{|y|} \sum_{t=1}^{|y|} \log \pi_{\theta_{\text{stu}}}(y_t \mid q, y_{<t})

Traces are selected from the middle‑to‑high score range within length buckets, excluding extremes to avoid noise.

Instruct RL

Final stage mixing format‑sensitive, long‑context, and general alignment data. Rule‑based validators check explicit constraints; rubric‑based reward models evaluate open‑ended responses for helpfulness, coherence, and instruction adherence.

Test‑Time Scaling: Claim‑Level Reliability Assessment (CLR)

For answer‑verifiable tasks, CLR generates K=32K=32 candidate trajectories, extracts M=5M=5 decision‑relevant claims per trajectory, and uses the model as a self‑verifier to produce binary verdicts vk,m{0,1}v_{k,m} \in \{0,1\}. A nonlinear trajectory‑level reliability score is computed:

rk=(1Mm=1Mvk,m)Mr_k = \left( \frac{1}{M} \sum_{m=1}^M v_{k,m} \right)^M

Candidate answers are clustered by equivalence, and the answer with maximum reliability‑weighted aggregation is selected:

Score(G)={kykG}rk\text{Score}(G) = \sum_{\{k \mid y_k \in G\}} r_k

Empirical Validation / Results

Evaluation Setup

  • Benchmarks: AIME25, AIME26, HMMT25, BruMO25, IMO‑AnswerBench (math); LiveCodeBench v6, OJBench (coding); GPQA‑Diamond (knowledge); IFEval, IFBench (instruction following); recent LeetCode contests (OOD coding).
  • Protocol: vLLM inference, temperature 1.0, top‑p 0.95. Math: Pass@1 over 64 generations (16 for IMO‑AnswerBench), using math verify + LLM‑as‑judge. Code: Pass@1 over 8 generations with sandbox execution. CLR: 8 independent runs, averaged.

Core Benchmark Performance

Table 1 compares VibeThinker‑3B with small and large reasoning models. Key rows:

ModelParamsAIME25AIME26HMMT25BruMO25IMO‑AnsLCBv6OJBenchGPQA‑DIFEvalIFBench
Small Reasoning Models
SmolLM33B36.741.026.049.228.729.15.241.771.227.6
Qwen3.5‑4B4B79.884.073.883.548.762.023.576.289.859.2
Ministral‑3‑Reasoning‑251214B82.985.067.186.763.466.015.171.273.932.3
Large Reasoning Models
GPT‑OSS‑20B (high)20B91.790.276.786.761.961.071.592.865.0
DeepSeek V3.2671B93.194.290.296.778.380.848.482.492.660.7
Kimi K2.51T96.193.395.498.381.885.054.787.693.970.0
GLM‑5744B96.795.897.982.585.555.086.092.676.5
VibeThinker‑3B3B91.494.389.393.876.480.238.670.293.474.5

VibeThinker‑3B leads small models on all math and coding benchmarks and outperforms several models orders of magnitude larger (e.g., GPT‑OSS‑20B, DeepSeek V3.2 on AIME26). Instruction following (IFEval 93.4, IFBench 74.5) confirms reasoning enhancement does not compromise controllability.

Comparison with Top‑Tier Models (Table 2)

ModelParamsAIME25AIME26HMMT25BruMO25IMO‑AnsLCBv6GPQA‑DIFEvalIFBench
DeepSeek V3.2671B93.194.290.296.778.380.882.492.660.7
Kimi K2.51T96.193.395.498.381.885.087.693.970.0
GLM‑5744B96.795.897.982.585.586.092.676.5
Gemini 3 ProN/A96.091.797.598.383.187.491.970.4
Claude Opus 4.5N/A92.895.192.978.584.887.058.0
VibeThinker‑3B3B91.494.389.393.876.480.270.293.474.5
+ CLR3B96.797.195.499.280.672.9

With CLR, VibeThinker‑3B enters the top cluster on all competition‑style mathematics benchmarks, matching or exceeding many flagship models. On GPQA‑Diamond, a clear gap remains, consistent with the knowledge vs. reasoning decoupling hypothesis.

OOD Generalization: Recent LeetCode Contests (Table 3)

ModelBW181BW182BW183W499W500W502W503W504Overall
GPT‑5.3‑Codex16/1616/1616/1616/1616/1616/1616/1616/16100.0%
Gemini 3 Flash16/1616/1612/1616/1616/1616/1616/1616/1696.9%
VibeThinker‑3B16/1616/1612/1616/1616/1616/1615/1616/1696.1%
GPT‑5.215/1616/1615/1616/1615/1614/1616/1615/1695.3%
Claude Opus 4.615/1616/1612/1616/1612/1616/1615/169/1686.7%
GLM‑515/1614/1612/1614/168/1616/1612/167/1676.6%

VibeThinker‑3B achieves 123/128 (96.1%) first‑attempt Python submissions, outperforming GPT‑5.2, Kimi K2.5, Qwen3.5‑397B, Claude Opus 4.6, and GLM‑5 on these fresh, execution‑verified problems.

Theoretical and Practical Implications

  • The Parametric Compression‑Coverage Hypothesis provides a theoretical lens: verifiable reasoning can be compressed into a compact core (parameter‑dense), while open‑domain knowledge requires broad parameter coverage (parameter‑expansive). This explains why VibeThinker‑3B excels on math/code but trails on GPQA‑Diamond.
  • The Reasoning‑Knowledge Decoupling Paradigm suggests that large‑scale generalists and compact specialists are complementary rather than substitutive. Large models remain essential for broad knowledge; compact models can achieve frontier performance on structured, verifiable tasks.
  • Practical implications: Small models are not merely deployment‑efficient substitutes but a valid research trajectory for achieving high‑density reasoning. Post‑training methodology (diverse exploration, multi‑domain RL, self‑distillation, test‑time scaling) is decisive in unlocking this potential.
  • The CLR test‑time scaling method offers a token‑efficient alternative to full‑trace verification, achieving substantial gains on answer‑verifiable tasks without parameter updates.

Conclusion

VibeThinker‑3B demonstrates that a strictly 3B‑parameter model, when carefully post‑trained with the Spectrum‑to‑Signal paradigm, can match or exceed top‑tier reasoning systems on highly demanding verifiable tasks, including competition‑level mathematics and coding. It achieves 94.3 on AIME26 (97.1 with CLR), 80.2

Related papers