DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Summary (Overview)
- Pure RL for Reasoning: The paper demonstrates that advanced reasoning capabilities in Large Language Models (LLMs) can be incentivized through pure reinforcement learning (RL), bypassing the need for human-labeled reasoning trajectories (Supervised Fine-Tuning). The model, DeepSeek-R1-Zero, is trained from a base checkpoint using only outcome-based rewards for correctness.
- Emergence of Sophisticated Behaviors: The RL process facilitates the autonomous emergence of advanced reasoning patterns such as self-reflection, verification, and dynamic strategy adaptation. The model naturally learns to generate longer, more detailed Chain-of-Thought (CoT) reasoning, with its average response length increasing significantly during training.
- State-of-the-Art Performance: The final model, DeepSeek-R1, achieves superior performance on verifiable reasoning tasks. For example, it reaches a 79.8% Pass@1 accuracy on the AIME 2024 math competition, significantly surpassing the average human competitor score and outperforming models trained with conventional supervised learning.
- Multi-Stage Training Pipeline: To address issues of readability and language mixing in the pure RL model, the authors develop DeepSeek-R1 through a multi-stage pipeline integrating rejection sampling, RL with model-based rewards (for helpfulness/safety), and supervised fine-tuning on mixed reasoning and non-reasoning data.
- Capability Distillation: The strong reasoning capabilities developed in the large-scale model can be systematically transferred to smaller, publicly released models via distillation, enhancing their performance beyond their original instruction-tuned counterparts.
Introduction and Theoretical Foundation
General reasoning remains a formidable challenge in AI. While LLMs and techniques like Chain-of-Thought (CoT) prompting have shown success, they are heavily dependent on human-annotated demonstrations, which limits scalability, introduces cognitive biases, and caps performance at human exemplar levels. This paper posits that LLMs possess latent reasoning potential that can be unlocked through self-evolution in an RL framework, minimizing reliance on human labeling.
The core hypothesis is that human-defined reasoning patterns may limit model exploration. Instead of teaching the model how to reason via supervised data, the proposed method provides the right incentives (rewards for correct final answers) and allows the model to autonomously discover effective, and potentially novel, problem-solving strategies. This approach is built upon DeepSeek-V3-Base and employs the Group Relative Policy Optimization (GRPO) RL algorithm for efficiency.
Methodology
1. Reinforcement Learning Framework: Group Relative Policy Optimization (GRPO)
The authors adopt GRPO to simplify training and reduce resource consumption compared to Proximal Policy Optimization (PPO). For each question , GRPO samples a group of outputs from the old policy and optimizes the policy model by maximizing the objective:
where the KL divergence is defined as:
The advantage is computed using a group of rewards :
Here, is a reference policy, and and are hyperparameters.
2. DeepSeek-R1-Zero (Pure RL)
- Training: Starts directly from DeepSeek-V3-Base with no SFT phase. Uses a simple template requiring reasoning within `` tags and a final answer within
<answer>...</answer>tags. - Reward Design: Uses rule-based rewards.
- Accuracy Reward: Based on the correctness of the final answer (e.g., exact match for math, test case pass rate for code).
- Format Reward: Incentivizes proper use of the specified reasoning and answer tags.
- Combined reward:
- Key Hyperparameters: Learning rate = 3e-6, KL coefficient = 0.001, group size G = 16, maximum response length increased from 32,768 to 65,536 tokens mid-training.
3. DeepSeek-R1 (Multi-Stage Pipeline)
To improve readability, language consistency, and general capability, DeepSeek-R1 is trained via a multi-stage pipeline (see Figure 2 in paper).
- Model-Based Rewards: Used for general (non-reasoning) data.
- Helpful Reward Model (): Trained on 66K preference pairs generated and judged by DeepSeek-V3. Evaluates only the final summary.
- Safety Reward Model (): Trained on 106K prompts with point-wise "safe"/"unsafe" labels. Evaluates the entire response.
- Language Consistency Reward: Added to mitigate language mixing.
- Final Combined Reward: For a batch of mixed data, the reward is: where and .
Empirical Validation / Results
Key Training Dynamics of DeepSeek-R1-Zero
- Performance Improvement: On the AIME 2024 benchmark, Pass@1 accuracy increased from 15.6% to 77.9% during RL training. With self-consistency (Cons@16), accuracy reached 86.7%, surpassing the average human competitor score.
- Emergent Long CoT: The average response length increased dramatically during training (see Figure 1b), indicating the model autonomously learned to "think longer" for complex problems.
- Emergent Reasoning Behaviors: The model developed advanced strategies like self-reflection (e.g., using phrases like "Wait, let's reevaluate") and exploring alternative solutions within a single response.
Benchmark Performance of DeepSeek-R1 Series
The table below shows the progression of performance through the multi-stage pipeline (R1-Zero → Dev1 → Dev2 → Dev3 → Final R1):
Table 3 | Experimental results at each stage of DeepSeek-R1. (Abbreviated; see paper for full table)
| Benchmark (Metric) | R1-Zero | R1-Dev1 | R1-Dev2 | R1-Dev3 | R1 |
|---|---|---|---|---|---|
| English | |||||
| MMLU-Pro (EM) | 68.9 | 74.1 | 83.8 | 83.1 | 84.0 |
| IF-Eval (Prompt Strict) | 46.6 | 71.7 | 72.0 | 78.1 | 83.3 |
| AlpacaEval2.0 (LC-winrate) | 24.7 | 50.1 | 55.8 | 62.1 | 87.6 |
| ArenaHard (GPT-4-1106) | 53.6 | 77.0 | 73.2 | 75.6 | 92.3 |
| Code | |||||
| LiveCodeBench (Pass@1-COT) | 50.0 | 57.5 | 63.5 | 64.6 | 65.9 |
| Codeforces (Percentile) | 80.4 | 84.5 | 90.5 | 92.1 | 96.3 |
| Math | |||||
| AIME 2024 (Pass@1) | 77.9 | 59.0 | 74.0 | 78.1 | 79.8 |
| CNMO 2024 (Pass@1) | 88.1 | 58.0 | 73.9 | 77.3 | 78.8 |
Key Observations:
- R1-Zero → Dev1: Instruction-following (IF-Eval, ArenaHard) improves significantly, but reasoning performance (AIME, CNMO) drops due to limited cold-start SFT data.
- Dev1 → Dev2: Reasoning-oriented RL restores and enhances reasoning capabilities (Codeforces, MATH-500).
- Dev2 → Dev3: Incorporating non-reasoning SFT data boosts general capabilities (AlpacaEval2.0, Aider-Polyglot).
- Dev3 → Final R1: The final RL stage on mixed data yields the largest gains in general instruction-following and user preference (AlpacaEval2.0 +25%, ArenaHard +17%), with marginal further improvements in reasoning.
Theoretical and Practical Implications
- Unlocking Latent Potential: The work demonstrates that pre-trained base models inherently possess strong reasoning potential, which can be unlocked not by large-scale human annotation, but by providing hard problems, a reliable verifier, and sufficient RL compute.
- Paradigm Shift in Training: It presents a viable alternative to the dominant SFT+RLHF paradigm, showing that pure outcome-based RL can drive the emergence of sophisticated cognitive strategies without process supervision.
- Towards Superhuman Reasoning: Machines equipped with such RL techniques are poised to surpass human capabilities in domains where a reliable verifier exists, as they can iterate and optimize beyond human thought patterns.
- Practical Resource: The release of the DeepSeek-R1 models and their distilled smaller versions provides the community with a powerful resource for research and application, particularly in STEM fields.
Conclusion
The DeepSeek-R1 project successfully demonstrates that reinforcement learning is a powerful engine for incentivizing and evolving reasoning capabilities in LLMs. The key ingredients are a strong base model, a scalable RL algorithm (GRPO), and precise reward signals for verifiable tasks. The model autonomously develops advanced reasoning patterns like reflection and verification, achieving state-of-the-art results on challenging benchmarks.
Limitations and Future Work include:
- Suboptimal structure output and tool use.
- Token efficiency: Instances of overthinking on simple problems.
- Language mixing in non-English/Chinese queries.
- Prompt sensitivity: Few-shot prompting degrades performance; zero-shot is recommended.
- Reward Hacking: A fundamental challenge for pure RL in domains without reliable rule-based rewards.
- Future directions involve integrating tool-augmented reasoning and developing more robust reward models for complex, less verifiable tasks.