# DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

> DeepSeek-R1 demonstrates that pure reinforcement learning with outcome-based rewards can unlock advanced reasoning capabilities in LLMs, surpassing supervised methods and achieving state-of-the-art performance on complex tasks like the AIME math competition.

- **Source:** [arXiv](https://arxiv.org/abs/2501.12948)
- **Published:** 2026-03-07
- **Permalink:** https://picx.dev/p/7rYbgf
- **Whiteboard:** https://picx.dev/p/7rYbgf/image

## Summary

# DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

## Summary (Overview)
*   **Pure RL for Reasoning:** The paper demonstrates that advanced reasoning capabilities in Large Language Models (LLMs) can be incentivized through **pure reinforcement learning (RL)**, bypassing the need for human-labeled reasoning trajectories (Supervised Fine-Tuning). The model, DeepSeek-R1-Zero, is trained from a base checkpoint using only outcome-based rewards for correctness.
*   **Emergence of Sophisticated Behaviors:** The RL process facilitates the **autonomous emergence** of advanced reasoning patterns such as self-reflection, verification, and dynamic strategy adaptation. The model naturally learns to generate longer, more detailed Chain-of-Thought (CoT) reasoning, with its average response length increasing significantly during training.
*   **State-of-the-Art Performance:** The final model, **DeepSeek-R1**, achieves superior performance on verifiable reasoning tasks. For example, it reaches a 79.8% Pass@1 accuracy on the AIME 2024 math competition, significantly surpassing the average human competitor score and outperforming models trained with conventional supervised learning.
*   **Multi-Stage Training Pipeline:** To address issues of readability and language mixing in the pure RL model, the authors develop **DeepSeek-R1** through a multi-stage pipeline integrating rejection sampling, RL with model-based rewards (for helpfulness/safety), and supervised fine-tuning on mixed reasoning and non-reasoning data.
*   **Capability Distillation:** The strong reasoning capabilities developed in the large-scale model can be systematically transferred to smaller, publicly released models via distillation, enhancing their performance beyond their original instruction-tuned counterparts.

## Introduction and Theoretical Foundation
General reasoning remains a formidable challenge in AI. While LLMs and techniques like Chain-of-Thought (CoT) prompting have shown success, they are heavily dependent on **human-annotated demonstrations**, which limits scalability, introduces cognitive biases, and caps performance at human exemplar levels. This paper posits that LLMs possess latent reasoning potential that can be unlocked through **self-evolution in an RL framework**, minimizing reliance on human labeling.

The core hypothesis is that **human-defined reasoning patterns may limit model exploration**. Instead of teaching the model *how* to reason via supervised data, the proposed method provides the right **incentives** (rewards for correct final answers) and allows the model to autonomously discover effective, and potentially novel, problem-solving strategies. This approach is built upon **DeepSeek-V3-Base** and employs the **Group Relative Policy Optimization (GRPO)** RL algorithm for efficiency.

## Methodology

### 1. Reinforcement Learning Framework: Group Relative Policy Optimization (GRPO)
The authors adopt GRPO to simplify training and reduce resource consumption compared to Proximal Policy Optimization (PPO). For each question $q$, GRPO samples a group of outputs $\{ o_1, o_2, \cdots, o_G \}$ from the old policy $\pi_{\theta_{old}}$ and optimizes the policy model $\pi_\theta$ by maximizing the objective:

$$
J_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon \right) A_i \right) - \beta D_{KL} \left( \pi_\theta || \pi_{ref} \right) \right) \right]
$$

where the KL divergence is defined as:
$$
D_{KL}\left( \pi_\theta || \pi_{ref} \right) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1
$$

The advantage $A_i$ is computed using a group of rewards $\{ r_1, r_2, ..., r_G \}$:
$$
A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \cdots, r_G\})}{\text{std}(\{r_1, r_2, \cdots, r_G\})}
$$

Here, $\pi_{ref}$ is a reference policy, and $\varepsilon$ and $\beta$ are hyperparameters.

### 2. DeepSeek-R1-Zero (Pure RL)
*   **Training:** Starts directly from **DeepSeek-V3-Base** with no SFT phase. Uses a simple template requiring reasoning within `` tags and a final answer within `<answer>...</answer>` tags.
*   **Reward Design:** Uses **rule-based rewards**.
    *   **Accuracy Reward:** Based on the correctness of the final answer (e.g., exact match for math, test case pass rate for code).
    *   **Format Reward:** Incentivizes proper use of the specified reasoning and answer tags.
    *   Combined reward: $Reward_{rule} = Reward_{acc} + Reward_{format}$
*   **Key Hyperparameters:** Learning rate = 3e-6, KL coefficient = 0.001, group size G = 16, maximum response length increased from 32,768 to 65,536 tokens mid-training.

### 3. DeepSeek-R1 (Multi-Stage Pipeline)
To improve readability, language consistency, and general capability, DeepSeek-R1 is trained via a multi-stage pipeline (see Figure 2 in paper).

*   **Model-Based Rewards:** Used for general (non-reasoning) data.
    *   **Helpful Reward Model ($RM_{helpful}$):** Trained on 66K preference pairs generated and judged by DeepSeek-V3. Evaluates only the final summary. $Reward_{helpful} = RM_{helpful}(Response_A, Response_B)$
    *   **Safety Reward Model ($RM_{safety}$):** Trained on 106K prompts with point-wise "safe"/"unsafe" labels. Evaluates the entire response. $Reward_{safety} = RM_{safety}(Response)$
*   **Language Consistency Reward:** Added to mitigate language mixing. $Reward_{language} = \frac{Num(Words_{target})}{Num(Words)}$
*   **Final Combined Reward:** For a batch of mixed data, the reward is:
    $$
    Reward = Reward_{reasoning} + Reward_{general} + Reward_{language}
    $$
    where $Reward_{reasoning} = Reward_{rule}$ and $Reward_{general} = Reward_{reward\_model} + Reward_{format}$.

## Empirical Validation / Results

### Key Training Dynamics of DeepSeek-R1-Zero
*   **Performance Improvement:** On the AIME 2024 benchmark, Pass@1 accuracy increased from **15.6% to 77.9%** during RL training. With self-consistency (Cons@16), accuracy reached **86.7%**, surpassing the average human competitor score.
*   **Emergent Long CoT:** The average response length increased dramatically during training (see Figure 1b), indicating the model autonomously learned to "think longer" for complex problems.
*   **Emergent Reasoning Behaviors:** The model developed advanced strategies like **self-reflection** (e.g., using phrases like "Wait, let's reevaluate") and exploring alternative solutions within a single response.

### Benchmark Performance of DeepSeek-R1 Series
The table below shows the progression of performance through the multi-stage pipeline (R1-Zero → Dev1 → Dev2 → Dev3 → Final R1):

**Table 3 | Experimental results at each stage of DeepSeek-R1.** (Abbreviated; see paper for full table)
| Benchmark (Metric)               | R1-Zero | R1-Dev1 | R1-Dev2 | R1-Dev3 | R1      |
|----------------------------------|---------|---------|---------|---------|---------|
| **English**                      |         |         |         |         |         |
| MMLU-Pro (EM)                    | 68.9    | 74.1    | 83.8    | 83.1    | **84.0**|
| IF-Eval (Prompt Strict)          | 46.6    | 71.7    | 72.0    | 78.1    | **83.3**|
| AlpacaEval2.0 (LC-winrate)       | 24.7    | 50.1    | 55.8    | 62.1    | **87.6**|
| ArenaHard (GPT-4-1106)           | 53.6    | 77.0    | 73.2    | 75.6    | **92.3**|
| **Code**                         |         |         |         |         |         |
| LiveCodeBench (Pass@1-COT)       | 50.0    | 57.5    | 63.5    | 64.6    | **65.9**|
| Codeforces (Percentile)          | 80.4    | 84.5    | 90.5    | 92.1    | **96.3**|
| **Math**                         |         |         |         |         |         |
| AIME 2024 (Pass@1)               | **77.9**| 59.0    | 74.0    | 78.1    | 79.8    |
| CNMO 2024 (Pass@1)               | **88.1**| 58.0    | 73.9    | 77.3    | 78.8    |

**Key Observations:**
1.  **R1-Zero → Dev1:** Instruction-following (IF-Eval, ArenaHard) improves significantly, but reasoning performance (AIME, CNMO) drops due to limited cold-start SFT data.
2.  **Dev1 → Dev2:** Reasoning-oriented RL restores and enhances reasoning capabilities (Codeforces, MATH-500).
3.  **Dev2 → Dev3:** Incorporating non-reasoning SFT data boosts general capabilities (AlpacaEval2.0, Aider-Polyglot).
4.  **Dev3 → Final R1:** The final RL stage on mixed data yields the largest gains in **general instruction-following and user preference** (AlpacaEval2.0 +25%, ArenaHard +17%), with marginal further improvements in reasoning.

## Theoretical and Practical Implications
*   **Unlocking Latent Potential:** The work demonstrates that **pre-trained base models inherently possess strong reasoning potential**, which can be unlocked not by large-scale human annotation, but by providing hard problems, a reliable verifier, and sufficient RL compute.
*   **Paradigm Shift in Training:** It presents a viable alternative to the dominant SFT+RLHF paradigm, showing that **pure outcome-based RL can drive the emergence of sophisticated cognitive strategies** without process supervision.
*   **Towards Superhuman Reasoning:** Machines equipped with such RL techniques are poised to surpass human capabilities in domains where a reliable verifier exists, as they can iterate and optimize beyond human thought patterns.
*   **Practical Resource:** The release of the DeepSeek-R1 models and their distilled smaller versions provides the community with a powerful resource for research and application, particularly in STEM fields.

## Conclusion
The DeepSeek-R1 project successfully demonstrates that **reinforcement learning is a powerful engine for incentivizing and evolving reasoning capabilities in LLMs**. The key ingredients are a strong base model, a scalable RL algorithm (GRPO), and precise reward signals for verifiable tasks. The model autonomously develops advanced reasoning patterns like reflection and verification, achieving state-of-the-art results on challenging benchmarks.

**Limitations and Future Work** include:
*   **Suboptimal structure output and tool use.**
*   **Token efficiency:** Instances of overthinking on simple problems.
*   **Language mixing** in non-English/Chinese queries.
*   **Prompt sensitivity:** Few-shot prompting degrades performance; zero-shot is recommended.
*   **Reward Hacking:** A fundamental challenge for pure RL in domains without reliable rule-based rewards.
*   Future directions involve integrating **tool-augmented reasoning** and developing more **robust reward models** for complex, less verifiable tasks.

---

_Markdown view of https://picx.dev/p/7rYbgf, served by PicX — AI-generated visual whiteboard summaries of research papers._
