DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Summary (Overview)

Core Idea: DenoiseRL is a reinforcement learning (RL) framework that improves reasoning in large language models (LLMs) by training them to "recover" from incorrect reasoning steps ("noisy prefixes") generated by a weaker model, instead of relying on stronger teacher models or curated datasets.
Key Mechanism: It injects erroneous partial solutions from a weak model into the policy's training rollouts, forcing the model to learn how to correct mistakes and reach the correct answer from a corrupted intermediate state.
Main Contributions: DenoiseRL consistently improves standard on-policy RL methods (GRPO and DAPO) across multiple mathematical reasoning benchmarks (MATH500, AMC23, AIME, BBEH) on two model scales (4B and 8B parameters).
Key Findings: The method induces stronger self-correction behavior, but the intensity of the noise (prefix length) must be carefully tuned. Updating the model's parameters based on the off-policy prefix tokens leads to training instability.
Implication: Provides a scalable, resource-efficient pathway for post-training LLMs by turning model failures into a valuable learning signal.

Introduction and Theoretical Foundation

Reinforcement learning has become a dominant method for enhancing the reasoning capabilities of large language models (LLMs). However, state-of-the-art approaches often depend on supervision from even stronger teacher models or require extensive human effort to curate difficult training datasets. This creates a structural limitation: how can we create strong models without relying on pre-existing stronger models?

Prior work follows two main directions:

Weak-to-Strong Generalization: Using weaker models to supervise stronger ones, but performance is capped by the teacher's quality.
Difficulty-Driven Data Synthesis: Creating harder problems, but this requires complex, manual data engineering.

DenoiseRL unifies these ideas by repurposing the weak model not as a teacher, but as a generator of "structured perturbations." It frames reasoning RL as a denoising problem: errors from a weak model are treated as corruptions to a reasoning trajectory, and the policy model is trained to reconstruct a correct solution from these corrupted starting points. This approach increases training difficulty automatically, diversifies the training states (exposing the model to a wider range of failure modes), and directly targets the underdeveloped capability of recovery from mistakes.

Methodology

3.1 Denoising Reasoning

The core idea is to prepend an incorrect partial solution (a "noisy prefix") from a weak model to the policy's generation. The policy is then trained to continue reasoning from this corrupted state to reach the correct answer.

Formal Setup:

A pool of incorrect solutions $\mathcal{W}(q)$ is created offline for each training question $q \in \mathcal{D}$ by sampling a weak model $\pi_w$ and filtering out wrong answers.
If $\pi_w$ never produces a wrong answer in $M$ trials, $\mathcal{W}(q)$ is empty, and the training proceeds with standard rollouts for that question.

3.2 Reinforcement Learning for Recovering from Noisy Prefixes

Each training step samples two types of rollouts for a question $q$ :

Main Rollouts ( $N$ per problem): Standard on-policy generation. $y \sim \pi_{\theta}(\cdot | q)$
Denoise Rollouts ( $K$ per problem): Start from a noisy prefix $w \sim \mathcal{W}(q)$ . A prefix of length $p$ is retained using a fixed ratio $\rho$ : $p = \max\left(1, \lfloor \rho |w| \rfloor\right)$ The policy then continues from this prefix: $y_{>p} \sim \pi_{\theta}(\cdot | q, w_{1:p})$

Output Budget and Folding: To ensure a fair comparison, both rollout types share a maximum response length $R$ . The complete, "folded" response for a denoise rollout is:

\tilde{y} = [\underbrace{w_{1:p}}_{\text{prefix}}, \underbrace{y_{p+1:p+L}}_{\text{continuation}}], \quad p + L \leq R

where $L = \min(T_{y_{>p}}, R - p)$ . The verifier assigns a terminal reward $r(\tilde{y}; q) \in \{0, 1\}$ based on the final answer. Crucially, gradient updates are applied only to the on-policy continuation tokens $y_{p+1:p+L}$ .

Token-level GRPO Objective: The advantage is computed per problem group $\mathcal{G}(q)$ containing all $N+K$ rollouts:

A_i = \frac{r_i - \mu_q}{\sigma_q + \epsilon}, \quad \mu_q = \frac{1}{N+K}\sum_{j \in \mathcal{G}(q)} r_j, \quad \sigma^2_q = \frac{1}{N+K}\sum_{j \in \mathcal{G}(q)} (r_j - \mu_q)^2

The per-token importance ratio is:

r_{i,t}(\theta) = \frac{\pi_{\theta}(y_{i,t} | c_{i,t}, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} | c_{i,t}, y_{i,<t})}

where $c_{i,t}$ is the context ( $q$ for main, $(q, w_{1:p_i})$ for denoise).

The PPO clipped surrogate loss for a trajectory is:

\mathcal{L}^{\text{PPO}}_i(\theta) = \frac{1}{|\mathcal{T}_i|} \sum_{t \in \mathcal{T}_i} \min\left( r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}\left(r_{i,t}(\theta), 1-\varepsilon_{\text{low}}, 1+\varepsilon_{\text{high}}\right) \hat{A}_{i,t} \right)

Joint Objective: The final objective is a weighted mixture:

\mathcal{J}(\theta) = \frac{N}{N+K} \mathcal{J}^{\text{main}}(\theta) + \frac{K}{N+K} \mathcal{J}^{\text{denoise}}(\theta)

where

\mathcal{J}^{\text{main}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}, y \sim \pi^{\text{main}}_{\theta_{\text{old}}}(\cdot|q)} \left[ \mathcal{L}^{\text{PPO}}(\theta; q, y) \right]

\mathcal{J}^{\text{denoise}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}, w \sim \mathcal{W}(q), y \sim \pi^{\text{denoise}}_{\theta_{\text{old}}}(\cdot|q,w)} \left[ \mathcal{L}^{\text{PPO}}(\theta; q, w_{1:p}, y) \right]

The Monte-Carlo estimator optimized each step is:

\hat{\mathcal{J}}(\theta) = \frac{1}{B(N+K)} \sum_{b=1}^{B} \left[ \sum_{i \in \mathcal{M}(q_b)} \mathcal{L}^{\text{PPO}}_i(\theta) + \sum_{i \in \mathcal{S}(q_b)} \mathcal{L}^{\text{PPO}}_i(\theta) \right]

where $B$ is batch size, $\mathcal{M}(q_b)$ are $N$ main rollouts, and $\mathcal{S}(q_b)$ are $K$ denoise rollouts for question $q_b$ .

Empirical Validation / Results

4.1 Settings

Weak Model: Qwen2.5-1.5B-Instruct used to collect incorrect trajectories from MATH-7.5K.
Policy Models: Qwen3-4B-Base and Qwen3-8B-Base trained with $N=12$ main, $K=4$ denoise rollouts, prefix ratio $\rho=0.2$ , response length $R=4096$ .
Evaluation: Benchmarks include MATH500, AMC23, AIME2024, AIME2025, and BBEH.

4.2 Main Results

DenoiseRL consistently improves the average performance over strong RL baselines (GRPO and DAPO) across both model scales.

Table 1: Main results on mathematical and reasoning benchmarks.

Method	MATH500	AMC23	AIME24	AIME25	BBEH	Avg.
Qwen3-4B-Base
Base	70.0	43.1	8.3	7.7	4.1	26.6
GRPO	83.6	63.1	22.1	18.1	11.1	39.6
DAPO	83.8	62.5	20.6	21.5	10.4	39.8
DenoiseRL-GRPO	85.8	61.4	24.8	23.3	14.8	42.0
DenoiseRL-DAPO	84.6	63.6	21.9	21.7	15.7	41.5
Qwen3-8B-Base
Base	70.4	49.2	11.9	10.8	4.1	29.3
GRPO	87.8	69.7	24.0	22.9	10.6	43.0
DAPO	87.0	69.7	23.8	21.7	11.7	42.8
DenoiseRL-GRPO	87.2	70.3	24.6	23.1	11.5	43.3
DenoiseRL-DAPO	88.2	71.4	27.0	24.8	12.6	44.8

4.3 Intensity of Noise

Prefix Ratio ( $\rho$ ): A larger $\rho$ (longer noisy prefix) induces overthinking—longer self-correction loops, increased response length, and more uncertainty (see Figure 2 & 3). The mild setting $\rho=0.2$ works best.
Number of Denoise Rollouts ( $K$ ): There is a trade-off. $K=1$ provides too sparse a signal, $K=8$ over-emphasizes recovery at the cost of primary problem-solving, while $K=4$ yields the strongest overall improvement (see Figure 4).

4.4 Off-policy Prefix

Critical Finding: Applying PPO updates to the off-policy prefix tokens ( $w_{1:p}$ ) causes severe training instability and collapse (see Figure 5). This is attributed to a large mismatch between the log-probability distributions of the current and behavior policies for those tokens. DenoiseRL successfully masks these tokens from gradient updates.

4.5 Fairness of Output Budget

Enforcing the length-fair constraint $p + L \leq R$ is necessary for strong performance. Allowing denoise rollouts a longer total budget ( $p + R$ ) leads to verbose, less reliable reasoning and worse results.

Table 2: Effect of length-fair output budget on Qwen3-4B-Base.

Folding Mode	MATH500	AMC23	AIME2024	AIME2025	BBEH	Average
Length-fair	85.8	61.4	24.8	23.3	14.8	42.0
No length cap	84.2	60.6	18.8	24.2	13.5	40.2

4.6 Training Time Efficiency

DenoiseRL has a modest per-step time overhead (~49.7s vs. 43.8s for GRPO) due to generating longer continuation tokens as the model rethinks and repairs reasoning (see Figure 6). The cost remains in the same regime while delivering higher accuracy.

Table Third: Average training time per step on Qwen3-4B-Base.

Method	Rollouts / problem	Time (s / step)
GRPO baseline	16 on-policy	43.8
DenoiseRL-GRPO	12 + 4	49.7

4.7 Case Study

Qualitative analysis (see Table 4) shows that DenoiseRL induces genuine recovery behavior. The model continuation does not blindly follow the erroneous prefix. Instead, it re-evaluates the problem, preserves useful partial reasoning, and corrects the specific failure modes to reach the correct answer. Supplementary cases (Tables 5 & 6 in Appendix) further illustrate this ability to switch strategies and use more efficient solution methods.

Theoretical and Practical Implications

Theoretical: DenoiseRL offers a new perspective on scalable post-training, showing that model mistakes can be a powerful source of learning signal. It elevates self-correction from an emergent behavior to a direct training target.
Practical: The method reduces dependency on expensive external resources (stronger teachers, curated datasets). It provides a more scalable and resource-efficient pathway for improving reasoning capabilities in LLMs.
RL Design: The findings highlight the importance of carefully handling off-policy data in LLM RL and the need to balance recovery training with the primary objective of problem-solving.

Conclusion

DenoiseRL is a recovery-oriented RL framework that improves reasoning by training models to recover from incorrect reasoning trajectories generated by weak models. It converts weak-model failures into structured perturbations, increasing reasoning difficulty and diversity in a scalable way.

Key Takeaways:

DenoiseRL consistently improves performance across benchmarks and model scales.
It strengthens the model's self-correction and recovery capability.
Careful design is required: mask off-policy prefixes, use a mild noise intensity, and enforce a fair output budget.
The method induces deeper reasoning dynamics, including overthinking under strong corruption.

Limitations & Future Work:

Effectiveness depends on the quality and diversity of errors from the weak model.
Stronger recovery supervision can amplify overthinking, increasing inference cost.
Future work should balance recovery gains with decoding efficiency and explore the limits of bootstrap-from-noise paradigms.