Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Summary (Overview)

  • Key Contribution: Proposed GOLF, a novel RL framework that leverages Group-level Natural Language Feedback (aggregating external critiques and intra-group attempts) to produce actionable refinements that guide targeted exploration and improve training efficiency.
  • Core Mechanism: Aggregates complementary NL feedback sources, adaptively injects high-quality refinements as off-policy scaffolds in low-reward regimes, and jointly optimizes generation and refinement within a unified RL loop.
  • Performance Gains: Achieves superior performance across both verifiable (math, instruction following, code) and non-verifiable (chat, creative writing) benchmarks, with ~2.2x improvement in sample efficiency compared to scalar-reward-only RL methods.
  • Enhanced Exploration: Maintains higher policy entropy and improves Pass@k scores, indicating broader solution coverage and more diverse exploration.
  • Capability Development: Joint training improves both direct problem-solving and self-refinement capabilities, enabling better utilization of NL feedback at inference time.

Introduction and Theoretical Foundation

Background & Motivation: Reinforcement Learning (RL) for Large Language Models (LLMs), such as RLHF and RLVR, typically relies solely on scalar rewards (success/failure signals). This leads to inefficient exploration, as the policy lacks explicit guidance on how to improve and must rely on costly trial-and-error. In many real-world scenarios, LLMs receive richer Natural Language (NL) feedback (e.g., error diagnoses, revision suggestions), but current RL algorithms (e.g., GRPO) do not fully exploit this information.

Core Insight: NL feedback, when aggregated from multiple complementary sources, can be translated into actionable refinements that provide explicit guidance, densifying learning signals and alleviating exploration bottlenecks in sparse-reward regions.

Theoretical Basis: The framework builds upon Group Relative Policy Optimization (GRPO). The problem is exacerbated when group-normalized advantages collapse (e.g., all-zero reward groups), yielding vanishing gradients. GOLF addresses this by injecting high-quality refinements derived from NL feedback to restore informative advantages and provide targeted guidance.

Methodology

GOLF consists of three tightly coupled components:

1. Group-level Feedback Aggregated Refinement

For each prompt xx, sample a group of NN responses:

Ggen(x)={y(i)}i=1N,y(i)πθold(x).G_{\text{gen}}(x) = \{y^{(i)}\}_{i=1}^{N}, \quad y^{(i)} \sim \pi_{\theta_{\text{old}}}(\cdot|x).

Receive scalar reward and critique: (r(i),c(i))=R(x,y(i))(r^{(i)}, c^{(i)}) = R(x, y^{(i)}).

  • External Feedback: Critique c^{(i) associated with a specific response.
  • Intra-group Feedback: Alternative responses within Ggen(x)G_{\text{gen}}(x), containing complementary partial ideas.

Collect the failure set:

F(x)={(y(i),c(i))r(i)=0}.F(x) = \{ (y^{(i)}, c^{(i)}) | r^{(i)} = 0 \}.

Construct an aggregated refinement prompt by concatenating the prompt and the failure set:

pagg(x)=CONCAT(x,F(x)).p_{\text{agg}}(x) = \text{CONCAT}(x, F(x)).

Conditioned on pagg(x)p_{\text{agg}}(x), sample a refinement group:

Grefine(x)={y~(j)}j=1N,y~(j)πθold(pagg(x)),G_{\text{refine}}(x) = \{\tilde{y}^{(j)}\}_{j=1}^{N}, \quad \tilde{y}^{(j)} \sim \pi_{\theta_{\text{old}}}(\cdot|p_{\text{agg}}(x)),

and score each: r~(j)=R(x,y~(j))\tilde{r}^{(j)} = R(x, \tilde{y}^{(j)}).

2. Adaptive Guidance via Mixed Policy Optimization

Adaptive Injection: Compute the group's average reward:

s(x)=1NyGgen(x)r(x,y).s(x) = \frac{1}{N} \sum_{y \in G_{\text{gen}}(x)} r(x, y).

Trigger injection when s(x)s(x) falls below a threshold τ\tau (default: 1/N1/N). Form the set of successful refinements:

Sref(x)={y~Gref(x)r~(x,y~)=1}.S_{\text{ref}}(x) = \{ \tilde{y} \in G_{\text{ref}}(x) | \tilde{r}(x, \tilde{y}) = 1 \}.

If Sref(x)S_{\text{ref}}(x) \neq \emptyset, randomly select y~Sref(x)\tilde{y}^\star \in S_{\text{ref}}(x) and inject it by replacing one failed response in Ggen(x)G_{\text{gen}}(x).

Mixed Policy Optimization: Let Gaug(x)=Gon(x)Goff(x)G_{\text{aug}}(x) = G_{\text{on}}(x) \cup G_{\text{off}}(x), where GonG_{\text{on}} are on-policy rollouts and GoffG_{\text{off}} are injected refinement trajectories. Optimize using a mixed objective:

JMixed(θ)=1Z[i=1Nont=1τiCLIP(ri,ton(θ),A^i,ϵ)+j=1Nofft=1τjCLIP(f(rj,toff(θ)),A^j,ϵ)],J_{\text{Mixed}}(\theta) = \frac{1}{Z} \left[ \sum_{i=1}^{N_{\text{on}}} \sum_{t=1}^{|\tau_i|} \text{CLIP}\left(r^{\text{on}}_{i,t}(\theta), \hat{A}_i, \epsilon\right) + \sum_{j=1}^{N_{\text{off}}} \sum_{t=1}^{|\tau_j|} \text{CLIP}\left(f(r^{\text{off}}_{j,t}(\theta)), \hat{A}_j, \epsilon\right) \right],

where:

  • Z=i=1Nonτi+j=1NoffτjZ = \sum_{i=1}^{N_{\text{on}}} |\tau_i| + \sum_{j=1}^{N_{\text{off}}} |\tau_j| normalizes by total tokens.
  • ri,ton(θ)=πθ(τi,tx,τi,<t)πθold(τi,tx,τi,<t)r^{\text{on}}_{i,t}(\theta) = \frac{\pi_\theta(\tau_{i,t}|x, \tau_{i,<t})}{\pi_{\theta_{\text{old}}(\tau_{i,t}|x, \tau_{i,<t})}}.
  • rj,toff(θ)=πθ(τj,tx,τj,<t)πθold(τj,tpagg(x),τj,<t)r^{\text{off}}_{j,t}(\theta) = \frac{\pi_\theta(\tau_{j,t}|x, \tau_{j,<t})}{\pi_{\theta_{\text{old}}(\tau_{j,t}|p_{\text{agg}}(x), \tau_{j,<t})}}.
  • Advantages are computed by normalizing rewards within Gaug(x)G_{\text{aug}}(x): A^i=R(τi)mean(Gaug(x))\hat{A}_i = R(\tau_i) - \text{mean}(G_{\text{aug}}(x)).
  • For off-policy ratios, a reshaping function f(u)=u/(u+λ)f(u) = u/(u+\lambda) with λ=0.1\lambda=0.1 is applied, and the clip operation is omitted to emphasize low-probability effective actions.

3. Joint Optimization for Self-Refinement

Collect two rollout groups: a generation group Ggen(x)G_{\text{gen}}(x) and a refinement group Gref(x)G_{\text{ref}}(x). Concatenate into a joint batch B(x)=Ggen(x)Gref(x)B(x) = G_{\text{gen}}(x) \cup G_{\text{ref}}(x). Advantages within each group are computed separately, and the policy πθ\pi_\theta is updated using GRPO within a single RL process. This creates a virtuous cycle: improved self-refinement produces higher-quality scaffolds, which further improve exploration.

Empirical Validation / Results

Non-verifiable Tasks

Setup: Trained on Llama-3.1-8B-Instruct and Qwen-3-8B using 7,500 prompts from WildChat-IF. Benchmarks: AlpacaEval-v2, WildBench, Arena-Hard-v1/v2, CreativeWriting-v3. Judge: GPT-4o.

Baselines: Direct-Likert, Pairwise-GRPO, Rubric-as-Reward, Critique-GRPO.

Key Results Table:

ModelAlpacaEval-v2 (LC Win Rate %)WildBench (Score %)Arena-Hard-v1 (Win Rate %)Arena-Hard-v2 (Win Rate %)CreativeWriting-v3 (LLM Judge %)Average Score (%)
Llama-3.1-8B-Instruct31.93-8.2530.805.5753.9624.30
+ Direct-Likert38.8813.4851.5511.7364.1035.79
+ Pairwise-GRPO45.4725.5449.2013.3062.9539.94
+ Rubric-as-Reward42.2426.5152.1015.5768.1240.11
+ Critique-GRPO47.4525.0950.1513.7365.7640.92
+ GOLF53.4234.4252.4025.0366.2150.19
Qwen-3-8B55.1648.0570.7033.9063.2753.95
+ Direct-Likert64.8458.0182.7541.7069.5662.99
+ Pairwise-GRPO66.3467.7781.2050.1068.0866.97
+ Rubric-as-Reward65.3467.0981.9050.0869.2167.08
+ Critique-GRPO68.2064.8481.9549.6367.3066.96
+ GOLF71.8068.1680.9052.0070.7869.26

GOLF achieves the best average performance on both models, surpassing the strongest baseline by +9.27 points (Llama) and +2.18 points (Qwen).

Sample Efficiency: GOLF shows ~2.2x improvement in sample efficiency. For example, on AlpacaEval-v2, it matches the baseline's final LC win rate in just 80 steps (2.25x efficiency). It also converges to a higher performance ceiling (+12.7% on AlpacaEval-v2, +85.2% on WildBench, +70.7% on ArenaHard-v2).

Verifiable Tasks

Setup: Models: Qwen-3-4B and Qwen-3-8B. Training data: OpenR1-Math (4k problems), filtered IFTrain (3,798 samples), LCBv6 subset of LiveCodeBench. Benchmarks: AIME24/25, AMC23 (math); IFBench, IFEval (instruction following); LiveCodeBench (code).

Baselines: Refinement-FT, Critique-FT, GRPO, Critique-GRPO, SDPO (for code).

Key Results Table:

ModelAIME24 (%)AIME25 (%)AMC23 (%)IFBench (%)IFEval (%)
Qwen-3-4B22.5318.5559.4123.6781.52
+ Refinement-FT31.6721.2564.0630.4483.73
+ Critique-FT34.5824.5865.9431.6782.63
+ GRPO42.7235.4276.8533.3384.45
+ Critique-GRPO45.7235.8976.1435.6785.21
+ GOLF49.1838.1077.1537.6786.51
Qwen-3-8B27.9719.6061.3227.0083.55
+ Refinement-FT42.0827.5067.8134.3384.29
+ Critique-FT46.7528.7570.3133.6084.45
+ GRPO55.0538.0278.6135.6584.76
+ Critique-GRPO55.4937.8677.5836.3385.58
+ GOLF58.4941.6580.7438.3387.80

GOLF consistently delivers the strongest results across all verifiable benchmarks, outperforming GRPO and Critique-GRPO.

Pass@k Analysis: Figure 4 shows that GOLF improves both Pass@1 and Pass@k on math reasoning benchmarks, indicating improved single-sample quality and broader solution coverage/diversity.

Code Generation: On LCBv6 with Qwen-3-8B, GOLF achieves an Avg@4 of 47.71, outperforming GRPO by +3.63 points and showing 1.5x sample efficiency. It also slightly outperforms SDPO (47.71 vs. 47.51).

Theoretical and Practical Implications

Significance of Findings:

  1. Complementary Feedback Sources: The aggregation of external critiques (targeted error identification) and intra-group attempts (diverse failure patterns & partial ideas) yields richer refinement contexts and higher-quality refinements than using either source alone.
  2. Adaptive Guidance Mechanism: Injecting high-quality refinements as off-policy scaffolds in low-reward regimes effectively mitigates the exploration bottleneck caused by collapsed group-normalized advantages (e.g., all-zero groups), restoring usable policy gradients.
  3. Joint Optimization Cycle: Training generation and refinement jointly within a unified RL loop creates a virtuous cycle—improved self-refinement produces better scaffolds, which in turn improve exploration, leading to continuous improvement in both capabilities.
  4. Exploration Diversity: GOLF maintains higher policy entropy (Figure 8) and improves Pass@k, demonstrating that it promotes diverse exploration and prevents premature mode collapse.
  5. Practical Efficiency: The framework achieves substantial improvements in sample efficiency (~2.2x) and final performance across diverse task types, making RL training for LLMs more efficient and effective.

Broader Impact: The method provides a scalable path to leverage rich NL feedback (common in real-world interactions) to densify learning signals and guide exploration, reducing reliance on costly trial-and-error. It could lower computational costs and improve reliability in interactive settings.

Potential Risks: Stronger refinement and exploration capabilities may amplify risks related to generating persuasive or strategically optimized content. Bias in LLM-based judges/critiques could be reinforced. Responsible deployment and careful evaluation are necessary.

Conclusion

Main Takeaways: GOLF effectively improves RL exploration for LLMs by aggregating group-level NL feedback (external critiques + intra-group attempts) into actionable refinements, adaptively injecting them as off-policy scaffolds in sparse-reward regions, and jointly optimizing generation and refinement. This leads to:

  • Superior performance on both verifiable and non-verifiable tasks.
  • Significant improvements in sample efficiency (~2.2x).
  • Enhanced exploration diversity (higher entropy, better Pass@k).
  • Improved self-refinement capability.

Future Directions: The method demonstrates that NL guidance is a practical and scalable path to more efficient and diverse exploration in language model RL. Combining GOLF's approach (aggregating diverse failures) with methods like SDPO (leveraging past successes) presents a promising direction for future work.