Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Summary (Overview)
- Key Contribution: Proposed GOLF, a novel RL framework that leverages Group-level Natural Language Feedback (aggregating external critiques and intra-group attempts) to produce actionable refinements that guide targeted exploration and improve training efficiency.
- Core Mechanism: Aggregates complementary NL feedback sources, adaptively injects high-quality refinements as off-policy scaffolds in low-reward regimes, and jointly optimizes generation and refinement within a unified RL loop.
- Performance Gains: Achieves superior performance across both verifiable (math, instruction following, code) and non-verifiable (chat, creative writing) benchmarks, with ~2.2x improvement in sample efficiency compared to scalar-reward-only RL methods.
- Enhanced Exploration: Maintains higher policy entropy and improves
Pass@kscores, indicating broader solution coverage and more diverse exploration. - Capability Development: Joint training improves both direct problem-solving and self-refinement capabilities, enabling better utilization of NL feedback at inference time.
Introduction and Theoretical Foundation
Background & Motivation: Reinforcement Learning (RL) for Large Language Models (LLMs), such as RLHF and RLVR, typically relies solely on scalar rewards (success/failure signals). This leads to inefficient exploration, as the policy lacks explicit guidance on how to improve and must rely on costly trial-and-error. In many real-world scenarios, LLMs receive richer Natural Language (NL) feedback (e.g., error diagnoses, revision suggestions), but current RL algorithms (e.g., GRPO) do not fully exploit this information.
Core Insight: NL feedback, when aggregated from multiple complementary sources, can be translated into actionable refinements that provide explicit guidance, densifying learning signals and alleviating exploration bottlenecks in sparse-reward regions.
Theoretical Basis: The framework builds upon Group Relative Policy Optimization (GRPO). The problem is exacerbated when group-normalized advantages collapse (e.g., all-zero reward groups), yielding vanishing gradients. GOLF addresses this by injecting high-quality refinements derived from NL feedback to restore informative advantages and provide targeted guidance.
Methodology
GOLF consists of three tightly coupled components:
1. Group-level Feedback Aggregated Refinement
For each prompt , sample a group of responses:
Receive scalar reward and critique: .
- External Feedback: Critique c^{(i) associated with a specific response.
- Intra-group Feedback: Alternative responses within , containing complementary partial ideas.
Collect the failure set:
Construct an aggregated refinement prompt by concatenating the prompt and the failure set:
Conditioned on , sample a refinement group:
and score each: .
2. Adaptive Guidance via Mixed Policy Optimization
Adaptive Injection: Compute the group's average reward:
Trigger injection when falls below a threshold (default: ). Form the set of successful refinements:
If , randomly select and inject it by replacing one failed response in .
Mixed Policy Optimization: Let , where are on-policy rollouts and are injected refinement trajectories. Optimize using a mixed objective:
where:
- normalizes by total tokens.
- .
- .
- Advantages are computed by normalizing rewards within : .
- For off-policy ratios, a reshaping function with is applied, and the clip operation is omitted to emphasize low-probability effective actions.
3. Joint Optimization for Self-Refinement
Collect two rollout groups: a generation group and a refinement group . Concatenate into a joint batch . Advantages within each group are computed separately, and the policy is updated using GRPO within a single RL process. This creates a virtuous cycle: improved self-refinement produces higher-quality scaffolds, which further improve exploration.
Empirical Validation / Results
Non-verifiable Tasks
Setup: Trained on Llama-3.1-8B-Instruct and Qwen-3-8B using 7,500 prompts from WildChat-IF. Benchmarks: AlpacaEval-v2, WildBench, Arena-Hard-v1/v2, CreativeWriting-v3. Judge: GPT-4o.
Baselines: Direct-Likert, Pairwise-GRPO, Rubric-as-Reward, Critique-GRPO.
Key Results Table:
| Model | AlpacaEval-v2 (LC Win Rate %) | WildBench (Score %) | Arena-Hard-v1 (Win Rate %) | Arena-Hard-v2 (Win Rate %) | CreativeWriting-v3 (LLM Judge %) | Average Score (%) |
|---|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct | 31.93 | -8.25 | 30.80 | 5.57 | 53.96 | 24.30 |
| + Direct-Likert | 38.88 | 13.48 | 51.55 | 11.73 | 64.10 | 35.79 |
| + Pairwise-GRPO | 45.47 | 25.54 | 49.20 | 13.30 | 62.95 | 39.94 |
| + Rubric-as-Reward | 42.24 | 26.51 | 52.10 | 15.57 | 68.12 | 40.11 |
| + Critique-GRPO | 47.45 | 25.09 | 50.15 | 13.73 | 65.76 | 40.92 |
| + GOLF | 53.42 | 34.42 | 52.40 | 25.03 | 66.21 | 50.19 |
| Qwen-3-8B | 55.16 | 48.05 | 70.70 | 33.90 | 63.27 | 53.95 |
| + Direct-Likert | 64.84 | 58.01 | 82.75 | 41.70 | 69.56 | 62.99 |
| + Pairwise-GRPO | 66.34 | 67.77 | 81.20 | 50.10 | 68.08 | 66.97 |
| + Rubric-as-Reward | 65.34 | 67.09 | 81.90 | 50.08 | 69.21 | 67.08 |
| + Critique-GRPO | 68.20 | 64.84 | 81.95 | 49.63 | 67.30 | 66.96 |
| + GOLF | 71.80 | 68.16 | 80.90 | 52.00 | 70.78 | 69.26 |
GOLF achieves the best average performance on both models, surpassing the strongest baseline by +9.27 points (Llama) and +2.18 points (Qwen).
Sample Efficiency: GOLF shows ~2.2x improvement in sample efficiency. For example, on AlpacaEval-v2, it matches the baseline's final LC win rate in just 80 steps (2.25x efficiency). It also converges to a higher performance ceiling (+12.7% on AlpacaEval-v2, +85.2% on WildBench, +70.7% on ArenaHard-v2).
Verifiable Tasks
Setup: Models: Qwen-3-4B and Qwen-3-8B. Training data: OpenR1-Math (4k problems), filtered IFTrain (3,798 samples), LCBv6 subset of LiveCodeBench. Benchmarks: AIME24/25, AMC23 (math); IFBench, IFEval (instruction following); LiveCodeBench (code).
Baselines: Refinement-FT, Critique-FT, GRPO, Critique-GRPO, SDPO (for code).
Key Results Table:
| Model | AIME24 (%) | AIME25 (%) | AMC23 (%) | IFBench (%) | IFEval (%) |
|---|---|---|---|---|---|
| Qwen-3-4B | 22.53 | 18.55 | 59.41 | 23.67 | 81.52 |
| + Refinement-FT | 31.67 | 21.25 | 64.06 | 30.44 | 83.73 |
| + Critique-FT | 34.58 | 24.58 | 65.94 | 31.67 | 82.63 |
| + GRPO | 42.72 | 35.42 | 76.85 | 33.33 | 84.45 |
| + Critique-GRPO | 45.72 | 35.89 | 76.14 | 35.67 | 85.21 |
| + GOLF | 49.18 | 38.10 | 77.15 | 37.67 | 86.51 |
| Qwen-3-8B | 27.97 | 19.60 | 61.32 | 27.00 | 83.55 |
| + Refinement-FT | 42.08 | 27.50 | 67.81 | 34.33 | 84.29 |
| + Critique-FT | 46.75 | 28.75 | 70.31 | 33.60 | 84.45 |
| + GRPO | 55.05 | 38.02 | 78.61 | 35.65 | 84.76 |
| + Critique-GRPO | 55.49 | 37.86 | 77.58 | 36.33 | 85.58 |
| + GOLF | 58.49 | 41.65 | 80.74 | 38.33 | 87.80 |
GOLF consistently delivers the strongest results across all verifiable benchmarks, outperforming GRPO and Critique-GRPO.
Pass@k Analysis: Figure 4 shows that GOLF improves both Pass@1 and Pass@k on math reasoning benchmarks, indicating improved single-sample quality and broader solution coverage/diversity.
Code Generation: On LCBv6 with Qwen-3-8B, GOLF achieves an Avg@4 of 47.71, outperforming GRPO by +3.63 points and showing 1.5x sample efficiency. It also slightly outperforms SDPO (47.71 vs. 47.51).
Theoretical and Practical Implications
Significance of Findings:
- Complementary Feedback Sources: The aggregation of external critiques (targeted error identification) and intra-group attempts (diverse failure patterns & partial ideas) yields richer refinement contexts and higher-quality refinements than using either source alone.
- Adaptive Guidance Mechanism: Injecting high-quality refinements as off-policy scaffolds in low-reward regimes effectively mitigates the exploration bottleneck caused by collapsed group-normalized advantages (e.g., all-zero groups), restoring usable policy gradients.
- Joint Optimization Cycle: Training generation and refinement jointly within a unified RL loop creates a virtuous cycle—improved self-refinement produces better scaffolds, which in turn improve exploration, leading to continuous improvement in both capabilities.
- Exploration Diversity: GOLF maintains higher policy entropy (Figure 8) and improves
Pass@k, demonstrating that it promotes diverse exploration and prevents premature mode collapse. - Practical Efficiency: The framework achieves substantial improvements in sample efficiency (~2.2x) and final performance across diverse task types, making RL training for LLMs more efficient and effective.
Broader Impact: The method provides a scalable path to leverage rich NL feedback (common in real-world interactions) to densify learning signals and guide exploration, reducing reliance on costly trial-and-error. It could lower computational costs and improve reliability in interactive settings.
Potential Risks: Stronger refinement and exploration capabilities may amplify risks related to generating persuasive or strategically optimized content. Bias in LLM-based judges/critiques could be reinforced. Responsible deployment and careful evaluation are necessary.
Conclusion
Main Takeaways: GOLF effectively improves RL exploration for LLMs by aggregating group-level NL feedback (external critiques + intra-group attempts) into actionable refinements, adaptively injecting them as off-policy scaffolds in sparse-reward regions, and jointly optimizing generation and refinement. This leads to:
- Superior performance on both verifiable and non-verifiable tasks.
- Significant improvements in sample efficiency (~2.2x).
- Enhanced exploration diversity (higher entropy, better
Pass@k). - Improved self-refinement capability.
Future Directions: The method demonstrates that NL guidance is a practical and scalable path to more efficient and diverse exploration in language model RL. Combining GOLF's approach (aggregating diverse failures) with methods like SDPO (leveraging past successes) presents a promising direction for future work.