Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Summary (Overview)

Key Contribution: Proposed GOLF, a novel RL framework that leverages Group-level Natural Language Feedback (aggregating external critiques and intra-group attempts) to produce actionable refinements that guide targeted exploration and improve training efficiency.
Core Mechanism: Aggregates complementary NL feedback sources, adaptively injects high-quality refinements as off-policy scaffolds in low-reward regimes, and jointly optimizes generation and refinement within a unified RL loop.
Performance Gains: Achieves superior performance across both verifiable (math, instruction following, code) and non-verifiable (chat, creative writing) benchmarks, with ~2.2x improvement in sample efficiency compared to scalar-reward-only RL methods.
Enhanced Exploration: Maintains higher policy entropy and improves Pass@k scores, indicating broader solution coverage and more diverse exploration.
Capability Development: Joint training improves both direct problem-solving and self-refinement capabilities, enabling better utilization of NL feedback at inference time.

Introduction and Theoretical Foundation

Background & Motivation: Reinforcement Learning (RL) for Large Language Models (LLMs), such as RLHF and RLVR, typically relies solely on scalar rewards (success/failure signals). This leads to inefficient exploration, as the policy lacks explicit guidance on how to improve and must rely on costly trial-and-error. In many real-world scenarios, LLMs receive richer Natural Language (NL) feedback (e.g., error diagnoses, revision suggestions), but current RL algorithms (e.g., GRPO) do not fully exploit this information.

Core Insight: NL feedback, when aggregated from multiple complementary sources, can be translated into actionable refinements that provide explicit guidance, densifying learning signals and alleviating exploration bottlenecks in sparse-reward regions.

Theoretical Basis: The framework builds upon Group Relative Policy Optimization (GRPO). The problem is exacerbated when group-normalized advantages collapse (e.g., all-zero reward groups), yielding vanishing gradients. GOLF addresses this by injecting high-quality refinements derived from NL feedback to restore informative advantages and provide targeted guidance.

Methodology

GOLF consists of three tightly coupled components:

1. Group-level Feedback Aggregated Refinement

For each prompt $x$ , sample a group of $N$ responses:

G_{\text{gen}}(x) = \{y^{(i)}\}_{i=1}^{N}, \quad y^{(i)} \sim \pi_{\theta_{\text{old}}}(\cdot|x).

Receive scalar reward and critique: $(r^{(i)}, c^{(i)}) = R(x, y^{(i)})$ .

External Feedback: Critique $c^{(i)$ associated with a specific response.
Intra-group Feedback: Alternative responses within $G_{\text{gen}}(x)$ , containing complementary partial ideas.

Collect the failure set:

F(x) = \{ (y^{(i)}, c^{(i)}) | r^{(i)} = 0 \}.

Construct an aggregated refinement prompt by concatenating the prompt and the failure set:

p_{\text{agg}}(x) = \text{CONCAT}(x, F(x)).

Conditioned on $p_{\text{agg}}(x)$ , sample a refinement group:

G_{\text{refine}}(x) = \{\tilde{y}^{(j)}\}_{j=1}^{N}, \quad \tilde{y}^{(j)} \sim \pi_{\theta_{\text{old}}}(\cdot|p_{\text{agg}}(x)),

and score each: $\tilde{r}^{(j)} = R(x, \tilde{y}^{(j)})$ .

2. Adaptive Guidance via Mixed Policy Optimization

Adaptive Injection: Compute the group's average reward:

s(x) = \frac{1}{N} \sum_{y \in G_{\text{gen}}(x)} r(x, y).

Trigger injection when $s(x)$ falls below a threshold $\tau$ (default: $1/N$ ). Form the set of successful refinements:

S_{\text{ref}}(x) = \{ \tilde{y} \in G_{\text{ref}}(x) | \tilde{r}(x, \tilde{y}) = 1 \}.

If $S_{\text{ref}}(x) \neq \emptyset$ , randomly select $\tilde{y}^\star \in S_{\text{ref}}(x)$ and inject it by replacing one failed response in $G_{\text{gen}}(x)$ .

Mixed Policy Optimization: Let $G_{\text{aug}}(x) = G_{\text{on}}(x) \cup G_{\text{off}}(x)$ , where $G_{\text{on}}$ are on-policy rollouts and $G_{\text{off}}$ are injected refinement trajectories. Optimize using a mixed objective:

J_{\text{Mixed}}(\theta) = \frac{1}{Z} \left[ \sum_{i=1}^{N_{\text{on}}} \sum_{t=1}^{|\tau_i|} \text{CLIP}\left(r^{\text{on}}_{i,t}(\theta), \hat{A}_i, \epsilon\right) + \sum_{j=1}^{N_{\text{off}}} \sum_{t=1}^{|\tau_j|} \text{CLIP}\left(f(r^{\text{off}}_{j,t}(\theta)), \hat{A}_j, \epsilon\right) \right],

where:

$Z = \sum_{i=1}^{N_{\text{on}}} |\tau_i| + \sum_{j=1}^{N_{\text{off}}} |\tau_j|$ normalizes by total tokens.
$r^{\text{on}}_{i,t}(\theta) = \frac{\pi_\theta(\tau_{i,t}|x, \tau_{i,<t})}{\pi_{\theta_{\text{old}}(\tau_{i,t}|x, \tau_{i,<t})}}$ .
$r^{\text{off}}_{j,t}(\theta) = \frac{\pi_\theta(\tau_{j,t}|x, \tau_{j,<t})}{\pi_{\theta_{\text{old}}(\tau_{j,t}|p_{\text{agg}}(x), \tau_{j,<t})}}$ .
Advantages are computed by normalizing rewards within $G_{\text{aug}}(x)$ : $\hat{A}_i = R(\tau_i) - \text{mean}(G_{\text{aug}}(x))$ .
For off-policy ratios, a reshaping function $f(u) = u/(u+\lambda)$ with $\lambda=0.1$ is applied, and the clip operation is omitted to emphasize low-probability effective actions.

3. Joint Optimization for Self-Refinement

Collect two rollout groups: a generation group $G_{\text{gen}}(x)$ and a refinement group $G_{\text{ref}}(x)$ . Concatenate into a joint batch $B(x) = G_{\text{gen}}(x) \cup G_{\text{ref}}(x)$ . Advantages within each group are computed separately, and the policy $\pi_\theta$ is updated using GRPO within a single RL process. This creates a virtuous cycle: improved self-refinement produces higher-quality scaffolds, which further improve exploration.

Empirical Validation / Results

Non-verifiable Tasks

Setup: Trained on Llama-3.1-8B-Instruct and Qwen-3-8B using 7,500 prompts from WildChat-IF. Benchmarks: AlpacaEval-v2, WildBench, Arena-Hard-v1/v2, CreativeWriting-v3. Judge: GPT-4o.

Baselines: Direct-Likert, Pairwise-GRPO, Rubric-as-Reward, Critique-GRPO.

Key Results Table:

Model	AlpacaEval-v2 (LC Win Rate %)	WildBench (Score %)	Arena-Hard-v1 (Win Rate %)	Arena-Hard-v2 (Win Rate %)	CreativeWriting-v3 (LLM Judge %)	Average Score (%)
Llama-3.1-8B-Instruct	31.93	-8.25	30.80	5.57	53.96	24.30
+ Direct-Likert	38.88	13.48	51.55	11.73	64.10	35.79
+ Pairwise-GRPO	45.47	25.54	49.20	13.30	62.95	39.94
+ Rubric-as-Reward	42.24	26.51	52.10	15.57	68.12	40.11
+ Critique-GRPO	47.45	25.09	50.15	13.73	65.76	40.92
+ GOLF	53.42	34.42	52.40	25.03	66.21	50.19
Qwen-3-8B	55.16	48.05	70.70	33.90	63.27	53.95
+ Direct-Likert	64.84	58.01	82.75	41.70	69.56	62.99
+ Pairwise-GRPO	66.34	67.77	81.20	50.10	68.08	66.97
+ Rubric-as-Reward	65.34	67.09	81.90	50.08	69.21	67.08
+ Critique-GRPO	68.20	64.84	81.95	49.63	67.30	66.96
+ GOLF	71.80	68.16	80.90	52.00	70.78	69.26

GOLF achieves the best average performance on both models, surpassing the strongest baseline by +9.27 points (Llama) and +2.18 points (Qwen).

Sample Efficiency: GOLF shows ~2.2x improvement in sample efficiency. For example, on AlpacaEval-v2, it matches the baseline's final LC win rate in just 80 steps (2.25x efficiency). It also converges to a higher performance ceiling (+12.7% on AlpacaEval-v2, +85.2% on WildBench, +70.7% on ArenaHard-v2).

Verifiable Tasks

Setup: Models: Qwen-3-4B and Qwen-3-8B. Training data: OpenR1-Math (4k problems), filtered IFTrain (3,798 samples), LCBv6 subset of LiveCodeBench. Benchmarks: AIME24/25, AMC23 (math); IFBench, IFEval (instruction following); LiveCodeBench (code).

Baselines: Refinement-FT, Critique-FT, GRPO, Critique-GRPO, SDPO (for code).

Key Results Table:

Model	AIME24 (%)	AIME25 (%)	AMC23 (%)	IFBench (%)	IFEval (%)
Qwen-3-4B	22.53	18.55	59.41	23.67	81.52
+ Refinement-FT	31.67	21.25	64.06	30.44	83.73
+ Critique-FT	34.58	24.58	65.94	31.67	82.63
+ GRPO	42.72	35.42	76.85	33.33	84.45
+ Critique-GRPO	45.72	35.89	76.14	35.67	85.21
+ GOLF	49.18	38.10	77.15	37.67	86.51
Qwen-3-8B	27.97	19.60	61.32	27.00	83.55
+ Refinement-FT	42.08	27.50	67.81	34.33	84.29
+ Critique-FT	46.75	28.75	70.31	33.60	84.45
+ GRPO	55.05	38.02	78.61	35.65	84.76
+ Critique-GRPO	55.49	37.86	77.58	36.33	85.58
+ GOLF	58.49	41.65	80.74	38.33	87.80

GOLF consistently delivers the strongest results across all verifiable benchmarks, outperforming GRPO and Critique-GRPO.

Pass@k Analysis: Figure 4 shows that GOLF improves both Pass@1 and Pass@k on math reasoning benchmarks, indicating improved single-sample quality and broader solution coverage/diversity.

Code Generation: On LCBv6 with Qwen-3-8B, GOLF achieves an Avg@4 of 47.71, outperforming GRPO by +3.63 points and showing 1.5x sample efficiency. It also slightly outperforms SDPO (47.71 vs. 47.51).

Theoretical and Practical Implications

Significance of Findings:

Complementary Feedback Sources: The aggregation of external critiques (targeted error identification) and intra-group attempts (diverse failure patterns & partial ideas) yields richer refinement contexts and higher-quality refinements than using either source alone.
Adaptive Guidance Mechanism: Injecting high-quality refinements as off-policy scaffolds in low-reward regimes effectively mitigates the exploration bottleneck caused by collapsed group-normalized advantages (e.g., all-zero groups), restoring usable policy gradients.
Joint Optimization Cycle: Training generation and refinement jointly within a unified RL loop creates a virtuous cycle—improved self-refinement produces better scaffolds, which in turn improve exploration, leading to continuous improvement in both capabilities.
Exploration Diversity: GOLF maintains higher policy entropy (Figure 8) and improves Pass@k, demonstrating that it promotes diverse exploration and prevents premature mode collapse.
Practical Efficiency: The framework achieves substantial improvements in sample efficiency (~2.2x) and final performance across diverse task types, making RL training for LLMs more efficient and effective.

Broader Impact: The method provides a scalable path to leverage rich NL feedback (common in real-world interactions) to densify learning signals and guide exploration, reducing reliance on costly trial-and-error. It could lower computational costs and improve reliability in interactive settings.

Potential Risks: Stronger refinement and exploration capabilities may amplify risks related to generating persuasive or strategically optimized content. Bias in LLM-based judges/critiques could be reinforced. Responsible deployment and careful evaluation are necessary.

Conclusion

Main Takeaways: GOLF effectively improves RL exploration for LLMs by aggregating group-level NL feedback (external critiques + intra-group attempts) into actionable refinements, adaptively injecting them as off-policy scaffolds in sparse-reward regions, and jointly optimizing generation and refinement. This leads to:

Superior performance on both verifiable and non-verifiable tasks.
Significant improvements in sample efficiency (~2.2x).
Enhanced exploration diversity (higher entropy, better Pass@k).
Improved self-refinement capability.

Future Directions: The method demonstrates that NL guidance is a practical and scalable path to more efficient and diverse exploration in language model RL. Combining GOLF's approach (aggregating diverse failures) with methods like SDPO (leveraging past successes) presents a promising direction for future work.