# Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

> GOLF improves RL sample efficiency by 2.2x using aggregated natural language feedback to guide exploration in sparse-reward tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2603.04597)
- **Published:** 2026-03-13
- **Permalink:** https://picx.dev/p/D4Do4z
- **Whiteboard:** https://picx.dev/p/D4Do4z/image

## Summary

# Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

## Summary (Overview)
- **Key Contribution**: Proposed **GOLF**, a novel RL framework that leverages **Group-level Natural Language Feedback** (aggregating external critiques and intra-group attempts) to produce actionable refinements that guide targeted exploration and improve training efficiency.
- **Core Mechanism**: Aggregates complementary NL feedback sources, adaptively injects high-quality refinements as off-policy scaffolds in low-reward regimes, and jointly optimizes generation and refinement within a unified RL loop.
- **Performance Gains**: Achieves superior performance across both verifiable (math, instruction following, code) and non-verifiable (chat, creative writing) benchmarks, with **~2.2x improvement in sample efficiency** compared to scalar-reward-only RL methods.
- **Enhanced Exploration**: Maintains higher policy entropy and improves `Pass@k` scores, indicating broader solution coverage and more diverse exploration.
- **Capability Development**: Joint training improves both direct problem-solving and self-refinement capabilities, enabling better utilization of NL feedback at inference time.

## Introduction and Theoretical Foundation
**Background & Motivation**: Reinforcement Learning (RL) for Large Language Models (LLMs), such as RLHF and RLVR, typically relies solely on scalar rewards (success/failure signals). This leads to **inefficient exploration**, as the policy lacks explicit guidance on how to improve and must rely on costly trial-and-error. In many real-world scenarios, LLMs receive richer **Natural Language (NL) feedback** (e.g., error diagnoses, revision suggestions), but current RL algorithms (e.g., GRPO) do not fully exploit this information.

**Core Insight**: NL feedback, when aggregated from **multiple complementary sources**, can be translated into actionable refinements that provide explicit guidance, densifying learning signals and alleviating exploration bottlenecks in sparse-reward regions.

**Theoretical Basis**: The framework builds upon **Group Relative Policy Optimization (GRPO)**. The problem is exacerbated when group-normalized advantages collapse (e.g., all-zero reward groups), yielding vanishing gradients. GOLF addresses this by injecting high-quality refinements derived from NL feedback to restore informative advantages and provide targeted guidance.

## Methodology
GOLF consists of three tightly coupled components:

### 1. Group-level Feedback Aggregated Refinement
For each prompt $x$, sample a group of $N$ responses:
$$G_{\text{gen}}(x) = \{y^{(i)}\}_{i=1}^{N}, \quad y^{(i)} \sim \pi_{\theta_{\text{old}}}(\cdot|x).$$

Receive scalar reward and critique: $(r^{(i)}, c^{(i)}) = R(x, y^{(i)})$.
- **External Feedback**: Critique $c^{(i)$ associated with a specific response.
- **Intra-group Feedback**: Alternative responses within $G_{\text{gen}}(x)$, containing complementary partial ideas.

Collect the **failure set**:
$$F(x) = \{ (y^{(i)}, c^{(i)}) | r^{(i)} = 0 \}.$$

Construct an **aggregated refinement prompt** by concatenating the prompt and the failure set:
$$p_{\text{agg}}(x) = \text{CONCAT}(x, F(x)).$$

Conditioned on $p_{\text{agg}}(x)$, sample a refinement group:
$$G_{\text{refine}}(x) = \{\tilde{y}^{(j)}\}_{j=1}^{N}, \quad \tilde{y}^{(j)} \sim \pi_{\theta_{\text{old}}}(\cdot|p_{\text{agg}}(x)),$$
and score each: $\tilde{r}^{(j)} = R(x, \tilde{y}^{(j)})$.

### 2. Adaptive Guidance via Mixed Policy Optimization
**Adaptive Injection**: Compute the group's average reward:
$$s(x) = \frac{1}{N} \sum_{y \in G_{\text{gen}}(x)} r(x, y).$$

Trigger injection when $s(x)$ falls below a threshold $\tau$ (default: $1/N$). Form the set of successful refinements:
$$S_{\text{ref}}(x) = \{ \tilde{y} \in G_{\text{ref}}(x) | \tilde{r}(x, \tilde{y}) = 1 \}.$$

If $S_{\text{ref}}(x) \neq \emptyset$, randomly select $\tilde{y}^\star \in S_{\text{ref}}(x)$ and inject it by replacing one failed response in $G_{\text{gen}}(x)$.

**Mixed Policy Optimization**: Let $G_{\text{aug}}(x) = G_{\text{on}}(x) \cup G_{\text{off}}(x)$, where $G_{\text{on}}$ are on-policy rollouts and $G_{\text{off}}$ are injected refinement trajectories. Optimize using a mixed objective:

$$J_{\text{Mixed}}(\theta) = \frac{1}{Z} \left[ \sum_{i=1}^{N_{\text{on}}} \sum_{t=1}^{|\tau_i|} \text{CLIP}\left(r^{\text{on}}_{i,t}(\theta), \hat{A}_i, \epsilon\right) + \sum_{j=1}^{N_{\text{off}}} \sum_{t=1}^{|\tau_j|} \text{CLIP}\left(f(r^{\text{off}}_{j,t}(\theta)), \hat{A}_j, \epsilon\right) \right],$$

where:
- $Z = \sum_{i=1}^{N_{\text{on}}} |\tau_i| + \sum_{j=1}^{N_{\text{off}}} |\tau_j|$ normalizes by total tokens.
- $r^{\text{on}}_{i,t}(\theta) = \frac{\pi_\theta(\tau_{i,t}|x, \tau_{i,<t})}{\pi_{\theta_{\text{old}}(\tau_{i,t}|x, \tau_{i,<t})}}$.
- $r^{\text{off}}_{j,t}(\theta) = \frac{\pi_\theta(\tau_{j,t}|x, \tau_{j,<t})}{\pi_{\theta_{\text{old}}(\tau_{j,t}|p_{\text{agg}}(x), \tau_{j,<t})}}$.
- Advantages are computed by normalizing rewards within $G_{\text{aug}}(x)$: $\hat{A}_i = R(\tau_i) - \text{mean}(G_{\text{aug}}(x))$.
- For off-policy ratios, a reshaping function $f(u) = u/(u+\lambda)$ with $\lambda=0.1$ is applied, and the clip operation is omitted to emphasize low-probability effective actions.

### 3. Joint Optimization for Self-Refinement
Collect two rollout groups: a generation group $G_{\text{gen}}(x)$ and a refinement group $G_{\text{ref}}(x)$. Concatenate into a joint batch $B(x) = G_{\text{gen}}(x) \cup G_{\text{ref}}(x)$. Advantages within each group are computed separately, and the policy $\pi_\theta$ is updated using GRPO within a single RL process. This creates a virtuous cycle: improved self-refinement produces higher-quality scaffolds, which further improve exploration.

## Empirical Validation / Results

### Non-verifiable Tasks
**Setup**: Trained on Llama-3.1-8B-Instruct and Qwen-3-8B using 7,500 prompts from WildChat-IF. Benchmarks: AlpacaEval-v2, WildBench, Arena-Hard-v1/v2, CreativeWriting-v3. Judge: GPT-4o.

**Baselines**: Direct-Likert, Pairwise-GRPO, Rubric-as-Reward, Critique-GRPO.

**Key Results Table**:

| Model | AlpacaEval-v2 (LC Win Rate %) | WildBench (Score %) | Arena-Hard-v1 (Win Rate %) | Arena-Hard-v2 (Win Rate %) | CreativeWriting-v3 (LLM Judge %) | **Average Score (%)** |
|---|---|---|---|---|---|---|
| **Llama-3.1-8B-Instruct** | 31.93 | -8.25 | 30.80 | 5.57 | 53.96 | 24.30 |
| + Direct-Likert | 38.88 | 13.48 | 51.55 | 11.73 | 64.10 | 35.79 |
| + Pairwise-GRPO | 45.47 | 25.54 | 49.20 | 13.30 | 62.95 | 39.94 |
| + Rubric-as-Reward | 42.24 | 26.51 | 52.10 | 15.57 | 68.12 | 40.11 |
| + Critique-GRPO | 47.45 | 25.09 | 50.15 | 13.73 | 65.76 | 40.92 |
| **+ GOLF** | **53.42** | **34.42** | **52.40** | **25.03** | **66.21** | **50.19** |
| **Qwen-3-8B** | 55.16 | 48.05 | 70.70 | 33.90 | 63.27 | 53.95 |
| + Direct-Likert | 64.84 | 58.01 | 82.75 | 41.70 | 69.56 | 62.99 |
| + Pairwise-GRPO | 66.34 | 67.77 | 81.20 | 50.10 | 68.08 | 66.97 |
| + Rubric-as-Reward | 65.34 | 67.09 | 81.90 | 50.08 | 69.21 | 67.08 |
| + Critique-GRPO | 68.20 | 64.84 | 81.95 | 49.63 | 67.30 | 66.96 |
| **+ GOLF** | **71.80** | **68.16** | **80.90** | **52.00** | **70.78** | **69.26** |

*GOLF achieves the best average performance on both models, surpassing the strongest baseline by +9.27 points (Llama) and +2.18 points (Qwen).*

**Sample Efficiency**: GOLF shows ~2.2x improvement in sample efficiency. For example, on AlpacaEval-v2, it matches the baseline's final LC win rate in just 80 steps (2.25x efficiency). It also converges to a higher performance ceiling (+12.7% on AlpacaEval-v2, +85.2% on WildBench, +70.7% on ArenaHard-v2).

### Verifiable Tasks
**Setup**: Models: Qwen-3-4B and Qwen-3-8B. Training data: OpenR1-Math (4k problems), filtered IFTrain (3,798 samples), LCBv6 subset of LiveCodeBench. Benchmarks: AIME24/25, AMC23 (math); IFBench, IFEval (instruction following); LiveCodeBench (code).

**Baselines**: Refinement-FT, Critique-FT, GRPO, Critique-GRPO, SDPO (for code).

**Key Results Table**:

| Model | AIME24 (%) | AIME25 (%) | AMC23 (%) | IFBench (%) | IFEval (%) |
|---|---|---|---|---|---|
| **Qwen-3-4B** | 22.53 | 18.55 | 59.41 | 23.67 | 81.52 |
| + Refinement-FT | 31.67 | 21.25 | 64.06 | 30.44 | 83.73 |
| + Critique-FT | 34.58 | 24.58 | 65.94 | 31.67 | 82.63 |
| + GRPO | 42.72 | 35.42 | 76.85 | 33.33 | 84.45 |
| + Critique-GRPO | 45.72 | 35.89 | 76.14 | 35.67 | 85.21 |
| **+ GOLF** | **49.18** | **38.10** | **77.15** | **37.67** | **86.51** |
| **Qwen-3-8B** | 27.97 | 19.60 | 61.32 | 27.00 | 83.55 |
| + Refinement-FT | 42.08 | 27.50 | 67.81 | 34.33 | 84.29 |
| + Critique-FT | 46.75 | 28.75 | 70.31 | 33.60 | 84.45 |
| + GRPO | 55.05 | 38.02 | 78.61 | 35.65 | 84.76 |
| + Critique-GRPO | 55.49 | 37.86 | 77.58 | 36.33 | 85.58 |
| **+ GOLF** | **58.49** | **41.65** | **80.74** | **38.33** | **87.80** |

*GOLF consistently delivers the strongest results across all verifiable benchmarks, outperforming GRPO and Critique-GRPO.*

**Pass@k Analysis**: Figure 4 shows that GOLF improves both `Pass@1` and `Pass@k` on math reasoning benchmarks, indicating improved single-sample quality and broader solution coverage/diversity.

**Code Generation**: On LCBv6 with Qwen-3-8B, GOLF achieves an `Avg@4` of **47.71**, outperforming GRPO by +3.63 points and showing **1.5x sample efficiency**. It also slightly outperforms SDPO (47.71 vs. 47.51).

## Theoretical and Practical Implications
**Significance of Findings**:
1.  **Complementary Feedback Sources**: The aggregation of **external critiques** (targeted error identification) and **intra-group attempts** (diverse failure patterns & partial ideas) yields richer refinement contexts and higher-quality refinements than using either source alone.
2.  **Adaptive Guidance Mechanism**: Injecting high-quality refinements as **off-policy scaffolds** in low-reward regimes effectively mitigates the exploration bottleneck caused by collapsed group-normalized advantages (e.g., all-zero groups), restoring usable policy gradients.
3.  **Joint Optimization Cycle**: Training generation and refinement jointly within a unified RL loop creates a **virtuous cycle**—improved self-refinement produces better scaffolds, which in turn improve exploration, leading to continuous improvement in both capabilities.
4.  **Exploration Diversity**: GOLF maintains higher policy entropy (Figure 8) and improves `Pass@k`, demonstrating that it promotes **diverse exploration** and prevents premature mode collapse.
5.  **Practical Efficiency**: The framework achieves substantial improvements in **sample efficiency (~2.2x)** and final performance across diverse task types, making RL training for LLMs more efficient and effective.

**Broader Impact**: The method provides a scalable path to leverage rich NL feedback (common in real-world interactions) to densify learning signals and guide exploration, reducing reliance on costly trial-and-error. It could lower computational costs and improve reliability in interactive settings.

**Potential Risks**: Stronger refinement and exploration capabilities may amplify risks related to generating persuasive or strategically optimized content. Bias in LLM-based judges/critiques could be reinforced. Responsible deployment and careful evaluation are necessary.

## Conclusion
**Main Takeaways**: GOLF effectively improves RL exploration for LLMs by aggregating group-level NL feedback (external critiques + intra-group attempts) into actionable refinements, adaptively injecting them as off-policy scaffolds in sparse-reward regions, and jointly optimizing generation and refinement. This leads to:
- Superior performance on both verifiable and non-verifiable tasks.
- Significant improvements in sample efficiency (~2.2x).
- Enhanced exploration diversity (higher entropy, better `Pass@k`).
- Improved self-refinement capability.

**Future Directions**: The method demonstrates that NL guidance is a practical and scalable path to more efficient and diverse exploration in language model RL. Combining GOLF's approach (aggregating diverse failures) with methods like SDPO (leveraging past successes) presents a promising direction for future work.

---

_Markdown view of https://picx.dev/p/D4Do4z, served by PicX — AI-generated visual whiteboard summaries of research papers._
