Online Experiential Learning for Language Models

Summary (Overview)

Key Contribution: Introduces Online Experiential Learning (OEL), a reward-free framework that enables Large Language Models (LLMs) to continuously improve from their own deployment experience, forming an online learning loop.
Core Mechanism: Operates in two iterable stages: (1) Extraction of transferable experiential knowledge from user-side interaction trajectories, and (2) Consolidation of this knowledge into model parameters via on-policy context distillation.
Main Findings: OEL achieves consistent performance improvements across successive iterations on text-based games, enhances token efficiency (shorter responses), and preserves out-of-distribution performance better than off-policy alternatives.
Critical Insights: Extracted experiential knowledge is significantly more effective than raw trajectories, and on-policy consistency between the knowledge source and the policy model is essential for effective learning.
Practical Benefit: The framework requires no human annotations, reward models, or server-side access to user environments, making it scalable for real-world deployment.

Introduction and Theoretical Foundation

The prevailing paradigm for improving LLMs relies on offline training with human annotations or simulated environments (e.g., SFT, RL). This creates a bottleneck: once deployed, the model's rich stream of real-world interaction experience is discarded. The authors advocate for a shift to online learning, where models continuously improve from test-time experience accumulated during deployment.

Key Challenges: The server side cannot access user-side environments, and real-world interactions typically provide only unstructured textual feedback (e.g., outcome descriptions), not scalar rewards. Standard RL algorithms cannot consume such signals directly.

Theoretical Insight: OEL addresses these challenges by converting textual environment feedback into experiential knowledge that can be extracted, accumulated, and internalized. The process is entirely reward-free.

Methodology

OEL operates in an iterative loop between the User Side (deployment) and the Server Side (training).

1. Extract Experiential Knowledge from User Trajectories

On the user side, the model $\pi_{\theta}$ interacts with an environment $E$ , collecting a set of $n$ multi-turn trajectories $\mathcal{T} = \{ \tau_1, \tau_2, ..., \tau_n \}$ . Each trajectory $\tau_i = (f_1^i, a_1^i, f_2^i, a_2^i, ...)$ is an alternating sequence of model actions ( $a$ ) and textual environment feedback ( $f$ ).

A language model $\pi_{\text{extract}}$ (default: $\pi_{\text{extract}} = \pi_{\theta}$ ) extracts transferable knowledge from these trajectories in an accumulative fashion.

Formally, let $e_i$ denote the accumulated experiential knowledge after processing trajectory $\tau_i$ , with $e_0 = \emptyset$ . The recursive extraction and accumulation process is defined for $i = 1,...,n$ as:

e'_i \sim \pi_{\text{extract}}(\cdot | \tau_i, e_{i-1})

e_i = [e_{i-1}; e'_i]

where $[e_{i-1}; e'_i]$ denotes concatenation. This process is repeated with different random seeds to produce a set of accumulated experiential knowledge $\mathcal{C} = \{e_1, e_2, ..., e_K\}$ .

2. Consolidate Experiential Knowledge into Model Weights

The server side consolidates the knowledge $\mathcal{C}$ into the model parameters via on-policy context distillation.

Training Data Construction: From a separate set of user-collected trajectories $\mathcal{T}'$ , all partial rollout prefixes $x_i^j = (f_1^i, a_1^i, ..., f_{j-1}^i, a_{j-1}^i, f_j^i)$ are extracted, forming a dataset $\mathcal{D} = \{x_i^j\}$ .

Training Objective: The student model $\pi_{\theta}$ generates a response $y$ conditioned only on prefix $x$ . It is optimized to match the output of a knowledge-conditioned teacher $\pi_{\text{teacher}}$ (typically the frozen initial $\pi_{\theta}$ ) via token-level reverse KL divergence:

\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, e \sim \mathcal{C}, y \sim \pi_{\theta}(\cdot|x)} \left[ \frac{1}{|y|} \sum_{t=1}^{|y|} D_{\text{KL}}\big( \pi_{\theta}(\cdot | x, y_{<t}) \; \big\| \; \pi_{\text{teacher}}(\cdot | e, x, y_{<t}) \big) \right]

This process uses single-turn rollouts, requires no access to $E$ , and provides dense token-level signal from textual feedback alone.

3. Online Learning Process

The two stages are iterated:

The consolidated (improved) model $\pi_{\theta}$ is redeployed to collect new trajectories $\mathcal{T}$ and $\mathcal{T}'$ .
These higher-quality trajectories yield richer experiential knowledge $\mathcal{C}$ .
This new $\mathcal{C}$ drives the next consolidation round. This forms a virtuous cycle of continuous improvement. Pseudocode is provided in Algorithm 1.

Empirical Validation / Results

Experiments were conducted on two text-based game environments from TextArena: Frozen Lake (navigation) and Sokoban (spatial reasoning).

Key Results

OEL Enables Progressive Online Learning: Iterating extraction and consolidation stages leads to consistent improvement in task pass rate across rounds (Figure 4).
OEL Improves Token Efficiency: Average per-turn response length decreases across iterations (to ~70% of initial length by round 3), indicating more efficient reasoning (Figure 5).
OEL Mitigates Catastrophic Forgetting: The on-policy context distillation used in OEL preserves out-of-distribution (OOD) performance on IF-Eval better than off-policy alternatives, while achieving higher in-distribution performance (Figure 6).

Effect of Model Size

Performance scaling across OEL rounds for Qwen3-1.7B, 4B, and 8B on FrozenLake (Figure 7):

OEL yields substantial improvements for all model sizes.
Larger models achieve higher pass rates, and gains from Round 1 to Round 2 are consistent across scales.

Analysis Tables

Table 1: Learning from Experiential Knowledge over Raw Experience (Sokoban, Qwen3-4B-Instruct-2507)

Experience Type	In-Context Pass Rate (%)	Consolidate Pass Rate (%)
w/o Experience	7.5	-
Raw Trajectory	10.9	7.8
Knowledge	18.2	21.4

Extracted experiential knowledge is substantially more effective than raw trajectories.

Table 2: On-Policy Consistency Between Experiential Knowledge and Policy Model (Frozen Lake, Qwen3-1.7B)

Experience Source	In-Context Pass Rate (%)	Consolidate Pass Rate (%)
w/o Experience	7.3	-
Qwen3-4B	18.0	22.7
Qwen3-1.7B (Self)	23.8	31.1

On-policy experiential knowledge (from the model's own trajectories) yields higher performance than off-policy knowledge from a larger model.

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates that textual feedback alone can serve as a sufficient learning signal for online improvement, bypassing the need for reward models.
Highlights the importance of on-policy consistency and knowledge extraction over using raw experience, providing insights for experience-based learning frameworks.
Shows that reverse KL divergence in on-policy context distillation helps preserve OOD performance.

Practical Implications:

Provides a scalable and practical framework for continuous model improvement post-deployment, as it requires only trajectory collection on the user side and training on the server side.
Enables reward-free online learning, eliminating the costly need for human annotations or simulated environments for every new scenario.
Improves both accuracy and inference efficiency, reducing computational costs during deployment.

Conclusion

Online Experiential Learning (OEL) represents a promising paradigm shift from static, offline training to continuous, online improvement from real-world deployment experience. By iteratively extracting and consolidating experiential knowledge via on-policy context distillation, OEL enables language models to become more accurate and efficient over time without sacrificing general capabilities. The framework's reward-free nature and lack of requirement for server-side environment access make it highly applicable for scalable real-world deployment. Future work may explore OEL in more complex, open-ended environments and investigate the long-term dynamics of the online learning loop.