# Online Experiential Learning for Language Models

> Online Experiential Learning enables LLMs to continuously improve from deployment experience without rewards by extracting and consolidating experiential knowledge.

- **Source:** [arXiv](https://arxiv.org/abs/2603.16856)
- **Published:** 2026-03-19
- **Permalink:** https://picx.dev/p/vJjlgH
- **Whiteboard:** https://picx.dev/p/vJjlgH/image

## Summary

# Online Experiential Learning for Language Models

## Summary (Overview)
* **Key Contribution**: Introduces **Online Experiential Learning (OEL)**, a reward-free framework that enables Large Language Models (LLMs) to continuously improve from their own deployment experience, forming an online learning loop.
* **Core Mechanism**: Operates in two iterable stages: (1) **Extraction** of transferable experiential knowledge from user-side interaction trajectories, and (2) **Consolidation** of this knowledge into model parameters via **on-policy context distillation**.
* **Main Findings**: OEL achieves consistent performance improvements across successive iterations on text-based games, enhances token efficiency (shorter responses), and preserves out-of-distribution performance better than off-policy alternatives.
* **Critical Insights**: Extracted experiential knowledge is significantly more effective than raw trajectories, and **on-policy consistency** between the knowledge source and the policy model is essential for effective learning.
* **Practical Benefit**: The framework requires **no human annotations, reward models, or server-side access to user environments**, making it scalable for real-world deployment.

## Introduction and Theoretical Foundation
The prevailing paradigm for improving LLMs relies on **offline training** with human annotations or simulated environments (e.g., SFT, RL). This creates a bottleneck: once deployed, the model's rich stream of real-world interaction experience is discarded. The authors advocate for a shift to **online learning**, where models continuously improve from test-time experience accumulated during deployment.

**Key Challenges**: The server side cannot access user-side environments, and real-world interactions typically provide only unstructured **textual feedback** (e.g., outcome descriptions), not scalar rewards. Standard RL algorithms cannot consume such signals directly.

**Theoretical Insight**: OEL addresses these challenges by converting textual environment feedback into **experiential knowledge** that can be extracted, accumulated, and internalized. The process is entirely **reward-free**.

## Methodology
OEL operates in an iterative loop between the **User Side** (deployment) and the **Server Side** (training).

### 1. Extract Experiential Knowledge from User Trajectories
On the user side, the model $\pi_{\theta}$ interacts with an environment $E$, collecting a set of $n$ multi-turn trajectories $\mathcal{T} = \{ \tau_1, \tau_2, ..., \tau_n \}$. Each trajectory $\tau_i = (f_1^i, a_1^i, f_2^i, a_2^i, ...)$ is an alternating sequence of model actions ($a$) and textual environment feedback ($f$).

A language model $\pi_{\text{extract}}$ (default: $\pi_{\text{extract}} = \pi_{\theta}$) extracts transferable knowledge from these trajectories in an **accumulative fashion**.

Formally, let $e_i$ denote the accumulated experiential knowledge after processing trajectory $\tau_i$, with $e_0 = \emptyset$. The recursive extraction and accumulation process is defined for $i = 1,...,n$ as:
$$
e'_i \sim \pi_{\text{extract}}(\cdot | \tau_i, e_{i-1})
$$
$$
e_i = [e_{i-1}; e'_i]
$$
where $[e_{i-1}; e'_i]$ denotes concatenation. This process is repeated with different random seeds to produce a set of accumulated experiential knowledge $\mathcal{C} = \{e_1, e_2, ..., e_K\}$.

### 2. Consolidate Experiential Knowledge into Model Weights
The server side consolidates the knowledge $\mathcal{C}$ into the model parameters via **on-policy context distillation**.

**Training Data Construction**: From a separate set of user-collected trajectories $\mathcal{T}'$, all **partial rollout prefixes** $x_i^j = (f_1^i, a_1^i, ..., f_{j-1}^i, a_{j-1}^i, f_j^i)$ are extracted, forming a dataset $\mathcal{D} = \{x_i^j\}$.

**Training Objective**: The student model $\pi_{\theta}$ generates a response $y$ conditioned only on prefix $x$. It is optimized to match the output of a **knowledge-conditioned teacher** $\pi_{\text{teacher}}$ (typically the frozen initial $\pi_{\theta}$) via token-level reverse KL divergence:

$$
\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, e \sim \mathcal{C}, y \sim \pi_{\theta}(\cdot|x)} \left[ \frac{1}{|y|} \sum_{t=1}^{|y|} D_{\text{KL}}\big( \pi_{\theta}(\cdot | x, y_{<t}) \; \big\| \; \pi_{\text{teacher}}(\cdot | e, x, y_{<t}) \big) \right]
$$

This process uses single-turn rollouts, requires no access to $E$, and provides dense token-level signal from textual feedback alone.

### 3. Online Learning Process
The two stages are iterated:
1. The consolidated (improved) model $\pi_{\theta}$ is redeployed to collect new trajectories $\mathcal{T}$ and $\mathcal{T}'$.
2. These higher-quality trajectories yield richer experiential knowledge $\mathcal{C}$.
3. This new $\mathcal{C}$ drives the next consolidation round.
This forms a **virtuous cycle** of continuous improvement. Pseudocode is provided in Algorithm 1.

## Empirical Validation / Results
Experiments were conducted on two text-based game environments from **TextArena**: **Frozen Lake** (navigation) and **Sokoban** (spatial reasoning).

### Key Results
* **OEL Enables Progressive Online Learning**: Iterating extraction and consolidation stages leads to consistent improvement in task **pass rate** across rounds (Figure 4).
* **OEL Improves Token Efficiency**: Average per-turn response length decreases across iterations (to ~70% of initial length by round 3), indicating more efficient reasoning (Figure 5).
* **OEL Mitigates Catastrophic Forgetting**: The **on-policy** context distillation used in OEL preserves out-of-distribution (OOD) performance on **IF-Eval** better than **off-policy** alternatives, while achieving higher in-distribution performance (Figure 6).

### Effect of Model Size
Performance scaling across OEL rounds for Qwen3-1.7B, 4B, and 8B on FrozenLake (Figure 7):
* OEL yields substantial improvements for all model sizes.
* Larger models achieve higher pass rates, and gains from Round 1 to Round 2 are consistent across scales.

### Analysis Tables
**Table 1: Learning from Experiential Knowledge over Raw Experience (Sokoban, Qwen3-4B-Instruct-2507)**

| Experience Type       | In-Context Pass Rate (%) | Consolidate Pass Rate (%) |
| --------------------- | ------------------------ | ------------------------- |
| w/o Experience       | 7.5                      | -                         |
| Raw Trajectory       | 10.9                     | 7.8                       |
| Knowledge            | 18.2                     | 21.4                      |

*Extracted experiential knowledge is substantially more effective than raw trajectories.*

**Table 2: On-Policy Consistency Between Experiential Knowledge and Policy Model (Frozen Lake, Qwen3-1.7B)**

| Experience Source          | In-Context Pass Rate (%) | Consolidate Pass Rate (%) |
| -------------------------- | ------------------------ | ------------------------- |
| w/o Experience            | 7.3                      | -                         |
| Qwen3-4B                  | 18.0                     | 22.7                      |
| Qwen3-1.7B (Self)        | 23.8                     | 31.1                      |

*On-policy experiential knowledge (from the model's own trajectories) yields higher performance than off-policy knowledge from a larger model.*

## Theoretical and Practical Implications
**Theoretical Implications**:
* Demonstrates that **textual feedback alone** can serve as a sufficient learning signal for online improvement, bypassing the need for reward models.
* Highlights the importance of **on-policy consistency** and **knowledge extraction** over using raw experience, providing insights for experience-based learning frameworks.
* Shows that **reverse KL divergence** in on-policy context distillation helps preserve OOD performance.

**Practical Implications**:
* Provides a **scalable and practical** framework for continuous model improvement post-deployment, as it requires only trajectory collection on the user side and training on the server side.
* Enables **reward-free online learning**, eliminating the costly need for human annotations or simulated environments for every new scenario.
* Improves both **accuracy** and **inference efficiency**, reducing computational costs during deployment.

## Conclusion
Online Experiential Learning (OEL) represents a promising paradigm shift from static, offline training to **continuous, online improvement** from real-world deployment experience. By iteratively extracting and consolidating experiential knowledge via on-policy context distillation, OEL enables language models to become more accurate and efficient over time without sacrificing general capabilities. The framework's reward-free nature and lack of requirement for server-side environment access make it highly applicable for scalable real-world deployment. Future work may explore OEL in more complex, open-ended environments and investigate the long-term dynamics of the online learning loop.

---

_Markdown view of https://picx.dev/p/vJjlgH, served by PicX — AI-generated visual whiteboard summaries of research papers._
