ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Summary (Overview)

Problem: Standard policy gradient Reinforcement Learning (RL) fails when applied to Proactive Recommender Systems (PRS), degenerating into generating long, identical paths instead of discovering high-quality guidance sequences.
Key Deficiencies Identified: The paper identifies two core issues: (1) Length Shortcut: Path-level rewards decompose into step-level rewards with a positive mean, creating a bias favoring longer paths. (2) High Gradient Variance: Using the entire path reward for each step's gradient ignores the reward decomposition structure, leading to noisy updates.
Proposed Solution (ProRL): Introduces a rectified policy gradient framework with two novel mechanisms: Stepwise Reward Centering (SRC) to neutralize the length-dependent bias and Position-Specific Advantage Estimation (PSAE) to compute low-variance, step-adapted baselines.
Main Result: ProRL significantly outperforms state-of-the-art PRS methods across three real-world datasets (MovieLens-1M, Steam, Amazon-Book) on key metrics for both path feasibility (CTR) and guidance effectiveness (IoI, IoR).
Insight: The RL stage acts as a "probabilistic rectifier," unlocking the high-quality path generation potential already latent in a supervised pre-trained model by shifting probability mass towards high-reward sequences.

Introduction and Theoretical Foundation

Proactive Recommender Systems (PRS) aim to gradually shift user preferences toward a platform-specified target item by generating a sequence (path) of intermediate recommendations. This addresses the tension between platforms wanting to promote new items and users being anchored in familiar preferences. The core challenge is jointly optimizing two objectives: Path Feasibility (high acceptance probability for each intermediate item) and Guidance Effectiveness (significantly increasing the user's interest in the target item).

Prior work includes heuristic methods (prone to local optima), LLM-based methods (prohibitively expensive), and supervised methods (limited to imitating historical data). The paper formalizes PRS as a reward maximization problem, where path quality $R_{path}$ is a weighted sum of quantitative metrics:

IoI (Increment of Interest): $\text{IoI} := \log P(i_T | S_u \oplus L_u) - \log P(i_T | S_u)$
IoR (Increment of Rank): $\text{IoR} := \text{Rank}(i_T | S_u) - \text{Rank}(i_T | S_u \oplus L_u)$
CTR (Click-Through Rate): $\text{CTR} := \frac{1}{|L_u|} \sum_{k=1}^{|L_u|} P(i_k | S_u \oplus L_{u}^{<k})$

While RL is a natural framework for this exploration problem, preliminary experiments show that standard policy gradient optimization fails, causing the policy to degenerate into generating maximum-length, low-diversity paths. The theoretical foundation for this failure is established by decomposing any path reward $R$ into step-level increments:

R(i_1, \dots, i_L) = \sum_{t=1}^{L} r_t, \quad \text{where} \quad r_t := R(i_1, \dots, i_t) - R(i_1, \dots, i_{t-1})

The paper proves that if the expected step reward $\mathbb{E}_\pi[r_t]$ is positive, the expected path reward becomes directly dependent on path length, creating a "length shortcut" bias in gradient estimation.

Methodology

ProRL rectifies the standard policy gradient estimator to address the identified deficiencies. The overall framework is illustrated in Figure 3 of the paper.

1. Stepwise Reward Centering (SRC): This mechanism eliminates the length shortcut by ensuring that extending a path yields zero expected gain. It subtracts the global expected step reward $\bar{r}$ from each step reward:

\tilde{r}_t = r_t - \bar{r}, \quad \text{where} \quad \bar{r} = \mathbb{E}_\pi[r_*]

This centers the step rewards so that $\mathbb{E}_\pi[\tilde{r}_t] = 0$ , breaking the spurious correlation between expected return and path length. For multi-objective rewards with $K$ components, SRC is extended to normalization:

\tilde{r}_t = \sum_{i=1}^{K} w_i \cdot \frac{r_t^{(i)} - \mu^{(i)}}{\sigma^{(i)}}

where $\mu^{(i)}$ and $\sigma^{(i)}$ are the mean and standard deviation of component $i$ 's step rewards, estimated from a warm-up epoch.

2. Position-Specific Advantage Estimation (PSAE): This mechanism reduces gradient variance by leveraging the reward decomposition structure. Instead of weighting each step's gradient by the full path reward $R$ (as in standard REINFORCE) or even the reward-to-go $G_t$ , it uses a position-specific advantage. First, the reward-to-go is defined as:

G_t^{(i,j)} = \sum_{\ell=t}^{L^{(i,j)}} r_\ell^{(i,j)}

Then, a baseline is computed as the average reward-to-go at step $t$ across all rollouts from the same input that reach step $t$ :

\bar{G}_{i,t} = \frac{\sum_{j: L^{(i,j)} \ge t} G_t^{(i,j)}}{\sum_{j=1}^m \mathbb{I}[L^{(i,j)} \ge t]}

The position-specific advantage is: $\hat{A}_t^{(i,j)} = G_t^{(i,j)} - \bar{G}_{i,t}$ . The final rectified policy gradient estimator is:

\hat{g}_{\text{rect}} = \frac{1}{nm} \sum_{i=1}^{n} \sum_{j=1}^{m} \left[ \sum_{t=1}^{L^{(i,j)}} \nabla_\theta \log \pi_\theta^{(i,j,t)} \cdot \hat{A}_t^{(i,j)} \right]

This estimator is unbiased and has lower variance because it excludes past rewards (via $G_t$ ) and uses a tight, step-adapted baseline.

Empirical Validation / Results

Experiments were conducted on three datasets: MovieLens-1M, Steam, and Amazon-Book. The policy was initialized via supervised pre-training on historical paths and then fine-tuned with RL.

Overall Performance: ProRL consistently outperforms all baselines, including sequential recommenders (GRU4Rec, BERT4Rec), supervised PRS (IRN), heuristic methods (IPG, ITM-PRec), and LLM-based methods (LLM-IPP, T-PRA).

Table 1: Overall Performance (SASRec as Evaluator)

Dataset	Model	CTR	Coherence	IoI	IoR
MovieLens-1M	ProRL (Ours)	0.8543*	0.8422*	2.8504*	728.18*
Steam	ProRL (Ours)	0.5625*	0.8707*	1.1188*	340.18*
Amazon-Book	ProRL (Ours)	0.8568*	0.6775*	2.9812*	1383.41*

Note: Best results in bold, second-best underlined. * indicates statistically significant improvement (p < 0.05). ProRL achieves the best guidance effectiveness (IoI, IoR) and path feasibility (CTR), while also excelling at the unrewarded Coherence metric.

Generalization (Cross-Evaluator Analysis): To test for overfitting, policies were evaluated using unseen user simulators (GRU4Rec, LightSANs, BERT4Rec). ProRL maintained superior performance, demonstrating it learns generalizable guidance strategies rather than exploiting a specific reward model.

Table 2: Cross-Evaluator Analysis (GRU4Rec as Evaluator)

Dataset	Model	CTR	Coherence	IoI	IoR
MovieLens-1M	ProRL (Ours)	0.8460*	0.8422*	2.4560*	649.26*
Steam	ProRL (Ours)	0.6328*	0.8707*	0.2013*	83.70*
Amazon-Book	ProRL (Ours)	0.8832*	0.6775*	1.7650*	1001.27*

Ablation Studies:

Rectification Modules: Removing SRC leads to path collapse (over-optimization of CTR at the expense of IoI/IoR). Removing PSAE reduces performance across all metrics.
Multi-Reward Design: All three reward components (CTR, IoI, IoR) are necessary for optimal performance; removing any hurts metrics, indicating they are mutually reinforcing.
Gradient Estimators: ProRL's estimator achieves the best balance of performance and stability. Compared to alternatives (REINFORCE, GRPO, A2C), ProRL converges to moderate, stable path lengths and has the lowest gradient variance.

Table 5: Analysis of Gradient Estimators on ML-1M

Method	CTR	IoI	IoR	Avg. Path Length (E1/E5/E10)	Adv. Variance (E1/E2/E3)
RF	0.581	1.626	329.8	5.2 / 2.9 / 1.5	1.00× / 1.18× / 0.94×
GRPO	0.633	1.483	284.9	10.0 / 10.0 / 10.0	0.22× / 0.21× / 0.19×
A2C	0.857	1.695	527.5	1.8 / 4.7 / 5.3	0.09× / 0.12× / 0.17×
RTG	0.694	2.383	675.7	1.5 / 3.4 / 4.1	0.12× / 0.11× / 0.10×
ProRL	0.854	2.850	728.2	1.6 / 3.1 / 3.8	0.06× / 0.05× / 0.05×

Training Stage Analysis: The supervised pre-trained model establishes high path feasibility (CTR) but has limited guidance effectiveness. The RL stage dramatically improves effectiveness (IoI, IoR) while maintaining feasibility. A Rollout@K analysis reveals that the pre-trained model, when sampled extensively, contains high-quality paths in its low-probability tail. ProRL's RL stage acts as a "probabilistic rectifier," identifying and up-weighting these high-reward paths.

Table 7: Latent Capacity of Pretrained Model (Rollout@K)

Dataset	Metric	@1	@5	@10
MovieLens-1M	Max-IoI	1.1347	2.7779	3.3585
	Max-IoR	294.53	717.69	851.03
Steam	Max-IoI	0.2395	1.8728	2.4803
	Max-IoR	57.89	818.11	1074.35
Books	Max-IoI	0.1523	2.2524	3.0780
	Max-IoR	52.47	1132.01	1509.70

Theoretical and Practical Implications

Theoretical Implications:

The paper provides a formal analysis of why standard policy gradients fail in PRS, identifying the length shortcut as a fundamental issue arising from the positive-mean structure of decomposed step rewards.
It establishes Stepwise Reward Centering as a principled solution to decouple expected reward from path length, a concept potentially applicable to other RL tasks with similar reward structures.
It demonstrates how Position-Specific Advantage Estimation can effectively reduce variance without needing a learned critic model, by exploiting the inherent temporal decomposition of the task reward.

Practical Implications:

Effective RL for PRS: ProRL provides a practical and effective framework for applying RL to proactive recommendation, overcoming the failure modes of standard methods.
Improved Recommendation Strategy: The learned policies generate paths that are both highly feasible (maintaining user engagement) and effective (shifting preference), as validated by superior performance on all metrics.
Efficiency and Generalizability: The method is based on a lightweight transformer, avoiding the high cost of LLM-based methods. Its strong performance under cross-evaluator analysis shows it learns robust, generalizable principles of user guidance.
Insight into RL Fine-tuning: The "probabilistic rectifier" perspective offers a nuanced view of how RL fine-tuning works with pre-trained models, emphasizing the selection and reinforcement of high-quality behaviors already within the model's capacity.

Conclusion

ProRL successfully addresses the critical deficiencies of standard policy gradient estimation in Proactive Recommendation Systems. By introducing Stepwise Reward Centering to eliminate the length shortcut and Position-Specific Advantage Estimation to reduce gradient variance, the framework produces rectified gradients that directly optimize path quality. Extensive experiments confirm that ProRL significantly outperforms state-of-the-art baselines across multiple datasets and evaluation settings. The work demonstrates that with properly designed gradient estimators, RL can be a highly effective tool for discovering high-quality guidance paths that balance immediate user acceptance with long-term preference shift.