RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Summary (Overview)

Unified Generator-Discriminator Framework: Proposes a novel architecture that decouples motion planning into a diffusion-based generator for diverse trajectory exploration and an RL-optimized discriminator for reranking based on long-term driving quality, addressing the instability of directly applying sparse RL rewards to high-dimensional trajectory spaces.
Novel RL Optimization Techniques: Introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO) to alleviate credit assignment problems via temporal coherence, and On-policy Generator Optimization (OGO) to progressively shift the generator's distribution toward high-reward trajectory manifolds using structured longitudinal optimization signals.
High-Throughput Simulation: Develops BEV-Warp, an efficient, feature-level closed-loop simulation environment that leverages spatial warping of Bird's-Eye View (BEV) features, bypassing costly image rendering and enabling scalable RL training.
Significant Performance Gains: Demonstrates a 56% reduction in collision rate compared to strong diffusion-based planners in closed-loop simulation and shows improved perceived safety and smoothness in real-world deployment.
Effective Scaling: The framework shows superior scaling efficiency with joint optimization of the generator and discriminator, outperforming discriminator-only or two-stage training strategies.

Introduction and Theoretical Foundation

Achieving robust, safe, and human-like motion planning is a core challenge for high-level autonomous driving. While diffusion-based planners have emerged as a promising approach for modeling multimodal continuous trajectories, they suffer from key limitations when applied to real-world driving:

Stochastic Instabilities: Real-world datasets contain noise and uneven distributions, leading to occasional low-quality or unstable trajectories.
Lack of Negative Feedback: Pure imitation learning (IL) provides no corrective feedback to suppress unrealistic or unsafe behaviors.
Causal Confusion & Open-loop Mismatch: IL learns correlations instead of causal factors and is trained open-loop, mismatching the closed-loop nature of real driving.
RL Optimization Challenges: Directly applying Reinforcement Learning (RL) is difficult due to the mismatch between low-dimensional scalar rewards and high-dimensional, temporally structured trajectory action spaces, leading to unstable optimization and severe credit assignment problems.

To address these issues, RAD-2 proposes a unified generator-discriminator framework. The core idea is to decouple exploration from evaluation. A diffusion generator $G_{\theta}(\tau | o)$ produces a diverse set of candidate trajectories. An RL-trained discriminator $D_{\phi}(\tau | o, C)$ evaluates and reranks these candidates based on their expected long-term outcomes. This avoids directly applying sparse rewards to the full trajectory space.

The two components define a joint policy distribution:

\Pi_{\theta,\phi}(\tau | o) = \mathbb{E}_{C \sim G_{\theta}(\cdot|o)}[D_{\phi}(\tau | o, C)]

which aligns with the probabilistic inference framework for optimal control. This architecture also supports inference-time scaling, where increasing the number of candidate samples $M$ allows for better trajectory selection without retraining.

Methodology

3.1 Generator-Discriminator Framework

The framework decomposes planning into two jointly optimized components.

3.1.1 Diffusion-based Generator

The generator models a multimodal distribution over future trajectories conditioned on the observation $o_t$ .

Scene Encoding: Observations are encoded into BEV features. Static map elements $X_{map}$ , dynamic agents $X_{agent}$ , and navigation inputs $X_{nav}$ are extracted and encoded into token embeddings: $T_m = E_m(X_{map}), \quad T_a = E_a(X_{agent}), \quad T_n = E_n(X_{nav})$ These are fused with BEV features $T_b$ via a learnable module $F(\cdot)$ to obtain a unified scene embedding $E_{scene}$ .
Trajectory Generation: For $M$ modes, an initial noise trajectory $\tau^{(0,m)} \sim \mathcal{N}(0, I)$ is denoised for $K$ steps using a conditional denoising network $G$ : $\tau^{(k,m)} = G(\tau^{(k-1,m)}, E_{scene}, k), \quad k=1,...,K$ The final candidate set is $\hat{T} = \{\hat{\tau}^m_{t:t+H}\}_{m=1}^M$ where $\hat{\tau}^m_{t:t+H} = \tau^{(K,m)} \sim G_{\theta}(\tau | o_t)$ .

3.1.2 RL-based Discriminator

The discriminator evaluates candidate trajectories and provides a preference distribution.

Trajectory Encoding: Each trajectory point is embedded, and the sequence is processed by a Transformer encoder. A [CLS] token output serves as the trajectory-level query $Q_{\tau}$ .
Scene Conditioning & Interaction: The discriminator constructs its own scene representation from $X_{map}$ and $X_{agent}$ using encoders $E^*_m$ and $E^*_a$ . The trajectory query $Q_{\tau}$ aggregates multi-source scene context via cross-attention $\Psi(Q, KV)$ : $O_m = \Psi(Q_{\tau}, T^*_m), \quad O_b = \Psi(Q_{\tau}, T_b), \quad O_a = \Psi(Q_{\tau}, T^*_a), \quad T_{a \cap m} = \Psi(T^*_a, T^*_m), \quad O_{a \cap m} = \Psi(Q_{\tau}, T_{a \cap m})$
Trajectory Scoring: These embeddings are aggregated into $E_{fusion}$ , and a scalar score is produced via sigmoid: $s(\hat{\tau}_{t:t+H}) = \sigma(E_{fusion}) \in [0, 1]$ .

3.2 Closed-Loop Simulation Environment and Controller

To enable efficient RL training, the BEV-Warp environment is introduced.

BEV-Warp: Instead of rendering images, the simulation directly manipulates BEV features over time. Given a reference BEV feature $B^{ref}_t$ and pose $P_t$ , the planner selects a trajectory. The vehicle's new pose $P_{t+1}$ is computed. A warp matrix $M_{t+1} = (P_{t+1})^{-1}P^{ref}_{t+1} \in \mathbb{R}^{3\times3}$ is derived. The synthesized feature for the next step is: $B_{t+1} = \mathcal{W}(B^{ref}_{t+1}, M_{t+1})$ where $\mathcal{W}(\cdot)$ is bilinear interpolation. This provides high-fidelity, feature-level observations efficiently.
Controller: An iLQR-based controller tracks the planned trajectory $\hat{\tau}^*$ by minimizing a quadratic cost over horizon $H$ : $u^*_{t:t+H} = \arg\min_u \sum_{k=t}^{t+H} \lVert x_k - \hat{x}^{ref}_k \rVert^2_Q + \lVert u_k \rVert^2_R$ where $\hat{x}^{ref}_k$ is the reference state from $\hat{\tau}^*$ .

3.3 Joint Policy Optimization

The generator and discriminator are jointly optimized via a three-stage iterative process to minimize $D_{KL}(\Pi_{\theta,\phi}(\tau | o) \lVert \Pi^*(\tau | o))$ .

3.3.1 Temporally Consistent Rollout

To maintain behavioral coherence and improve credit assignment, a trajectory reuse mechanism is employed. Once a trajectory $\hat{\tau}^*_t$ is selected, its corresponding control sequence $u^*_{t:t+H}$ is reused over a fixed execution horizon $H_{reuse} < H$ , ensuring the vehicle state evolves consistently along the committed trajectory.

3.3.2 Discriminator Optimization via RL (TC-GRPO)

The discriminator is optimized using a multi-objective reward and the proposed TC-GRPO.

Reward Modeling:
- Safety-Criticality Reward ( $r_{coll}$ ): Based on Time-to-Collision (TTC). $T_t$ is the earliest moment of intersection between the ego's projected occupancy $B_{ego}(k; t)$ and the environment's ground-truth occupancy $V_{env}(t+k)$ . The sequence-level reward is the worst-case temporal margin: $r_{coll} = \min_{1 \leq t \leq L} (T_t - T_{max})$
- Navigational Efficiency Reward ( $r_{eff}$ ): Based on Ego Progress (EP) $\rho$ at the end of a rollout. It penalizes deviation from a target efficiency interval $[\rho_{low}, \rho_{high}]$ : $r_{eff} = \min(\rho - \rho_{low}, 0) + \min(\rho_{high} - \rho, 0) + 1$
TC-GRPO Objective: For a group of $G$ rollouts $\{O_i\}_{i=1}^G$ from the same initial state, the standardized advantage for rollout $i$ with reward $r_i$ is: $A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})}$ Let $K_i$ be the set of timesteps in rollout $O_i$ where a new trajectory is sampled. The clipped objective for these decision points is: $\mathcal{L}_{i, t \in K_i} = \min\left( \rho_{i,t} A_i, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i \right)$ where $\rho_{i,t} = \frac{D_{\phi}(\hat{\tau}^*_{i,t} | o_{i,t})}{D_{\phi_{old}}(\hat{\tau}^*_{i,t} | o_{i,t})}$ is the importance sampling ratio. An adaptive entropy regularization term $\beta H_{i,t}$ is added, where $\beta = \exp(\lambda) \cdot \mathbf{1}[\bar{H} < \bar{H}_{target}]$ . The final RL objective is: $J_{RL}(\phi) = \mathbb{E}\left[ \frac{1}{\sum_{i=1}^G |K_i|} \sum_{i=1}^G \sum_{t=1}^{|K_i|} \left( \mathcal{L}_{i,t} + \beta H_{i,t} \right) \right]$

3.3.3 On-policy Generator Optimization (OGO)

OGO converts closed-loop feedback into structured longitudinal signals to shift the generator's distribution.

Reward-Guided Longitudinal Optimization: From a raw trajectory $\tau^{raw}_t$ $τ_{t}^{r a w}$ , the longitudinal component (acceleration profile) is optimized based on reward signals:
- Safety-driven Deceleration: If $T_t < \gamma_{safe}$ , reduce travel distance over horizon $H$ by a ratio $\rho \in (0,1)$ .
- Efficiency-driven Acceleration: If progress is insufficient and $T_t$ is safe, increase travel distance by a ratio $\rho' > 1$ .
This yields an optimized trajectory $\tau^{opt}_t$ that preserves the spatial path but improves temporal progression. These segments form an on-policy dataset $D_{opt} = \{\tau^{opt}_t\}$ .
Distribution Shifting: The generator is fine-tuned using a mean squared error loss on $D_{opt}$ : $\mathcal{L}_{op}(\theta) = \mathbb{E}_{\tau^{opt} \sim D_{opt}} \left[ \sum_{k=0}^{H} \lVert \hat{\tau}_{t+k} - \tau^{opt}_{t+k} \rVert^2_2 \right]$

Empirical Validation / Results

4.3 Experimental Results

RAD-2 is evaluated on closed-loop and open-loop benchmarks, demonstrating significant improvements.

Closed-loop performance in BEV-Warp environment: Table 1: Closed-loop performance comparison in the BEV-Warp simulation environment.

Method	Safety-oriented Scenario			Efficiency-oriented Scenario
	CR ↓	AF-CR ↓	Safety@1 ↑	Safety@2 ↑	EP-Mean ↑	EP@1.0 ↑
TransFuser [3]	0.563	0.275	0.400	0.346	0.897	0.244
VAD [16]	0.594	0.299	0.371	0.312	0.904	0.252
GenAD [60]	0.592	0.305	0.363	0.309	0.930	0.467
ResAD [63] (Baseline)	0.533	0.264	0.418	0.281	0.970	0.516
RAD -2 (Ours)	0.234	0.092	0.730	0.596	0.988	0.736

Key Result: RAD-2 reduces the collision rate (CR) by 56% (from 0.533 to 0.234) and significantly improves safety margins and navigation efficiency compared to the strong diffusion-based baseline ResAD.

Closed-loop evaluation in photorealistic 3DGS environment: Table 2: Evaluation on photorealistic 3DGS benchmark.

Method	CR ↓	AF-CR ↓	Safety@1 ↑	Safety@2 ↑
ResAD [63]	0.509	0.288	0.469	0.399
Senna-2 [45]	0.269	0.077	0.667	0.565
RAD [7]	0.281	0.113	0.613	0.543
RAD -2 (Ours)	0.250	0.078	0.723	0.644

Key Result: RAD-2 achieves the highest safety scores (Safety@1/2) among compared methods, demonstrating effectiveness in photorealistic simulation.

Open-loop trajectory evaluation: Table سننا-2.

Method	FDE (m) ↓	ADE (m) ↓	CR (%) ↓	DCR (%) ↓	SCR (%) ↓
ResAD [63]	0.634	0.234	0.378	0.367	0.011
Senna-2 [45]	0.597	0.225	0.288	0.283	0.005
RAD -2 (Ours)	0.553	0.208	0.142	0.138	0.004

Key Result: RAD-2 achieves state-of-the-art trajectory accuracy (lowest FDE/ADE) and the lowest collision rates in open-loop prediction, indicating improved trajectory quality.

4.4 - 4.6 Ablation Studies and Analysis

Scaling Behavior (Fig. 7): Joint optimization of generator and discriminator achieves superior scaling efficiency and final performance compared to discriminator-only or two-stage training.
Ablation on Training Pipeline (Table 4): The full pipeline (IL pre-training + OGO + Discriminator RL) is crucial for optimal balance between safety and efficiency.
RL Design Choices: Ablations confirm the importance of:
- Temporal Consistency: An execution horizon $H_{reuse}=8$ provides the best balance (Table 5).
- Reward-Variance Clip Filtering: Improves efficiency and training stability (Table 6, Fig. 8).
- Discriminator Initialization: Initializing from the pre-trained planning head is better than random (Table 7).
- TC-GRPO Group Size: A group size of 4 works best (Table 8).
- Entropy Regularization: Prevents score collapse and improves stability (Table 9, Fig. 9).
Scenario Composition: Training on a mixed set of safety and efficiency scenarios yields the most balanced policy (Fig. 10).
Inference-time Scaling: Increasing the candidate count $M$ at inference consistently improves navigation efficiency (EP@1.0) without retraining (Table 10).
Qualitative Results: RAD-2 demonstrates proactive safety maneuvers (deceleration to avoid collision) and efficient tactical decisions (agile lane-changing) in complex interactions (Fig. 11, Fig. 12).

Theoretical and Practical Implications

Theoretical Implications:

Decoupling for Stable RL: The work provides a principled framework for applying RL to high-dimensional, continuous action spaces by reformulating the planning task into a tractable preference learning problem within a lower-dimensional scoring space.
Temporal Coherence as Prior: TC-GRPO formally introduces temporal consistency as a physical prior to structure the RL search space, effectively denoising advantage signals and mitigating the credit assignment problem inherent in long-horizon, weakly correlated reward-action settings.
Structured Distribution Shifting: OGO demonstrates how closed-loop feedback can be transformed into dimension-specific, structured supervision to safely and gradually shift a generative model's distribution,