RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Summary (Overview)

  • Unified Generator-Discriminator Framework: Proposes a novel architecture that decouples motion planning into a diffusion-based generator for diverse trajectory exploration and an RL-optimized discriminator for reranking based on long-term driving quality, addressing the instability of directly applying sparse RL rewards to high-dimensional trajectory spaces.
  • Novel RL Optimization Techniques: Introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO) to alleviate credit assignment problems via temporal coherence, and On-policy Generator Optimization (OGO) to progressively shift the generator's distribution toward high-reward trajectory manifolds using structured longitudinal optimization signals.
  • High-Throughput Simulation: Develops BEV-Warp, an efficient, feature-level closed-loop simulation environment that leverages spatial warping of Bird's-Eye View (BEV) features, bypassing costly image rendering and enabling scalable RL training.
  • Significant Performance Gains: Demonstrates a 56% reduction in collision rate compared to strong diffusion-based planners in closed-loop simulation and shows improved perceived safety and smoothness in real-world deployment.
  • Effective Scaling: The framework shows superior scaling efficiency with joint optimization of the generator and discriminator, outperforming discriminator-only or two-stage training strategies.

Introduction and Theoretical Foundation

Achieving robust, safe, and human-like motion planning is a core challenge for high-level autonomous driving. While diffusion-based planners have emerged as a promising approach for modeling multimodal continuous trajectories, they suffer from key limitations when applied to real-world driving:

  1. Stochastic Instabilities: Real-world datasets contain noise and uneven distributions, leading to occasional low-quality or unstable trajectories.
  2. Lack of Negative Feedback: Pure imitation learning (IL) provides no corrective feedback to suppress unrealistic or unsafe behaviors.
  3. Causal Confusion & Open-loop Mismatch: IL learns correlations instead of causal factors and is trained open-loop, mismatching the closed-loop nature of real driving.
  4. RL Optimization Challenges: Directly applying Reinforcement Learning (RL) is difficult due to the mismatch between low-dimensional scalar rewards and high-dimensional, temporally structured trajectory action spaces, leading to unstable optimization and severe credit assignment problems.

To address these issues, RAD-2 proposes a unified generator-discriminator framework. The core idea is to decouple exploration from evaluation. A diffusion generator Gθ(τo)G_{\theta}(\tau | o) produces a diverse set of candidate trajectories. An RL-trained discriminator Dϕ(τo,C)D_{\phi}(\tau | o, C) evaluates and reranks these candidates based on their expected long-term outcomes. This avoids directly applying sparse rewards to the full trajectory space.

The two components define a joint policy distribution:

Πθ,ϕ(τo)=ECGθ(o)[Dϕ(τo,C)]\Pi_{\theta,\phi}(\tau | o) = \mathbb{E}_{C \sim G_{\theta}(\cdot|o)}[D_{\phi}(\tau | o, C)]

which aligns with the probabilistic inference framework for optimal control. This architecture also supports inference-time scaling, where increasing the number of candidate samples MM allows for better trajectory selection without retraining.

Methodology

3.1 Generator-Discriminator Framework

The framework decomposes planning into two jointly optimized components.

3.1.1 Diffusion-based Generator

The generator models a multimodal distribution over future trajectories conditioned on the observation oto_t.

  • Scene Encoding: Observations are encoded into BEV features. Static map elements XmapX_{map}, dynamic agents XagentX_{agent}, and navigation inputs XnavX_{nav} are extracted and encoded into token embeddings: Tm=Em(Xmap),Ta=Ea(Xagent),Tn=En(Xnav)T_m = E_m(X_{map}), \quad T_a = E_a(X_{agent}), \quad T_n = E_n(X_{nav}) These are fused with BEV features TbT_b via a learnable module F()F(\cdot) to obtain a unified scene embedding EsceneE_{scene}.
  • Trajectory Generation: For MM modes, an initial noise trajectory τ(0,m)N(0,I)\tau^{(0,m)} \sim \mathcal{N}(0, I) is denoised for KK steps using a conditional denoising network GG: τ(k,m)=G(τ(k1,m),Escene,k),k=1,...,K\tau^{(k,m)} = G(\tau^{(k-1,m)}, E_{scene}, k), \quad k=1,...,K The final candidate set is T^={τ^t:t+Hm}m=1M\hat{T} = \{\hat{\tau}^m_{t:t+H}\}_{m=1}^M where τ^t:t+Hm=τ(K,m)Gθ(τot)\hat{\tau}^m_{t:t+H} = \tau^{(K,m)} \sim G_{\theta}(\tau | o_t).

3.1.2 RL-based Discriminator

The discriminator evaluates candidate trajectories and provides a preference distribution.

  • Trajectory Encoding: Each trajectory point is embedded, and the sequence is processed by a Transformer encoder. A [CLS] token output serves as the trajectory-level query QτQ_{\tau}.
  • Scene Conditioning & Interaction: The discriminator constructs its own scene representation from XmapX_{map} and XagentX_{agent} using encoders EmE^*_m and EaE^*_a. The trajectory query QτQ_{\tau} aggregates multi-source scene context via cross-attention Ψ(Q,KV)\Psi(Q, KV): Om=Ψ(Qτ,Tm),Ob=Ψ(Qτ,Tb),Oa=Ψ(Qτ,Ta),Tam=Ψ(Ta,Tm),Oam=Ψ(Qτ,Tam)O_m = \Psi(Q_{\tau}, T^*_m), \quad O_b = \Psi(Q_{\tau}, T_b), \quad O_a = \Psi(Q_{\tau}, T^*_a), \quad T_{a \cap m} = \Psi(T^*_a, T^*_m), \quad O_{a \cap m} = \Psi(Q_{\tau}, T_{a \cap m})
  • Trajectory Scoring: These embeddings are aggregated into EfusionE_{fusion}, and a scalar score is produced via sigmoid: s(τ^t:t+H)=σ(Efusion)[0,1]s(\hat{\tau}_{t:t+H}) = \sigma(E_{fusion}) \in [0, 1].

3.2 Closed-Loop Simulation Environment and Controller

To enable efficient RL training, the BEV-Warp environment is introduced.

  • BEV-Warp: Instead of rendering images, the simulation directly manipulates BEV features over time. Given a reference BEV feature BtrefB^{ref}_t and pose PtP_t, the planner selects a trajectory. The vehicle's new pose Pt+1P_{t+1} is computed. A warp matrix Mt+1=(Pt+1)1Pt+1refR3×3M_{t+1} = (P_{t+1})^{-1}P^{ref}_{t+1} \in \mathbb{R}^{3\times3} is derived. The synthesized feature for the next step is: Bt+1=W(Bt+1ref,Mt+1)B_{t+1} = \mathcal{W}(B^{ref}_{t+1}, M_{t+1}) where W()\mathcal{W}(\cdot) is bilinear interpolation. This provides high-fidelity, feature-level observations efficiently.
  • Controller: An iLQR-based controller tracks the planned trajectory τ^\hat{\tau}^* by minimizing a quadratic cost over horizon HH: ut:t+H=argminuk=tt+Hxkx^krefQ2+ukR2u^*_{t:t+H} = \arg\min_u \sum_{k=t}^{t+H} \lVert x_k - \hat{x}^{ref}_k \rVert^2_Q + \lVert u_k \rVert^2_R where x^kref\hat{x}^{ref}_k is the reference state from τ^\hat{\tau}^*.

3.3 Joint Policy Optimization

The generator and discriminator are jointly optimized via a three-stage iterative process to minimize DKL(Πθ,ϕ(τo)Π(τo))D_{KL}(\Pi_{\theta,\phi}(\tau | o) \lVert \Pi^*(\tau | o)).

3.3.1 Temporally Consistent Rollout

To maintain behavioral coherence and improve credit assignment, a trajectory reuse mechanism is employed. Once a trajectory τ^t\hat{\tau}^*_t is selected, its corresponding control sequence ut:t+Hu^*_{t:t+H} is reused over a fixed execution horizon Hreuse<HH_{reuse} < H, ensuring the vehicle state evolves consistently along the committed trajectory.

3.3.2 Discriminator Optimization via RL (TC-GRPO)

The discriminator is optimized using a multi-objective reward and the proposed TC-GRPO.

  • Reward Modeling:
    • Safety-Criticality Reward (rcollr_{coll}): Based on Time-to-Collision (TTC). TtT_t is the earliest moment of intersection between the ego's projected occupancy Bego(k;t)B_{ego}(k; t) and the environment's ground-truth occupancy Venv(t+k)V_{env}(t+k). The sequence-level reward is the worst-case temporal margin: rcoll=min1tL(TtTmax)r_{coll} = \min_{1 \leq t \leq L} (T_t - T_{max})
    • Navigational Efficiency Reward (reffr_{eff}): Based on Ego Progress (EP) ρ\rho at the end of a rollout. It penalizes deviation from a target efficiency interval [ρlow,ρhigh][\rho_{low}, \rho_{high}]: reff=min(ρρlow,0)+min(ρhighρ,0)+1r_{eff} = \min(\rho - \rho_{low}, 0) + \min(\rho_{high} - \rho, 0) + 1
  • TC-GRPO Objective: For a group of GG rollouts {Oi}i=1G\{O_i\}_{i=1}^G from the same initial state, the standardized advantage for rollout ii with reward rir_i is: Ai=rimean({r1,...,rG})std({r1,...,rG})A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})} Let KiK_i be the set of timesteps in rollout OiO_i where a new trajectory is sampled. The clipped objective for these decision points is: Li,tKi=min(ρi,tAi,clip(ρi,t,1ϵ,1+ϵ)Ai)\mathcal{L}_{i, t \in K_i} = \min\left( \rho_{i,t} A_i, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i \right) where ρi,t=Dϕ(τ^i,toi,t)Dϕold(τ^i,toi,t)\rho_{i,t} = \frac{D_{\phi}(\hat{\tau}^*_{i,t} | o_{i,t})}{D_{\phi_{old}}(\hat{\tau}^*_{i,t} | o_{i,t})} is the importance sampling ratio. An adaptive entropy regularization term βHi,t\beta H_{i,t} is added, where β=exp(λ)1[Hˉ<Hˉtarget]\beta = \exp(\lambda) \cdot \mathbf{1}[\bar{H} < \bar{H}_{target}]. The final RL objective is: JRL(ϕ)=E[1i=1GKii=1Gt=1Ki(Li,t+βHi,t)]J_{RL}(\phi) = \mathbb{E}\left[ \frac{1}{\sum_{i=1}^G |K_i|} \sum_{i=1}^G \sum_{t=1}^{|K_i|} \left( \mathcal{L}_{i,t} + \beta H_{i,t} \right) \right]

3.3.3 On-policy Generator Optimization (OGO)

OGO converts closed-loop feedback into structured longitudinal signals to shift the generator's distribution.

  • Reward-Guided Longitudinal Optimization: From a raw trajectory τtraw\tau^{raw}_t, the longitudinal component (acceleration profile) is optimized based on reward signals:
    • Safety-driven Deceleration: If Tt<γsafeT_t < \gamma_{safe}, reduce travel distance over horizon HH by a ratio ρ(0,1)\rho \in (0,1).
    • Efficiency-driven Acceleration: If progress is insufficient and TtT_t is safe, increase travel distance by a ratio ρ>1\rho' > 1.
  • This yields an optimized trajectory τtopt\tau^{opt}_t that preserves the spatial path but improves temporal progression. These segments form an on-policy dataset Dopt={τtopt}D_{opt} = \{\tau^{opt}_t\}.
  • Distribution Shifting: The generator is fine-tuned using a mean squared error loss on DoptD_{opt}: Lop(θ)=EτoptDopt[k=0Hτ^t+kτt+kopt22]\mathcal{L}_{op}(\theta) = \mathbb{E}_{\tau^{opt} \sim D_{opt}} \left[ \sum_{k=0}^{H} \lVert \hat{\tau}_{t+k} - \tau^{opt}_{t+k} \rVert^2_2 \right]

Empirical Validation / Results

4.3 Experimental Results

RAD-2 is evaluated on closed-loop and open-loop benchmarks, demonstrating significant improvements.

Closed-loop performance in BEV-Warp environment: Table 1: Closed-loop performance comparison in the BEV-Warp simulation environment.

MethodSafety-oriented ScenarioEfficiency-oriented Scenario
CR ↓AF-CR ↓Safety@1 ↑Safety@2 ↑EP-Mean ↑EP@1.0 ↑
TransFuser [3]0.5630.2750.4000.3460.8970.244
VAD [16]0.5940.2990.3710.3120.9040.252
GenAD [60]0.5920.3050.3630.3090.9300.467
ResAD [63] (Baseline)0.5330.2640.4180.2810.9700.516
RAD -2 (Ours)0.2340.0920.7300.5960.9880.736

Key Result: RAD-2 reduces the collision rate (CR) by 56% (from 0.533 to 0.234) and significantly improves safety margins and navigation efficiency compared to the strong diffusion-based baseline ResAD.

Closed-loop evaluation in photorealistic 3DGS environment: Table 2: Evaluation on photorealistic 3DGS benchmark.

MethodCR ↓AF-CR ↓Safety@1 ↑Safety@2 ↑
ResAD [63]0.5090.2880.4690.399
Senna-2 [45]0.2690.0770.6670.565
RAD [7]0.2810.1130.6130.543
RAD -2 (Ours)0.2500.0780.7230.644

Key Result: RAD-2 achieves the highest safety scores (Safety@1/2) among compared methods, demonstrating effectiveness in photorealistic simulation.

Open-loop trajectory evaluation: Table سننا-2.

MethodFDE (m) ↓ADE (m) ↓CR (%) ↓DCR (%) ↓SCR (%) ↓
ResAD [63]0.6340.2340.3780.3670.011
Senna-2 [45]0.5970.2250.2880.2830.005
RAD -2 (Ours)0.5530.2080.1420.1380.004

Key Result: RAD-2 achieves state-of-the-art trajectory accuracy (lowest FDE/ADE) and the lowest collision rates in open-loop prediction, indicating improved trajectory quality.

4.4 - 4.6 Ablation Studies and Analysis

  • Scaling Behavior (Fig. 7): Joint optimization of generator and discriminator achieves superior scaling efficiency and final performance compared to discriminator-only or two-stage training.
  • Ablation on Training Pipeline (Table 4): The full pipeline (IL pre-training + OGO + Discriminator RL) is crucial for optimal balance between safety and efficiency.
  • RL Design Choices: Ablations confirm the importance of:
    • Temporal Consistency: An execution horizon Hreuse=8H_{reuse}=8 provides the best balance (Table 5).
    • Reward-Variance Clip Filtering: Improves efficiency and training stability (Table 6, Fig. 8).
    • Discriminator Initialization: Initializing from the pre-trained planning head is better than random (Table 7).
    • TC-GRPO Group Size: A group size of 4 works best (Table 8).
    • Entropy Regularization: Prevents score collapse and improves stability (Table 9, Fig. 9).
  • Scenario Composition: Training on a mixed set of safety and efficiency scenarios yields the most balanced policy (Fig. 10).
  • Inference-time Scaling: Increasing the candidate count MM at inference consistently improves navigation efficiency (EP@1.0) without retraining (Table 10).
  • Qualitative Results: RAD-2 demonstrates proactive safety maneuvers (deceleration to avoid collision) and efficient tactical decisions (agile lane-changing) in complex interactions (Fig. 11, Fig. 12).

Theoretical and Practical Implications

Theoretical Implications:

  1. Decoupling for Stable RL: The work provides a principled framework for applying RL to high-dimensional, continuous action spaces by reformulating the planning task into a tractable preference learning problem within a lower-dimensional scoring space.
  2. Temporal Coherence as Prior: TC-GRPO formally introduces temporal consistency as a physical prior to structure the RL search space, effectively denoising advantage signals and mitigating the credit assignment problem inherent in long-horizon, weakly correlated reward-action settings.
  3. Structured Distribution Shifting: OGO demonstrates how closed-loop feedback can be transformed into dimension-specific, structured supervision to safely and gradually shift a generative model's distribution,