RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
Summary (Overview)
- Unified Generator-Discriminator Framework: Proposes a novel architecture that decouples motion planning into a diffusion-based generator for diverse trajectory exploration and an RL-optimized discriminator for reranking based on long-term driving quality, addressing the instability of directly applying sparse RL rewards to high-dimensional trajectory spaces.
- Novel RL Optimization Techniques: Introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO) to alleviate credit assignment problems via temporal coherence, and On-policy Generator Optimization (OGO) to progressively shift the generator's distribution toward high-reward trajectory manifolds using structured longitudinal optimization signals.
- High-Throughput Simulation: Develops BEV-Warp, an efficient, feature-level closed-loop simulation environment that leverages spatial warping of Bird's-Eye View (BEV) features, bypassing costly image rendering and enabling scalable RL training.
- Significant Performance Gains: Demonstrates a 56% reduction in collision rate compared to strong diffusion-based planners in closed-loop simulation and shows improved perceived safety and smoothness in real-world deployment.
- Effective Scaling: The framework shows superior scaling efficiency with joint optimization of the generator and discriminator, outperforming discriminator-only or two-stage training strategies.
Introduction and Theoretical Foundation
Achieving robust, safe, and human-like motion planning is a core challenge for high-level autonomous driving. While diffusion-based planners have emerged as a promising approach for modeling multimodal continuous trajectories, they suffer from key limitations when applied to real-world driving:
- Stochastic Instabilities: Real-world datasets contain noise and uneven distributions, leading to occasional low-quality or unstable trajectories.
- Lack of Negative Feedback: Pure imitation learning (IL) provides no corrective feedback to suppress unrealistic or unsafe behaviors.
- Causal Confusion & Open-loop Mismatch: IL learns correlations instead of causal factors and is trained open-loop, mismatching the closed-loop nature of real driving.
- RL Optimization Challenges: Directly applying Reinforcement Learning (RL) is difficult due to the mismatch between low-dimensional scalar rewards and high-dimensional, temporally structured trajectory action spaces, leading to unstable optimization and severe credit assignment problems.
To address these issues, RAD-2 proposes a unified generator-discriminator framework. The core idea is to decouple exploration from evaluation. A diffusion generator produces a diverse set of candidate trajectories. An RL-trained discriminator evaluates and reranks these candidates based on their expected long-term outcomes. This avoids directly applying sparse rewards to the full trajectory space.
The two components define a joint policy distribution:
which aligns with the probabilistic inference framework for optimal control. This architecture also supports inference-time scaling, where increasing the number of candidate samples allows for better trajectory selection without retraining.
Methodology
3.1 Generator-Discriminator Framework
The framework decomposes planning into two jointly optimized components.
3.1.1 Diffusion-based Generator
The generator models a multimodal distribution over future trajectories conditioned on the observation .
- Scene Encoding: Observations are encoded into BEV features. Static map elements , dynamic agents , and navigation inputs are extracted and encoded into token embeddings: These are fused with BEV features via a learnable module to obtain a unified scene embedding .
- Trajectory Generation: For modes, an initial noise trajectory is denoised for steps using a conditional denoising network : The final candidate set is where .
3.1.2 RL-based Discriminator
The discriminator evaluates candidate trajectories and provides a preference distribution.
- Trajectory Encoding: Each trajectory point is embedded, and the sequence is processed by a Transformer encoder. A [CLS] token output serves as the trajectory-level query .
- Scene Conditioning & Interaction: The discriminator constructs its own scene representation from and using encoders and . The trajectory query aggregates multi-source scene context via cross-attention :
- Trajectory Scoring: These embeddings are aggregated into , and a scalar score is produced via sigmoid: .
3.2 Closed-Loop Simulation Environment and Controller
To enable efficient RL training, the BEV-Warp environment is introduced.
- BEV-Warp: Instead of rendering images, the simulation directly manipulates BEV features over time. Given a reference BEV feature and pose , the planner selects a trajectory. The vehicle's new pose is computed. A warp matrix is derived. The synthesized feature for the next step is: where is bilinear interpolation. This provides high-fidelity, feature-level observations efficiently.
- Controller: An iLQR-based controller tracks the planned trajectory by minimizing a quadratic cost over horizon : where is the reference state from .
3.3 Joint Policy Optimization
The generator and discriminator are jointly optimized via a three-stage iterative process to minimize .
3.3.1 Temporally Consistent Rollout
To maintain behavioral coherence and improve credit assignment, a trajectory reuse mechanism is employed. Once a trajectory is selected, its corresponding control sequence is reused over a fixed execution horizon , ensuring the vehicle state evolves consistently along the committed trajectory.
3.3.2 Discriminator Optimization via RL (TC-GRPO)
The discriminator is optimized using a multi-objective reward and the proposed TC-GRPO.
- Reward Modeling:
- Safety-Criticality Reward (): Based on Time-to-Collision (TTC). is the earliest moment of intersection between the ego's projected occupancy and the environment's ground-truth occupancy . The sequence-level reward is the worst-case temporal margin:
- Navigational Efficiency Reward (): Based on Ego Progress (EP) at the end of a rollout. It penalizes deviation from a target efficiency interval :
- TC-GRPO Objective: For a group of rollouts from the same initial state, the standardized advantage for rollout with reward is: Let be the set of timesteps in rollout where a new trajectory is sampled. The clipped objective for these decision points is: where is the importance sampling ratio. An adaptive entropy regularization term is added, where . The final RL objective is:
3.3.3 On-policy Generator Optimization (OGO)
OGO converts closed-loop feedback into structured longitudinal signals to shift the generator's distribution.
- Reward-Guided Longitudinal Optimization: From a raw trajectory , the longitudinal component (acceleration profile) is optimized based on reward signals:
- Safety-driven Deceleration: If , reduce travel distance over horizon by a ratio .
- Efficiency-driven Acceleration: If progress is insufficient and is safe, increase travel distance by a ratio .
- This yields an optimized trajectory that preserves the spatial path but improves temporal progression. These segments form an on-policy dataset .
- Distribution Shifting: The generator is fine-tuned using a mean squared error loss on :
Empirical Validation / Results
4.3 Experimental Results
RAD-2 is evaluated on closed-loop and open-loop benchmarks, demonstrating significant improvements.
Closed-loop performance in BEV-Warp environment: Table 1: Closed-loop performance comparison in the BEV-Warp simulation environment.
| Method | Safety-oriented Scenario | Efficiency-oriented Scenario | ||||
|---|---|---|---|---|---|---|
| CR ↓ | AF-CR ↓ | Safety@1 ↑ | Safety@2 ↑ | EP-Mean ↑ | EP@1.0 ↑ | |
| TransFuser [3] | 0.563 | 0.275 | 0.400 | 0.346 | 0.897 | 0.244 |
| VAD [16] | 0.594 | 0.299 | 0.371 | 0.312 | 0.904 | 0.252 |
| GenAD [60] | 0.592 | 0.305 | 0.363 | 0.309 | 0.930 | 0.467 |
| ResAD [63] (Baseline) | 0.533 | 0.264 | 0.418 | 0.281 | 0.970 | 0.516 |
| RAD -2 (Ours) | 0.234 | 0.092 | 0.730 | 0.596 | 0.988 | 0.736 |
Key Result: RAD-2 reduces the collision rate (CR) by 56% (from 0.533 to 0.234) and significantly improves safety margins and navigation efficiency compared to the strong diffusion-based baseline ResAD.
Closed-loop evaluation in photorealistic 3DGS environment: Table 2: Evaluation on photorealistic 3DGS benchmark.
| Method | CR ↓ | AF-CR ↓ | Safety@1 ↑ | Safety@2 ↑ |
|---|---|---|---|---|
| ResAD [63] | 0.509 | 0.288 | 0.469 | 0.399 |
| Senna-2 [45] | 0.269 | 0.077 | 0.667 | 0.565 |
| RAD [7] | 0.281 | 0.113 | 0.613 | 0.543 |
| RAD -2 (Ours) | 0.250 | 0.078 | 0.723 | 0.644 |
Key Result: RAD-2 achieves the highest safety scores (Safety@1/2) among compared methods, demonstrating effectiveness in photorealistic simulation.
Open-loop trajectory evaluation: Table سننا-2.
| Method | FDE (m) ↓ | ADE (m) ↓ | CR (%) ↓ | DCR (%) ↓ | SCR (%) ↓ |
|---|---|---|---|---|---|
| ResAD [63] | 0.634 | 0.234 | 0.378 | 0.367 | 0.011 |
| Senna-2 [45] | 0.597 | 0.225 | 0.288 | 0.283 | 0.005 |
| RAD -2 (Ours) | 0.553 | 0.208 | 0.142 | 0.138 | 0.004 |
Key Result: RAD-2 achieves state-of-the-art trajectory accuracy (lowest FDE/ADE) and the lowest collision rates in open-loop prediction, indicating improved trajectory quality.
4.4 - 4.6 Ablation Studies and Analysis
- Scaling Behavior (Fig. 7): Joint optimization of generator and discriminator achieves superior scaling efficiency and final performance compared to discriminator-only or two-stage training.
- Ablation on Training Pipeline (Table 4): The full pipeline (IL pre-training + OGO + Discriminator RL) is crucial for optimal balance between safety and efficiency.
- RL Design Choices: Ablations confirm the importance of:
- Temporal Consistency: An execution horizon provides the best balance (Table 5).
- Reward-Variance Clip Filtering: Improves efficiency and training stability (Table 6, Fig. 8).
- Discriminator Initialization: Initializing from the pre-trained planning head is better than random (Table 7).
- TC-GRPO Group Size: A group size of 4 works best (Table 8).
- Entropy Regularization: Prevents score collapse and improves stability (Table 9, Fig. 9).
- Scenario Composition: Training on a mixed set of safety and efficiency scenarios yields the most balanced policy (Fig. 10).
- Inference-time Scaling: Increasing the candidate count at inference consistently improves navigation efficiency (EP@1.0) without retraining (Table 10).
- Qualitative Results: RAD-2 demonstrates proactive safety maneuvers (deceleration to avoid collision) and efficient tactical decisions (agile lane-changing) in complex interactions (Fig. 11, Fig. 12).
Theoretical and Practical Implications
Theoretical Implications:
- Decoupling for Stable RL: The work provides a principled framework for applying RL to high-dimensional, continuous action spaces by reformulating the planning task into a tractable preference learning problem within a lower-dimensional scoring space.
- Temporal Coherence as Prior: TC-GRPO formally introduces temporal consistency as a physical prior to structure the RL search space, effectively denoising advantage signals and mitigating the credit assignment problem inherent in long-horizon, weakly correlated reward-action settings.
- Structured Distribution Shifting: OGO demonstrates how closed-loop feedback can be transformed into dimension-specific, structured supervision to safely and gradually shift a generative model's distribution,