CARLA-Air: Fly Drones Inside a CARLA World - A Unified Infrastructure for Air-Ground Embodied Intelligence

Summary (Overview)

Unified Single-Process Architecture: Integrates CARLA (ground) and AirSim (aerial) simulation backends within a single Unreal Engine process, providing a shared physics tick, shared rendering pipeline, and strict spatial-temporal consistency across all sensors and agents.
Full API and Code Compatibility: Preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification reuse of existing research codebases for air-ground tasks.
Comprehensive Multi-Modal Sensing: Synchronously captures up to 18 sensor modalities (RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, barometry, etc.) across all aerial and ground platforms at each simulation tick within photorealistic urban environments.
Support for Air-Ground Embodied Intelligence: Provides out-of-the-box support for key research directions: air-ground cooperation, embodied navigation/vision-language action, multi-modal perception/dataset construction, and reinforcement learning policy training.
Sustainable Evolution Path: Inherits and extends the aerial capabilities of AirSim (whose upstream development is archived) within a modern, actively maintained infrastructure.

Introduction and Theoretical Foundation

The convergence of the low-altitude economy, embodied intelligence, and air-ground cooperative systems creates a pressing need for simulation infrastructure capable of jointly modeling aerial and ground agents within a single, physically coherent environment. Existing open-source platforms are domain-segregated: urban driving simulators (e.g., CARLA) provide rich traffic but no aerial dynamics, while multirotor simulators (e.g., AirSim) offer physics-accurate flight but lack realistic ground scenes and traffic.

Bridge-based co-simulation (connecting heterogeneous backends via ROS or middleware) is a common workaround but introduces synchronization overhead, communication latency, duplicated rendering, and cannot guarantee the strict spatial-temporal consistency required by modern perception and learning pipelines. CARLA-Air is designed to fill this gap by providing a unified, single-process simulation foundation for air-ground embodied intelligence research, combining the strengths of both CARLA and AirSim.

Methodology

The core technical challenge is resolving a fundamental Unreal Engine constraint: only one active GameMode per world. CARLA and AirSim each provide independent, incompatible GameMode classes.

Architectural Solution: The system introduces CARLAAirGameMode, which inherits from CARLA's game mode base (acquiring all ground simulation subsystems) and composes AirSim's aerial flight actor as a regular world entity spawned after initialization. This resolves the conflict with minimal upstream modifications (only 2 files in CARLA).
System Architecture: Two plugin modules load sequentially within one UE4 process. Two independent RPC servers run concurrently, allowing native Python clients for both simulators to connect without modification. All world actors share a single rendering pipeline.
Coordinate System Mapping: CARLA uses a left-handed UE4 frame (X forward, Y right, Z up, centimeters). AirSim uses a right-handed North-East-Down (NED) frame (X north, Y east, Z down, meters). A transform reconciles them for co-registration. Let $p \in \mathbb{R}^3$ denote a point in the UE4 world frame and $o$ the shared world origin. The equivalent NED position is: $p_{NED} = \frac{1}{100} \begin{pmatrix} p_x - o_x \\ p_y - o_y \\ -(p_z - o_z) \end{pmatrix}$ For orientation, let $q = (w, q_x, q_y, q_z)$ be a unit quaternion in UE4 frame. The equivalent NED quaternion is: $q_{NED} = (w, q_x, q_y, -q_z)$
Extensible Asset Pipeline: Researchers can import custom robot platforms, UAV models, and environment assets into the shared world, where they participate in the same physics and rendering as built-in actors.

Empirical Validation / Results

Performance Evaluation

Benchmarks were conducted on a workstation (NVIDIA RTX A4000, AMD Ryzen 7 5800X, 32 GB RAM).

1. Frame Rate and Resource Scaling: The "moderate joint" workload (3 vehicles, 2 pedestrians, 1 drone, 8 sensors) sustains $19.8 \pm 1.1$ FPS. Integration overhead is $8.6$ FPS (30.3%) compared to ground-only baseline, primarily from the aerial physics engine (CPU utilization 54% vs 38%). Sensor rendering, not actor population, is the dominant cost driver.

Table 3: Frame rate and resource consumption across representative joint workloads

Profile	Configuration	FPS	VRAM (MiB)	CPU (%)
Standalone baselines
Ground sim only	3 vehicles + 2 pedestrians; 8 sensors @ 1280×720	$28.4 \pm 1.2$	$3,821 \pm 10$	$31 \pm 3$
Aerial sim only	1 drone; 8 sensors @ 1280×720	$44.7 \pm system.1$	$2,941 \pm 8$	$29 \pm 3$
Joint workloads
Idle	Town10HD; no actors; no sensors	$60.0 \pm 0.4$	$3,702 \pm 8$	$12 \pm 2$
Ground only	3 vehicles + 2 pedestrians; 8 sensors @ 1280×720	$26.3 \pm 1.4$	$3,831 \pm 11$	$38 \pm 4$
Moderate joint	3 vehicles + 2 pedestrians + 1 drone; 8 sensors @ 1280×720	$19.8 \pm 1.1$	$3,870 \pm 13$	$54 \pm 5$
Stability endurance	Moderate joint; 357 spawn/destroy cycles; 3 hr continuous	$19.7 \pm 1.3$	$3,878 \pm 17$	$55 \pm 5$

2. Memory Stability: A 3-hour endurance run with 357 actor spawn/destroy cycles showed no significant memory accumulation (linear regression slope $0.49$ MiB/cycle, $R^2 = 0.11$ ). Zero API errors and zero simulation crashes validate robustness for RL training.

3. Communication Latency: Single-process architecture eliminates inter-process serialization overhead. Key API call latencies are well below per-tick budgets (e.g., actor transform query: $280 \mu s$ , image capture: $3,200 \mu s$ ).

Table Crowd5: Round-trip API call latency

API Call	Median ( $\mu s$ )	IQR ( $\mu s$ )
Ground sim Actor transform query	280	35
Aerial sim Multirotor state query	410	55
Aerial sim Image capture (1 RGB stream)	3,200	380
Bridge IPC [17] Cross-process state sync (ref.)	3,000	2,000

Representative Applications

Five workflows validated core capabilities:

W1: Air-Ground Cooperative Precision Landing: A drone lands on a moving ground vehicle. Using tick-synchronous control and cross-frame coordination, it achieved a final horizontal error $< 0.5$ m over a $\approx 20$ s descent at $19.3$ FPS.

W2: Embodied Navigation & VLN/VLA Data Generation: The platform enables generation of vision-language navigation/action datasets with paired bird's-eye and street-level visual observations under identical conditions.

W3: Synchronized Multi-Modal Dataset Collection: Collected $1,000$ fully synchronized 12-stream records (8 ground + 4 aerial sensors) at $\approx 17$ Hz with $\leq 1$ -tick alignment error, eliminating manual synchronization bottlenecks.

W4: Air-Ground Cross-View Perception: Produced $500$ co-registered aerial-depth/ground-segmentation pairs at $\approx 18$ Hz with zero tick alignment error ( $\epsilon_k = t_{k}^{gnd} - t_{k}^{air} = 0$ ). Verified rendering consistency across all 14 weather presets.

W5: Reinforcement Learning Training Environment: The stability results (zero crashes over 357 reset cycles) and synchronous stepping validate CARLA-Air as a suitable RL environment for learning cooperative policies.

Theoretical and Practical Implications

Technical Contribution: Provides a principled, minimal-modification solution to unify high-fidelity ground and aerial simulation, overcoming the fundamental UE4 GameMode conflict. This yields properties unavailable in bridge-based approaches: shared physics tick, shared rendering, and full API preservation.
Research Enablement: Lowers the barrier for research in air-ground embodied intelligence by providing a practical, unified foundation. It enables previously inaccessible directions like physically consistent paired aerial-ground datasets, coordination over joint multi-modal observations, and cross-view embodied navigation.
Sustainability: Ensures the continued evolution of the widely adopted but archived AirSim flight stack within a modern, maintained infrastructure.
Broader Impact: Supports emerging applications in low-altitude robotics, urban air mobility, drone logistics, and large-scale embodied AI research by providing a scalable, photorealistic simulation testbed.

Conclusion

CARLA-Air resolves the historical fragmentation of simulation platforms by integrating CARLA and AirSim within a single Unreal Engine process. Its composition-based architecture provides a shared world state, strict spatial-temporal consistency, and full compatibility with existing codebases.

The platform is validated through performance benchmarks and five representative workflows, demonstrating its capability as a unified infrastructure for air-ground embodied intelligence research. By inheriting AirSim's capabilities, it also provides a sustainable path forward for aerial simulation research. CARLA-Air is released as open-source with prebuilt binaries and source code to support immediate adoption.