RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments - Summary

Summary (Overview)

Open-Vocabulary Semantic SLAM from Monocular Video: RADIO-ViPE is an online system that ingests raw, uncalibrated monocular RGB video and produces a 3D map where arbitrary natural language queries can be grounded to localized regions and objects.
Tightly Coupled Multi-Modal Fusion: It introduces a novel dense bundle adjustment framework that jointly optimizes geometric constraints (camera poses, depth) with high-level semantic consistency from vision-language foundation model embeddings (RADIO/RADSeg).
Dynamic Environment Robustness: A core contribution is a Temporally Consistent Adaptive Robust Kernel that classifies scene elements into static, movable, or actively moving categories, applying appropriate loss functions to suppress the influence of dynamics on map consistency.
Calibration-Free and Ready-to-Deploy: The system requires no prior camera intrinsics, depth sensors, pose initialization, or category-specific supervision, bridging a critical gap for real-world robotic deployment and in-the-wild video analysis.

Introduction and Theoretical Foundation

Knowledge mapping—grounding free-form language concepts onto a 3D geometric representation—is fundamental for general-purpose robots. While foundation models enable open-vocabulary scene understanding, most advanced 3D semantic methods operate offline, rely on calibrated data (posed RGB-D), and assume static scenes. Real-world deployment for robotics and unconstrained video streams faces challenges: data is inherently uncalibrated and environments are dynamic (moving agents, displaced furniture).

RADIO-ViPE addresses this integration gap. It builds upon ViPE (for uncalibrated monocular SLAM) and agglomerative foundation models like RADIO, which unify capabilities from multiple teacher models via distillation. The core idea is to tightly couple multi-modal embeddings (vision & language) with geometric scene information within an online optimization framework, improving map consistency from multiple sources and handling dynamic disturbances.

Methodology

The pipeline operates online at ~8-10 FPS on uncalibrated monocular RGB video. Key methodological components are:

A. System Overview

Camera Initialization: Intrinsics are bootstrapped from sampled frames using GeoCalib.
Keyframe Selection: Based on dense optical flow motion relative to the last keyframe.
Multi-Modal Feature Extraction: Dense embeddings are extracted per keyframe using RADSeg, upsampled, and compressed via PCA to $D=256$ dimensions for efficiency.
Depth Estimation: Metric depth is estimated per keyframe using monocular foundation models and converted to inverse depth.
Semantic Flow Initialization: Optical flow priors are augmented with semantic correspondence from RADIO features, fused via per-pixel confidence blending: $\Omega_{\text{prior}}(u) := \beta \Omega_{\text{prior}}(u) + (1 - \beta) \Omega_{\text{sem}}(u)$ where $\beta$ balances photometric and semantic confidence.
Bundle Adjustment: A factor graph optimization jointly refines camera intrinsics $K_q$ , poses $T_i \in SE(3)$ , and 3D structure by minimizing a combined vision-language-geometric energy. Graph connectivity is augmented with embedding-based co-visibility using keyframe descriptors from mean-pooled RADSeg features.
Open-Vocabulary Grounding: Achieved by decoding compressed 3D point features and projecting them into the SigLip latent space for matching with text query embeddings.

B. Joint Bundle Adjustment

The optimization minimizes a weighted energy function $E_{\text{total}}$ . Two key terms are:

Dense Photometric Flow Term ( $E_{\text{photo}}$ ): Enforces geometric consistency via dense optical flow constraints from DROID-SLAM. A pixel $u$ in frame $i$ is projected into frame $j$ :
$\mu_{ij} = \Pi_j \left( T_j T_i^{-1} \circ \Pi_i^{-1}(u, d_i(u)) \right)$
The residual between prior flow $\Omega_{ij}^{\text{prior}} = \mu_{ij} - u$ and the network-predicted flow $\Omega_{ij}(u)$ is minimized.
RADIO Embedding Similarity Term ( $E_{\text{embed}}$ ): A novel term enforcing cross-view semantic alignment under geometric constraints. For a projected pixel correspondence $v = P_{i,j}(u)$ , the target embedding $\hat{Z}_j(P_{i,j}(u))$ is obtained via bilinear interpolation. The cosine similarity is:
$\text{cs}_{ij}(u) = \frac{Z_i(u)^\top \hat{Z}_j(P_{i,j}(u))}{\|Z_i(u)\| \cdot \|\hat{Z}_j(P_{i,j}(u))\|}$
The embedding residual is cast in a photometric form:
$r_{\text{embed}}(u) = \lambda_{\text{embed}} \sqrt{2(1 - \text{cs}_{ij}(u))}$
with $\lambda_{\text{embed}}=2$ . The full term is $E_{\text{embed}} = \sum_u w(u) \cdot r_{\text{embed}}^2(u)$ .

C. Temporally Consistent Adaptive Robust Kernel

To handle dynamics, a temporal stability field $S_i(u)$ is computed per pixel over all connected keyframes $N(i)$ :

\bar{\text{cs}}_i(u) = \frac{1}{|N(i)|} \sum_{j \in N(i)} \text{cs}_{ij}(u)

\sigma_i^2(u) = \frac{1}{|N(i)|} \sum_{j \in N(i)} (\text{cs}_{ij}(u) - \bar{\text{cs}}_i(u))^2

S_i(u) = \bar{\text{cs}}_i(u) \cdot (1 - \sigma_i^2(u)) \in [0, 1]

$S_i(u) \approx 1$ indicates static structure; $S_i(u) \approx 0$ flags dynamic or displaced elements.

This field maps to the shape parameter $\alpha$ of the general Barron loss $\rho_\alpha$ , creating a three-regime adaptive kernel:

2, & S_i(u) \geq \theta_s \\ 1 + \frac{S_i(u) - \theta_m}{\theta_s - \theta_m}, & \theta_m \leq S_i(u) < \theta_s \\ \alpha_{\text{dyn}} + \frac{S_i(u)}{\theta_m}(1 - \alpha_{\text{dyn}}), & S_i(u) < \theta_m \end{cases} $$ with $\alpha_{\text{dyn}} \leq 0$, $\theta_s=0.75$, $\theta_m=0.35$. This applies $\ell_2$ loss ($\alpha=2$) to static surfaces, Huber ($\alpha=1$) to movable objects, and Cauchy-like ($\alpha \rightarrow 0$) to actively moving agents. The photometric loss is then reweighted:

E_{\text{photo}}^{\text{ark}} = \sum_u w_{\text{ark}}(E_{\text{photo}}(u), \alpha_i) \cdot E_{\text{photo}}(u)

w_{\text{ark}}(r, \alpha) = \frac{1}{\max(r, \varepsilon)} \frac{\partial \rho_\alpha(r)}{\partial r}

### D. Complete Objective The final optimization objective is:

E_{\text{total}} = \gamma_{\text{photo}} E_{\text{photo}}^{\text{ark}} + \gamma_{\text{embed}} E_{\text{embed}} + E_{\text{reg}}

with a depth regularization term $E_{\text{reg}}(d_i) = \alpha_{\text{disp}} \sum_u \| d_i(u) - d_i^{\text{prior}}(u) \|^2$ to incorporate prior depth knowledge. ## Empirical Validation / Results ### A. SLAM Performance on Dynamic TUM-RGBD RADIO-ViPE achieves state-of-the-art Average Trajectory Error (ATE) on dynamic sequences, demonstrating the robustness gained from the embedding term and adaptive kernel. **TABLE II: SLAM PERFORMANCE COMPARISON ON TUM-RGBD IN CM (ATE ↓)** | Method | fr3/w/xyz | fr3/w/rpy | fr3/w/hs | fr3/w/static | fr3/s/xyz | fr3/s/rpy | fr3/s/hs | fr3/s/static | **Average** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Dyna-SLAM [32] | 1.64 | 3.54 | 2.96 | 0.68 | 1.27 | – | 1.86 | – | 2.00 | | DLD-SLAM [33] | 1.85 | 4.24 | 2.19 | 0.56 | – | – | – | – | 2.21 | | V3D-SLAM [34] | 1.53 | 7.81 | 2.29 | 0.65 | 0.87 | 1.69 | 1.47 | 0.58 | 2.10 | | ViPE (SAM) [5] | 2.4 | 3.46 | 2.52 | 0.54 | 1.43 | 3.80 | 2.57 | 0.6 | 2.17 | | **RADIO-ViPE** | **1.90** | **3.50** | **3.10** | **0.55** | **1.15** | **2.72** | **1.60** | **0.53** | **1.90** | | **RADIO-ViPE ark** | **1.55** | **3.39** | **1.96** | **0.50** | **0.98** | **2.65** | **1.44** | **0.56** | **1.63** | > **Key Result:** The full pipeline with adaptive robust kernel (**RADIO-ViPE ark**) achieves the **best average ATE (1.63 cm)**, outperforming all compared dynamic SLAM methods. ### B. 3D Open-Vocabulary Semantic Segmentation on Replica Evaluated on the Replica dataset, RADIO-ViPE ranks top-3 among offline open-vocabulary methods, despite operating online and without calibration, ground-truth depth, or poses. **TABLE III: QUANTITATIVE COMPARISON ON REPLICA** | Methods | mIoU ↑ | f-mIoU ↑ | Acc ↑ | mIoU ↑ | f-mIoU ↑ | Acc ↑ | Online | Calib Free | Depth Free | Pose Free | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | **Without Background** | | | **With Background** | | | | | | | | ConceptGraphs [17] | 11.63 | 16.61 | 19.80 | 11.72 | 21.35 | 28.28 | ✗ | ✗ | ✗ | ✗ | | RayFronts [23] | 39.37 | 62.03 | 68.80 | 27.73 | 43.37 | 54.45 | ✗ | ✗ | ✗ | ✓ | | **RADIO-ViPE GT** | 29.51 | **52.24** | 59.80 | 28.19 | **54.44** | **65.21** | ✗ | ✗ | ✗ | ✗ | | **RADIO-ViPE** | **24.25** | **50.63** | **59.25** | **19.00** | **37.13** | **48.38** | **✓** | **✓** | **✓** | **✓** | > **Key Result:** The performance gap between **RADIO-ViPE** (no supervision) and **RADIO-ViPE GT** (using ground-truth depth & pose) is small (~1- 2% in f-mIoU without background), confirming the method retains most accuracy without geometric supervision. RADIO-ViPE is the only method that ticks all four capability columns (**Online, Calib-free, Depth-free, Pose-free**). ### C. Ablation Studies * **PCA Dimensionality:** Using $D=256$ for feature compression closely matches full-dimensional baseline performance ($\Delta \text{mIoU} < 1\%$), offering an optimal efficiency-accuracy trade-off. * **Open-Vocabulary Grounding:** The system successfully grounds diverse text queries (e.g., "whiteboard", "plant", "chair") to correct 3D regions in the reconstructed map. ## Theoretical and Practical Implications **Theoretical:** RADIO-ViPE demonstrates a principled framework for **tightly coupling** high-level semantic representations from foundation models with low-level geometric optimization in SLAM. The temporal adaptive kernel provides a novel method for jointly reasoning about geometric and semantic consistency to classify and handle scene dynamics. **Practical:** The system bridges a critical deployment gap for autonomous robotics and AR/VR by providing: 1. **Real-time, calibration-free operation** from ubiquitous monocular video. 2. **Open-vocabulary interaction** with the 3D map using natural language. 3. **Inherent robustness** in dynamic, human-centric environments. This enables applications ranging from language-guided robot navigation to semantic analysis of in-the-wild egocentric video streams. ## Conclusion RADIO-ViPE presents a robust, online semantic SLAM system that unifies open-vocabulary grounding, dynamic scene handling, and calibration-free operation from monocular RGB video. By tightly coupling RADIO-based multi-modal embeddings with geometric bundle adjustment and a novel adaptive robust kernel, it achieves state-of-the-art SLAM accuracy in dynamic settings and competitive open-vocabulary segmentation performance. The work demonstrates a strong performance-to-efficiency trade-off suitable for real-world deployment. Future directions may include improving segmentation of structural background classes and extending the framework to incorporate additional modalities like audio or inertial data.