# RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

> RADIO-ViPE introduces a tightly coupled online bundle adjustment framework that fuses geometric SLAM with open-vocabulary semantic embeddings from foundation models, enabling robust 3D semantic mapping from uncalibrated monocular video in dynamic environments.

- **Source:** [arXiv](https://arxiv.org/abs/2604.26067)
- **Published:** 2026-05-01
- **Permalink:** https://picx.dev/p/805fLk
- **Whiteboard:** https://picx.dev/p/805fLk/image

## Summary

# RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments - Summary

## Summary (Overview)
*   **Open-Vocabulary Semantic SLAM from Monocular Video:** RADIO-ViPE is an online system that ingests raw, uncalibrated monocular RGB video and produces a 3D map where arbitrary natural language queries can be grounded to localized regions and objects.
*   **Tightly Coupled Multi-Modal Fusion:** It introduces a novel dense bundle adjustment framework that jointly optimizes geometric constraints (camera poses, depth) with high-level semantic consistency from vision-language foundation model embeddings (RADIO/RADSeg).
*   **Dynamic Environment Robustness:** A core contribution is a **Temporally Consistent Adaptive Robust Kernel** that classifies scene elements into static, movable, or actively moving categories, applying appropriate loss functions to suppress the influence of dynamics on map consistency.
*   **Calibration-Free and Ready-to-Deploy:** The system requires no prior camera intrinsics, depth sensors, pose initialization, or category-specific supervision, bridging a critical gap for real-world robotic deployment and in-the-wild video analysis.

## Introduction and Theoretical Foundation
Knowledge mapping—grounding free-form language concepts onto a 3D geometric representation—is fundamental for general-purpose robots. While foundation models enable open-vocabulary scene understanding, most advanced 3D semantic methods operate **offline**, rely on **calibrated data** (posed RGB-D), and assume **static scenes**. Real-world deployment for robotics and unconstrained video streams faces challenges: data is inherently **uncalibrated** and environments are **dynamic** (moving agents, displaced furniture).

RADIO-ViPE addresses this integration gap. It builds upon ViPE (for uncalibrated monocular SLAM) and **agglomerative foundation models** like RADIO, which unify capabilities from multiple teacher models via distillation. The core idea is to **tightly couple** multi-modal embeddings (vision & language) with geometric scene information within an online optimization framework, improving map consistency from multiple sources and handling dynamic disturbances.

## Methodology
The pipeline operates online at ~8-10 FPS on uncalibrated monocular RGB video. Key methodological components are:

### A. System Overview
1.  **Camera Initialization:** Intrinsics are bootstrapped from sampled frames using GeoCalib.
2.  **Keyframe Selection:** Based on dense optical flow motion relative to the last keyframe.
3.  **Multi-Modal Feature Extraction:** Dense embeddings are extracted per keyframe using **RADSeg**, upsampled, and compressed via PCA to $D=256$ dimensions for efficiency.
4.  **Depth Estimation:** Metric depth is estimated per keyframe using monocular foundation models and converted to inverse depth.
5.  **Semantic Flow Initialization:** Optical flow priors are augmented with semantic correspondence from RADIO features, fused via per-pixel confidence blending:
    $$ \Omega_{\text{prior}}(u) := \beta \Omega_{\text{prior}}(u) + (1 - \beta) \Omega_{\text{sem}}(u) $$
    where $\beta$ balances photometric and semantic confidence.
6.  **Bundle Adjustment:** A factor graph optimization jointly refines camera intrinsics $K_q$, poses $T_i \in SE(3)$, and 3D structure by minimizing a combined vision-language-geometric energy. Graph connectivity is augmented with **embedding-based co-visibility** using keyframe descriptors from mean-pooled RADSeg features.
7.  **Open-Vocabulary Grounding:** Achieved by decoding compressed 3D point features and projecting them into the SigLip latent space for matching with text query embeddings.

### B. Joint Bundle Adjustment
The optimization minimizes a weighted energy function $E_{\text{total}}$. Two key terms are:

1.  **Dense Photometric Flow Term ($E_{\text{photo}}$):** Enforces geometric consistency via dense optical flow constraints from DROID-SLAM. A pixel $u$ in frame $i$ is projected into frame $j$:
    $$ \mu_{ij} = \Pi_j \left( T_j T_i^{-1} \circ \Pi_i^{-1}(u, d_i(u)) \right) $$
    The residual between prior flow $\Omega_{ij}^{\text{prior}} = \mu_{ij} - u$ and the network-predicted flow $\Omega_{ij}(u)$ is minimized.

2.  **RADIO Embedding Similarity Term ($E_{\text{embed}}$):** A novel term enforcing cross-view semantic alignment under geometric constraints. For a projected pixel correspondence $v = P_{i,j}(u)$, the target embedding $\hat{Z}_j(P_{i,j}(u))$ is obtained via bilinear interpolation. The cosine similarity is:
    $$ \text{cs}_{ij}(u) = \frac{Z_i(u)^\top \hat{Z}_j(P_{i,j}(u))}{\|Z_i(u)\| \cdot \|\hat{Z}_j(P_{i,j}(u))\|} $$
    The embedding residual is cast in a photometric form:
    $$ r_{\text{embed}}(u) = \lambda_{\text{embed}} \sqrt{2(1 - \text{cs}_{ij}(u))} $$
    with $\lambda_{\text{embed}}=2$. The full term is $E_{\text{embed}} = \sum_u w(u) \cdot r_{\text{embed}}^2(u)$.

### C. Temporally Consistent Adaptive Robust Kernel
To handle dynamics, a temporal stability field $S_i(u)$ is computed per pixel over all connected keyframes $N(i)$:
$$ \bar{\text{cs}}_i(u) = \frac{1}{|N(i)|} \sum_{j \in N(i)} \text{cs}_{ij}(u) $$
$$ \sigma_i^2(u) = \frac{1}{|N(i)|} \sum_{j \in N(i)} (\text{cs}_{ij}(u) - \bar{\text{cs}}_i(u))^2 $$
$$ S_i(u) = \bar{\text{cs}}_i(u) \cdot (1 - \sigma_i^2(u)) \in [0, 1] $$
$S_i(u) \approx 1$ indicates static structure; $S_i(u) \approx 0$ flags dynamic or displaced elements.

This field maps to the shape parameter $\alpha$ of the general Barron loss $\rho_\alpha$, creating a three-regime adaptive kernel:
$$ \alpha_i(u) = \begin{cases}
2, & S_i(u) \geq \theta_s \\
1 + \frac{S_i(u) - \theta_m}{\theta_s - \theta_m}, & \theta_m \leq S_i(u) < \theta_s \\
\alpha_{\text{dyn}} + \frac{S_i(u)}{\theta_m}(1 - \alpha_{\text{dyn}}), & S_i(u) < \theta_m
\end{cases} $$
with $\alpha_{\text{dyn}} \leq 0$, $\theta_s=0.75$, $\theta_m=0.35$. This applies $\ell_2$ loss ($\alpha=2$) to static surfaces, Huber ($\alpha=1$) to movable objects, and Cauchy-like ($\alpha \rightarrow 0$) to actively moving agents.

The photometric loss is then reweighted:
$$ E_{\text{photo}}^{\text{ark}} = \sum_u w_{\text{ark}}(E_{\text{photo}}(u), \alpha_i) \cdot E_{\text{photo}}(u) $$
$$ w_{\text{ark}}(r, \alpha) = \frac{1}{\max(r, \varepsilon)} \frac{\partial \rho_\alpha(r)}{\partial r} $$

### D. Complete Objective
The final optimization objective is:
$$ E_{\text{total}} = \gamma_{\text{photo}} E_{\text{photo}}^{\text{ark}} + \gamma_{\text{embed}} E_{\text{embed}} + E_{\text{reg}} $$
with a depth regularization term $E_{\text{reg}}(d_i) = \alpha_{\text{disp}} \sum_u \| d_i(u) - d_i^{\text{prior}}(u) \|^2$ to incorporate prior depth knowledge.

## Empirical Validation / Results

### A. SLAM Performance on Dynamic TUM-RGBD
RADIO-ViPE achieves state-of-the-art Average Trajectory Error (ATE) on dynamic sequences, demonstrating the robustness gained from the embedding term and adaptive kernel.

**TABLE II: SLAM PERFORMANCE COMPARISON ON TUM-RGBD IN CM (ATE ↓)**
| Method | fr3/w/xyz | fr3/w/rpy | fr3/w/hs | fr3/w/static | fr3/s/xyz | fr3/s/rpy | fr3/s/hs | fr3/s/static | **Average** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Dyna-SLAM [32] | 1.64 | 3.54 | 2.96 | 0.68 | 1.27 | – | 1.86 | – | 2.00 |
| DLD-SLAM [33] | 1.85 | 4.24 | 2.19 | 0.56 | – | – | – | – | 2.21 |
| V3D-SLAM [34] | 1.53 | 7.81 | 2.29 | 0.65 | 0.87 | 1.69 | 1.47 | 0.58 | 2.10 |
| ViPE (SAM) [5] | 2.4 | 3.46 | 2.52 | 0.54 | 1.43 | 3.80 | 2.57 | 0.6 | 2.17 |
| **RADIO-ViPE** | **1.90** | **3.50** | **3.10** | **0.55** | **1.15** | **2.72** | **1.60** | **0.53** | **1.90** |
| **RADIO-ViPE ark** | **1.55** | **3.39** | **1.96** | **0.50** | **0.98** | **2.65** | **1.44** | **0.56** | **1.63** |

> **Key Result:** The full pipeline with adaptive robust kernel (**RADIO-ViPE ark**) achieves the **best average ATE (1.63 cm)**, outperforming all compared dynamic SLAM methods.

### B. 3D Open-Vocabulary Semantic Segmentation on Replica
Evaluated on the Replica dataset, RADIO-ViPE ranks top-3 among offline open-vocabulary methods, despite operating online and without calibration, ground-truth depth, or poses.

**TABLE III: QUANTITATIVE COMPARISON ON REPLICA**
| Methods | mIoU ↑ | f-mIoU ↑ | Acc ↑ | mIoU ↑ | f-mIoU ↑ | Acc ↑ | Online | Calib Free | Depth Free | Pose Free |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| | **Without Background** | | | **With Background** | | | | | | |
| ConceptGraphs [17] | 11.63 | 16.61 | 19.80 | 11.72 | 21.35 | 28.28 | ✗ | ✗ | ✗ | ✗ |
| RayFronts [23] | 39.37 | 62.03 | 68.80 | 27.73 | 43.37 | 54.45 | ✗ | ✗ | ✗ | ✓ |
| **RADIO-ViPE GT** | 29.51 | **52.24** | 59.80 | 28.19 | **54.44** | **65.21** | ✗ | ✗ | ✗ | ✗ |
| **RADIO-ViPE** | **24.25** | **50.63** | **59.25** | **19.00** | **37.13** | **48.38** | **✓** | **✓** | **✓** | **✓** |

> **Key Result:** The performance gap between **RADIO-ViPE** (no supervision) and **RADIO-ViPE GT** (using ground-truth depth & pose) is small (~1-
2% in f-mIoU without background), confirming the method retains most accuracy without geometric supervision. RADIO-ViPE is the only method that ticks all four capability columns (**Online, Calib-free, Depth-free, Pose-free**).

### C. Ablation Studies
*   **PCA Dimensionality:** Using $D=256$ for feature compression closely matches full-dimensional baseline performance ($\Delta \text{mIoU} < 1\%$), offering an optimal efficiency-accuracy trade-off.
*   **Open-Vocabulary Grounding:** The system successfully grounds diverse text queries (e.g., "whiteboard", "plant", "chair") to correct 3D regions in the reconstructed map.

## Theoretical and Practical Implications
**Theoretical:** RADIO-ViPE demonstrates a principled framework for **tightly coupling** high-level semantic representations from foundation models with low-level geometric optimization in SLAM. The temporal adaptive kernel provides a novel method for jointly reasoning about geometric and semantic consistency to classify and handle scene dynamics.

**Practical:** The system bridges a critical deployment gap for autonomous robotics and AR/VR by providing:
1.  **Real-time, calibration-free operation** from ubiquitous monocular video.
2.  **Open-vocabulary interaction** with the 3D map using natural language.
3.  **Inherent robustness** in dynamic, human-centric environments.
This enables applications ranging from language-guided robot navigation to semantic analysis of in-the-wild egocentric video streams.

## Conclusion
RADIO-ViPE presents a robust, online semantic SLAM system that unifies open-vocabulary grounding, dynamic scene handling, and calibration-free operation from monocular RGB video. By tightly coupling RADIO-based multi-modal embeddings with geometric bundle adjustment and a novel adaptive robust kernel, it achieves state-of-the-art SLAM accuracy in dynamic settings and competitive open-vocabulary segmentation performance. The work demonstrates a strong performance-to-efficiency trade-off suitable for real-world deployment. Future directions may include improving segmentation of structural background classes and extending the framework to incorporate additional modalities like audio or inertial data.

---

_Markdown view of https://picx.dev/p/805fLk, served by PicX — AI-generated visual whiteboard summaries of research papers._