Summary (Overview)

  • Novel architecture: GAM repurposes a pretrained Geometric Foundation Model (GFM) as a shared substrate for perception, temporal prediction, and action decoding, rather than using it only as a static feature extractor.
  • Joint prediction: A causal future predictor inserted at an intermediate GFM layer forecasts both future geometric tokens and action tokens in a single autoregressive sequence, allowing the same backbone to decode future geometry and executable actions.
  • State-of-the-art results: On LIBERO/LIBERO-Plus benchmarks, GAM achieves a 97.6% success rate on the original set and 85.5% on the challenging perturbed set, with a 9.7%p improvement over the best baseline in camera-perturbation scenarios.
  • Efficiency: GAM operates at 6.9 ms inference latency (≈145 Hz) with only 1.4B parameters—up to 55 × faster than diffusion-based world-action models—while maintaining superior accuracy and robustness.
  • Real-world transfer: In physical robot experiments, GAM substantially outperforms baselines under both in-distribution and out-of-distribution (camera perturbation) settings, demonstrating practical viability.

Introduction and Theoretical Foundation

Generalist robot policies must follow natural-language instructions and reason about how objects, cameras, and robot actions interact in 3D physical space. Recent Vision‑Language‑Action Models (VLAs) and World‑Action Models (WAMs) inherit strong semantic or temporal priors from large-scale pretrained models, but they operate primarily on 2D image frames or 2D-derived latent spaces. This forces the action decoder to implicitly infer depth, scale, and occlusion from monocular cues, leading to limited generalization across changes in camera viewpoint, lighting, and scene layout.

To overcome this, several works incorporate Geometric Foundation Models (GFMs)—transformers that map RGB images to dense 3D geometry (depth, point maps, camera parameters). Prior attempts either distill selected GFM features into a VLA backbone or attach a lightweight action head on top of the GFM’s final features. These approaches treat the GFM as a static feature extractor and do not repurpose its multi‑layer geometric structure as the policy’s own temporal and action‑generating substrate.

GAM directly repurposes a GFM as a manipulation policy by splitting it at an intermediate layer. The shallow layers serve as an observation encoder, while the remaining deep layers act as a decoder. A causal transformer inserted at the split layer predicts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the deep GFM layers to simultaneously decode future geometry and robot actions. This design equips the GFM with language‑conditioned temporal world modeling with minimal architectural modification.


Methodology

Problem Formulation

The policy πθ\pi_\theta maps a context window of HH recent timesteps—multi‑view RGB observations {ot}\{o_t\}, proprioceptive states {st}\{s_t\}, previous actions {at1}\{a_{t-1}\}, and a fixed language instruction \ell—to an action chunk a^tRC×da\hat{a}_t \in \mathbb{R}^{C \times d_a} of length CC (open‑loop execution).

Architecture Overview

GAM operates in three stages inside a single pretrained GFM:

  1. Observation Encoding (§4.1): The shallow GFM layers (up to split layer LsL_s) encode each multi‑view observation into latent geometric tokens:

    Zt(Ls)=ELs(ot)RV(1+P)×d,Z_{t'}^{(L_s)} = E_{\leq L_s}(o_{t'}) \in \mathbb{R}^{V(1+P) \times d},

    where VV is the number of views, PP patches per view, and dd the hidden dimension. LsL_s is chosen so that Ls<m1L_s < m_1 (the earliest layer used by the DPT head) to allow future‑state decoding into depth.

  2. Causal Future Predictor (§4.2): A causal transformer gϕg_\phi is inserted at layer LsL_s. For each timestep tt', the encoder output is concatenated with embedded proprioception pt=ψs(st)p_{t'} = \psi_s(s_{t'}) and previous action qt=ψa(at1)q_{t'} = \psi_a(a_{t'-1}) to form a token block UtU_{t'}. The full predictor input is

    X=[L;UtH+1;;Ut],X = [L_\ell; U_{t-H+1}; \dots; U_t],

    where LL_\ell are language tokens from a frozen T5 encoder. Block‑causal self‑attention prevents future leakage. The predictor outputs:

    • predicted future geometric tokens Z~t+1(Ls)\tilde{Z}_{t'+1}^{(L_s)},
    • a predicted next action token a~tRd\tilde{a}_{t'} \in \mathbb{R}^d.
  3. Feature Propagation and Action Decoding (§4.3): The action token is replicated per‑view and concatenated with predicted geometric tokens. These are passed through the remaining GFM deep layers D>LsD_{>L_s} with extended causal masking. The propagated features are decoded by:

    • Action head hacth_{\text{act}}: aggregates action tokens over the context window to regress the executable action chunk a^t\hat{a}_{t'}.
    • Depth head hdepthh_{\text{depth}}: the original GFM DPT head decodes predicted future depth maps D~t+1\tilde{D}_{t'+1}.

Training Objective

The model is trained end‑to‑end with a multi‑task loss:

Ltotal=λactLact+λfeatLfeat+λdepthLdepth,\mathcal{L}_{\text{total}} = \lambda_{\text{act}} \mathcal{L}_{\text{act}} + \lambda_{\text{feat}} \mathcal{L}_{\text{feat}} + \lambda_{\text{depth}} \mathcal{L}_{\text{depth}},
  • Lact\mathcal{L}_{\text{act}}: 1\ell_1 regression between decoded a^t\hat{a}_{t'} and expert action ata_{t'}.
  • Lfeat\mathcal{L}_{\text{feat}}: aligns predicted future tokens Z~t+1(Ls)\tilde{Z}_{t'+1}^{(L_s)} with actual next‑frame tokens Zt+1(Ls)Z_{t'+1}^{(L_s)} extracted from the frozen GFM encoder.
  • Ldepth\mathcal{L}_{\text{depth}}: supervises decoded future depth D~t+1\tilde{D}_{t'+1} against ground‑truth future depth Dt+1D_{t'+1} using scale‑invariant and gradient‑matching penalties.

Hyperparameters: Ls=12L_s = 12, H=4H = 4 for pretraining, H=1H = 1 for post‑training, C=8C = 8, λact=3\lambda_{\text{act}} = 3, λfeat=1\lambda_{\text{feat}} = 1, λdepth=3\lambda_{\text{depth}} = 3. The backbone is DA3‑Giant fine‑tuned on Track4World, with a 12‑layer causal predictor of width dg=1024d_g = 1024.


Empirical Validation / Results

Simulation Benchmarks

Experiments are conducted on LIBERO (in‑distribution) and LIBERO‑Plus (out‑of‑distribution perturbations along camera, lighting, background, layout, etc.). GAM is compared against VLAs (π₀.₅, OpenVLA‑OFT, etc.), WAMs (Cosmos‑Policy, Fast‑WAM), and geometry‑aware VLAs (π₀.₅ + Spatial Forcing, π₀.₅ + ROCKET).

Table 1 (main results, truncated for clarity):

MethodSizeLIBERO (Orig.)LIBERO-Plus (Overall)Camera Pert.
π₀.₅3.3B96.984.672.0
OpenVLA-OFT7B97.169.656.4
Cosmos-Policy2B98.582.473.4
π₀.₅ + Spatial Forcing3.3B94.025.70.1
GAM (Ours)1.4B97.685.583.1

GAM achieves the highest overall LIBERO‑Plus success rate (85.5%) and outperforms all baselines in camera‑perturbation settings by at least 9.7 percentage points, demonstrating the benefit of full GFM integration.

Real‑World Results

Four manipulation tasks (~200 demos each) are evaluated in both nominal and perturbed (camera‑moved) settings. GAM consistently outperforms π₀.₅ and Spatial Forcing (Figure 4). For example, in the “Pick and Place” task, GAM achieves 80% success under perturbation vs. 10% for the best baseline.

Ablation Studies

Component ablation (Table 2):

  • Pretraining is critical: omitting it drops LIBERO‑Plus success from 89.7% to 73.4%.
  • Removing Ldepth\mathcal{L}_{\text{depth}} or Lfeat\mathcal{L}_{\text{feat}} has minimal effect when pretrained, indicating geometric dynamics are already encoded.
  • Horizon H=1H = 1 is most robust; longer contexts can hurt.

Split layer selection (Table 3):

  • Best performance at Ls=12L_s = 12 (99.6% Orig., 70.1% Plus).
  • Too early (Ls=0L_s = 0) or too late (Ls27L_s \geq 27) leads to collapse, confirming that forecasted tokens need sufficient deep‑layer interaction.

Inference cost (Table 4):

MethodSizeLatency (ms)
OpenVLA-OFT7B77.8
π₀.₅3.3B29.2
Cosmos-Policy2B382.4
GAM1.4B6.9

GAM is 55× faster than Cosmos‑Policy and substantially smaller than most baselines.

Robustness to viewpoint variation (Figure 5): GAM maintains a high success rate across five difficulty levels of camera perturbation, consistently outperforming all competitors, especially under the strongest disturbances.


Theoretical and Practical Implications

  • Theoretical contribution: GAM demonstrates that a pretrained geometric foundation model can be repurposed as a full policy backbone—not merely a feature extractor—by inserting a lightweight causal transformer. Jointly predicting future geometry and actions in the same latent space aligns the policy with explicit 3D reasoning about object interactions, camera perspective, and robot dynamics.
  • Practical benefits:
    • Speed and size: Single‑pass feedforward avoids multi‑step diffusion, enabling real‑time deployment (145 Hz) while using only 1.4B parameters—suitable for resource‑constrained robots.
    • Robustness: Explicit geometric priors yield strong generalization to unseen camera viewpoints and environmental perturbations, a critical requirement for real‑world manipulation.
    • Simplified training: The geometric future‑prediction losses provide rich supervision even without large‑scale pretraining, making GAM accessible for in‑house data collection.
  • Implications for future research: GAM suggests that future robot policies should be built natively on geometric backbones rather than attaching geometric modules as an afterthought. The success of joint action‑geometry prediction motivates further exploration of 3D‑aware world models for contact‑rich and long‑horizon tasks.

Conclusion and Limitations

GAM unifies geometry and action prediction with temporal world modeling inside a single shared GFM. By inserting a causal transformer between the GFM’s shallow and deep layers, it autoregressively decodes actions and future geometries, resolving the spatial ambiguities of 2D‑based foundation models. Extensive experiments show superior accuracy, faster inference, and strong out‑of‑distribution robustness.

Limitations: GAM’s language reasoning and commonsense capabilities are bounded by the frozen text encoder (T5). Integrating a large language model or an external reasoning module is a natural next step to enhance instruction following and high‑level planning. Additionally, the current framework assumes fixed camera setups; handling dynamic or unseen camera configurations at inference time remains an open challenge.

Related papers