Summary (Overview)

  • Native World Model Stack: Kairos introduces a unified framework for Physical AI that jointly addresses learning, maintaining, and running world models, moving beyond fragmented post-training approaches.
  • Cross-Embodiment Data Curriculum (CEDC): A progressive pre-training paradigm that organizes data into three layers—open-world videos (physical priors), human behavior (task semantics), and robot interactions (embodied control)—to build actionable world knowledge from scratch.
  • Hybrid Linear Temporal Memory: A Mixture-of-Transformers (MoT) architecture combining Sliding-Window Attention (SWA), Dilated SWA (DSWA), and Gated Linear Attention (GLA) for efficient long-horizon generation. Formal theoretical bounds prove this factorization strictly limits error accumulation.
  • Deployment-Aware System Co-Design: Co-designed for low-latency, memory-efficient inference on both server and consumer-grade hardware, enabling real-time closed-loop operation.
  • State-of-the-Art Performance: Achieves top scores on embodied benchmarks (LIBERO-plus, RoboTwin 2.0, WorldModelBench, DreamGen) while being 2–85× faster than comparable models.

Introduction and Theoretical Foundation

World models are evolving from passive video generators into foundational infrastructure for Physical AI. The paper identifies four tightly coupled challenges:

  1. Fragmented learning: Open-world videos lack action grounding; human data lacks robot embodiment; robot data is scarce and narrow.
  2. Long-horizon state maintenance: Local continuation heuristics (e.g., sliding window) cannot guarantee global state consistency over extended durations—a structural limitation proven mathematically.
  3. Embodiment gap: Models may predict plausible futures but fail to learn how actions change those futures in a controllable way.
  4. Deployment constraints: Real-time observation–action–feedback loops require low latency and memory efficiency, often neglected in offline evaluations.

Kairos tackles these jointly via a native pre-training paradigm, a unified architecture with hybrid temporal memory, and deployment-aware system co-design.

Methodology

Native Architecture with Unified Understanding, Generation, and Prediction

Kairos uses a Mixture-of-Transformers (MoT) backbone comprising:

  • World Understanding: A Vision-Language Model (VLM, e.g., Qwen series) extracts high-level semantic representations from heterogeneous inputs.

  • World Generation: A Diffusion Transformer (DiT) conditioned on text/image generates future video frames. It operates in a latent space via a high-compression video VAE.

  • World Prediction: An Action DiT jointly models future video tokens and robot action tokens using a unified attention masking scheme. The model is trained with flow matching:

    LFM(θ)=Ez0,ϵ,σ,c[Vθ(zσ,σ,c)(ϵz0)22]L_{\text{FM}}(\theta) = \mathbb{E}_{z_0, \epsilon, \sigma, c} \left[ \| \mathcal{V}_\theta(z_\sigma, \sigma, c) - (\epsilon - z_0) \|_2^2 \right]

    where zσ=(1σ)z0+σϵz_\sigma = (1-\sigma) z_0 + \sigma \epsilon is a linear interpolation between clean latent video z0z_0 and noise ϵ\epsilon.

Hybrid Linear Temporal Attention

The DiT backbone uses three complementary attention mechanisms to achieve linear complexity in the temporal dimension:

  • Short-term: Sliding Window Attention (SWA) – restricts attention to a local temporal window (size proportional to spatial tokens per frame) for modeling local motion.
  • Mid-term: Dilated Sliding Window Attention (DSWA) – uses dilation to extend the receptive field without quadratic cost, e.g., d{6,12}d \in \{6, 12\}.
  • Long-term: Gated Linear Attention (GLA) – implements a gated delta update (GatedDeltaNet) for global causal memory with linear complexity.

The GLA update rule is:

qt=WQxt,kt=WKxt,vt=WVxt,βt=σ(Wβxt)vtold=St1kt,vtnew=βtvt+(1βt)vtoldSt=αtSt1+βt(vtvtold)kt,αt=σ(Wαxt)ot=Stqt\begin{aligned} q_t &= W_Q x_t, \quad k_t = W_K x_t, \quad v_t = W_V x_t, \quad \beta_t = \sigma(W_\beta x_t) \\ v_t^{\text{old}} &= S_{t-1} k_t, \quad v_t^{\text{new}} = \beta_t v_t + (1-\beta_t) v_t^{\text{old}} \\ S_t &= \alpha_t S_{t-1} + \beta_t (v_t - v_t^{\text{old}}) k_t^\top , \quad \alpha_t = \sigma(W_\alpha x_t) \\ o_t &= S_t q_t \end{aligned}

Here StRdv×dkS_t \in \mathbb{R}^{d_v \times d_k} is a learnable associative memory. The gated delta update performs a single SGD step on an online regression loss, adaptively forgetting outdated associations.

Theoretical Guarantees

Theorem 1 (Necessity of persistent state): The excess risk from using only a recent window Wt(w)W_t^{(w)} vs. full history HtH_t satisfies

RwRfull=E[Var(mtWt(w))]R_w^* - R_{\text{full}}^* = \mathbb{E}[\text{Var}(m_t | W_t^{(w)})]

which is strictly positive if the optimal full-history predictor mtm_t is not measurable with respect to the window. This proves local heuristics are structurally insufficient for long-horizon consistency.

Theorem 2 (Sufficiency of hybrid memory): Under a contractive gated delta update (factor ρ<1\rho < 1) and bounded approximation errors, the long-horizon excess risk satisfies

Rt(μ^t)Rt(Lε+LGξˉ1ρ)2R_t(\hat{\mu}_t) - R_t^* \leq \left( L\varepsilon + \frac{L_G \bar{\xi}}{1-\rho} \right)^2

as tt \to \infty, where ε\varepsilon is the local approximation error and ξˉ\bar{\xi} is the maximum one-step perturbation. The geometric damping ensures error does not accumulate.

Native Pre-training Paradigm (CEDC)

The training pipeline has three stages, each dominated by a data layer:

StageData LayerResolutionMax FramesObjective
I – Physical PretrainingOpen-world videos256P → 720P1 → 241Inject physical priors (gravity, object permanence, etc.)
II – Embodied PretrainingHuman demonstrations + robot data720P81–241Learn behavioral semantics, task taxonomies
III – Joint World-ActionRobot trajectories (low-level actions)720P81Align visual forecasting with action prediction

A shape-aware timestep scheduler shift is used to adapt the flow-matching noise schedule to varying video lengths and resolutions:

σ~i=sσi(0)1+(s1)σi(0),s=exp(f(L))F\tilde{\sigma}_i = \frac{s \sigma_i^{(0)}}{1 + (s-1) \sigma_i^{(0)}}, \quad s = \exp(f(L)) \sqrt{F}

where LL is the number of spatial tokens per frame and FF the number of frames.

Inference Optimization

  • Timestep Distillation: Combines Distribution Matching Distillation (DMD) and Consistency Distillation to reduce diffusion steps from dozens to 4, preserving quality.
  • Hardware-aware optimization: Mixed-parallel inference (sequence + tensor parallelism), FP8 quantization, tiled GatedDeltaNet streaming, and weight-only INT4 text encoder quantization enable real-time performance on consumer GPUs.

Empirical Validation / Results

Embodied World Model Benchmarks

WorldModelBench (Robot subset) – Table 6:

ModelParamsInstruction FollowingPhysics AdherenceTotal Score
Kairos4B2.364.969.30
Cosmos3-Nano*16B2.364.969.26
Lingbot*28B2.144.929.04

Kairos achieves perfect scores in Newtonian mechanics, fluid dynamics, and gravity.

DreamGen Bench – Table 7:

ModelParamsAVG_PAAVG_IFAVG_Score
Kairos4B0.5380.6980.618
Wan2.2*14B0.5190.7030.611
Cosmos-Predict2.5*14B0.4950.4780.487

PAI-Bench Robot – Table 8 (small-scale models <10B):

ModelParamsDomain ScoreOverall Score
Kairos4B88.5982.57
Wan2.2*5B80.1778.63
GigaWorld-02B85.8380.87

World Action Model Benchmarks

RoboTwin 2.0 – Table 11:

ModelTypeCleanRandomizedAverage
KairosWAM96.995.296.1
MotuBrainWAM95.896.196.0
G0.5VLA93.792.893.2

LIBERO-Plus – Table 12:

ModelTypeAverage
KairosWAM89.0
Kairos-jointWAM90.8
Being-H0.7WAM84.8

Ablation studies show that embodied human-centric pretraining yields a +6.0 gain, and joint video-action training adds +23.2.

General World Model Benchmarks

VideoPhy – Table 17:

ModelAverage Score
Kairos (4B)45.55
Cosmos-Predict2.5 (14B)45.16
Wan2.2 (5B)38.85

Inference Efficiency

Table 5 (720p, 5s, TI2V):

ModelMemory (GB)Complexity (PFlops)1 GPU (s)4 GPU (s)
Kairos-4B23.52.3439
Wan2.2-5B23.416.620185
Cosmos-Predict2.5-14B70.2156.52526687

Kairos scales linearly with video length, while competitors show exponential growth.

Theoretical and Practical Implications

  • Theoretical: The paper proves that purely local attention is fundamentally insufficient for long-horizon world modeling (Theorem 1). The hybrid design (SWA+DSWA+GLA) is shown to be approximately sufficient, with error bounds that depend only on local approximation quality and the contractive property of GLA (Theorem 2). This provides a rigorous justification for separating local and global memory.

  • Practical: Kairos demonstrates that a well-designed native pre-training curriculum can replace decoupled fine-tuning pipelines, enabling efficient knowledge transfer from internet videos to robot control. The deployment-aware co-design makes high-fidelity world simulation possible on consumer hardware, opening the door to real-time self-evolution loops and democratized robotics research.

  • Impact on Physical AI: The unified understanding-generation-prediction architecture provides a substrate for future self-evolving agents that can autonomously collect data, refine policies, and adapt to new embodiments without manual re-engineering.

Conclusion

Kairos introduces a native world model stack that jointly addresses how to learn, maintain, and run the world for Physical AI. Key contributions include:

  1. A Cross-Embodiment Data Curriculum that progressively aligns open-world video, human behavior, and robot data.
  2. A Hybrid Linear Temporal Memory (SWA, DSWA, GLA) with theoretical error bounds guaranteeing long-horizon consistency.
  3. A Deployment-Aware System Co-Design enabling real-time inference on consumer-grade hardware.

Extensive evaluations show state-of-the-art results on embodied and general world model benchmarks while being orders of magnitude faster than comparably sized models.

Future directions: Autonomous self-evolution via recursive imagination and scaling to a generalist embodied substrate supporting diverse hardware (humanoids, dexterous hands) with zero-shot generalization.

⚠️ The paper and its code are available at GitHub, Hugging Face, and ModelScope.

Related papers