# Kairos: A Native World Model Stack for Physical AI

> Kairos achieves state-of-the-art embodied world modeling with a native pre-training curriculum and hybrid linear temporal memory, running 2–85× faster than comparable models.

- **Source:** [arXiv](https://arxiv.org/abs/2606.16533)
- **Published:** 2026-06-19
- **Permalink:** https://picx.dev/p/Fw36ML
- **Whiteboard:** https://picx.dev/p/Fw36ML/image

## Summary

## Summary (Overview)

- **Native World Model Stack**: Kairos introduces a unified framework for Physical AI that jointly addresses learning, maintaining, and running world models, moving beyond fragmented post-training approaches.
- **Cross-Embodiment Data Curriculum (CEDC)**: A progressive pre-training paradigm that organizes data into three layers—open-world videos (physical priors), human behavior (task semantics), and robot interactions (embodied control)—to build actionable world knowledge from scratch.
- **Hybrid Linear Temporal Memory**: A Mixture-of-Transformers (MoT) architecture combining Sliding-Window Attention (SWA), Dilated SWA (DSWA), and Gated Linear Attention (GLA) for efficient long-horizon generation. Formal theoretical bounds prove this factorization strictly limits error accumulation.
- **Deployment-Aware System Co-Design**: Co-designed for low-latency, memory-efficient inference on both server and consumer-grade hardware, enabling real-time closed-loop operation.
- **State-of-the-Art Performance**: Achieves top scores on embodied benchmarks (LIBERO-plus, RoboTwin 2.0, WorldModelBench, DreamGen) while being 2–85× faster than comparable models.

## Introduction and Theoretical Foundation

World models are evolving from passive video generators into foundational infrastructure for Physical AI. The paper identifies four tightly coupled challenges:

1. **Fragmented learning**: Open-world videos lack action grounding; human data lacks robot embodiment; robot data is scarce and narrow.
2. **Long-horizon state maintenance**: Local continuation heuristics (e.g., sliding window) cannot guarantee global state consistency over extended durations—a structural limitation proven mathematically.
3. **Embodiment gap**: Models may predict plausible futures but fail to learn how actions change those futures in a controllable way.
4. **Deployment constraints**: Real-time observation–action–feedback loops require low latency and memory efficiency, often neglected in offline evaluations.

Kairos tackles these jointly via a **native pre-training paradigm**, a **unified architecture with hybrid temporal memory**, and **deployment-aware system co-design**.

## Methodology

### Native Architecture with Unified Understanding, Generation, and Prediction

Kairos uses a Mixture-of-Transformers (MoT) backbone comprising:

- **World Understanding**: A Vision-Language Model (VLM, e.g., Qwen series) extracts high-level semantic representations from heterogeneous inputs.
- **World Generation**: A Diffusion Transformer (DiT) conditioned on text/image generates future video frames. It operates in a latent space via a high-compression video VAE.
- **World Prediction**: An Action DiT jointly models future video tokens and robot action tokens using a unified attention masking scheme. The model is trained with **flow matching**:

  $$
  L_{\text{FM}}(\theta) = \mathbb{E}_{z_0, \epsilon, \sigma, c} \left[ \| \mathcal{V}_\theta(z_\sigma, \sigma, c) - (\epsilon - z_0) \|_2^2 \right]
  $$

  where $z_\sigma = (1-\sigma) z_0 + \sigma \epsilon$ is a linear interpolation between clean latent video $z_0$ and noise $\epsilon$.

### Hybrid Linear Temporal Attention

The DiT backbone uses three complementary attention mechanisms to achieve linear complexity in the temporal dimension:

- **Short-term: Sliding Window Attention (SWA)** – restricts attention to a local temporal window (size proportional to spatial tokens per frame) for modeling local motion.
- **Mid-term: Dilated Sliding Window Attention (DSWA)** – uses dilation to extend the receptive field without quadratic cost, e.g., $d \in \{6, 12\}$.
- **Long-term: Gated Linear Attention (GLA)** – implements a gated delta update (GatedDeltaNet) for global causal memory with linear complexity.

The GLA update rule is:

$$
\begin{aligned}
q_t &= W_Q x_t, \quad k_t = W_K x_t, \quad v_t = W_V x_t, \quad \beta_t = \sigma(W_\beta x_t) \\
v_t^{\text{old}} &= S_{t-1} k_t, \quad v_t^{\text{new}} = \beta_t v_t + (1-\beta_t) v_t^{\text{old}} \\
S_t &= \alpha_t S_{t-1} + \beta_t (v_t - v_t^{\text{old}}) k_t^\top , \quad \alpha_t = \sigma(W_\alpha x_t) \\
o_t &= S_t q_t
\end{aligned}
$$

Here $S_t \in \mathbb{R}^{d_v \times d_k}$ is a learnable associative memory. The gated delta update performs a single SGD step on an online regression loss, adaptively forgetting outdated associations.

### Theoretical Guarantees

**Theorem 1** (Necessity of persistent state): The excess risk from using only a recent window $W_t^{(w)}$ vs. full history $H_t$ satisfies  

$$
R_w^* - R_{\text{full}}^* = \mathbb{E}[\text{Var}(m_t | W_t^{(w)})]
$$

which is strictly positive if the optimal full-history predictor $m_t$ is not measurable with respect to the window. This proves local heuristics are **structurally insufficient** for long-horizon consistency.

**Theorem 2** (Sufficiency of hybrid memory): Under a contractive gated delta update (factor $\rho < 1$) and bounded approximation errors, the long-horizon excess risk satisfies  

$$
R_t(\hat{\mu}_t) - R_t^* \leq \left( L\varepsilon + \frac{L_G \bar{\xi}}{1-\rho} \right)^2
$$

as $t \to \infty$, where $\varepsilon$ is the local approximation error and $\bar{\xi}$ is the maximum one-step perturbation. The geometric damping ensures error does not accumulate.

### Native Pre-training Paradigm (CEDC)

The training pipeline has three stages, each dominated by a data layer:

| Stage | Data Layer | Resolution | Max Frames | Objective |
|-------|------------|------------|------------|-----------|
| I – Physical Pretraining | Open-world videos | 256P → 720P | 1 → 241 | Inject physical priors (gravity, object permanence, etc.) |
| II – Embodied Pretraining | Human demonstrations + robot data | 720P | 81–241 | Learn behavioral semantics, task taxonomies |
| III – Joint World-Action | Robot trajectories (low-level actions) | 720P | 81 | Align visual forecasting with action prediction |

A **shape-aware timestep scheduler shift** is used to adapt the flow-matching noise schedule to varying video lengths and resolutions:

$$
\tilde{\sigma}_i = \frac{s \sigma_i^{(0)}}{1 + (s-1) \sigma_i^{(0)}}, \quad s = \exp(f(L)) \sqrt{F}
$$

where $L$ is the number of spatial tokens per frame and $F$ the number of frames.

### Inference Optimization

- **Timestep Distillation**: Combines Distribution Matching Distillation (DMD) and Consistency Distillation to reduce diffusion steps from dozens to 4, preserving quality.
- **Hardware-aware optimization**: Mixed-parallel inference (sequence + tensor parallelism), FP8 quantization, tiled GatedDeltaNet streaming, and weight-only INT4 text encoder quantization enable real-time performance on consumer GPUs.

## Empirical Validation / Results

### Embodied World Model Benchmarks

**WorldModelBench (Robot subset)** – Table 6:

| Model | Params | Instruction Following | Physics Adherence | Total Score |
|-------|--------|----------------------|-------------------|-------------|
| Kairos | 4B | **2.36** | **4.96** | **9.30** |
| Cosmos3-Nano* | 16B | **2.36** | **4.96** | 9.26 |
| Lingbot* | 28B | 2.14 | 4.92 | 9.04 |

Kairos achieves perfect scores in Newtonian mechanics, fluid dynamics, and gravity.

**DreamGen Bench** – Table 7:

| Model | Params | AVG_PA | AVG_IF | AVG_Score |
|-------|--------|--------|--------|-----------|
| Kairos | 4B | **0.538** | 0.698 | **0.618** |
| Wan2.2* | 14B | 0.519 | **0.703** | 0.611 |
| Cosmos-Predict2.5* | 14B | 0.495 | 0.478 | 0.487 |

**PAI-Bench Robot** – Table 8 (small-scale models <10B):

| Model | Params | Domain Score | Overall Score |
|-------|--------|--------------|---------------|
| Kairos | 4B | **88.59** | **82.57** |
| Wan2.2* | 5B | 80.17 | 78.63 |
| GigaWorld-0 | 2B | 85.83 | 80.87 |

### World Action Model Benchmarks

**RoboTwin 2.0** – Table 11:

| Model | Type | Clean | Randomized | Average |
|-------|------|-------|------------|---------|
| Kairos | WAM | **96.9** | **95.2** | **96.1** |
| MotuBrain | WAM | 95.8 | 96.1 | 96.0 |
| G0.5 | VLA | 93.7 | 92.8 | 93.2 |

**LIBERO-Plus** – Table 12:

| Model | Type | Average |
|-------|------|---------|
| Kairos | WAM | **89.0** |
| Kairos-joint | WAM | **90.8** |
| Being-H0.7 | WAM | 84.8 |

Ablation studies show that embodied human-centric pretraining yields a +6.0 gain, and joint video-action training adds +23.2.

### General World Model Benchmarks

**VideoPhy** – Table 17:

| Model | Average Score |
|-------|---------------|
| Kairos (4B) | **45.55** |
| Cosmos-Predict2.5 (14B) | 45.16 |
| Wan2.2 (5B) | 38.85 |

### Inference Efficiency

**Table 5** (720p, 5s, TI2V):

| Model | Memory (GB) | Complexity (PFlops) | 1 GPU (s) | 4 GPU (s) |
|-------|-------------|---------------------|-----------|------------|
| Kairos-4B | 23.5 | **2.3** | **43** | **9** |
| Wan2.2-5B | 23.4 | 16.6 | 201 | 85 |
| Cosmos-Predict2.5-14B | 70.2 | 156.5 | 2526 | 687 |

Kairos scales linearly with video length, while competitors show exponential growth.

## Theoretical and Practical Implications

- **Theoretical**: The paper proves that purely local attention is fundamentally insufficient for long-horizon world modeling (Theorem 1). The hybrid design (SWA+DSWA+GLA) is shown to be approximately sufficient, with error bounds that depend only on local approximation quality and the contractive property of GLA (Theorem 2). This provides a rigorous justification for separating local and global memory.

- **Practical**: Kairos demonstrates that a well-designed native pre-training curriculum can replace decoupled fine-tuning pipelines, enabling efficient knowledge transfer from internet videos to robot control. The deployment-aware co-design makes high-fidelity world simulation possible on consumer hardware, opening the door to real-time self-evolution loops and democratized robotics research.

- **Impact on Physical AI**: The unified understanding-generation-prediction architecture provides a substrate for future self-evolving agents that can autonomously collect data, refine policies, and adapt to new embodiments without manual re-engineering.

## Conclusion

Kairos introduces a native world model stack that jointly addresses how to **learn**, **maintain**, and **run** the world for Physical AI. Key contributions include:

1. A **Cross-Embodiment Data Curriculum** that progressively aligns open-world video, human behavior, and robot data.
2. A **Hybrid Linear Temporal Memory** (SWA, DSWA, GLA) with theoretical error bounds guaranteeing long-horizon consistency.
3. A **Deployment-Aware System Co-Design** enabling real-time inference on consumer-grade hardware.

Extensive evaluations show state-of-the-art results on embodied and general world model benchmarks while being orders of magnitude faster than comparably sized models.

**Future directions**: Autonomous self-evolution via recursive imagination and scaling to a generalist embodied substrate supporting diverse hardware (humanoids, dexterous hands) with zero-shot generalization.

> ⚠️ The paper and its code are available at [GitHub](https://github.com/kairos-agi/kairos-sensenova), [Hugging Face](https://huggingface.co/kairos-agi), and [ModelScope](https://modelscope.cn/collections/kairos-team/kairos30).

---

_Markdown view of https://picx.dev/p/Fw36ML, served by PicX — AI-generated visual whiteboard summaries of research papers._