Visual Summary | Kairos: A Native World Model Stack for Physical AI

Summary (Overview)

Native World Model Stack: Kairos introduces a unified framework for Physical AI that jointly addresses learning, maintaining, and running world models, moving beyond fragmented post-training approaches.
Cross-Embodiment Data Curriculum (CEDC): A progressive pre-training paradigm that organizes data into three layers—open-world videos (physical priors), human behavior (task semantics), and robot interactions (embodied control)—to build actionable world knowledge from scratch.
Hybrid Linear Temporal Memory: A Mixture-of-Transformers (MoT) architecture combining Sliding-Window Attention (SWA), Dilated SWA (DSWA), and Gated Linear Attention (GLA) for efficient long-horizon generation. Formal theoretical bounds prove this factorization strictly limits error accumulation.
Deployment-Aware System Co-Design: Co-designed for low-latency, memory-efficient inference on both server and consumer-grade hardware, enabling real-time closed-loop operation.
State-of-the-Art Performance: Achieves top scores on embodied benchmarks (LIBERO-plus, RoboTwin 2.0, WorldModelBench, DreamGen) while being 2–85× faster than comparable models.

Introduction and Theoretical Foundation

World models are evolving from passive video generators into foundational infrastructure for Physical AI. The paper identifies four tightly coupled challenges:

Fragmented learning: Open-world videos lack action grounding; human data lacks robot embodiment; robot data is scarce and narrow.
Long-horizon state maintenance: Local continuation heuristics (e.g., sliding window) cannot guarantee global state consistency over extended durations—a structural limitation proven mathematically.
Embodiment gap: Models may predict plausible futures but fail to learn how actions change those futures in a controllable way.
Deployment constraints: Real-time observation–action–feedback loops require low latency and memory efficiency, often neglected in offline evaluations.

Kairos tackles these jointly via a native pre-training paradigm, a unified architecture with hybrid temporal memory, and deployment-aware system co-design.

Methodology

Native Architecture with Unified Understanding, Generation, and Prediction

Kairos uses a Mixture-of-Transformers (MoT) backbone comprising:

World Understanding: A Vision-Language Model (VLM, e.g., Qwen series) extracts high-level semantic representations from heterogeneous inputs.
World Generation: A Diffusion Transformer (DiT) conditioned on text/image generates future video frames. It operates in a latent space via a high-compression video VAE.
World Prediction: An Action DiT jointly models future video tokens and robot action tokens using a unified attention masking scheme. The model is trained with flow matching:
$L_{\text{FM}}(\theta) = \mathbb{E}_{z_0, \epsilon, \sigma, c} \left[ \| \mathcal{V}_\theta(z_\sigma, \sigma, c) - (\epsilon - z_0) \|_2^2 \right]$
where $z_\sigma = (1-\sigma) z_0 + \sigma \epsilon$ is a linear interpolation between clean latent video $z_0$ and noise $\epsilon$ .

Hybrid Linear Temporal Attention

The DiT backbone uses three complementary attention mechanisms to achieve linear complexity in the temporal dimension:

Short-term: Sliding Window Attention (SWA) – restricts attention to a local temporal window (size proportional to spatial tokens per frame) for modeling local motion.
Mid-term: Dilated Sliding Window Attention (DSWA) – uses dilation to extend the receptive field without quadratic cost, e.g., $d \in \{6, 12\}$ .
Long-term: Gated Linear Attention (GLA) – implements a gated delta update (GatedDeltaNet) for global causal memory with linear complexity.

The GLA update rule is:

\begin{aligned} q_t &= W_Q x_t, \quad k_t = W_K x_t, \quad v_t = W_V x_t, \quad \beta_t = \sigma(W_\beta x_t) \\ v_t^{\text{old}} &= S_{t-1} k_t, \quad v_t^{\text{new}} = \beta_t v_t + (1-\beta_t) v_t^{\text{old}} \\ S_t &= \alpha_t S_{t-1} + \beta_t (v_t - v_t^{\text{old}}) k_t^\top , \quad \alpha_t = \sigma(W_\alpha x_t) \\ o_t &= S_t q_t \end{aligned}

Here $S_t \in \mathbb{R}^{d_v \times d_k}$ is a learnable associative memory. The gated delta update performs a single SGD step on an online regression loss, adaptively forgetting outdated associations.

Theoretical Guarantees

Theorem 1 (Necessity of persistent state): The excess risk from using only a recent window $W_t^{(w)}$ vs. full history $H_t$ satisfies

R_w^* - R_{\text{full}}^* = \mathbb{E}[\text{Var}(m_t | W_t^{(w)})]

which is strictly positive if the optimal full-history predictor $m_t$ is not measurable with respect to the window. This proves local heuristics are structurally insufficient for long-horizon consistency.

Theorem 2 (Sufficiency of hybrid memory): Under a contractive gated delta update (factor $\rho < 1$ ) and bounded approximation errors, the long-horizon excess risk satisfies

R_t(\hat{\mu}_t) - R_t^* \leq \left( L\varepsilon + \frac{L_G \bar{\xi}}{1-\rho} \right)^2

as $t \to \infty$ , where $\varepsilon$ is the local approximation error and $\bar{\xi}$ is the maximum one-step perturbation. The geometric damping ensures error does not accumulate.

Native Pre-training Paradigm (CEDC)

The training pipeline has three stages, each dominated by a data layer:

Stage	Data Layer	Resolution	Max Frames	Objective
I – Physical Pretraining	Open-world videos	256P → 720P	1 → 241	Inject physical priors (gravity, object permanence, etc.)
II – Embodied Pretraining	Human demonstrations + robot data	720P	81–241	Learn behavioral semantics, task taxonomies
III – Joint World-Action	Robot trajectories (low-level actions)	720P	81	Align visual forecasting with action prediction

A shape-aware timestep scheduler shift is used to adapt the flow-matching noise schedule to varying video lengths and resolutions:

\tilde{\sigma}_i = \frac{s \sigma_i^{(0)}}{1 + (s-1) \sigma_i^{(0)}}, \quad s = \exp(f(L)) \sqrt{F}

where $L$ is the number of spatial tokens per frame and $F$ the number of frames.

Inference Optimization

Timestep Distillation: Combines Distribution Matching Distillation (DMD) and Consistency Distillation to reduce diffusion steps from dozens to 4, preserving quality.
Hardware-aware optimization: Mixed-parallel inference (sequence + tensor parallelism), FP8 quantization, tiled GatedDeltaNet streaming, and weight-only INT4 text encoder quantization enable real-time performance on consumer GPUs.

Empirical Validation / Results

Embodied World Model Benchmarks

WorldModelBench (Robot subset) – Table 6:

Model	Params	Instruction Following	Physics Adherence	Total Score
Kairos	4B	2.36	4.96	9.30
Cosmos3-Nano*	16B	2.36	4.96	9.26
Lingbot*	28B	2.14	4.92	9.04

Kairos achieves perfect scores in Newtonian mechanics, fluid dynamics, and gravity.

DreamGen Bench – Table 7:

Model	Params	AVG_PA	AVG_IF	AVG_Score
Kairos	4B	0.538	0.698	0.618
Wan2.2*	14B	0.519	0.703	0.611
Cosmos-Predict2.5*	14B	0.495	0.478	0.487

PAI-Bench Robot – Table 8 (small-scale models <10B):

Model	Params	Domain Score	Overall Score
Kairos	4B	88.59	82.57
Wan2.2*	5B	80.17	78.63
GigaWorld-0	2B	85.83	80.87

World Action Model Benchmarks

RoboTwin 2.0 – Table 11:

Model	Type	Clean	Randomized	Average
Kairos	WAM	96.9	95.2	96.1
MotuBrain	WAM	95.8	96.1	96.0
G0.5	VLA	93.7	92.8	93.2

LIBERO-Plus – Table 12:

Model	Type	Average
Kairos	WAM	89.0
Kairos-joint	WAM	90.8
Being-H0.7	WAM	84.8

Ablation studies show that embodied human-centric pretraining yields a +6.0 gain, and joint video-action training adds +23.2.

General World Model Benchmarks

VideoPhy – Table 17:

Model	Average Score
Kairos (4B)	45.55
Cosmos-Predict2.5 (14B)	45.16
Wan2.2 (5B)	38.85

Inference Efficiency

Table 5 (720p, 5s, TI2V):

Model	Memory (GB)	Complexity (PFlops)	1 GPU (s)	4 GPU (s)
Kairos-4B	23.5	2.3	43	9
Wan2.2-5B	23.4	16.6	201	85
Cosmos-Predict2.5-14B	70.2	156.5	2526	687

Kairos scales linearly with video length, while competitors show exponential growth.

Theoretical and Practical Implications

Theoretical: The paper proves that purely local attention is fundamentally insufficient for long-horizon world modeling (Theorem 1). The hybrid design (SWA+DSWA+GLA) is shown to be approximately sufficient, with error bounds that depend only on local approximation quality and the contractive property of GLA (Theorem 2). This provides a rigorous justification for separating local and global memory.
Practical: Kairos demonstrates that a well-designed native pre-training curriculum can replace decoupled fine-tuning pipelines, enabling efficient knowledge transfer from internet videos to robot control. The deployment-aware co-design makes high-fidelity world simulation possible on consumer hardware, opening the door to real-time self-evolution loops and democratized robotics research.
Impact on Physical AI: The unified understanding-generation-prediction architecture provides a substrate for future self-evolving agents that can autonomously collect data, refine policies, and adapt to new embodiments without manual re-engineering.

Conclusion

Kairos introduces a native world model stack that jointly addresses how to learn, maintain, and run the world for Physical AI. Key contributions include:

A Cross-Embodiment Data Curriculum that progressively aligns open-world video, human behavior, and robot data.
A Hybrid Linear Temporal Memory (SWA, DSWA, GLA) with theoretical error bounds guaranteeing long-horizon consistency.
A Deployment-Aware System Co-Design enabling real-time inference on consumer-grade hardware.

Extensive evaluations show state-of-the-art results on embodied and general world model benchmarks while being orders of magnitude faster than comparably sized models.

Future directions: Autonomous self-evolution via recursive imagination and scaling to a generalist embodied substrate supporting diverse hardware (humanoids, dexterous hands) with zero-shot generalization.

⚠️ The paper and its code are available at GitHub, Hugging Face, and ModelScope.