Summary (Overview)
- Native World Model Stack: Kairos introduces a unified framework for Physical AI that jointly addresses learning, maintaining, and running world models, moving beyond fragmented post-training approaches.
- Cross-Embodiment Data Curriculum (CEDC): A progressive pre-training paradigm that organizes data into three layers—open-world videos (physical priors), human behavior (task semantics), and robot interactions (embodied control)—to build actionable world knowledge from scratch.
- Hybrid Linear Temporal Memory: A Mixture-of-Transformers (MoT) architecture combining Sliding-Window Attention (SWA), Dilated SWA (DSWA), and Gated Linear Attention (GLA) for efficient long-horizon generation. Formal theoretical bounds prove this factorization strictly limits error accumulation.
- Deployment-Aware System Co-Design: Co-designed for low-latency, memory-efficient inference on both server and consumer-grade hardware, enabling real-time closed-loop operation.
- State-of-the-Art Performance: Achieves top scores on embodied benchmarks (LIBERO-plus, RoboTwin 2.0, WorldModelBench, DreamGen) while being 2–85× faster than comparable models.
Introduction and Theoretical Foundation
World models are evolving from passive video generators into foundational infrastructure for Physical AI. The paper identifies four tightly coupled challenges:
- Fragmented learning: Open-world videos lack action grounding; human data lacks robot embodiment; robot data is scarce and narrow.
- Long-horizon state maintenance: Local continuation heuristics (e.g., sliding window) cannot guarantee global state consistency over extended durations—a structural limitation proven mathematically.
- Embodiment gap: Models may predict plausible futures but fail to learn how actions change those futures in a controllable way.
- Deployment constraints: Real-time observation–action–feedback loops require low latency and memory efficiency, often neglected in offline evaluations.
Kairos tackles these jointly via a native pre-training paradigm, a unified architecture with hybrid temporal memory, and deployment-aware system co-design.
Methodology
Native Architecture with Unified Understanding, Generation, and Prediction
Kairos uses a Mixture-of-Transformers (MoT) backbone comprising:
-
World Understanding: A Vision-Language Model (VLM, e.g., Qwen series) extracts high-level semantic representations from heterogeneous inputs.
-
World Generation: A Diffusion Transformer (DiT) conditioned on text/image generates future video frames. It operates in a latent space via a high-compression video VAE.
-
World Prediction: An Action DiT jointly models future video tokens and robot action tokens using a unified attention masking scheme. The model is trained with flow matching:
where is a linear interpolation between clean latent video and noise .
Hybrid Linear Temporal Attention
The DiT backbone uses three complementary attention mechanisms to achieve linear complexity in the temporal dimension:
- Short-term: Sliding Window Attention (SWA) – restricts attention to a local temporal window (size proportional to spatial tokens per frame) for modeling local motion.
- Mid-term: Dilated Sliding Window Attention (DSWA) – uses dilation to extend the receptive field without quadratic cost, e.g., .
- Long-term: Gated Linear Attention (GLA) – implements a gated delta update (GatedDeltaNet) for global causal memory with linear complexity.
The GLA update rule is:
Here is a learnable associative memory. The gated delta update performs a single SGD step on an online regression loss, adaptively forgetting outdated associations.
Theoretical Guarantees
Theorem 1 (Necessity of persistent state): The excess risk from using only a recent window vs. full history satisfies
which is strictly positive if the optimal full-history predictor is not measurable with respect to the window. This proves local heuristics are structurally insufficient for long-horizon consistency.
Theorem 2 (Sufficiency of hybrid memory): Under a contractive gated delta update (factor ) and bounded approximation errors, the long-horizon excess risk satisfies
as , where is the local approximation error and is the maximum one-step perturbation. The geometric damping ensures error does not accumulate.
Native Pre-training Paradigm (CEDC)
The training pipeline has three stages, each dominated by a data layer:
| Stage | Data Layer | Resolution | Max Frames | Objective |
|---|---|---|---|---|
| I – Physical Pretraining | Open-world videos | 256P → 720P | 1 → 241 | Inject physical priors (gravity, object permanence, etc.) |
| II – Embodied Pretraining | Human demonstrations + robot data | 720P | 81–241 | Learn behavioral semantics, task taxonomies |
| III – Joint World-Action | Robot trajectories (low-level actions) | 720P | 81 | Align visual forecasting with action prediction |
A shape-aware timestep scheduler shift is used to adapt the flow-matching noise schedule to varying video lengths and resolutions:
where is the number of spatial tokens per frame and the number of frames.
Inference Optimization
- Timestep Distillation: Combines Distribution Matching Distillation (DMD) and Consistency Distillation to reduce diffusion steps from dozens to 4, preserving quality.
- Hardware-aware optimization: Mixed-parallel inference (sequence + tensor parallelism), FP8 quantization, tiled GatedDeltaNet streaming, and weight-only INT4 text encoder quantization enable real-time performance on consumer GPUs.
Empirical Validation / Results
Embodied World Model Benchmarks
WorldModelBench (Robot subset) – Table 6:
| Model | Params | Instruction Following | Physics Adherence | Total Score |
|---|---|---|---|---|
| Kairos | 4B | 2.36 | 4.96 | 9.30 |
| Cosmos3-Nano* | 16B | 2.36 | 4.96 | 9.26 |
| Lingbot* | 28B | 2.14 | 4.92 | 9.04 |
Kairos achieves perfect scores in Newtonian mechanics, fluid dynamics, and gravity.
DreamGen Bench – Table 7:
| Model | Params | AVG_PA | AVG_IF | AVG_Score |
|---|---|---|---|---|
| Kairos | 4B | 0.538 | 0.698 | 0.618 |
| Wan2.2* | 14B | 0.519 | 0.703 | 0.611 |
| Cosmos-Predict2.5* | 14B | 0.495 | 0.478 | 0.487 |
PAI-Bench Robot – Table 8 (small-scale models <10B):
| Model | Params | Domain Score | Overall Score |
|---|---|---|---|
| Kairos | 4B | 88.59 | 82.57 |
| Wan2.2* | 5B | 80.17 | 78.63 |
| GigaWorld-0 | 2B | 85.83 | 80.87 |
World Action Model Benchmarks
RoboTwin 2.0 – Table 11:
| Model | Type | Clean | Randomized | Average |
|---|---|---|---|---|
| Kairos | WAM | 96.9 | 95.2 | 96.1 |
| MotuBrain | WAM | 95.8 | 96.1 | 96.0 |
| G0.5 | VLA | 93.7 | 92.8 | 93.2 |
LIBERO-Plus – Table 12:
| Model | Type | Average |
|---|---|---|
| Kairos | WAM | 89.0 |
| Kairos-joint | WAM | 90.8 |
| Being-H0.7 | WAM | 84.8 |
Ablation studies show that embodied human-centric pretraining yields a +6.0 gain, and joint video-action training adds +23.2.
General World Model Benchmarks
VideoPhy – Table 17:
| Model | Average Score |
|---|---|
| Kairos (4B) | 45.55 |
| Cosmos-Predict2.5 (14B) | 45.16 |
| Wan2.2 (5B) | 38.85 |
Inference Efficiency
Table 5 (720p, 5s, TI2V):
| Model | Memory (GB) | Complexity (PFlops) | 1 GPU (s) | 4 GPU (s) |
|---|---|---|---|---|
| Kairos-4B | 23.5 | 2.3 | 43 | 9 |
| Wan2.2-5B | 23.4 | 16.6 | 201 | 85 |
| Cosmos-Predict2.5-14B | 70.2 | 156.5 | 2526 | 687 |
Kairos scales linearly with video length, while competitors show exponential growth.
Theoretical and Practical Implications
-
Theoretical: The paper proves that purely local attention is fundamentally insufficient for long-horizon world modeling (Theorem 1). The hybrid design (SWA+DSWA+GLA) is shown to be approximately sufficient, with error bounds that depend only on local approximation quality and the contractive property of GLA (Theorem 2). This provides a rigorous justification for separating local and global memory.
-
Practical: Kairos demonstrates that a well-designed native pre-training curriculum can replace decoupled fine-tuning pipelines, enabling efficient knowledge transfer from internet videos to robot control. The deployment-aware co-design makes high-fidelity world simulation possible on consumer hardware, opening the door to real-time self-evolution loops and democratized robotics research.
-
Impact on Physical AI: The unified understanding-generation-prediction architecture provides a substrate for future self-evolving agents that can autonomously collect data, refine policies, and adapt to new embodiments without manual re-engineering.
Conclusion
Kairos introduces a native world model stack that jointly addresses how to learn, maintain, and run the world for Physical AI. Key contributions include:
- A Cross-Embodiment Data Curriculum that progressively aligns open-world video, human behavior, and robot data.
- A Hybrid Linear Temporal Memory (SWA, DSWA, GLA) with theoretical error bounds guaranteeing long-horizon consistency.
- A Deployment-Aware System Co-Design enabling real-time inference on consumer-grade hardware.
Extensive evaluations show state-of-the-art results on embodied and general world model benchmarks while being orders of magnitude faster than comparably sized models.
Future directions: Autonomous self-evolution via recursive imagination and scaling to a generalist embodied substrate supporting diverse hardware (humanoids, dexterous hands) with zero-shot generalization.
⚠️ The paper and its code are available at GitHub, Hugging Face, and ModelScope.
Related papers
- Effective Distillation to Hybrid xLSTM Architectures
This paper introduces a hybrid xLSTM architecture with mLSTM and sliding window attention that achieves near-lossless distillation from quadratic attention models, enabling 2-4x higher inference throughput with constant decoding memory.
- FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
- EpochX: Building the Infrastructure for an Emergent Agent Civilization
EpochX is a credits-native marketplace infrastructure that coordinates human-AI production networks by formalizing end-to-end transactions, enabling task decomposition, and creating reusable ecosystem assets to foster emergent collaboration.