Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Summary (Overview)

  • Unified Framework: Presents Matrix-Game 3.0, a co-designed framework integrating an industrial-scale data engine, a memory-augmented Diffusion Transformer (DiT) base model, and a multi-segment distillation pipeline to achieve long-horizon consistency, high-resolution generation, and real-time inference simultaneously.
  • Real-Time Performance: Achieves up to 40 FPS real-time video generation at 720p resolution with a 5B parameter model, enabled by INT8 quantization, VAE pruning, GPU-based memory retrieval, and other system-level optimizations.
  • Long-Horizon Memory: Introduces a camera-aware memory mechanism with unified self-attention, enabling the model to retrieve and utilize relevant past observations to maintain spatiotemporal consistency over minute-long sequences, demonstrated through scene revisitation tasks.
  • Error-Aware Training: Enhances the base model's robustness by modeling prediction residuals and re-injecting imperfect generated frames during training, allowing the model to learn self-correction and better align with autoregressive inference.
  • Scalability: Shows that scaling the model up to a 28B parameter Mixture-of-Experts (MoE) architecture further improves generation quality, dynamic behavior, and generalization across diverse AAA-game and synthetic environments.

Introduction and Theoretical Foundation

Building interactive world models that can simulate environment dynamics and predict future observations under user actions has broad applications in robotics, entertainment, and extended reality (XR). While diffusion-based video models show great potential as world simulators, a critical prerequisite for practical deployment is real-time generation with long-horizon spatiotemporal consistency—the ability to generate content continuously at interactive speeds while preserving semantic and geometric coherence over extended sequences.

Current models face a trade-off: powerful short-video diffusion models lack long-term consistency and real-time capability, while exploratory world models often sacrifice one for the other. Prior works like Matrix-Game 2.0 achieved real-time streaming via causal autoregressive diffusion but lacked memory for minute-long consistency. Conversely, models like Lingbot-World improved consistency by scaling context length but struggled with real-time deployment.

Matrix-Game 3.0 addresses this gap by proposing a coordinated solution across three tightly coupled factors:

  1. Data: An industrial-scale engine to produce large-scale, precisely annotated video data.
  2. Modeling: A bidirectional backbone with camera-aware memory and error-aware training to achieve long-horizon consistency and robustness.
  3. Deployment: A multi-segment distillation strategy combined with acceleration techniques to enable real-time inference.

The goal is to provide a practical, open pathway toward industrial-scale deployable world models.

Methodology

The Matrix-Game 3.0 framework consists of four key components.

3.1 Error-Aware Interactive Base Model

Built upon a unified bidirectional Diffusion Transformer (DiT) architecture for both teacher and student to avoid instability from architectural heterogeneity. The model is designed to be robust to imperfect contexts (self-generated history) to reduce exposure bias.

  • Action Control: Discrete keyboard actions are injected via a Cross-Attention module; continuous mouse-control signals are injected via Self-Attention.
  • Error Buffer Training: Inspired by SVI, an error buffer EE is maintained. Residuals between model predictions x^i\hat{x}_i and ground truth xix_i are collected: δ=x^ixi\delta = \hat{x}_i - x_i These residuals are then uniformly sampled (δUniform(E)\delta \sim \text{Uniform}(E)) and used to perturb the history latent frames during training: x~i=xi+γδ\tilde{x}_i = x_i + \gamma\delta where γ\gamma controls perturbation magnitude.
  • Training Objective: The flow-matching objective is applied only to the current frames to be predicted: L=Ex,t,ϵ,δ[ϵ(xk+1:Nvθ(xk+1:Nt,tx~1:k,c))22]\mathcal{L} = \mathbb{E}_{x,t,\epsilon,\delta} \left[ \left\| \epsilon - \left( x_{k+1:N} - v_{\theta}\left( x_{k+1:N}^t, t | \tilde{x}_{1:k}, c \right) \right) \right\|_2^2 \right] where cc denotes the action condition.

3.2 Long-Horizon Memory Mechanism

Enhances the base model with a camera-aware memory system for long-range consistency.

  • Unified Self-Attention: Instead of a separate memory branch, retrieved memory latents m1:rm_{1:r}, past frame latents x1:kx_{1:k}, and noisy current latents xk+1:Ntx_{k+1:N}^t are concatenated and processed jointly by the same DiT in a unified attention space.
  • Camera-Aware Retrieval: Memory frames are retrieved based on camera pose and field-of-view overlap to ensure view-relevance. The retrieval score for a query view ii and candidate jj is based on frustum overlap: sexact(i,j)=Vol(F(Ei)F(Ej))Vol(F(Ei))s_{\text{exact}}(i, j) = \frac{\text{Vol}(F(E_i) \cap F(E_j))}{\text{Vol}(F(E_i))}
  • Geometry Conditioning: Relative camera geometry between the current target and selected memory is encoded using Plücker-style cues.
  • Memory Error Injection: To bridge the train-inference gap, the same error buffer mechanism is applied to perturb both history and memory latents: x~1:k=x1:k+γhδ,m~1:r=m1:r+γmδ\tilde{x}_{1:k} = x_{1:k} + \gamma_h \delta, \quad \tilde{m}_{1:r} = m_{1:r} + \gamma_m \delta
  • Perturbed Rotary Positional Encoding (RoPE): To mitigate periodic aliasing in long sequences, a head-wise perturbed RoPE base is introduced: θ^h=θbase(1+σθϵh)\hat{\theta}_h = \theta_{\text{base}} \left( 1 + \sigma_\theta \epsilon_h \right) where ϵh\epsilon_h is a head-dependent perturbation coefficient and σθ\sigma_\theta controls the magnitude.

3.3 Training-Inference Aligned Few-step Distillation

To achieve real-time few-step generation, a Distribution Matching Distillation (DMD) strategy is employed with a multi-segment rollout to align training and inference.

  • Multi-Segment Inference: The bidirectional student performs autoregressive rollouts over multiple segments. The past frames for segment ii are taken from the tail of segment i1i-1, and memory is retrieved from an online-updated pool.
  • DMD Objective: Minimizes the reverse KL divergence between the student's generated distribution pgenp_{\text{gen}} and the target data distribution pdatap_{\text{data}}: θLDMDEt[(sdata(xtcurrent,t,xpast,c,M)sgen,ξ(xtcurrent,t,xpast,c,M))θxtdϵ]\nabla_\theta \mathcal{L}_{\text{DMD}} \approx -\mathbb{E}_t \left[ \int \left( s_{\text{data}}(x_t^{\text{current}}, t, x^{\text{past}}, c, M) - s_{\text{gen},\xi}(x_t^{\text{current}}, t, x^{\text{past}}, c, M) \right) \nabla_\theta x_t \, d\epsilon \right] where MM denotes memory.

3.4 Real-Time Inference Acceleration

Several techniques are combined to achieve 40 FPS at 720p with a 5B model.

  • DiT INT8 Quantization: Applied to attention projection layers.
  • VAE Pruning: A lightweight VAE decoder (MG-LightVAE) is trained with 50% and 75% pruning ratios, achieving significant speedups.
  • GPU-based Memory Retrieval: Uses a sampling-based approximation for frustum overlap to avoid expensive 3D intersection computations on the CPU: sapprox(i,j)=1Nn=1N1n(j)s_{\text{approx}}(i, j) = \frac{1}{N} \sum_{n=1}^{N} \mathbb{1}^{(j)}_n

###16 3.5 Large Model Scaling A 28B MoE model is trained with progressive scaling of resolution and clip length. The training is decoupled: high-noise models are trained with action-accurate data for control, while low-noise models are trained with internet video for visual quality. Separate high-noise models are trained for first-person and third-person views to specialize in their respective dynamics.

Empirical Validation / Results

Base Model Performance

The interactive base model exhibits basic controllability with stable backgrounds and camera-consistent zoom relationships (Figure 8).

Memory-Augmented Scene Revisitation

The model successfully recovers previously observed scene structures and fine-grained details when the camera revisits earlier viewpoints, demonstrating effective long-range memory utilization (Figure 9).

Large Model (28B) Generation

The scaled-up model shows strong temporal consistency and vivid motion dynamics across diverse AAA-game and synthetic environments (Figure 10).

Distilled Model Performance

The distilled model effectively inherits the memory capability of the base model, faithfully reproducing previously seen content and generating rich new scenes without noticeable drift (Figure 11).

Real-Time Inference Ablation

The combined acceleration techniques are critical for achieving high FPS.

Table 1: Ablation on major acceleration components with 75% VAE pruning.

ConfigurationFPS ↓Drop
Full~40
- INT8 quantization27.3812.62
- MG-LightVAE25.7914.21
- GPU retrieval6.6033.40

GPU retrieval is the most critical component. H-series GPUs consistently deliver higher throughput than A-series GPUs.

VAE Pruning Efficiency

MG-LightVAE provides a favorable trade-off between reconstruction quality and speed.

Table 2: Reconstruction quality and efficiency comparison.

ModelPSNR ↑SSIM ↑Full(s) ↓Dec.(s) ↓
Wan2.2 VAE33.790.990.990.76
MG-LightVAE (50% pruned)31.840.990.520.30
MG-LightVAE (75% pruned)31.140.990.350.13

The 50% pruned version maintains strong quality with a ~2x speedup in decoder time.

Theoretical and Practical Implications

  • Practical World Modeling: Matrix-Game 3.0 demonstrates that simultaneously achieving long-horizon memory, high-resolution fidelity, and real-time interaction in a unified framework is feasible, providing a concrete blueprint for industrial-scale deployable world models.
  • Co-design Philosophy: The work underscores the importance of co-designing data, modeling, and deployment stacks, as advances in one area (e.g., memory mechanisms) can be negated without corresponding support in others (e.g., inference acceleration).
  • Error-Aware Training as a General Principle: The method of modeling prediction residuals and re-injecting imperfect frames during training presents a generalizable strategy to mitigate exposure bias and error accumulation in autoregressive generative models.
  • Unified Attention for Memory: The design of placing memory, history, and current frames in a unified self-attention space, as opposed to a separate cross-attention branch, proves more stable and efficient for streaming generation, offering a new architectural paradigm for memory-augmented models.
  • Open-Source Contribution: As an open technical report, it provides detailed recipes and ablation studies that help isolate key design choices, advancing reproducibility and research in the field of interactive world models.

Conclusion

Matrix-Game 3.0 presents a comprehensive solution for interactive world modeling, achieving up to 40 FPS real-time generation at 720p with a 5B model while maintaining stable memory consistency over minute-long sequences. Key innovations include an industrial-scale data engine, an error-aware and memory-augmented base model, a training-inference aligned distillation pipeline, and system-level acceleration techniques. Scaling to a 28B model further improves quality and generalization. Future work will focus on scaling model and data for higher quality, developing more efficient architectures for higher resolutions and longer sequences, and exploring advanced memory mechanisms for complex, long-term dependency modeling.