# Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

> Wan-Streamer achieves sub-second full-duplex audio-visual interaction with 200 ms model-side latency using a single causal Transformer.

- **Source:** [arXiv](https://arxiv.org/abs/2606.25041)
- **Published:** 2026-06-26
- **Permalink:** https://picx.dev/p/OK1QBP
- **Whiteboard:** https://picx.dev/p/OK1QBP/image

## Summary

## Summary (Overview)

- **Wan-Streamer** is a native-streaming, end-to-end foundation model for real-time, full-duplex text, audio, and video interaction, using a single Transformer with block-causal attention.
- Unlike cascaded interactive systems, it jointly learns perception, reasoning, generation, response timing, turn management, and cross-modal synchronization without relying on external VAD, ASR, language, TTS, audio-driven animation, or video-generation modules.
- The entire stack is designed for causality: strictly causal audio and video VAEs, causal encoders/decoders, block-causal multimodal attention, and full-history autoregressive streaming, enabling streaming units as short as 160 ms at 25 FPS.
- A deployable **thinker-performer** inference pipeline preserves the unified model state via KV-cache exchange while overlapping understanding and generation, achieving approximately **200 ms model-side latency** and approximately **550 ms total interaction latency** (including 350 ms bidirectional network).
- The model is trained in three stages (independent-task pretraining, end-to-end interaction training, distillation for low-latency streaming) using a broad mixture of understanding, generation, and duplex interaction data, reaching sub-second duplex audio-visual communication without module-boundary waiting times.

## Introduction and Theoretical Foundation

Human interaction is fundamentally streaming and full-duplex: we continuously watch, listen, speak, gesture, and interrupt, with perception and expression overlapping. Building artificial systems with the same pattern is increasingly important for embodied assistants, real-time digital humans, live broadcasting, and interactive world models. These applications require a model that continuously consumes audio-visual observations, maintains persistent world and dialogue state, decides when and how to respond, and expresses that response through synchronized language, speech, and video with very low latency.

Recent progress in multimodal language models and video generation has advanced several pieces of this goal, but systems are usually assembled as **asymmetric or cascaded** pipelines (e.g., separate VAD, ASR, LLM, TTS, avatar rendering). Such pipelines introduce waiting time at module boundaries, accumulate recognition and synchronization errors, and make response timing, turn management, and long-horizon consistency difficult to learn as one behavior.

The core difficulty is that real-time audio-visual interaction is intrinsically full-duplex: when the user is speaking, the agent should produce visible listening behavior; when the agent is responding, it should still perceive user feedback for interruption and adaptation. Different modalities have different token rates, representations, objectives, and latency constraints, yet they must be causally aligned within a single ongoing process. Wan-Streamer addresses this by design: every component operates causally, every new observation is usable immediately, and every generated unit is emitted and committed back into the interaction history.

## Methodology

### Model Architecture
Wan-Streamer represents language, audio, and video as an interleaved causal sequence processed by a single Transformer. The interaction is modeled as a continuous causal stream:

$$
p_\theta(\mathbf{y}_{1:K} \mid \mathbf{u}_{1:K}) = \prod_{k=1}^K p_\theta\left( y_k^t, y_k^a, y_k^v \mid \mathbf{u}_{\le k}^t, \mathbf{u}_{\le k}^a, \mathbf{u}_{\le k}^v, \mathbf{y}_{<k}^t, \mathbf{y}_{<k}^a, \mathbf{y}_{<k}^v \right) \tag{1}
$$

where $\mathbf{u}_k = (u_k^t, u_k^a, u_k^v)$ are user observations and $\mathbf{y}_k = (y_k^t, y_k^a, y_k^v)$ are agent responses at streaming unit $k$. Language response is discrete tokens optimized with cross-entropy loss; audio and video responses are continuous latents generated jointly with **conditional flow matching**.

For clean target latent $z_0^m$ and noise $\epsilon^m \sim \mathcal{N}(0, I)$, with flow time $\tau$:

$$
z_\tau^m = (1-\tau) z_0^m + \tau \epsilon^m, \quad \frac{\partial z_\tau^m}{\partial \tau} = \epsilon^m - z_0^m \tag{2}
$$

The velocity field is estimated with loss:

$$
\mathcal{L}_{FM}^m = \mathbb{E}_{\epsilon^m} \left\| f_\theta(z_\tau^a, z_\tau^v, c_k, \tau) - \frac{\partial z_\tau^m}{\partial \tau} \right\|_2^2 \tag{3}
$$

where $c_k$ is the clean streaming context (user observations and agent responses already committed to history). The same clean context conditions both audio and video velocity predictions, coupling speech, motion, and appearance.

### Fully Causal Stack
- **Causal VAEs**: strictly causal audio and video variational autoencoders for streaming latent coding.
- **Causal encoders** and **causal decoders** for audio and video.
- **Block-causal attention** in the Transformer for incremental streaming.

### Training Stages
1. **Independent-task pretraining**: Initialize from a language model; train multimodal interface on understanding (image, audio, video, ASR, dialogue) and generation (image, audio, video, joint audio-visual) tasks.
2. **End-to-end interaction training**: Train on duplex interaction data where user inputs and agent outputs are interleaved in the same causal stream, learning response timing, active listening, interruption handling, and long-context consistency.
3. **Distillation for low-latency streaming**: Distill a stronger teacher (with CFG and more solver steps) into an efficient student. Use rolling distillation with self-forcing and distribution matching to mitigate long-horizon degradation.

### Inference: Thinker-Performer Pipeline
Trained as a single model, but deployed as separated thinker and performer GPUs for overlap:
- **Thinker**: hosts causal encoders, short token-causal Transformer for language prediction/state update, KV-cache construction, and causal decoders.
- **Performer**: hosts only the latent generation path (flow-matching solver).

At each streaming step:
1. Thinker encodes current user observations, updates KV cache, and decodes previous response latents for emission.
2. KV slice sent to performer.
3. Performer appends KV slice, runs flow-matching solver for next clean latents.
4. Clean latents returned to thinker at next step.

This pipelines perception/state update, decoding, communication, and latent denoising, enabling real-time 160 ms streaming units.

## Empirical Validation / Results

### Latency and Runtime Comparison
Wan-Streamer achieves ~200 ms model-side signal-to-signal latency and ~550 ms total interaction latency (including 350 ms bidirectional network) for a remote user.

**Table 1: Response-latency comparison for real-time speech and omni-modal interaction systems**

| System | Interaction | User-visible response | Other reported metric | Comparison boundary |
|--------|-------------|----------------------|-----------------------|---------------------|
| Doubao Realtime Voice | speech-to-speech | ~1 s overall | ~700 ms bare-model latency | Speech-only product |
| Seeduplex | speech-to-speech | N/R | −250 ms endpoint latency | Speech-only; relative improvement |
| GPT-4o / Realtime API | speech-to-speech, audio/vision input | protocol-dependent | 232/320 ms official audio; ~500 ms API TTFB | Mix of model/API/network |
| Hume EVI 3 | speech-to-speech | 0.9–1.4 s web-app | <300 ms model response | No visual output |
| Gemini Live API | speech-to-speech | 1.2–3.6 s API | N/R model-side | Vendor benchmark |
| Moshi | speech-to-speech | N/R | 160 ms theoretical, 200 ms practical | Native duplex speech; no visual |
| Wan-Streamer (ours) | text/audio/video in/out | **~550 ms total** (including 350 ms network) | **~200 ms model-side**; 25 FPS video | One end-to-end model; synchronized visual |

**Table 2: Runtime comparison with visual agents, streaming avatars, and audio-visual generators**

| System | Visual interaction scope | Reported runtime | Main difference |
|--------|-------------------------|------------------|----------------|
| Body of Her | end-to-end humanoid agent | next frame ≤42 ms (24 FPS) | No deployed signal-to-signal latency |
| X-Streamer | open-ended video chat portrait | 25 FPS on 2×A100 | Absolute latency not disclosed |
| Avatar Forcing (Ki et al.) | interactive head-avatar | ~500 ms reaction latency | No dialogue generation |
| Hallo-Live | text-driven joint A/V avatar | 20.38 FPS, 0.94 s latency | Text-driven; no continuous perception |
| Wan-Streamer (ours) | text/audio/video interaction | **25 FPS; ~550 ms total; ~200 ms model-side** | Single causal Transformer |

Wan-Streamer’s **550 ms total latency** covers the full audio-visual response path (perception, reasoning, speech generation, synchronized visual output), unlike speech-only or component-level systems.

### Naturalness
- **Idle state**: maintains identity, gaze, posture, breathing, subtle facial motion over streaming history (no frozen portrait).
- **Listening state**: produces responsive non-verbal feedback (gaze shifts, nods, micro-expressions) coupled with user speech/visual cues.
- **Speech-video synchronization**: lip motion, facial dynamics, and prosody are synchronized natively via joint causal prediction before decoding.

### Interruption and Proactive Speaking
Full-duplex behavior is learned from interleaved interaction data, not hand-crafted turn-taking rules. The model continues consuming user audio-video while generating its own response, enabling natural interruption handling. It can also initiate relevant comments based on salient visual events, moving from passive Q&A to proactive continuous exchange.

## Theoretical and Practical Implications

- **Theoretical**: Demonstrates that full-duplex audio-visual interaction can be modeled as a single causal stream across modalities, without requiring separate modules for perception, reasoning, generation, and synchronization. The block-causal attention and fully causal VAEs provide a principled framework for streaming multimodal sequence modeling.
- **Practical**: Achieves sub-second interaction latency with synchronized audio-visual output, directly applicable to embodied assistants, real-time digital humans, live broadcasting, and interactive entertainment. The thinker-performer pipeline enables overlapping of compute-intensive tasks while preserving model state, making real-time deployment feasible on two GPUs.
- **Contrast to cascaded systems**: Eliminates accumulated errors and waiting times at module boundaries. Response timing, turn management, identity preservation, and cross-modal consistency are learned jointly rather than engineered as post-hoc rules.

## Conclusion

Wan-Streamer is a native-streaming, end-to-end foundation model for real-time full-duplex text, audio, and video interaction. By representing user inputs and agent outputs across all modalities as one causal stream processed by a single Transformer, it overcomes the limitations of cascaded systems. Fully causal VAEs, encoders/decoders, and block-causal attention enable sub-second interaction latency (~200 ms model-side, ~550 ms total) at 25 FPS with 160 ms streaming units. The current v0.1 results are a proof of concept at 192p resolution; scaling to higher resolutions is straightforward. The work suggests that real-time multimodal agents should be designed from the ground up as native full-duplex systems, where listening, seeing, speaking, and visible response are learned jointly rather than assembled as post-hoc modules.

---

_Markdown view of https://picx.dev/p/OK1QBP, served by PicX — AI-generated visual whiteboard summaries of research papers._
