Visual Summary | Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Summary (Overview)

Wan-Streamer is a native-streaming, end-to-end foundation model for real-time, full-duplex text, audio, and video interaction, using a single Transformer with block-causal attention.
Unlike cascaded interactive systems, it jointly learns perception, reasoning, generation, response timing, turn management, and cross-modal synchronization without relying on external VAD, ASR, language, TTS, audio-driven animation, or video-generation modules.
The entire stack is designed for causality: strictly causal audio and video VAEs, causal encoders/decoders, block-causal multimodal attention, and full-history autoregressive streaming, enabling streaming units as short as 160 ms at 25 FPS.
A deployable thinker-performer inference pipeline preserves the unified model state via KV-cache exchange while overlapping understanding and generation, achieving approximately 200 ms model-side latency and approximately 550 ms total interaction latency (including 350 ms bidirectional network).
The model is trained in three stages (independent-task pretraining, end-to-end interaction training, distillation for low-latency streaming) using a broad mixture of understanding, generation, and duplex interaction data, reaching sub-second duplex audio-visual communication without module-boundary waiting times.

Introduction and Theoretical Foundation

Human interaction is fundamentally streaming and full-duplex: we continuously watch, listen, speak, gesture, and interrupt, with perception and expression overlapping. Building artificial systems with the same pattern is increasingly important for embodied assistants, real-time digital humans, live broadcasting, and interactive world models. These applications require a model that continuously consumes audio-visual observations, maintains persistent world and dialogue state, decides when and how to respond, and expresses that response through synchronized language, speech, and video with very low latency.

Recent progress in multimodal language models and video generation has advanced several pieces of this goal, but systems are usually assembled as asymmetric or cascaded pipelines (e.g., separate VAD, ASR, LLM, TTS, avatar rendering). Such pipelines introduce waiting time at module boundaries, accumulate recognition and synchronization errors, and make response timing, turn management, and long-horizon consistency difficult to learn as one behavior.

The core difficulty is that real-time audio-visual interaction is intrinsically full-duplex: when the user is speaking, the agent should produce visible listening behavior; when the agent is responding, it should still perceive user feedback for interruption and adaptation. Different modalities have different token rates, representations, objectives, and latency constraints, yet they must be causally aligned within a single ongoing process. Wan-Streamer addresses this by design: every component operates causally, every new observation is usable immediately, and every generated unit is emitted and committed back into the interaction history.

Methodology

Model Architecture

Wan-Streamer represents language, audio, and video as an interleaved causal sequence processed by a single Transformer. The interaction is modeled as a continuous causal stream:

p_\theta(\mathbf{y}_{1:K} \mid \mathbf{u}_{1:K}) = \prod_{k=1}^K p_\theta\left( y_k^t, y_k^a, y_k^v \mid \mathbf{u}_{\le k}^t, \mathbf{u}_{\le k}^a, \mathbf{u}_{\le k}^v, \mathbf{y}_{<k}^t, \mathbf{y}_{<k}^a, \mathbf{y}_{<k}^v \right) \tag{1}

where $\mathbf{u}_k = (u_k^t, u_k^a, u_k^v)$ are user observations and $\mathbf{y}_k = (y_k^t, y_k^a, y_k^v)$ are agent responses at streaming unit $k$ . Language response is discrete tokens optimized with cross-entropy loss; audio and video responses are continuous latents generated jointly with conditional flow matching.

For clean target latent $z_0^m$ and noise $\epsilon^m \sim \mathcal{N}(0, I)$ , with flow time $\tau$ :

z_\tau^m = (1-\tau) z_0^m + \tau \epsilon^m, \quad \frac{\partial z_\tau^m}{\partial \tau} = \epsilon^m - z_0^m \tag{2}

The velocity field is estimated with loss:

\mathcal{L}_{FM}^m = \mathbb{E}_{\epsilon^m} \left\| f_\theta(z_\tau^a, z_\tau^v, c_k, \tau) - \frac{\partial z_\tau^m}{\partial \tau} \right\|_2^2 \tag{3}

where $c_k$ is the clean streaming context (user observations and agent responses already committed to history). The same clean context conditions both audio and video velocity predictions, coupling speech, motion, and appearance.

Fully Causal Stack

Causal VAEs: strictly causal audio and video variational autoencoders for streaming latent coding.
Causal encoders and causal decoders for audio and video.
Block-causal attention in the Transformer for incremental streaming.

Training Stages

Independent-task pretraining: Initialize from a language model; train multimodal interface on understanding (image, audio, video, ASR, dialogue) and generation (image, audio, video, joint audio-visual) tasks.
End-to-end interaction training: Train on duplex interaction data where user inputs and agent outputs are interleaved in the same causal stream, learning response timing, active listening, interruption handling, and long-context consistency.
Distillation for low-latency streaming: Distill a stronger teacher (with CFG and more solver steps) into an efficient student. Use rolling distillation with self-forcing and distribution matching to mitigate long-horizon degradation.

Inference: Thinker-Performer Pipeline

Trained as a single model, but deployed as separated thinker and performer GPUs for overlap:

Thinker: hosts causal encoders, short token-causal Transformer for language prediction/state update, KV-cache construction, and causal decoders.
Performer: hosts only the latent generation path (flow-matching solver).

At each streaming step:

Thinker encodes current user observations, updates KV cache, and decodes previous response latents for emission.
KV slice sent to performer.
Performer appends KV slice, runs flow-matching solver for next clean latents.
Clean latents returned to thinker at next step.

This pipelines perception/state update, decoding, communication, and latent denoising, enabling real-time 160 ms streaming units.

Empirical Validation / Results

Latency and Runtime Comparison

Wan-Streamer achieves ~200 ms model-side signal-to-signal latency and ~550 ms total interaction latency (including 350 ms bidirectional network) for a remote user.

Table 1: Response-latency comparison for real-time speech and omni-modal interaction systems

System	Interaction	User-visible response	Other reported metric	Comparison boundary
Doubao Realtime Voice	speech-to-speech	~1 s overall	~700 ms bare-model latency	Speech-only product
Seeduplex	speech-to-speech	N/R	−250 ms endpoint latency	Speech-only; relative improvement
GPT-4o / Realtime API	speech-to-speech, audio/vision input	protocol-dependent	232/320 ms official audio; ~500 ms API TTFB	Mix of model/API/network
Hume EVI 3	speech-to-speech	0.9–1.4 s web-app	<300 ms model response	No visual output
Gemini Live API	speech-to-speech	1.2–3.6 s API	N/R model-side	Vendor benchmark
Moshi	speech-to-speech	N/R	160 ms theoretical, 200 ms practical	Native duplex speech; no visual
Wan-Streamer (ours)	text/audio/video in/out	~550 ms total (including 350 ms network)	~200 ms model-side; 25 FPS video	One end-to-end model; synchronized visual

Table 2: Runtime comparison with visual agents, streaming avatars, and audio-visual generators

System	Visual interaction scope	Reported runtime	Main difference
Body of Her	end-to-end humanoid agent	next frame ≤42 ms (24 FPS)	No deployed signal-to-signal latency
X-Streamer	open-ended video chat portrait	25 FPS on 2×A100	Absolute latency not disclosed
Avatar Forcing (Ki et al.)	interactive head-avatar	~500 ms reaction latency	No dialogue generation
Hallo-Live	text-driven joint A/V avatar	20.38 FPS, 0.94 s latency	Text-driven; no continuous perception
Wan-Streamer (ours)	text/audio/video interaction	25 FPS; ~550 ms total; ~200 ms model-side	Single causal Transformer

Wan-Streamer’s 550 ms total latency covers the full audio-visual response path (perception, reasoning, speech generation, synchronized visual output), unlike speech-only or component-level systems.

Naturalness

Idle state: maintains identity, gaze, posture, breathing, subtle facial motion over streaming history (no frozen portrait).
Listening state: produces responsive non-verbal feedback (gaze shifts, nods, micro-expressions) coupled with user speech/visual cues.
Speech-video synchronization: lip motion, facial dynamics, and prosody are synchronized natively via joint causal prediction before decoding.

Interruption and Proactive Speaking

Full-duplex behavior is learned from interleaved interaction data, not hand-crafted turn-taking rules. The model continues consuming user audio-video while generating its own response, enabling natural interruption handling. It can also initiate relevant comments based on salient visual events, moving from passive Q&A to proactive continuous exchange.

Theoretical and Practical Implications

Theoretical: Demonstrates that full-duplex audio-visual interaction can be modeled as a single causal stream across modalities, without requiring separate modules for perception, reasoning, generation, and synchronization. The block-causal attention and fully causal VAEs provide a principled framework for streaming multimodal sequence modeling.
Practical: Achieves sub-second interaction latency with synchronized audio-visual output, directly applicable to embodied assistants, real-time digital humans, live broadcasting, and interactive entertainment. The thinker-performer pipeline enables overlapping of compute-intensive tasks while preserving model state, making real-time deployment feasible on two GPUs.
Contrast to cascaded systems: Eliminates accumulated errors and waiting times at module boundaries. Response timing, turn management, identity preservation, and cross-modal consistency are learned jointly rather than engineered as post-hoc rules.

Conclusion

Wan-Streamer is a native-streaming, end-to-end foundation model for real-time full-duplex text, audio, and video interaction. By representing user inputs and agent outputs across all modalities as one causal stream processed by a single Transformer, it overcomes the limitations of cascaded systems. Fully causal VAEs, encoders/decoders, and block-causal attention enable sub-second interaction latency (~200 ms model-side, ~550 ms total) at 25 FPS with 160 ms streaming units. The current v0.1 results are a proof of concept at 192p resolution; scaling to higher resolutions is straightforward. The work suggests that real-time multimodal agents should be designed from the ground up as native full-duplex systems, where listening, seeing, speaking, and visible response are learned jointly rather than assembled as post-hoc modules.