Summary (Overview)

  • Wan-Streamer is a native-streaming, end-to-end foundation model for real-time, full-duplex text, audio, and video interaction, using a single Transformer with block-causal attention.
  • Unlike cascaded interactive systems, it jointly learns perception, reasoning, generation, response timing, turn management, and cross-modal synchronization without relying on external VAD, ASR, language, TTS, audio-driven animation, or video-generation modules.
  • The entire stack is designed for causality: strictly causal audio and video VAEs, causal encoders/decoders, block-causal multimodal attention, and full-history autoregressive streaming, enabling streaming units as short as 160 ms at 25 FPS.
  • A deployable thinker-performer inference pipeline preserves the unified model state via KV-cache exchange while overlapping understanding and generation, achieving approximately 200 ms model-side latency and approximately 550 ms total interaction latency (including 350 ms bidirectional network).
  • The model is trained in three stages (independent-task pretraining, end-to-end interaction training, distillation for low-latency streaming) using a broad mixture of understanding, generation, and duplex interaction data, reaching sub-second duplex audio-visual communication without module-boundary waiting times.

Introduction and Theoretical Foundation

Human interaction is fundamentally streaming and full-duplex: we continuously watch, listen, speak, gesture, and interrupt, with perception and expression overlapping. Building artificial systems with the same pattern is increasingly important for embodied assistants, real-time digital humans, live broadcasting, and interactive world models. These applications require a model that continuously consumes audio-visual observations, maintains persistent world and dialogue state, decides when and how to respond, and expresses that response through synchronized language, speech, and video with very low latency.

Recent progress in multimodal language models and video generation has advanced several pieces of this goal, but systems are usually assembled as asymmetric or cascaded pipelines (e.g., separate VAD, ASR, LLM, TTS, avatar rendering). Such pipelines introduce waiting time at module boundaries, accumulate recognition and synchronization errors, and make response timing, turn management, and long-horizon consistency difficult to learn as one behavior.

The core difficulty is that real-time audio-visual interaction is intrinsically full-duplex: when the user is speaking, the agent should produce visible listening behavior; when the agent is responding, it should still perceive user feedback for interruption and adaptation. Different modalities have different token rates, representations, objectives, and latency constraints, yet they must be causally aligned within a single ongoing process. Wan-Streamer addresses this by design: every component operates causally, every new observation is usable immediately, and every generated unit is emitted and committed back into the interaction history.

Methodology

Model Architecture

Wan-Streamer represents language, audio, and video as an interleaved causal sequence processed by a single Transformer. The interaction is modeled as a continuous causal stream:

pθ(y1:Ku1:K)=k=1Kpθ(ykt,yka,ykvukt,uka,ukv,y<kt,y<ka,y<kv)(1)p_\theta(\mathbf{y}_{1:K} \mid \mathbf{u}_{1:K}) = \prod_{k=1}^K p_\theta\left( y_k^t, y_k^a, y_k^v \mid \mathbf{u}_{\le k}^t, \mathbf{u}_{\le k}^a, \mathbf{u}_{\le k}^v, \mathbf{y}_{<k}^t, \mathbf{y}_{<k}^a, \mathbf{y}_{<k}^v \right) \tag{1}

where uk=(ukt,uka,ukv)\mathbf{u}_k = (u_k^t, u_k^a, u_k^v) are user observations and yk=(ykt,yka,ykv)\mathbf{y}_k = (y_k^t, y_k^a, y_k^v) are agent responses at streaming unit kk. Language response is discrete tokens optimized with cross-entropy loss; audio and video responses are continuous latents generated jointly with conditional flow matching.

For clean target latent z0mz_0^m and noise ϵmN(0,I)\epsilon^m \sim \mathcal{N}(0, I), with flow time τ\tau:

zτm=(1τ)z0m+τϵm,zτmτ=ϵmz0m(2)z_\tau^m = (1-\tau) z_0^m + \tau \epsilon^m, \quad \frac{\partial z_\tau^m}{\partial \tau} = \epsilon^m - z_0^m \tag{2}

The velocity field is estimated with loss:

LFMm=Eϵmfθ(zτa,zτv,ck,τ)zτmτ22(3)\mathcal{L}_{FM}^m = \mathbb{E}_{\epsilon^m} \left\| f_\theta(z_\tau^a, z_\tau^v, c_k, \tau) - \frac{\partial z_\tau^m}{\partial \tau} \right\|_2^2 \tag{3}

where ckc_k is the clean streaming context (user observations and agent responses already committed to history). The same clean context conditions both audio and video velocity predictions, coupling speech, motion, and appearance.

Fully Causal Stack

  • Causal VAEs: strictly causal audio and video variational autoencoders for streaming latent coding.
  • Causal encoders and causal decoders for audio and video.
  • Block-causal attention in the Transformer for incremental streaming.

Training Stages

  1. Independent-task pretraining: Initialize from a language model; train multimodal interface on understanding (image, audio, video, ASR, dialogue) and generation (image, audio, video, joint audio-visual) tasks.
  2. End-to-end interaction training: Train on duplex interaction data where user inputs and agent outputs are interleaved in the same causal stream, learning response timing, active listening, interruption handling, and long-context consistency.
  3. Distillation for low-latency streaming: Distill a stronger teacher (with CFG and more solver steps) into an efficient student. Use rolling distillation with self-forcing and distribution matching to mitigate long-horizon degradation.

Inference: Thinker-Performer Pipeline

Trained as a single model, but deployed as separated thinker and performer GPUs for overlap:

  • Thinker: hosts causal encoders, short token-causal Transformer for language prediction/state update, KV-cache construction, and causal decoders.
  • Performer: hosts only the latent generation path (flow-matching solver).

At each streaming step:

  1. Thinker encodes current user observations, updates KV cache, and decodes previous response latents for emission.
  2. KV slice sent to performer.
  3. Performer appends KV slice, runs flow-matching solver for next clean latents.
  4. Clean latents returned to thinker at next step.

This pipelines perception/state update, decoding, communication, and latent denoising, enabling real-time 160 ms streaming units.

Empirical Validation / Results

Latency and Runtime Comparison

Wan-Streamer achieves ~200 ms model-side signal-to-signal latency and ~550 ms total interaction latency (including 350 ms bidirectional network) for a remote user.

Table 1: Response-latency comparison for real-time speech and omni-modal interaction systems

SystemInteractionUser-visible responseOther reported metricComparison boundary
Doubao Realtime Voicespeech-to-speech~1 s overall~700 ms bare-model latencySpeech-only product
Seeduplexspeech-to-speechN/R−250 ms endpoint latencySpeech-only; relative improvement
GPT-4o / Realtime APIspeech-to-speech, audio/vision inputprotocol-dependent232/320 ms official audio; ~500 ms API TTFBMix of model/API/network
Hume EVI 3speech-to-speech0.9–1.4 s web-app<300 ms model responseNo visual output
Gemini Live APIspeech-to-speech1.2–3.6 s APIN/R model-sideVendor benchmark
Moshispeech-to-speechN/R160 ms theoretical, 200 ms practicalNative duplex speech; no visual
Wan-Streamer (ours)text/audio/video in/out~550 ms total (including 350 ms network)~200 ms model-side; 25 FPS videoOne end-to-end model; synchronized visual

Table 2: Runtime comparison with visual agents, streaming avatars, and audio-visual generators

SystemVisual interaction scopeReported runtimeMain difference
Body of Herend-to-end humanoid agentnext frame ≤42 ms (24 FPS)No deployed signal-to-signal latency
X-Streameropen-ended video chat portrait25 FPS on 2×A100Absolute latency not disclosed
Avatar Forcing (Ki et al.)interactive head-avatar~500 ms reaction latencyNo dialogue generation
Hallo-Livetext-driven joint A/V avatar20.38 FPS, 0.94 s latencyText-driven; no continuous perception
Wan-Streamer (ours)text/audio/video interaction25 FPS; ~550 ms total; ~200 ms model-sideSingle causal Transformer

Wan-Streamer’s 550 ms total latency covers the full audio-visual response path (perception, reasoning, speech generation, synchronized visual output), unlike speech-only or component-level systems.

Naturalness

  • Idle state: maintains identity, gaze, posture, breathing, subtle facial motion over streaming history (no frozen portrait).
  • Listening state: produces responsive non-verbal feedback (gaze shifts, nods, micro-expressions) coupled with user speech/visual cues.
  • Speech-video synchronization: lip motion, facial dynamics, and prosody are synchronized natively via joint causal prediction before decoding.

Interruption and Proactive Speaking

Full-duplex behavior is learned from interleaved interaction data, not hand-crafted turn-taking rules. The model continues consuming user audio-video while generating its own response, enabling natural interruption handling. It can also initiate relevant comments based on salient visual events, moving from passive Q&A to proactive continuous exchange.

Theoretical and Practical Implications

  • Theoretical: Demonstrates that full-duplex audio-visual interaction can be modeled as a single causal stream across modalities, without requiring separate modules for perception, reasoning, generation, and synchronization. The block-causal attention and fully causal VAEs provide a principled framework for streaming multimodal sequence modeling.
  • Practical: Achieves sub-second interaction latency with synchronized audio-visual output, directly applicable to embodied assistants, real-time digital humans, live broadcasting, and interactive entertainment. The thinker-performer pipeline enables overlapping of compute-intensive tasks while preserving model state, making real-time deployment feasible on two GPUs.
  • Contrast to cascaded systems: Eliminates accumulated errors and waiting times at module boundaries. Response timing, turn management, identity preservation, and cross-modal consistency are learned jointly rather than engineered as post-hoc rules.

Conclusion

Wan-Streamer is a native-streaming, end-to-end foundation model for real-time full-duplex text, audio, and video interaction. By representing user inputs and agent outputs across all modalities as one causal stream processed by a single Transformer, it overcomes the limitations of cascaded systems. Fully causal VAEs, encoders/decoders, and block-causal attention enable sub-second interaction latency (~200 ms model-side, ~550 ms total) at 25 FPS with 160 ms streaming units. The current v0.1 results are a proof of concept at 192p resolution; scaling to higher resolutions is straightforward. The work suggests that real-time multimodal agents should be designed from the ground up as native full-duplex systems, where listening, seeing, speaking, and visible response are learned jointly rather than assembled as post-hoc modules.

Related papers