Stream-T1: Test-Time Scaling for Streaming Video Generation

Summary (Overview)

  • Core Innovation: Introduces Stream-T1, the first comprehensive Test-Time Scaling (TTS) framework specifically designed for streaming (chunk-by-chunk) video generation, addressing the computational and temporal guidance limitations of existing video TTS methods.
  • Three Key Mechanisms: The framework consists of three novel components:
    1. Stream-Scaled Noise Propagation: Refines initial noise for each chunk using high-quality noise from previous chunks to establish temporal dependency.
    2. Stream-Scaled Reward Pruning: Evaluates candidates using a combined short-term (frame-level) and long-term (sliding window) reward to balance spatial aesthetics and temporal coherence.
    3. Stream-Scaled Memory Sinking: Dynamically updates the KV-cache memory sink (Discard, EMA-Sink, or Append-Sink) based on reward feedback to preserve long-term semantics.
  • Performance: Demonstrates significant improvements over state-of-the-art baselines (CausVid, Self-forcing, LongLive) on both 5-second and 30-second video benchmarks, achieving superior temporal consistency, motion smoothness, and visual quality.
  • Paradigm Shift: Moves from passive candidate selection (like Best-of-N) to an active optimization paradigm that actively guides generation by refining noise and memory.

Introduction and Theoretical Foundation

The paper addresses key challenges in video generation: maintaining long-term semantic alignment, motion coherence, and temporal consistency. While scaling models during training is costly, Test-Time Scaling (TTS) offers a promising alternative by boosting performance during inference. However, existing video TTS methods (e.g., ImagerySearch) that generate the entire video simultaneously suffer from:

  • High Computational Cost: Searching in a global, high-dimensional space with multi-step denoising per candidate.
  • Lack of Fine-grained Temporal Control: Cannot inject temporal guidance or correct localized artifacts without rejecting the entire sequence.

The authors propose shifting focus to streaming video generation, which operates autoregressively in chunks with few denoising steps (e.g., 4 steps per chunk). This paradigm is intrinsically aligned with TTS principles, forming a "shallow search tree with wide branches" that lowers computational overhead and enables fine-grained temporal control.

Theoretical Foundation: The generation process is based on autoregressive video diffusion models. Given a text prompt cc, the joint distribution of NN frames x1:N=(x1,x2,...,xN)x_{1:N} = (x_1, x_2, ..., x_N) is factorized as:

pθ(x1:Nc)=i=1Npθ(xix<i,c).p_\theta(x_{1:N} | c) = \prod_{i=1}^{N} p_\theta(x_i | x_{<i}, c).

Each conditional step pθ(xix<i,c)p_\theta(x_i | x_{<i}, c) is modeled by a few-step denoising diffusion model GθG_\theta. Starting from pure noise xtTiN(0,I)x^i_{t_T} \sim \mathcal{N}(0, I), the model generates the ii-th chunk by iteratively denoising:

pθ(xix<i,c)=fθ,t1fθ,t2fθ,tT(xtTi),p_\theta(x^i | x_{<i}, c) = f_{\theta,t_1} \circ f_{\theta,t_2} \circ \cdots \circ f_{\theta,t_T}(x^i_{t_T}),

where fθ,tj(xtji)=Ψ(Gθ(xtji,tj,x<i,c),ϵtj1,tj1)f_{\theta,t_j}(x^i_{t_j}) = \Psi(G_\theta(x^i_{t_j}, t_j, x_{<i}, c), \epsilon_{t_{j-1}}, t_{j-1}).

To manage context overload for long videos, methods like Self-forcing use a sliding window p(xixiw+1:i1)p(x^i | x^{i-w+1:i-1}), while LongLive anchors initial chunks p(xix1,xiw+1:i1)p(x^i | x^1, x^{i-w+1:i-1}). Reward Forcing uses EMA-Sink to compress history but can blur features. Stream-T1 builds upon LongLive but introduces active, reward-guided mechanisms for noise initialization, candidate pruning, and memory management.

Methodology

The framework operates in three sequential phases per chunk generation (see Figure 1):

3.2 Stream-Scaled Noise Propagation

Instead of randomly sampling initial noise xTnx^n_T for chunk nn from N(0,I)\mathcal{N}(0, I), it is constructed via spherical interpolation using the optimal noise latent from the previous chunk n1n-1:

xT0N(0,I),x^0_T \sim \mathcal{N}(0, I), xTn=βxTn1+1β2ϵ,ϵN(0,I),x^n_T = \beta x^{n-1}_T + \sqrt{1 - \beta^2} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I),

where β(1,1)\beta \in (-1, 1) is an interpolation hyperparameter controlling temporal correlation. This maintains the marginal distribution as N(0,I)\mathcal{N}(0, I).

3.3 Stream-Scaled Reward Pruning

A beam search algorithm (beam size KK, expansion MM) is guided by a novel reward function that combines short-term (frame-level) and long-term (sliding window) evaluations:

Short Score (SshortnS^n_{\text{short}}): Average of frame-level image rewards.

Sshortn=1Ff=1FImageReward(x0n[:,f]),S^n_{\text{short}} = \frac{1}{F} \sum_{f=1}^{F} \text{ImageReward}(x^n_0[:, f]),

where FF is the number of frames in chunk x0nx^n_0.

Long Score (SlongnS^n_{\text{long}}): Video reward over a sliding window of ww chunks.

Slongn=VideoReward(output[:,max(0,nw+1):n]).S^n_{\text{long}} = \text{VideoReward}(\text{output}[:, \max(0, n-w+1): n]).

Final Score (SfinalnS^n_{\text{final}}): Dynamic weighted fusion with a threshold constraint τ\tau to avoid frame repetition.

\frac{n}{N} \cdot S^n_{\text{short}} + (1 - \frac{n}{N}) \cdot S^n_{\text{long}}, & \frac{n}{N} \leq \tau, \\ \tau \cdot S^n_{\text{short}} + (1 - \tau) \cdot S^n_{\text{long}}, & \frac{n}{N} > \tau, \end{cases} $$ where $n$ is the current chunk index, $N$ is the total number of chunks. ### 3.4 Stream-Scaled Memory Sinking A reward-guided mechanism dynamically routes evicted KV-cache $(K^n, V^n)$ (from sliding window of size $w$) into one of three pathways based on conditions derived from reward scores: **Semantic Boundary Detection Conditions:** 1. **Quality Gate ($C_{\text{quality}}$):** Ensures only high-quality chunks enter the sink. $$ C_{\text{quality}} := S^n_{\text{short}} - \bar{S}_{\text{short}} > \tau_{\text{short}}, $$ where $\bar{S}_{\text{short}}$ is the moving average of historical short scores. 2. **Transition Detector ($C_{\text{transition}}$):** Identifies scene/motion changes. $$ C_{\text{transition}} := S^{n-1}_{\text{long}} - S^n_{\text{long}} > \tau_{\text{long}}. $$ **Dynamic Memory Update Pathways:** * **Discard:** If $\neg C_{\text{quality}}$, discard $(K^n, V^n)$. $$ S^{n+w}_K = S^{n+w-1}_K, \quad S^{n+w}_V = S^{n+w-1}_V. $$ * **EMA-Sink:** If $C_{\text{quality}} \wedge \neg C_{\text{transition}}$ (high quality, no transition), integrate into the latest sink via exponential moving average with decay factor $\alpha$. $$ S^{n+w}_K = [S^{n+w-1}_K[:-1]; \alpha \cdot S^{n+w-1}_K[-1:] + (1-\alpha) \cdot K^n], $$ $$ S^{n+w}_V = [S^{n+w-1}_V[:-1]; \alpha \cdot S^{n+w-1}_V[-1:] + (1-\alpha) \cdot V^n]. $$ * **Append-Sink:** If $C_{\text{quality}} \wedge C_{\text{transition}}$ (high quality with transition), append as a new discrete anchor. $$ S^{n+w}_K = [S^{n+w-1}_K; K^n], \quad S^{n+w}_V = [S^{n+w-1}_V; V^n]. $$ During attention, the global context is: $K^{n+w}_{\text{global}} = [S^{n+w}_K; K^{n+1:n+w}]$, $V^{n+w}_{\text{global}} = [S^{n+w}_V; V^{n+1:n+w}]$. ## Empirical Validation / Results **Implementation:** Evaluated on LongLive (based on Wan2.1-T2V-1.3B) with window size 9, sink size 3. Short-sequence rewards used image models (HPSv3, ImageReward, MHP). Long-sequence rewards used video models (VisionReward, VideoAlign, VideoLLaMA3) over a 10-chunk sliding window. Videos: 16 FPS, 832×480. **Benchmarks:** 5s videos (946 prompts from VBench), 30s videos (128 prompts from MovieGen). Evaluated with VBench/VBench-long (Subject/Background Consistency, Motion Smoothness, Aesthetic/Imaging Quality) and VideoAlign (Visual Quality VQ, Motion Quality MQ, Text Alignment TA). ### Quantitative Results **Table 1: 5s Video Generation Comparison** | Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | CausVid | 96.33 | 95.56 | 98.66 | 69.69 | 62.90 | 0.433 | 0.550 | 1.02 | | Self-forcing | 95.26 | 95.67 | 98.67 | 71.61 | 63.97 | 0.099 | 0.088 | 1.193 | | LongLive | 97.00 | 96.78 | 99.12 | 71.28 | 65.28 | 0.285 | 0.350 | 1.193 | | **Stream-T1 (on LongLive)** | **97.25** | **97.05** | **99.15** | 71.42 | **65.98** | 0.426 | **0.629** | **1.305** | | △ | +0.26% | +0.28% | +0.03% | +0.2% | +1.07% | +49.47% | +79.71% | +9.39% | **Table 2: 30s Video Generation Comparison** | Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | CausVid | 97.91 | 96.74 | 98.15 | 66.32 | 59.71 | -0.144 | 0.328 | 0.501 | | Self-forcing | 97.18 | 96.37 | 98.35 | 68.35 | 59.19 | -0.461 | -0.216 | 0.656 | | LongLive | 97.90 | 96.82 | 98.78 | 68.99 | 61.56 | -0.169 | -0.002 | 1.073 | | **Stream-T1 (on LongLive)** | **98.43** | **97.18** | **99.03** | **69.10** | **62.11** | **-0.073** | 0.226 | **1.170** | | △ | +0.54% | +0.37% | +0.25% | +0.16% | +0.89% | +56.8% | +11400% | +9% | **Table 3: Comparison with TTS Methods (30s)** | Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | LongLive | 97.90 | 96.82 | 98.78 | 68.99 | 61.56 | -0.169 | -0.002 | 1.073 | | + Best of N | 98.13 | 96.88 | 98.86 | 69.34 | 61.97 | -0.083 | 0.062 | 1.160 | | + BeamSearch | 98.28 | 97.03 | 98.90 | 69.05 | 61.85 | -0.077 | 0.165 | 1.159 | | **+ Ours (Stream-T1)** | **98.43** | **97.18** | **99.03** | **69.10** | **62.11** | **-0.073** | **0.226** | **1.170** | **Ablation Study (Table 4, 30s generation):** * **Without Stream-Scaled Memory Sinking:** Gains in Imaging Quality but severe drops in Subject/Background Consistency. * **Without Stream-Scaled Noise Propagation:** Uniform performance drop across all metrics. * **Without Stream-Scaled Reward Pruning:** Marginal Imaging Quality increase but drastic decline in other metrics. ### Qualitative Results As shown in Figure 3, baseline models suffer severe quality degradation (visual distortion, temporal inconsistency) in long sequences, while **Stream-T1** maintains high spatiotemporal coherence and visual aesthetics throughout 30s videos. ## Theoretical and Practical Implications * **Theoretical:** Demonstrates that streaming video generation's chunk-level synthesis and few denoising steps are intrinsically suited for Test-Time Scaling, enabling efficient, fine-grained temporal control. Introduces an active optimization paradigm over passive selection. * **Practical:** Provides a cost-effective framework to significantly enhance video generation quality (temporal consistency, motion smoothness, visual fidelity) without increasing training costs. The three-component design offers a blueprint for integrating TTS into autoregressive video diffusion models. * **Memory Management:** The dynamic, reward-guided memory sinking mechanism addresses the spatial-temporal trade-off in streaming generation, overcoming issues of rigid frame copying (static sinks) and feature corruption (uniform EMA blending). ## Conclusion **Stream-T1** is a novel TTS framework for streaming video generation that operates through three phases: noise propagation, reward pruning, and memory sinking. It actively guides generation by refining noise initialization, evaluating candidates with a combined reward, and dynamically updating memory based on semantic boundaries. Extensive experiments on 5s and 30s benchmarks show it significantly improves temporal consistency, motion smoothness, and frame-level visual quality over state-of-the-art baselines. The framework establishes a new paradigm for efficient, high-quality long video generation.