# Stream-T1: Test-Time Scaling for Streaming Video Generation

> Stream-T1 introduces a test-time scaling framework that actively optimizes streaming video generation by refining noise, pruning candidates with multi-scale rewards, and dynamically managing memory for superior temporal coherence and visual quality.

- **Source:** [arXiv](https://arxiv.org/abs/2605.04461)
- **Published:** 2026-05-08
- **Permalink:** https://picx.dev/p/0KMZhc
- **Whiteboard:** https://picx.dev/p/0KMZhc/image

## Summary

# Stream-T1: Test-Time Scaling for Streaming Video Generation

## Summary (Overview)
*   **Core Innovation:** Introduces **Stream-T1**, the first comprehensive Test-Time Scaling (TTS) framework specifically designed for streaming (chunk-by-chunk) video generation, addressing the computational and temporal guidance limitations of existing video TTS methods.
*   **Three Key Mechanisms:** The framework consists of three novel components:
    1.  **Stream-Scaled Noise Propagation:** Refines initial noise for each chunk using high-quality noise from previous chunks to establish temporal dependency.
    2.  **Stream-Scaled Reward Pruning:** Evaluates candidates using a combined short-term (frame-level) and long-term (sliding window) reward to balance spatial aesthetics and temporal coherence.
    3.  **Stream-Scaled Memory Sinking:** Dynamically updates the KV-cache memory sink (Discard, EMA-Sink, or Append-Sink) based on reward feedback to preserve long-term semantics.
*   **Performance:** Demonstrates significant improvements over state-of-the-art baselines (CausVid, Self-forcing, LongLive) on both 5-second and 30-second video benchmarks, achieving superior temporal consistency, motion smoothness, and visual quality.
*   **Paradigm Shift:** Moves from passive candidate selection (like Best-of-N) to an **active optimization paradigm** that actively guides generation by refining noise and memory.

## Introduction and Theoretical Foundation
The paper addresses key challenges in video generation: maintaining long-term semantic alignment, motion coherence, and temporal consistency. While scaling models during training is costly, Test-Time Scaling (TTS) offers a promising alternative by boosting performance during inference. However, existing video TTS methods (e.g., ImagerySearch) that generate the entire video simultaneously suffer from:
*   **High Computational Cost:** Searching in a global, high-dimensional space with multi-step denoising per candidate.
*   **Lack of Fine-grained Temporal Control:** Cannot inject temporal guidance or correct localized artifacts without rejecting the entire sequence.

The authors propose shifting focus to **streaming video generation**, which operates autoregressively in chunks with few denoising steps (e.g., 4 steps per chunk). This paradigm is intrinsically aligned with TTS principles, forming a "shallow search tree with wide branches" that lowers computational overhead and enables fine-grained temporal control.

**Theoretical Foundation:** The generation process is based on autoregressive video diffusion models. Given a text prompt $c$, the joint distribution of $N$ frames $x_{1:N} = (x_1, x_2, ..., x_N)$ is factorized as:
$$ p_\theta(x_{1:N} | c) = \prod_{i=1}^{N} p_\theta(x_i | x_{<i}, c). $$
Each conditional step $p_\theta(x_i | x_{<i}, c)$ is modeled by a few-step denoising diffusion model $G_\theta$. Starting from pure noise $x^i_{t_T} \sim \mathcal{N}(0, I)$, the model generates the $i$-th chunk by iteratively denoising:
$$ p_\theta(x^i | x_{<i}, c) = f_{\theta,t_1} \circ f_{\theta,t_2} \circ \cdots \circ f_{\theta,t_T}(x^i_{t_T}), $$
where $f_{\theta,t_j}(x^i_{t_j}) = \Psi(G_\theta(x^i_{t_j}, t_j, x_{<i}, c), \epsilon_{t_{j-1}}, t_{j-1})$.

To manage context overload for long videos, methods like Self-forcing use a sliding window $p(x^i | x^{i-w+1:i-1})$, while LongLive anchors initial chunks $p(x^i | x^1, x^{i-w+1:i-1})$. Reward Forcing uses EMA-Sink to compress history but can blur features. **Stream-T1** builds upon LongLive but introduces active, reward-guided mechanisms for noise initialization, candidate pruning, and memory management.

## Methodology
The framework operates in three sequential phases per chunk generation (see Figure 1):

### 3.2 Stream-Scaled Noise Propagation
Instead of randomly sampling initial noise $x^n_T$ for chunk $n$ from $\mathcal{N}(0, I)$, it is constructed via spherical interpolation using the optimal noise latent from the previous chunk $n-1$:
$$ x^0_T \sim \mathcal{N}(0, I), $$
$$ x^n_T = \beta x^{n-1}_T + \sqrt{1 - \beta^2} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), $$
where $\beta \in (-1, 1)$ is an interpolation hyperparameter controlling temporal correlation. This maintains the marginal distribution as $\mathcal{N}(0, I)$.

### 3.3 Stream-Scaled Reward Pruning
A beam search algorithm (beam size $K$, expansion $M$) is guided by a novel reward function that combines **short-term** (frame-level) and **long-term** (sliding window) evaluations:

**Short Score ($S^n_{\text{short}}$):** Average of frame-level image rewards.
$$ S^n_{\text{short}} = \frac{1}{F} \sum_{f=1}^{F} \text{ImageReward}(x^n_0[:, f]), $$
where $F$ is the number of frames in chunk $x^n_0$.

**Long Score ($S^n_{\text{long}}$):** Video reward over a sliding window of $w$ chunks.
$$ S^n_{\text{long}} = \text{VideoReward}(\text{output}[:, \max(0, n-w+1): n]). $$

**Final Score ($S^n_{\text{final}}$):** Dynamic weighted fusion with a threshold constraint $\tau$ to avoid frame repetition.
$$ S^n_{\text{final}} = \begin{cases}
\frac{n}{N} \cdot S^n_{\text{short}} + (1 - \frac{n}{N}) \cdot S^n_{\text{long}}, & \frac{n}{N} \leq \tau, \\
\tau \cdot S^n_{\text{short}} + (1 - \tau) \cdot S^n_{\text{long}}, & \frac{n}{N} > \tau,
\end{cases} $$
where $n$ is the current chunk index, $N$ is the total number of chunks.

### 3.4 Stream-Scaled Memory Sinking
A reward-guided mechanism dynamically routes evicted KV-cache $(K^n, V^n)$ (from sliding window of size $w$) into one of three pathways based on conditions derived from reward scores:

**Semantic Boundary Detection Conditions:**

1.  **Quality Gate ($C_{\text{quality}}$):** Ensures only high-quality chunks enter the sink.
    $$ C_{\text{quality}} := S^n_{\text{short}} - \bar{S}_{\text{short}} > \tau_{\text{short}}, $$
    where $\bar{S}_{\text{short}}$ is the moving average of historical short scores.

2.  **Transition Detector ($C_{\text{transition}}$):** Identifies scene/motion changes.
    $$ C_{\text{transition}} := S^{n-1}_{\text{long}} - S^n_{\text{long}} > \tau_{\text{long}}. $$

**Dynamic Memory Update Pathways:**

*   **Discard:** If $\neg C_{\text{quality}}$, discard $(K^n, V^n)$.
    $$ S^{n+w}_K = S^{n+w-1}_K, \quad S^{n+w}_V = S^{n+w-1}_V. $$

*   **EMA-Sink:** If $C_{\text{quality}} \wedge \neg C_{\text{transition}}$ (high quality, no transition), integrate into the latest sink via exponential moving average with decay factor $\alpha$.
    $$ S^{n+w}_K = [S^{n+w-1}_K[:-1]; \alpha \cdot S^{n+w-1}_K[-1:] + (1-\alpha) \cdot K^n], $$
    $$ S^{n+w}_V = [S^{n+w-1}_V[:-1]; \alpha \cdot S^{n+w-1}_V[-1:] + (1-\alpha) \cdot V^n]. $$

*   **Append-Sink:** If $C_{\text{quality}} \wedge C_{\text{transition}}$ (high quality with transition), append as a new discrete anchor.
    $$ S^{n+w}_K = [S^{n+w-1}_K; K^n], \quad S^{n+w}_V = [S^{n+w-1}_V; V^n]. $$

During attention, the global context is: $K^{n+w}_{\text{global}} = [S^{n+w}_K; K^{n+1:n+w}]$, $V^{n+w}_{\text{global}} = [S^{n+w}_V; V^{n+1:n+w}]$.

## Empirical Validation / Results
**Implementation:** Evaluated on LongLive (based on Wan2.1-T2V-1.3B) with window size 9, sink size 3. Short-sequence rewards used image models (HPSv3, ImageReward, MHP). Long-sequence rewards used video models (VisionReward, VideoAlign, VideoLLaMA3) over a 10-chunk sliding window. Videos: 16 FPS, 832×480.

**Benchmarks:** 5s videos (946 prompts from VBench), 30s videos (128 prompts from MovieGen). Evaluated with VBench/VBench-long (Subject/Background Consistency, Motion Smoothness, Aesthetic/Imaging Quality) and VideoAlign (Visual Quality VQ, Motion Quality MQ, Text Alignment TA).

### Quantitative Results

**Table 1: 5s Video Generation Comparison**
| Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| CausVid | 96.33 | 95.56 | 98.66 | 69.69 | 62.90 | 0.433 | 0.550 | 1.02 |
| Self-forcing | 95.26 | 95.67 | 98.67 | 71.61 | 63.97 | 0.099 | 0.088 | 1.193 |
| LongLive | 97.00 | 96.78 | 99.12 | 71.28 | 65.28 | 0.285 | 0.350 | 1.193 |
| **Stream-T1 (on LongLive)** | **97.25** | **97.05** | **99.15** | 71.42 | **65.98** | 0.426 | **0.629** | **1.305** |
| △ | +0.26% | +0.28% | +0.03% | +0.2% | +1.07% | +49.47% | +79.71% | +9.39% |

**Table 2: 30s Video Generation Comparison**
| Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| CausVid | 97.91 | 96.74 | 98.15 | 66.32 | 59.71 | -0.144 | 0.328 | 0.501 |
| Self-forcing | 97.18 | 96.37 | 98.35 | 68.35 | 59.19 | -0.461 | -0.216 | 0.656 |
| LongLive | 97.90 | 96.82 | 98.78 | 68.99 | 61.56 | -0.169 | -0.002 | 1.073 |
| **Stream-T1 (on LongLive)** | **98.43** | **97.18** | **99.03** | **69.10** | **62.11** | **-0.073** | 0.226 | **1.170** |
| △ | +0.54% | +0.37% | +0.25% | +0.16% | +0.89% | +56.8% | +11400% | +9% |

**Table 3: Comparison with TTS Methods (30s)**
| Method | Subject Consistency | Background Consistency | Motion Smoothness | Imaging Quality | Aesthetic Quality | VQ | MQ | TA |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| LongLive | 97.90 | 96.82 | 98.78 | 68.99 | 61.56 | -0.169 | -0.002 | 1.073 |
| + Best of N | 98.13 | 96.88 | 98.86 | 69.34 | 61.97 | -0.083 | 0.062 | 1.160 |
| + BeamSearch | 98.28 | 97.03 | 98.90 | 69.05 | 61.85 | -0.077 | 0.165 | 1.159 |
| **+ Ours (Stream-T1)** | **98.43** | **97.18** | **99.03** | **69.10** | **62.11** | **-0.073** | **0.226** | **1.170** |

**Ablation Study (Table 4, 30s generation):**
*   **Without Stream-Scaled Memory Sinking:** Gains in Imaging Quality but severe drops in Subject/Background Consistency.
*   **Without Stream-Scaled Noise Propagation:** Uniform performance drop across all metrics.
*   **Without Stream-Scaled Reward Pruning:** Marginal Imaging Quality increase but drastic decline in other metrics.

### Qualitative Results
As shown in Figure 3, baseline models suffer severe quality degradation (visual distortion, temporal inconsistency) in long sequences, while **Stream-T1** maintains high spatiotemporal coherence and visual aesthetics throughout 30s videos.

## Theoretical and Practical Implications
*   **Theoretical:** Demonstrates that streaming video generation's chunk-level synthesis and few denoising steps are intrinsically suited for Test-Time Scaling, enabling efficient, fine-grained temporal control. Introduces an active optimization paradigm over passive selection.
*   **Practical:** Provides a cost-effective framework to significantly enhance video generation quality (temporal consistency, motion smoothness, visual fidelity) without increasing training costs. The three-component design offers a blueprint for integrating TTS into autoregressive video diffusion models.
*   **Memory Management:** The dynamic, reward-guided memory sinking mechanism addresses the spatial-temporal trade-off in streaming generation, overcoming issues of rigid frame copying (static sinks) and feature corruption (uniform EMA blending).

## Conclusion
**Stream-T1** is a novel TTS framework for streaming video generation that operates through three phases: noise propagation, reward pruning, and memory sinking. It actively guides generation by refining noise initialization, evaluating candidates with a combined reward, and dynamically updating memory based on semantic boundaries. Extensive experiments on 5s and 30s benchmarks show it significantly improves temporal consistency, motion smoothness, and frame-level visual quality over state-of-the-art baselines. The framework establishes a new paradigm for efficient, high-quality long video generation.

---

_Markdown view of https://picx.dev/p/0KMZhc, served by PicX — AI-generated visual whiteboard summaries of research papers._
