A Simple Baseline for Streaming Video Understanding - Summary

Summary (Overview)

Simple Baseline: The paper introduces SimpleStream, a minimal streaming baseline that feeds only the most recent N frames to an off-the-shelf Vision-Language Model (VLM) for each query, without any additional memory, retrieval, compression, or training.
Competitive Performance: Despite its simplicity, SimpleStream matches or surpasses the performance of 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. With only 4 recent frames, it achieves 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench.
Perception-Memory Trade-off: Controlled analyses reveal a consistent trade-off: adding more historical context can improve recall-oriented tasks (memory), but often weakens real-time perception tasks.
Non-Monotonic Context Benefit: The value of longer context is backbone-dependent, not uniformly increasing with model scale or window size. Performance often peaks at a modest window size (e.g., 4 frames).
Call for Better Evaluation: The authors argue that future streaming benchmarks should separate recent-scene perception from long-range memory, and that new methods should be required to clearly outperform SimpleStream to demonstrate meaningful progress from added complexity.

Introduction and Theoretical Foundation

Streaming video understanding research has increasingly focused on complex memory-centric designs (e.g., explicit memory banks, retrieval, compression) to handle long video streams under causal constraints. This trend is based on the assumption that strong performance requires increasingly sophisticated memory mechanisms. However, these designs have delivered modest gains.

This paper challenges that trend with a simple finding: a baseline that uses only a short sliding window of recent frames with a strong, off-the-shelf VLM is already highly competitive. The authors formalize this as SimpleStream. The theoretical motivation is to reframe streaming inference as a context-management problem: at query time t, the system must construct a bounded working context $C_t$ from the observed history. While prior methods differ in how they expand $C_t$ (external memory, retrieval, compression, latent memory), SimpleStream isolates the value of recent, uncompressed visual context alone. This serves as a strong reference point to test whether added memory complexity is truly necessary.

Methodology

SimpleStream is a deliberately simple inference-time input policy applied to an off-the-shelf VLM.

Given a video stream of frames $f_i$ and a text question $q_t$ at time t, SimpleStream constructs the model's input as:

\text{SimpleStream}(t) = \text{VLM}\left( \{ f_{t-N+1}, ..., f_t \}, q_t \right)

Key Design Choices:

No Architectural Changes: Uses an off-the-shelf VLM (e.g., Qwen2.5-VL, Qwen3-VL) without modification.

No Additional Modules: Omits memory banks, retrieval systems, or compression modules.

No Training: Applied directly at inference time.

Bounded Computation: Only the last N frames are retained, so memory and computation do not grow with stream length.

Experimental Setup:

Benchmarks: Evaluated on OVO-Bench (1,640 questions across 12 tasks) and StreamingBench (2,500 questions).
Compared Models: 6 offline video LLMs and 7 streaming video LLMs, covering major design paradigms.
SimpleStream Instantiations: Primarily uses Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct backbones. The recent window size N is varied in {2, 4, 8, 16} frames, sampled at 1 fps.

Empirical Validation / Results

Main Benchmark Performance

Table 1 shows SimpleStream outperforms or matches published streaming methods on both benchmarks.

Table 1: Main results on OVO-Bench and StreamingBench. (Abridged for key comparisons; Avg. is the mean of Real-Time and Backward category averages.)

Model	#Frames	StreamingBench	OVO-Bench Avg.
Human	–	91.46	92.77
Offline VLMs (e.g., Qwen2.5-VL-7B)	1 fps	73.31	52.28
Streaming VLMs (Best: HERMES-7B †)	1 fps	79.44	59.20
SimpleStream (Ours)
Qwen2.5-VL-7B + 4f	4	78.47	65.13
Qwen3-VL-8B + 4f	4	80.59	67.70

On OVO-Bench, SimpleStream (Qwen3-VL, 4f) achieves 67.7%, exceeding the best published streaming method (HERMES at 59.2%) by 8.5 percentage points.
On StreamingBench, SimpleStream (Qwen3-VL, 4f) reaches 80.59%, surpassing HERMES (79.44%).

Model Scale and Window Size Ablation

Table 2 and Figure 5 show that the optimal recent window size is not monotonic with model scale; it varies across backbone families and checkpoints.

Table 2: Model scale effects under a fixed recent-window protocol on OVO-Bench. (Abridged for Qwen3-VL family; Avg. = (Bwd. + Real-Time)/2.)

Model	2 Frames Avg.	4 Frames Avg.	8 Frames Avg.	16 Frames Avg.
Qwen3-VL-2B	58.55	60.53	60.12	60.35
Qwen3-VL-4B	64.03	65.81	65.97	66.06
Qwen3-VL-8B	66.38	67.70	67.37	67.15
Qwen3-VL-32B	72.43	73.49	74.09	73.81

Performance typically improves from 2 to 4 frames, then often plateaus or declines with larger windows.
Larger models can sometimes benefit from longer windows, but the relationship is not uniform.

Efficiency

SimpleStream is highly efficient in terms of latency and memory.

Latency (TTFT): SimpleStream-4f achieves the second-lowest time-to-first-token, remaining competitive with specialized streaming methods (see Table 3 in paper).
Peak GPU Memory: SimpleStream maintains the lowest and flattest memory curve because its state does not accumulate with stream length (see Figure 3 in paper).

Perception-Memory Trade-off Analysis

The authors define metrics to quantify the trade-off:

Perception Change: $\Delta P = \text{RT}_{\text{method}} - \text{RT}_{\text{SimpleStream}}$
Memory Gain: $\Delta M = \text{ER}_{\text{method}} - \text{ER}_{\text{SimpleStream}}$ , where $\text{ER} = (\text{EPM} + \text{ASI}) / 2$

Figure 6 (in the paper) visualizes this trade-off relative to a SimpleStream anchor. The dominant pattern is that external baselines incur a perception cost ( $\Delta P < 0$ ) while sometimes achieving a memory gain ( $\Delta M > 0$ ). For example:

StreamForest: $\Delta M = +8.9$ , but $\Delta P = -13.8$
HERMES: $\Delta M = +2.4$ , but $\Delta P = -6.0$

Visual-RAG Ablation

Appending retrieved historical chunks (+V-RAG) confirms the trade-off (Table 4). While it improves episodic memory (EPM +7.1) and action sequence identification (ASI +6.1), it degrades real-time perception tasks like object recognition (OJR -9.2) and optical character recognition (OCR -8.1), leading to an overall accuracy drop.

Theoretical and Practical Implications

Strong Baselines are Essential: Future work on streaming video understanding should compare against strong recency baselines like SimpleStream. Gains from added complexity (memory, retrieval, compression) must be clearly demonstrated under matched protocols.
Disaggregated Evaluation: Benchmarks should separate perception, memory recall, and hallucination robustness in reporting. Aggregate scores can mask the perception-memory trade-off, as they often overweight perception-heavy tasks.
Redefining Progress: The paper shifts the focus from "how to add more memory" to "how to use history without degrading current-scene understanding." A promising design principle is recent-first, history-on-demand.
Benchmark Limitations: The paper critiques OVO-Bench for conflating memory recall (EPM, ASI) with hallucination robustness (HLD) in its "Backward Tracing" category, and for having a macro-average that favors perception-heavy gains.

Conclusion

SimpleStream demonstrates that a minimal baseline using only recent visual context is sufficient to achieve state-of-the-art performance on current streaming video understanding benchmarks. The analyses reveal a systematic perception-memory trade-off and show that longer context is not uniformly better, with benefits being backbone-dependent. These findings necessitate a change in evaluation practice: future work should adopt strong recency baselines and disaggregated reporting to ensure that claimed improvements from complex memory mechanisms are genuine and not merely artifacts of benchmark design. The central open problem is not adding more memory, but using it without harming present-scene understanding.