Summary (Overview)

  • TURBOSERVE is the first serving system designed specifically for streaming video generation workloads, which involve long-lived sessions that generate video chunks progressively under tight per-chunk latency targets.
  • It addresses two key challenges: session duration heterogeneity (sessions persist for vastly different times) and temporal user-demand heterogeneity (active sessions fluctuate sharply over time).
  • The system formulates serving as an online scheduling problem and introduces a closed-loop scheduling algorithm that jointly coordinates migration-aware session placement and load-driven GPU autoscaling.
  • Runtime support includes coalesced chunk processing for batching concurrent sessions, GPU-CPU offloading for session suspension/resumption, and NCCL-based GPU-GPU migration for online rebalancing.
  • Evaluated on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters (up to 64 NVIDIA B300 GPUs), TURBOSERVE reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average.

Introduction and Theoretical Foundation

Background and Motivation

Streaming video generation is an emerging serving workload where users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation (e.g., Sora, HunyuanVideo) or typical LLM serving, streaming video generation must:

  • Preserve session state across active and idle periods.
  • Repeatedly schedule ongoing sessions.
  • Deliver each chunk under a tight latency target.

Current diffusion-based video generation systems (e.g., FastVideo, xDiT, vLLM-Omni) are optimized for stateless, one-shot requests. Streaming video generation violates this assumption because sessions preserve prompt context and cached temporal states (e.g., KV cache). This mismatch introduces two primary dimensions of heterogeneity:

  1. Session duration heterogeneity: Sessions persist for vastly different durations (from seconds to tens of minutes). Systems that treat sessions as short-lived requests make placement decisions that become suboptimal as long-running sessions accumulate.
  2. Temporal user-demand heterogeneity: User activity alternates between bursts and idle periods; the number of active sessions fluctuates sharply. Static provisioning either over-provisions GPUs (wasting resources) or under-provisions (causing high latency during bursts).

Theoretical Basis

The paper formulates streaming video generation serving as an online scheduling problem that jointly determines session placement and GPU provisioning. The objective is to minimize a weighted sum of GPU operating cost and worst-case per-chunk latency. The authors show through a characterization study (Section 3.2) that:

  • Session migration reduces bottleneck latency by rebalancing load across GPUs.
  • GPU autoscaling improves resource efficiency by adapting to workload variation.
  • Joint coordination of migration and autoscaling yields the best latency-cost trade-off.

These insights motivate the closed-loop design of TURBOSERVE.

Methodology

Problem Formulation (Section 5.1)

The serving problem is modeled as an event-driven online scheduling problem with two coupled control components: session placement and cluster autoscaling.

Key notations (Table 1):

NotationDescription
S(t)S(t)Active session set at event tt
G(t)G(t)Currently provisioned GPUs at event tt
M(t)M(t)Number of active GPUs at event tt
C(t)C(t)GPU operating cost at event tt
KKMax. concurrent sessions per GPU
ϕi(t)\phi_i(t)Placement of session sis_i at event tt
αi(t)\alpha_i(t)User-activity indicator for session sis_i at event tt
L(t)L(t)Worst-case per-chunk serving latency
L(M,t)L^*(M, t)Min. worst-case latency under budget M(t)M(t)
ρmax(t)\rho_{\max}(t)Max. normalized GPU load after placement
ρ^(t)\hat{\rho}(t)Target per-GPU utilization
λ(t)\lambda(t)Latency weight in the optimization objective
Mtar(t)M_{\text{tar}}(t)Target GPU budget computed by autoscaling

The optimization objective at each event tt is:

argminM(t),ϕ(t)C(t)+λ(t)L(t),\arg\min_{M(t), \phi(t)} C(t) + \lambda(t) \cdot L(t), s.t. {i:ϕi(t)=gj}K,gjG(t),\text{s.t. } |\{i : \phi_i(t) = g_j\}| \leq K, \forall g_j \in G(t), αi(t)=1    ϕi(t),siS(t).\alpha_i(t) = 1 \implies \phi_i(t) \neq \emptyset, \forall s_i \in S(t).

The first constraint enforces per-GPU capacity; the second ensures that any session receiving user input must be actively executed.

Closed-Loop Scheduling Algorithm (Section 5.2)

The algorithm consists of two tightly coupled controllers:

Placement Controller (Section 5.2.1): Operates at each event tt to determine session assignment ϕ(t)\phi(t) under a fixed GPU budget M(t)M(t). It approximately solves:

L(M,t)=argminϕ(t) feasible under M(t)L(t).L^*(M, t) = \arg\min_{\phi(t) \text{ feasible under } M(t)} L(t).

The controller performs:

  • Session assignment: For newly activated sessions, select the GPU that minimizes the resulting bottleneck latency.
  • Migration-aware min-max rebalancing: Iteratively considers migrating sessions from the bottleneck GPU to reduce L(t)L(t). The gain of a candidate move (i,j)(i, j') is:
Γi,j(t)=L(t)L(t)ηκi(t),\Gamma_{i,j'}(t) = L(t) - L'(t) - \eta \cdot \kappa_i(t),

where κi(t)\kappa_i(t) is the migration cost and η>0\eta > 0 controls the trade-off. Moves with positive gain are applied until no improvement.

Autoscaling Controller (Section 5.2.2): Determines the GPU budget M(t)M(t) based on load feedback ρmax(t)\rho_{\max}(t) from the placement controller. It uses:

  • Hysteresis-based scaling trigger: Scale-out when ρmax(t)>ρ^(t)+δ\rho_{\max}(t) > \hat{\rho}(t) + \delta; scale-in when ρmax(t)<ρ^(t)δ\rho_{\max}(t) < \hat{\rho}(t) - \delta.
  • Proportional scaling adjustment: Target budget Mtar(t)=Nreq(t)/(Kρ^(t))M_{\text{tar}}(t) = \lceil N_{\text{req}}(t) / (K \hat{\rho}(t)) \rceil, where Nreq(t)N_{\text{req}}(t) is the number of sessions requiring GPU execution.
  • Adaptive control parameters: λ(t)\lambda(t) and ρ^(t)\hat{\rho}(t) adjust based on workload volatility: larger λ\lambda and smaller ρ^\hat{\rho} during fluctuating periods, and vice versa during stable periods.

The interaction forms a closed loop (Algorithm 1): placement provides load feedback, autoscaling updates the GPU budget, and rebalancing around scaling decisions ensures efficient resource use.

Runtime Support (Section 6)

  • Coalesced chunk processing: Groups ready sessions on the same GPU into a batch for efficient model execution.
  • GPU-CPU offloading: Suspends idle sessions by copying state to host memory, freeing GPU slots.
  • GPU-GPU migration: Uses NCCL-based one-sided memory access to transfer per-session state between GPUs at chunk boundaries, with a consistency protocol to avoid duplicated execution.

Empirical Validation / Results

Experimental Setup

  • Hardware: Two clusters: 16 NVIDIA H20 GPUs (Cluster 1) and 64 NVIDIA B300 GPUs (Cluster 2), each with NVLink interconnects and RDMA InfiniBand.
  • Models: LongLive-style streaming video generation models (1.3B and 7B parameter variants).
  • Workloads: Six production traces (Trace 1-6) from Shengshu Technology with heterogeneous session durations and bursty activation patterns.
  • Baselines: TURBOSERVE_base (round-robin, no migration/autoscaling), TURBOSERVE_base+LAG (load-aware greedy), TURBOSERVE_base+MAG (memory-aware greedy). Ablation variants: TURBOSERVE (w/o autoscaling) and TURBOSERVE (w/o migration).

End-to-End Results (Section 7.2)

  • Latency: Under matched GPU cost, TURBOSERVE reduces worst-case per-chunk latency by 37.5% on average (up to 51.6%) compared to all baselines (Figure 7, rows 1-2).
  • Cost: Under matched latency constraints, TURBOSERVE reduces GPU operating cost by 37.2% on average (up to 49.0%) compared to all baselines (Figure 7, rows 3-4).

Ablation Studies (Section 7.3)

Removing either migration or autoscaling degrades cost efficiency:

  • Disabling migration increases GPU cost by 15.0% on average (up to 28.0%).
  • Disabling autoscaling increases GPU cost by 42.9% on average (up to 80.4%).
  • Full TURBOSERVE consistently achieves the best cost efficiency (Figure 8).

Scheduling Effectiveness (Section 7.4)

  • Placement quality: The migration-aware min-max rebalancing algorithm closes the gap to an oracle (exhaustive search) to 3.6% on average (max 6.5%), while reducing scheduling time by over 10× (Figure 9).
  • Autoscaling quality: TURBOSERVE stays within 6.1% of an offline cost-optimal oracle on average (max 8.3%) across three traces (Table 2).

Table 2: Autoscaling cost (USD) of TURBOSERVE compared with the offline oracle on Cluster 2.

MethodTrace 1Trace 2Trace 3
Oracle188.03$158.55$160.74$
TURBOSERVE196.87$ (↑4.7%)171.71$ (↑8.3%)152.36$ (↑5.5%)

Overhead Analysis (Section 7.5)

  • Migration overhead: 23–30 ms (2%–3% of per-chunk latency) across cluster and model configurations (Table 4). This is small relative to the latency penalty from persistent GPU imbalance.

Table 4: Session migration overhead across different cluster and model configurations on Trace 1.

MetricH20 (1.3B)H20 (7B)B300 (1.3B)B300 (7B)
Per-chunk Latency1054 ms1201 ms917 ms1181 ms
Migration Overhead23 ms (2%)24 ms (2%)24 ms (3%)30 ms (3%)
  • Scheduling time: On clusters up to 64 GPUs, scheduling completes within 15 ms (<2% of per-chunk generation time); on 256 GPUs, within 0.1 s (Figure 9, Left).

Theoretical and Practical Implications

  • Theoretical: The paper provides a formal problem formulation for streaming video generation serving, highlighting the need for joint optimization of placement and autoscaling. The closed-loop scheduling framework with adaptive control parameters offers a principled approach to balancing latency and cost under dynamic workloads.
  • Practical: TURBOSERVE demonstrates that production streaming video generation can be served efficiently with significant cost savings (37.2%) and latency improvements (37.5%). The system is deployable on existing GPU clusters with minimal overhead. The insights about migration and autoscaling coordination are applicable to other stateful serving workloads (e.g., interactive AI agents, real-time world models).
  • Key trade-off: The paper shows that treating placement and autoscaling independently leaves performance on the table; their tight coupling is essential for achieving both latency stability and cost efficiency.

Conclusion

TURBOSERVE is the first serving system designed specifically for streaming video generation workloads. It addresses the challenges of heterogeneous session durations and time-varying user demand through a closed-loop scheduling framework that jointly coordinates migration-aware session placement and load-driven GPU autoscaling. Runtime optimizations—coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration—enable efficient execution. Evaluations on real-world production traces from Shengshu Technology show that TURBOSERVE reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average, demonstrating its effectiveness in delivering cost-efficient, latency-stable serving for dynamic streaming video generation. Future work may extend the framework to other interactive, stateful serving workloads and explore more advanced scheduling policies. The code is publicly available at https://github.com/shengshu-ai/TurboServe.

Related papers