Video Analysis and Generation via a Semantic Progress Function
Summary (Overview)
- Introduces the Semantic Progress Function (SPF): A novel, model-agnostic one-dimensional representation that quantifies the cumulative semantic evolution of a video sequence over time. Its slope reflects the instantaneous rate of semantic change.
- Identifies and diagnoses uneven pacing: The SPF reveals where generated or real video sequences exhibit non-linear semantic evolution—long stretches of minimal change followed by abrupt jumps—which undermines perceptual coherence.
- Proposes semantic linearization via ReTime: A method to reparameterize video sequences so that semantic progress increases at a constant rate. This is achieved by warping the model's temporal positional embeddings (RoPE) based on the measured SPF, redistributing temporal capacity.
- Enables arbitrary pacing control: The framework generalizes beyond linearization, allowing video sequences to be retimed to match any target semantic pacing profile (e.g., exponential acceleration/deceleration).
- Validates effectiveness: The method preserves visual fidelity (confirmed by VBench metrics and user studies) while significantly improving perceived semantic smoothness, as demonstrated on both generated and real cinematic videos.
Introduction and Theoretical Foundation
Generative models for image and video sequences often produce transformations that evolve in a highly non-linear manner semantically. While the output may be temporally smooth, the meaning changes unevenly: prolonged periods with little variation are punctuated by sudden, abrupt jumps. This undermines perceptual coherence, reduces controllability, and complicates editing. Applications like artistic VFX, cinematic transitions, and product reveals rely on smooth semantic evolution.
Prior work focuses on temporal smoothness or latent-space interpolation but lacks a principled measure to quantify the rate of semantic change itself. There is no tool to identify where abrupt semantic shifts occur or to compare the semantic pacing of sequences from different models.
This work introduces the Semantic Progress Function (SPF) to address this gap. The SPF is a one-dimensional function that represents the cumulative semantic state of a sequence. Constructed from pairwise semantic distances between frames, it makes semantic progression explicit and measurable. Departures of the SPF from a straight line directly reveal uneven semantic pacing. This model-agnostic representation provides both diagnostic insight and a foundation for corrective intervention, leading to the proposed semantic linearization procedure.
Methodology
1. Semantic Progress Function (SPF) Construction
Given a video with 𝑇 frames , the goal is to compute a scalar-valued SPF for each frame index .
A. Frame-Level Semantic Distance: Each frame is embedded into a semantic latent vector using a pretrained model (SigLIP is chosen empirically). The semantic distance between frames and is computed using an angular metric:
For computational efficiency, distances are computed only for frame pairs where , defining the set .
B. Fitting the SPF: The SPF vector is estimated such that its pairwise differences approximate the semantic distances:
This can be expressed as a linear system , where matrix encodes the pairwise differences and vector contains the distances .
The solution is obtained via a regularized, weighted least-squares objective:
Here, is a diagonal weighting matrix that emphasizes temporally local constraints using a Gaussian:
and is a regularization strength.
The closed-form solution is:
2. Video Linearization via ReTime
The SPF is used to reparameterize time for constant semantic velocity.
A. Retiming of Generated Videos (Inference-time Intervention):
- Temporal Position Warping: The SPF is normalized to . For uniform semantic velocity, output frame should be at progress . The warped input position is found via inversion: where denotes piecewise-linear interpolation.
- RoPE Integration: Video diffusion models use Rotary Position Embeddings (RoPE) along the temporal axis: . Substituting the warped positions for the original linear indices warps the model's perceived time.
- Frequency-Aware Warping: RoPE uses frequency bands. A naive warp destabilizes generation. A blended position per band is used: The blending strength decays exponentially from low to high frequencies to correct global pacing while preserving local motion smoothness:
- Timestep-Dependent Modulation: Warping strength is reduced during later denoising steps for detail refinement via a decay multiplier : where is the normalized diffusion timestep.
- Iterative Refinement: The process is iterated (typically 3 times) to converge to a linear SPF. The temporal correction at iteration is: Positions are updated per band: .
B. Retiming of Existing Videos (Post-hoc Regeneration): For videos from closed-source models or real-world sources:
- Timeline Segmentation: The SPF is partitioned into contiguous, approximately linear segments using segmented least squares. The first and last frames of each segment become semantic keyframes.
- Intermediate Clip Regeneration: Using a video generator (e.g., Wan2.2 or LTX-2), new clips are generated between the keyframes of each segment. The length of each generated clip is set proportional to the semantic change in that segment: . The clips are concatenated to form the final linearized video.
Empirical Validation / Results
1. Retiming Strategy Comparison
The method is compared against baselines on a challenging strawberry → bird transition.
- Linear Pixelwise Interpolation: Fails, producing ghosting artifacts.
- External Model (LTX-2) Keyframe Interpolation: Imposes a quality bottleneck limited by the external model's capabilities.
- Our ReTime Method: Operates directly on the input model's features, preserving its intrinsic quality and producing a coherent, smooth transition.
2. Real Cinematic Video Linearization
Applied to a transformation sequence from Stranger Things. The original features an abrupt, lighting-obscured change. The linearized version redistributes the semantic change over time, revealing smooth intermediate stages of the human → monster transformation.
Table 1: Video Quality Preservation (VBench Metrics)
| Model Type | Aesthetic Q. ↑ | Motion S. ↑ | Temporal F. ↑ |
|---|---|---|---|
| Wan2.2 Original | 0.630 ± 0.093 | 0.987 ± 0.010 | 0.978 ± 0.019 |
| Wan2.2 Retimed | 0.626 ± 0.090 | 0.987 ± 0.010 | 0.978 ± 0.019 |
| LTX-2 Original | 0.660 ± 0.085 | 0.994 ± 0.003 | 0.990 ± 0.008 |
| LTX-2 Retimed | 0.656 ± 0.087 | 0.993 ± 0.005 | 261 |
Retiming preserves visual fidelity, with scores within one standard deviation of the original.
3. Non-Linear Retiming
The framework supports arbitrary target pacing. Figure 8 demonstrates retiming a video to match rising and falling exponential curves, visually accelerating and decelerating the entry of the sun.
4. Synthetic Validation
Using a synthetic rotating-spot video with known angular velocity profiles (constant, rising, falling exponential), the recovered SPF (dotted lines) is shown to closely track the ground-truth angular position (solid lines), confirming the SPF accurately captures designed non-uniform pacing.
5. SPF Hyperparameter Ablation
- Pairwise Distance Model: Four embeddings were compared: OpenCLIP, SigLIP, DINO, and pixel-level distance. SigLIP showed superior fine-grained sensitivity (e.g., detecting the onset of a subject's anger) and was chosen as the default. The metric failed to capture semantic shifts.
- Distance Power : A hyperparameter acts as a contrast modulator for the semantic curve. While is default, was found superior for segmenting existing videos.
6. User Study
A subjective user study confirmed the method significantly improves semantic pacing (88% preference) while maintaining visual quality.
Theoretical and Practical Implications
- Theoretical: Provides a foundational, interpretable metric for analyzing temporal semantic behavior in generative models. It shifts focus from mere temporal smoothness to the evolution of meaning, enabling principled comparison across different models and generation techniques.
- Practical: Offers a plug-and-play tool for improving video generation. The
ReTimemethod requires no model retraining or fine-tuning, making it readily applicable to existing diffusion-based video models. It enhances applications in VFX, cinematic transitions, and content creation by producing smoother, more predictable transformations. The ability to retime existing videos also opens up post-production editing possibilities.
Conclusion
The Semantic Progress Function provides a simple yet powerful framework for analyzing and controlling semantic evolution in video sequences. By making semantic pacing explicit and measurable, it enables the diagnosis of uneven transitions and the application of principled corrections like semantic linearization.
Limitations: The SPF can be influenced by strong non-semantic variations (e.g., rapid camera motion, lighting changes). Iterative refinement may push temporal embeddings out-of-distribution if over-applied. Disentangling motion, appearance, and semantics remains a challenge.
Future Work: Directions include developing motion-aware embeddings for robustness, extending the framework to multi-dimensional semantic factor analysis, and leveraging linearized sequences as training data for edit-strength controlled models. The SPF also has potential applications in video benchmarking, summarization, and thumbnailing.