Video Analysis and Generation via a Semantic Progress Function

Summary (Overview)

Introduces the Semantic Progress Function (SPF): A novel, model-agnostic one-dimensional representation that quantifies the cumulative semantic evolution of a video sequence over time. Its slope reflects the instantaneous rate of semantic change.
Identifies and diagnoses uneven pacing: The SPF reveals where generated or real video sequences exhibit non-linear semantic evolution—long stretches of minimal change followed by abrupt jumps—which undermines perceptual coherence.
Proposes semantic linearization via ReTime: A method to reparameterize video sequences so that semantic progress increases at a constant rate. This is achieved by warping the model's temporal positional embeddings (RoPE) based on the measured SPF, redistributing temporal capacity.
Enables arbitrary pacing control: The framework generalizes beyond linearization, allowing video sequences to be retimed to match any target semantic pacing profile (e.g., exponential acceleration/deceleration).
Validates effectiveness: The method preserves visual fidelity (confirmed by VBench metrics and user studies) while significantly improving perceived semantic smoothness, as demonstrated on both generated and real cinematic videos.

Introduction and Theoretical Foundation

Generative models for image and video sequences often produce transformations that evolve in a highly non-linear manner semantically. While the output may be temporally smooth, the meaning changes unevenly: prolonged periods with little variation are punctuated by sudden, abrupt jumps. This undermines perceptual coherence, reduces controllability, and complicates editing. Applications like artistic VFX, cinematic transitions, and product reveals rely on smooth semantic evolution.

Prior work focuses on temporal smoothness or latent-space interpolation but lacks a principled measure to quantify the rate of semantic change itself. There is no tool to identify where abrupt semantic shifts occur or to compare the semantic pacing of sequences from different models.

This work introduces the Semantic Progress Function (SPF) to address this gap. The SPF is a one-dimensional function that represents the cumulative semantic state of a sequence. Constructed from pairwise semantic distances between frames, it makes semantic progression explicit and measurable. Departures of the SPF from a straight line directly reveal uneven semantic pacing. This model-agnostic representation provides both diagnostic insight and a foundation for corrective intervention, leading to the proposed semantic linearization procedure.

Methodology

1. Semantic Progress Function (SPF) Construction

Given a video with 𝑇 frames $\{x_1, x_2, ..., x_T\}$ , the goal is to compute a scalar-valued SPF $S_i \in \mathbb{R}$ for each frame index $i$ .

A. Frame-Level Semantic Distance: Each frame $x_i$ is embedded into a semantic latent vector $z_i \in \mathbb{R}^d$ using a pretrained model (SigLIP is chosen empirically). The semantic distance between frames $i$ and $j$ is computed using an angular metric:

d_{ij} = \arccos(z_i^\top z_j)

For computational efficiency, distances are computed only for frame pairs where $|i - j| \leq 30$ , defining the set $\mathcal{P}$ .

B. Fitting the SPF: The SPF vector $S \in \mathbb{R}^T$ is estimated such that its pairwise differences approximate the semantic distances:

S_i - S_j \approx d_{ij}, \quad \forall (i, j) \in \mathcal{P}: i > j

This can be expressed as a linear system $AS \approx b$ , where matrix $A$ encodes the pairwise differences and vector $b$ contains the distances $d_{ij}$ .

The solution is obtained via a regularized, weighted least-squares objective:

\min_{S \in \mathbb{R}^T} (AS - b)^\top W (AS - b) + \lambda S^\top S

Here, $W$ is a diagonal weighting matrix that emphasizes temporally local constraints using a Gaussian:

w_{ij} = \exp\left(-\frac{(i-j)^2}{2\sigma^2}\right)

and $\lambda > 0$ is a regularization strength.

The closed-form solution is:

\hat{S} = (A^\top W A + \lambda I)^{-1} A^\top W b

2. Video Linearization via ReTime

The SPF is used to reparameterize time for constant semantic velocity.

A. Retiming of Generated Videos (Inference-time Intervention):

Temporal Position Warping: The SPF $S$ is normalized to $[0,1]$ . For uniform semantic velocity, output frame $k$ should be at progress $k/(T-1)$ . The warped input position is found via inversion: $\tau_k = S^{-1}\left( \frac{k}{T-1} \right)$ where $S^{-1}$ denotes piecewise-linear interpolation.
RoPE Integration: Video diffusion models use Rotary Position Embeddings (RoPE) along the temporal axis: $\mathbf{q}_p = R_\theta(p)\mathbf{q}, \mathbf{k}_p = R_\theta(p)\mathbf{k}$ . Substituting the warped positions $\tau_k$ for the original linear indices $p$ warps the model's perceived time.
Frequency-Aware Warping: RoPE uses $B$ frequency bands. A naive warp destabilizes generation. A blended position per band $b$ is used: $p_t^{(b)} = (1 - \alpha_b) t + \alpha_b \tau_t$ The blending strength $\alpha_b$ decays exponentially from low to high frequencies to correct global pacing while preserving local motion smoothness: $\alpha_b = \alpha_{\text{high}} + (\alpha_{\text{low}} - \alpha_{\text{high}}) e^{-\kappa b / (B-1)}$
Timestep-Dependent Modulation: Warping strength is reduced during later denoising steps for detail refinement via a decay multiplier $\gamma(\tilde{t}) \in [0,1]$ : $\gamma(\tilde{t}) = \frac{e^{3\tilde{t}} - 1}{e^{3} - 1}$ where $\tilde{t} \in [0,1]$ is the normalized diffusion timestep.
Iterative Refinement: The process is iterated (typically 3 times) to converge to a linear SPF. The temporal correction at iteration $n$ is: $\delta_k^{(n)} = \left(S^{(n)}\right)^{-1}\left( \frac{k}{T-1} \right) - k$ Positions are updated per band: $\tau_k^{(n+1),(b)} = \tau_k^{(n),(b)} + \alpha_b \cdot \delta_k^{(n)}$ .

B. Retiming of Existing Videos (Post-hoc Regeneration): For videos from closed-source models or real-world sources:

Timeline Segmentation: The SPF $S$ is partitioned into $K$ contiguous, approximately linear segments $[a_k, b_k]$ using segmented least squares. The first and last frames of each segment become semantic keyframes.
Intermediate Clip Regeneration: Using a video generator (e.g., Wan2.2 or LTX-2), new clips are generated between the keyframes of each segment. The length of each generated clip $T_k$ is set proportional to the semantic change in that segment: $T_k := \text{round}(T \cdot \Delta S_k)$ . The clips are concatenated to form the final linearized video.

Empirical Validation / Results

1. Retiming Strategy Comparison

The method is compared against baselines on a challenging strawberry → bird transition.

Linear Pixelwise Interpolation: Fails, producing ghosting artifacts.
External Model (LTX-2) Keyframe Interpolation: Imposes a quality bottleneck limited by the external model's capabilities.
Our ReTime Method: Operates directly on the input model's features, preserving its intrinsic quality and producing a coherent, smooth transition.

2. Real Cinematic Video Linearization

Applied to a transformation sequence from Stranger Things. The original features an abrupt, lighting-obscured change. The linearized version redistributes the semantic change over time, revealing smooth intermediate stages of the human → monster transformation.

Table 1: Video Quality Preservation (VBench Metrics)

Model Type	Aesthetic Q. ↑	Motion S. ↑	Temporal F. ↑
Wan2.2 Original	0.630 ± 0.093	0.987 ± 0.010	0.978 ± 0.019
Wan2.2 Retimed	0.626 ± 0.090	0.987 ± 0.010	0.978 ± 0.019
LTX-2 Original	0.660 ± 0.085	0.994 ± 0.003	0.990 ± 0.008
LTX-2 Retimed	0.656 ± 0.087	0.993 ± 0.005	261

Retiming preserves visual fidelity, with scores within one standard deviation of the original.

3. Non-Linear Retiming

The framework supports arbitrary target pacing. Figure 8 demonstrates retiming a video to match rising and falling exponential curves, visually accelerating and decelerating the entry of the sun.

4. Synthetic Validation

Using a synthetic rotating-spot video with known angular velocity profiles (constant, rising, falling exponential), the recovered SPF (dotted lines) is shown to closely track the ground-truth angular position $\theta(t)$ (solid lines), confirming the SPF accurately captures designed non-uniform pacing.

5. SPF Hyperparameter Ablation

Pairwise Distance Model: Four embeddings were compared: OpenCLIP, SigLIP, DINO, and pixel-level $\ell_2$ distance. SigLIP showed superior fine-grained sensitivity (e.g., detecting the onset of a subject's anger) and was chosen as the default. The $\ell_2$ metric failed to capture semantic shifts.
Distance Power $p$ : A hyperparameter $\tilde{d}_{ij} = d_{ij}^p$ acts as a contrast modulator for the semantic curve. While $p=1$ is default, $p=2$ was found superior for segmenting existing videos.

6. User Study

A subjective user study confirmed the method significantly improves semantic pacing (88% preference) while maintaining visual quality.

Theoretical and Practical Implications

Theoretical: Provides a foundational, interpretable metric for analyzing temporal semantic behavior in generative models. It shifts focus from mere temporal smoothness to the evolution of meaning, enabling principled comparison across different models and generation techniques.
Practical: Offers a plug-and-play tool for improving video generation. The ReTime method requires no model retraining or fine-tuning, making it readily applicable to existing diffusion-based video models. It enhances applications in VFX, cinematic transitions, and content creation by producing smoother, more predictable transformations. The ability to retime existing videos also opens up post-production editing possibilities.

Conclusion

The Semantic Progress Function provides a simple yet powerful framework for analyzing and controlling semantic evolution in video sequences. By making semantic pacing explicit and measurable, it enables the diagnosis of uneven transitions and the application of principled corrections like semantic linearization.

Limitations: The SPF can be influenced by strong non-semantic variations (e.g., rapid camera motion, lighting changes). Iterative refinement may push temporal embeddings out-of-distribution if over-applied. Disentangling motion, appearance, and semantics remains a challenge.

Future Work: Directions include developing motion-aware embeddings for robustness, extending the framework to multi-dimensional semantic factor analysis, and leveraging linearized sequences as training data for edit-strength controlled models. The SPF also has potential applications in video benchmarking, summarization, and thumbnailing.