LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Summary (Overview)

  • Unified Benchmark for Long-Form AV Generation: Introduces LongAV-Compass, the first benchmark dedicated to evaluating minute-scale (60-120 second) audio-visual generation across three conditioning modalities: Text-to-Audio-Video (T2AV), Image-to-Audio-Video (I2AV), and Video-to-Audio-Video (V2AV).
  • Diagnostic Evaluation Framework: Proposes a comprehensive evaluation framework with over 20 fine-grained dimensions, decomposing assessment into within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization.
  • Taxonomy-Guided Dataset: Constructs a curated dataset of 284 test cases organized by a two-dimensional taxonomy of application scenario (Vlog, Content-Creator, Performance Ads, Brand Ads) and generation complexity (L1-L4).
  • Systematic Model Analysis: Evaluates 11 representative models, revealing that current systems struggle with long-range identity drift, brittle event transitions, and audio-visual synchronization decay over extended durations. Proprietary models (Seedance 2.0, Kling 3.0, Veo 3.1) generally outperform open-source alternatives.
  • Human-Aligned Validation: Demonstrates strong correlation between automatic benchmark scores and human preferences, with Pearson correlations of 0.917 (content fidelity), 0.935 (visual quality), and 0.867 (long-video stability).

Introduction and Theoretical Foundation

Recent advances in video generation are pushing audio-visual (AV) generation beyond short clips towards minute-long content relevant for applications like vlogs, tutorials, and advertisements. Success in this long-form regime requires models to sustain subject identity, event continuity, scene transitions, and audio grounding over extended temporal horizons.

However, existing evaluation benchmarks (e.g., VBench, EvalCrafter, VABench, T2AV-Compass) remain largely confined to short-form settings (5-10 seconds) and provide fragmented coverage across input conditions. This creates three key limitations:

  1. Limited temporal scale for assessing minute-long coherence
  2. Fragmented coverage across T2AV, I2AV, and V2AV modalities
  3. Poor diagnostic visibility into long-range degradation (identity drift, weak continuation, unstable transitions)

As summarized in Table 1, LongAV-Compass addresses these gaps by providing unified X2AV coverage across all three modalities with average video duration > 1 minute.

Table 1: Benchmark Comparison

Benchmark#SamplesT2VT2AVI2AVV2AV2AVUnified X2AVAvg. Video Duration > 1min
MSVBench276
AVGen-Bench235
T2AV-Compass500
VABench1,299
PhyAVBench337
VinTAGe-Bench636
LongAV-Compass284

Methodology

3.1 Task Formulation

LongAV-Compass covers three long-form AV generation tasks under a unified framework:

  • T2AV: Generate minute-scale AV content from structured event scripts
  • I2AV: Generate long-form sequences conditioned on a reference image + event script, requiring consistent preservation of subject appearance
  • V2AV: Extend a reference video according to a continuation script while preserving style consistency and subject continuity

Table 2: Task Coverage

Task#Samples#Events#ShotsInput
T2AV1288792,115Script (S)
I2AV1328071,989Reference Image (RI) + S
V2AV41235731Reference Video (RV) + S

3.2 Taxonomy and Benchmark Scope

The benchmark is organized by a two-dimensional taxonomy:

  1. Application Scenario: Vlog, Content-Creator, Performance Ads, Brand Ads
  2. Generation Complexity: L1 (simple interactions) to L4 (causal chains, physical plausibility)

3.3 Data Construction

A hybrid pipeline combines real-video transcription (60%) with LLM-template generation (40%) using Gemini 3.1 Pro:

  • T2AV: 128 cases from real videos (Creative Commons) and scenario templates
  • I2AV: 115 cases with reference images from permissively licensed repositories
  • V2AV: 41 cases with 10-15s reference clips + continuation scripts

3.4 Unified Annotation Format

Each case has dual representations:

  1. Global description: Overall narrative structure
  2. Event sequence: Temporally aligned sub-events with:
    • Temporal span
    • Action summary
    • Completion criterion
    • Key visual elements
    • Expected audio content

3.5-3.7 Evaluation Metrics

The framework defines comprehensive metrics across video, audio, and task-specific dimensions:

Video Metrics (6 dimensions):

  1. Event Fulfillment (VQAV_{QA}): MLLM-based QA verification (0-1 scale)
  2. Visual Quality (VQ): MLLM evaluation of motion naturalness, subject integrity, artifact control, visual fidelity (1-5 scale)
  3. Long-form Continuity (Cont.): Measures story continuity, subject consistency, scene coherence (1-5 scale)
  4. Transition Stability (Trans.): Evaluates event boundaries for black frames, flickering, freezing (1-5 scale)
  5. Holistic Presentation (Hol.): Overall presentation quality, watchability (1-5 scale)
  6. Text-Video Alignment (TVAlign): CLIP embedding similarity (0-1 scale)

Audio Metrics (3 dimensions):

  1. Audio-Video Synchronization (AVS): Temporal alignment of sound with visible actions (1-5 scale)
  2. Audio Quality (AudQ): Realism and event-appropriateness (1-5 scale)
  3. Long-audio Coherence (AudL): Soundtrack continuity and stability (1-5 scale)

Task-Specific Metrics (I2AV):

  1. First-frame Image Anchoring (IV1IV_1): MLLM rating of reference image preservation (1-5 scale)
  2. Image Alignment (ImgAlign): CLIP image-image similarity between reference and sampled frames (0-1 scale)

Empirical Validation / Results

4.3 Main Results

Table 3: T2AV Task Results

ModelAud.Event VQAV_{QA}VQCont.Trans.Hol.TVAlignAVSAudQAudL
Seedance 2.0Yes0.90233.71164.26494.00654.11280.61833.60383.78754.1845
Kling 3.0Yes0.92743.38934.41393.85023.85420.61853.49223.60493.7713
Veo 3.1Yes0.77842.89613.13484.00323.57590.61423.34903.23873.6931

Table量与 4: I2AV Task Results

ModelAud.VQAV_{QA}VQCont.Trans.Hol.TVAlignIV1IV_1ImgAlignAVSAudQAudL
Seedance 2.0Yes0.92043.76514.91823.96253.88640.61450.96220.90273.56693.91134.2290
Kling 3.0Yes0.89393.27604.12444.06683.85260.61820.99600.88773.50813.80324.0164
Veo 3.1Yes0.82112.92663.81834.14143.64630.61560.96850.90513.35143.44844.1221

Table 5: V2AV Task Results

ModelAud.VQAV_{QA}VQCont.Trans.Hol.TVAlignAVSAudQAudL
Seedance 2.0Yes0.87533.83364.76363.92674.17050.97273.75914.43574.3129
Veo 3.1Yes0.80553.08691.84252.28153.36250.71003.49393.94853.2897

4.4 Analysis and Findings

Key Findings:

  1. Proprietary models dominate across all tasks, with Seedance 2.0 being the most consistent performer
  2. Task-specific alignment metrics (e.g., TVAlign, ImgAlign) often saturate, while event fulfillment, continuity, and holistic presentation provide more discriminative signals
  3. Performance Ads is the most challenging scenario, exposing weaknesses in product presentation and multi-step demonstration
  4. Models degrade with increasing complexity (L1→L4) and event-chain length

Table 6: Per-Difficulty Analysis (Average Balanced Score)

FamilyL1L2L3L4
Proprietary Models70.675.274.573.9
Open-Source Models57.952.952.851.4
Agent-Based Models47.347.443.241.2

4.5 Human Alignment

Strong correlation between automatic scores and human preferences:

  • Content Fidelity: Pearson r=0.917r = 0.917
  • Visual Quality: Pearson r=0.935r = 0.935
  • Long-Video Stability: Pearson r=0.867r = 0.867

4.6 Input Format Sensitivity

Table 7: Input-Format Sensitivity Analysis

ModelV2AVI2AVT2AV
Seedance 2.080.483.983.6
Veo 3.157.471.868.1
LongCat39.840.441.2
Helios (14B)40.534.434.6

Optimal input format is model-dependent, with no universally superior conditioning modality.

Theoretical and Practical Implications

Theoretical Implications:

  1. Long-form evaluation requires multi-dimensional assessment: Single scores (e.g., FVD, CLIP score) are insufficient for minute-scale generation
  2. Conditioning modality affects generation stability: Different models excel with different input formats (text, image, or video)
  3. Audio-visual synchronization decays over time: Maintaining AV alignment becomes increasingly challenging with duration

Practical Implications:

  1. Benchmark as diagnostic tool: LongAV-Compass helps identify specific failure modes (identity drift, transition artifacts) rather than just ranking models
  2. Guidance for model development: Highlights need for better long-range temporal modeling and cross-event consistency mechanisms
  3. Application-specific evaluation: Different scenarios (Vlog vs. Performance Ads) stress different model capabilities

Conclusion

LongAV-Compass establishes a unified benchmark for minute-scale audio-visual generation across T2AV, I2AV, and V2AV. Key takeaways:

  1. Current models cannot be characterized by single scores – strong long-form generation requires joint success across event completion, temporal continuity, visual quality, semantic alignment, and audio-visual synchronization
  2. Proprietary models lead but have clear weaknesses – particularly in product-oriented scenarios and complex event chains
  3. Audio generation remains challenging – native audio support doesn't guarantee synchronized, coherent soundtracks over minute durations
  4. Benchmark enables systematic diagnosis – revealing where models fail as temporal scope, conditioning diversity, and cross-modal coupling increase

Future Directions: Extending to even longer durations (5-10 minutes), incorporating more complex narrative structures, and developing better automatic metrics for long-range consistency assessment.