LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
Summary (Overview)
- Unified Benchmark for Long-Form AV Generation: Introduces LongAV-Compass, the first benchmark dedicated to evaluating minute-scale (60-120 second) audio-visual generation across three conditioning modalities: Text-to-Audio-Video (T2AV), Image-to-Audio-Video (I2AV), and Video-to-Audio-Video (V2AV).
- Diagnostic Evaluation Framework: Proposes a comprehensive evaluation framework with over 20 fine-grained dimensions, decomposing assessment into within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization.
- Taxonomy-Guided Dataset: Constructs a curated dataset of 284 test cases organized by a two-dimensional taxonomy of application scenario (Vlog, Content-Creator, Performance Ads, Brand Ads) and generation complexity (L1-L4).
- Systematic Model Analysis: Evaluates 11 representative models, revealing that current systems struggle with long-range identity drift, brittle event transitions, and audio-visual synchronization decay over extended durations. Proprietary models (Seedance 2.0, Kling 3.0, Veo 3.1) generally outperform open-source alternatives.
- Human-Aligned Validation: Demonstrates strong correlation between automatic benchmark scores and human preferences, with Pearson correlations of 0.917 (content fidelity), 0.935 (visual quality), and 0.867 (long-video stability).
Introduction and Theoretical Foundation
Recent advances in video generation are pushing audio-visual (AV) generation beyond short clips towards minute-long content relevant for applications like vlogs, tutorials, and advertisements. Success in this long-form regime requires models to sustain subject identity, event continuity, scene transitions, and audio grounding over extended temporal horizons.
However, existing evaluation benchmarks (e.g., VBench, EvalCrafter, VABench, T2AV-Compass) remain largely confined to short-form settings (5-10 seconds) and provide fragmented coverage across input conditions. This creates three key limitations:
- Limited temporal scale for assessing minute-long coherence
- Fragmented coverage across T2AV, I2AV, and V2AV modalities
- Poor diagnostic visibility into long-range degradation (identity drift, weak continuation, unstable transitions)
As summarized in Table 1, LongAV-Compass addresses these gaps by providing unified X2AV coverage across all three modalities with average video duration > 1 minute.
Table 1: Benchmark Comparison
| Benchmark | #Samples | T2V | T2AV | I2AV | V2A | V2AV | Unified X2AV | Avg. Video Duration > 1min |
|---|---|---|---|---|---|---|---|---|
| MSVBench | 276 | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| AVGen-Bench | 235 | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| T2AV-Compass | 500 | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| VABench | 1,299 | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| PhyAVBench | 337 | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| VinTAGe-Bench | 636 | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| LongAV-Compass | 284 | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ |
Methodology
3.1 Task Formulation
LongAV-Compass covers three long-form AV generation tasks under a unified framework:
- T2AV: Generate minute-scale AV content from structured event scripts
- I2AV: Generate long-form sequences conditioned on a reference image + event script, requiring consistent preservation of subject appearance
- V2AV: Extend a reference video according to a continuation script while preserving style consistency and subject continuity
Table 2: Task Coverage
| Task | #Samples | #Events | #Shots | Input |
|---|---|---|---|---|
| T2AV | 128 | 879 | 2,115 | Script (S) |
| I2AV | 132 | 807 | 1,989 | Reference Image (RI) + S |
| V2AV | 41 | 235 | 731 | Reference Video (RV) + S |
3.2 Taxonomy and Benchmark Scope
The benchmark is organized by a two-dimensional taxonomy:
- Application Scenario: Vlog, Content-Creator, Performance Ads, Brand Ads
- Generation Complexity: L1 (simple interactions) to L4 (causal chains, physical plausibility)
3.3 Data Construction
A hybrid pipeline combines real-video transcription (60%) with LLM-template generation (40%) using Gemini 3.1 Pro:
- T2AV: 128 cases from real videos (Creative Commons) and scenario templates
- I2AV: 115 cases with reference images from permissively licensed repositories
- V2AV: 41 cases with 10-15s reference clips + continuation scripts
3.4 Unified Annotation Format
Each case has dual representations:
- Global description: Overall narrative structure
- Event sequence: Temporally aligned sub-events with:
- Temporal span
- Action summary
- Completion criterion
- Key visual elements
- Expected audio content
3.5-3.7 Evaluation Metrics
The framework defines comprehensive metrics across video, audio, and task-specific dimensions:
Video Metrics (6 dimensions):
- Event Fulfillment (): MLLM-based QA verification (0-1 scale)
- Visual Quality (VQ): MLLM evaluation of motion naturalness, subject integrity, artifact control, visual fidelity (1-5 scale)
- Long-form Continuity (Cont.): Measures story continuity, subject consistency, scene coherence (1-5 scale)
- Transition Stability (Trans.): Evaluates event boundaries for black frames, flickering, freezing (1-5 scale)
- Holistic Presentation (Hol.): Overall presentation quality, watchability (1-5 scale)
- Text-Video Alignment (TVAlign): CLIP embedding similarity (0-1 scale)
Audio Metrics (3 dimensions):
- Audio-Video Synchronization (AVS): Temporal alignment of sound with visible actions (1-5 scale)
- Audio Quality (AudQ): Realism and event-appropriateness (1-5 scale)
- Long-audio Coherence (AudL): Soundtrack continuity and stability (1-5 scale)
Task-Specific Metrics (I2AV):
- First-frame Image Anchoring (): MLLM rating of reference image preservation (1-5 scale)
- Image Alignment (ImgAlign): CLIP image-image similarity between reference and sampled frames (0-1 scale)
Empirical Validation / Results
4.3 Main Results
Table 3: T2AV Task Results
| Model | Aud. | Event | VQ | Cont. | Trans. | Hol. | TVAlign | AVS | AudQ | AudL |
|---|---|---|---|---|---|---|---|---|---|---|
| Seedance 2.0 | Yes | 0.9023 | 3.7116 | 4.2649 | 4.0065 | 4.1128 | 0.6183 | 3.6038 | 3.7875 | 4.1845 |
| Kling 3.0 | Yes | 0.9274 | 3.3893 | 4.4139 | 3.8502 | 3.8542 | 0.6185 | 3.4922 | 3.6049 | 3.7713 |
| Veo 3.1 | Yes | 0.7784 | 2.8961 | 3.1348 | 4.0032 | 3.5759 | 0.6142 | 3.3490 | 3.2387 | 3.6931 |
Table量与 4: I2AV Task Results
| Model | Aud. | VQ | Cont. | Trans. | Hol. | TVAlign | ImgAlign | AVS | AudQ | AudL | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Seedance 2.0 | Yes | 0.9204 | 3.7651 | 4.9182 | 3.9625 | 3.8864 | 0.6145 | 0.9622 | 0.9027 | 3.5669 | 3.9113 | 4.2290 |
| Kling 3.0 | Yes | 0.8939 | 3.2760 | 4.1244 | 4.0668 | 3.8526 | 0.6182 | 0.9960 | 0.8877 | 3.5081 | 3.8032 | 4.0164 |
| Veo 3.1 | Yes | 0.8211 | 2.9266 | 3.8183 | 4.1414 | 3.6463 | 0.6156 | 0.9685 | 0.9051 | 3.3514 | 3.4484 | 4.1221 |
Table 5: V2AV Task Results
| Model | Aud. | VQ | Cont. | Trans. | Hol. | TVAlign | AVS | AudQ | AudL | |
|---|---|---|---|---|---|---|---|---|---|---|
| Seedance 2.0 | Yes | 0.8753 | 3.8336 | 4.7636 | 3.9267 | 4.1705 | 0.9727 | 3.7591 | 4.4357 | 4.3129 |
| Veo 3.1 | Yes | 0.8055 | 3.0869 | 1.8425 | 2.2815 | 3.3625 | 0.7100 | 3.4939 | 3.9485 | 3.2897 |
4.4 Analysis and Findings
Key Findings:
- Proprietary models dominate across all tasks, with Seedance 2.0 being the most consistent performer
- Task-specific alignment metrics (e.g., TVAlign, ImgAlign) often saturate, while event fulfillment, continuity, and holistic presentation provide more discriminative signals
- Performance Ads is the most challenging scenario, exposing weaknesses in product presentation and multi-step demonstration
- Models degrade with increasing complexity (L1→L4) and event-chain length
Table 6: Per-Difficulty Analysis (Average Balanced Score)
| Family | L1 | L2 | L3 | L4 |
|---|---|---|---|---|
| Proprietary Models | 70.6 | 75.2 | 74.5 | 73.9 |
| Open-Source Models | 57.9 | 52.9 | 52.8 | 51.4 |
| Agent-Based Models | 47.3 | 47.4 | 43.2 | 41.2 |
4.5 Human Alignment
Strong correlation between automatic scores and human preferences:
- Content Fidelity: Pearson
- Visual Quality: Pearson
- Long-Video Stability: Pearson
4.6 Input Format Sensitivity
Table 7: Input-Format Sensitivity Analysis
| Model | V2AV | I2AV | T2AV |
|---|---|---|---|
| Seedance 2.0 | 80.4 | 83.9 | 83.6 |
| Veo 3.1 | 57.4 | 71.8 | 68.1 |
| LongCat | 39.8 | 40.4 | 41.2 |
| Helios (14B) | 40.5 | 34.4 | 34.6 |
Optimal input format is model-dependent, with no universally superior conditioning modality.
Theoretical and Practical Implications
Theoretical Implications:
- Long-form evaluation requires multi-dimensional assessment: Single scores (e.g., FVD, CLIP score) are insufficient for minute-scale generation
- Conditioning modality affects generation stability: Different models excel with different input formats (text, image, or video)
- Audio-visual synchronization decays over time: Maintaining AV alignment becomes increasingly challenging with duration
Practical Implications:
- Benchmark as diagnostic tool: LongAV-Compass helps identify specific failure modes (identity drift, transition artifacts) rather than just ranking models
- Guidance for model development: Highlights need for better long-range temporal modeling and cross-event consistency mechanisms
- Application-specific evaluation: Different scenarios (Vlog vs. Performance Ads) stress different model capabilities
Conclusion
LongAV-Compass establishes a unified benchmark for minute-scale audio-visual generation across T2AV, I2AV, and V2AV. Key takeaways:
- Current models cannot be characterized by single scores – strong long-form generation requires joint success across event completion, temporal continuity, visual quality, semantic alignment, and audio-visual synchronization
- Proprietary models lead but have clear weaknesses – particularly in product-oriented scenarios and complex event chains
- Audio generation remains challenging – native audio support doesn't guarantee synchronized, coherent soundtracks over minute durations
- Benchmark enables systematic diagnosis – revealing where models fail as temporal scope, conditioning diversity, and cross-modal coupling increase
Future Directions: Extending to even longer durations (5-10 minutes), incorporating more complex narrative structures, and developing better automatic metrics for long-range consistency assessment.