LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Summary (Overview)

Unified Benchmark for Long-Form AV Generation: Introduces LongAV-Compass, the first benchmark dedicated to evaluating minute-scale (60-120 second) audio-visual generation across three conditioning modalities: Text-to-Audio-Video (T2AV), Image-to-Audio-Video (I2AV), and Video-to-Audio-Video (V2AV).
Diagnostic Evaluation Framework: Proposes a comprehensive evaluation framework with over 20 fine-grained dimensions, decomposing assessment into within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization.
Taxonomy-Guided Dataset: Constructs a curated dataset of 284 test cases organized by a two-dimensional taxonomy of application scenario (Vlog, Content-Creator, Performance Ads, Brand Ads) and generation complexity (L1-L4).
Systematic Model Analysis: Evaluates 11 representative models, revealing that current systems struggle with long-range identity drift, brittle event transitions, and audio-visual synchronization decay over extended durations. Proprietary models (Seedance 2.0, Kling 3.0, Veo 3.1) generally outperform open-source alternatives.
Human-Aligned Validation: Demonstrates strong correlation between automatic benchmark scores and human preferences, with Pearson correlations of 0.917 (content fidelity), 0.935 (visual quality), and 0.867 (long-video stability).

Introduction and Theoretical Foundation

Recent advances in video generation are pushing audio-visual (AV) generation beyond short clips towards minute-long content relevant for applications like vlogs, tutorials, and advertisements. Success in this long-form regime requires models to sustain subject identity, event continuity, scene transitions, and audio grounding over extended temporal horizons.

However, existing evaluation benchmarks (e.g., VBench, EvalCrafter, VABench, T2AV-Compass) remain largely confined to short-form settings (5-10 seconds) and provide fragmented coverage across input conditions. This creates three key limitations:

Limited temporal scale for assessing minute-long coherence
Fragmented coverage across T2AV, I2AV, and V2AV modalities
Poor diagnostic visibility into long-range degradation (identity drift, weak continuation, unstable transitions)

As summarized in Table 1, LongAV-Compass addresses these gaps by providing unified X2AV coverage across all three modalities with average video duration > 1 minute.

Table 1: Benchmark Comparison

Benchmark	#Samples	T2V	T2AV	I2AV	V2A	V2AV	Unified X2AV	Avg. Video Duration > 1min
MSVBench	276	✓	✗	✗	✗	✗	✗	✗
AVGen-Bench	235	✗	✓	✗	✗	✗	✗	✗
T2AV-Compass	500	✗	✓	✗	✗	✗	✗	✗
VABench	1,299	✗	✓	✓	✗	✗	✗	✗
PhyAVBench	337	✗	✓	✓	✓	✗	✗	✗
VinTAGe-Bench	636	✗	✗	✗	✓	✗	✗	✗
LongAV-Compass	284	✗	✓	✓	✗	✓	✓	✓

Methodology

3.1 Task Formulation

LongAV-Compass covers three long-form AV generation tasks under a unified framework:

T2AV: Generate minute-scale AV content from structured event scripts
I2AV: Generate long-form sequences conditioned on a reference image + event script, requiring consistent preservation of subject appearance
V2AV: Extend a reference video according to a continuation script while preserving style consistency and subject continuity

Table 2: Task Coverage

Task	#Samples	#Events	#Shots	Input
T2AV	128	879	2,115	Script (S)
I2AV	132	807	1,989	Reference Image (RI) + S
V2AV	41	235	731	Reference Video (RV) + S

3.2 Taxonomy and Benchmark Scope

The benchmark is organized by a two-dimensional taxonomy:

Application Scenario: Vlog, Content-Creator, Performance Ads, Brand Ads
Generation Complexity: L1 (simple interactions) to L4 (causal chains, physical plausibility)

3.3 Data Construction

A hybrid pipeline combines real-video transcription (60%) with LLM-template generation (40%) using Gemini 3.1 Pro:

T2AV: 128 cases from real videos (Creative Commons) and scenario templates
I2AV: 115 cases with reference images from permissively licensed repositories
V2AV: 41 cases with 10-15s reference clips + continuation scripts

3.4 Unified Annotation Format

Each case has dual representations:

Global description: Overall narrative structure
Event sequence: Temporally aligned sub-events with:
- Temporal span
- Action summary
- Completion criterion
- Key visual elements
- Expected audio content

3.5-3.7 Evaluation Metrics

The framework defines comprehensive metrics across video, audio, and task-specific dimensions:

Video Metrics (6 dimensions):

Event Fulfillment ( $V_{QA}$ ): MLLM-based QA verification (0-1 scale)
Visual Quality (VQ): MLLM evaluation of motion naturalness, subject integrity, artifact control, visual fidelity (1-5 scale)
Long-form Continuity (Cont.): Measures story continuity, subject consistency, scene coherence (1-5 scale)
Transition Stability (Trans.): Evaluates event boundaries for black frames, flickering, freezing (1-5 scale)
Holistic Presentation (Hol.): Overall presentation quality, watchability (1-5 scale)
Text-Video Alignment (TVAlign): CLIP embedding similarity (0-1 scale)

Audio Metrics (3 dimensions):

Audio-Video Synchronization (AVS): Temporal alignment of sound with visible actions (1-5 scale)
Audio Quality (AudQ): Realism and event-appropriateness (1-5 scale)
Long-audio Coherence (AudL): Soundtrack continuity and stability (1-5 scale)

Task-Specific Metrics (I2AV):

First-frame Image Anchoring ( $IV_1$ ): MLLM rating of reference image preservation (1-5 scale)
Image Alignment (ImgAlign): CLIP image-image similarity between reference and sampled frames (0-1 scale)

Empirical Validation / Results

4.3 Main Results

Table 3: T2AV Task Results

Model	Aud.	Event $V_{QA}$	VQ	Cont.	Trans.	Hol.	TVAlign	AVS	AudQ	AudL
Seedance 2.0	Yes	0.9023	3.7116	4.2649	4.0065	4.1128	0.6183	3.6038	3.7875	4.1845
Kling 3.0	Yes	0.9274	3.3893	4.4139	3.8502	3.8542	0.6185	3.4922	3.6049	3.7713
Veo 3.1	Yes	0.7784	2.8961	3.1348	4.0032	3.5759	0.6142	3.3490	3.2387	3.6931

Table量与 4: I2AV Task Results

Model	Aud.	$V_{QA}$	VQ	Cont.	Trans.	Hol.	TVAlign	$IV_1$	ImgAlign	AVS	AudQ	AudL
Seedance 2.0	Yes	0.9204	3.7651	4.9182	3.9625	3.8864	0.6145	0.9622	0.9027	3.5669	3.9113	4.2290
Kling 3.0	Yes	0.8939	3.2760	4.1244	4.0668	3.8526	0.6182	0.9960	0.8877	3.5081	3.8032	4.0164
Veo 3.1	Yes	0.8211	2.9266	3.8183	4.1414	3.6463	0.6156	0.9685	0.9051	3.3514	3.4484	4.1221

Table 5: V2AV Task Results

Model	Aud.	$V_{QA}$	VQ	Cont.	Trans.	Hol.	TVAlign	AVS	AudQ	AudL
Seedance 2.0	Yes	0.8753	3.8336	4.7636	3.9267	4.1705	0.9727	3.7591	4.4357	4.3129
Veo 3.1	Yes	0.8055	3.0869	1.8425	2.2815	3.3625	0.7100	3.4939	3.9485	3.2897

4.4 Analysis and Findings

Key Findings:

Proprietary models dominate across all tasks, with Seedance 2.0 being the most consistent performer
Task-specific alignment metrics (e.g., TVAlign, ImgAlign) often saturate, while event fulfillment, continuity, and holistic presentation provide more discriminative signals
Performance Ads is the most challenging scenario, exposing weaknesses in product presentation and multi-step demonstration
Models degrade with increasing complexity (L1→L4) and event-chain length

Table 6: Per-Difficulty Analysis (Average Balanced Score)

Family	L1	L2	L3	L4
Proprietary Models	70.6	75.2	74.5	73.9
Open-Source Models	57.9	52.9	52.8	51.4
Agent-Based Models	47.3	47.4	43.2	41.2

4.5 Human Alignment

Strong correlation between automatic scores and human preferences:

Content Fidelity: Pearson $r = 0.917$
Visual Quality: Pearson $r = 0.935$
Long-Video Stability: Pearson $r = 0.867$

4.6 Input Format Sensitivity

Table 7: Input-Format Sensitivity Analysis

Model	V2AV	I2AV	T2AV
Seedance 2.0	80.4	83.9	83.6
Veo 3.1	57.4	71.8	68.1
LongCat	39.8	40.4	41.2
Helios (14B)	40.5	34.4	34.6

Optimal input format is model-dependent, with no universally superior conditioning modality.

Theoretical and Practical Implications

Theoretical Implications:

Long-form evaluation requires multi-dimensional assessment: Single scores (e.g., FVD, CLIP score) are insufficient for minute-scale generation
Conditioning modality affects generation stability: Different models excel with different input formats (text, image, or video)
Audio-visual synchronization decays over time: Maintaining AV alignment becomes increasingly challenging with duration

Practical Implications:

Benchmark as diagnostic tool: LongAV-Compass helps identify specific failure modes (identity drift, transition artifacts) rather than just ranking models
Guidance for model development: Highlights need for better long-range temporal modeling and cross-event consistency mechanisms
Application-specific evaluation: Different scenarios (Vlog vs. Performance Ads) stress different model capabilities

Conclusion

LongAV-Compass establishes a unified benchmark for minute-scale audio-visual generation across T2AV, I2AV, and V2AV. Key takeaways:

Current models cannot be characterized by single scores – strong long-form generation requires joint success across event completion, temporal continuity, visual quality, semantic alignment, and audio-visual synchronization
Proprietary models lead but have clear weaknesses – particularly in product-oriented scenarios and complex event chains
Audio generation remains challenging – native audio support doesn't guarantee synchronized, coherent soundtracks over minute durations
Benchmark enables systematic diagnosis – revealing where models fail as temporal scope, conditioning diversity, and cross-modal coupling increase

Future Directions: Extending to even longer durations (5-10 minutes), incorporating more complex narrative structures, and developing better automatic metrics for long-range consistency assessment.