Summary of "When Vision Speaks for Sound"
Summary (Overview)
- Identifies the Audio-Visual Clever Hans Effect: Current video-capable multimodal large language models (MLLMs) often answer sound-related questions by exploiting visual-semantic priors (e.g., a crash implies a "thud") rather than genuinely verifying the presence, timing, or consistency of the audio stream.
- Introduces the THUD Diagnostic Framework: A systematic probing method using three counterfactual audio interventions—Shift (temporal displacement), Mute (silence), and Swap (audio substitution)—to break natural audio-visual correlations and expose shortcut reliance.
- Proposes a Two-Stage Alignment Recipe: A training method combining intervention-derived preference pairs (to teach audio verification) with general video data (to prevent over-specialization). The best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points while maintaining or slightly improving performance on general video and audio-visual QA benchmarks.
Introduction and Theoretical Foundation
The paper investigates whether state-of-the-art video-capable MLLMs perform genuine audio-visual grounding or merely hallucinate acoustic information from visual cues. This behavior is characterized as an audio-visual Clever Hans effect, analogous to the horse that appeared to solve arithmetic by picking up on its trainer's subtle cues. In the multimodal context, models exploit the strong natural correlations between visual events and their likely sounds (e.g., a barking dog, a breaking object) to produce plausible-sounding answers without verifying the actual audio evidence.
The authors argue that standard evaluations, which use naturally correlated videos, fail to expose this pseudo-alignment. To diagnose it, evaluation must use controlled interventions that systematically break these correlations, forcing the model to rely on actual audio-visual reasoning.
Methodology
1. The THUD Diagnostic Protocol
THUD constructs a probing space by applying three physical interventions to the audio track of natural videos while keeping the visual stream fixed. Let a video be represented as , where is the visual stream and is the audio track. The intervened video is:
The three operators are defined as:
- Shift (Temporal Synchronization): Displaces the audio track by an offset .
- Mute (Audio Existence): Replaces the audio with silence.
- Swap (Audio-Visual Consistency): Replaces the original audio with a track from another video .
2. Data Sourcing and Annotation
- Source Videos: The Oops dataset, containing "in-the-wild" videos of unintentional human actions (e.g., slips, crashes), is used due to its strong visual-acoustic event correlations.
- Annotation: Each source video is annotated with an event-time tuple: representing the visual event, visual timestamp, audio event, and audio timestamp. Annotations are generated via cross-model verification (using Gemini, GPT, Claude) and human review, with strict agreement thresholds (, ).
3. Alignment Recipe Construction
A two-stage post-training pipeline is designed to mitigate shortcuts:
- SFT Warm-up: Supervised Fine-Tuning on intervention data to establish audio-aware response patterns.
- Preference Optimization: Direct Preference Optimization (DPO) on a mixture of:
- Intervention-derived preference pairs , where the chosen response verifies the audio-visual relation and the rejected response exhibits the visually plausible shortcut.
- General video preference data (from FineVideo and LLaVA-Video) to regularize the model and preserve broad video understanding.
Empirical Validation / Results
1. Diagnosis of Shortcut Reliance
The paper evaluates several leading open and closed-source models (Gemini, Qwen3-Omni, MiniCPM-o, etc.) under the THUD protocol.
Table 1: Paired diagnostic accuracy (%) of video-capable multimodal models.
| Model | Size | Temporal Sync. (Orig. / Shift) | Audio Existence (Orig. / Mute) | Sound Consistency (Orig. / Swap) | Avg Gap |
|---|---|---|---|---|---|
| Gemini | N/A | 54.9 / 46.5 | 100.0 / 13.4 | 93.6 / 18.3 | 56.8 |
| MiniCPM-o-4.5 | 9B | 83.8 / 13.7 | 100.0 / 19.0 | 95.8 / 4.9 | 80.7 |
| Qwen3-Omni | 30B | 100.0* / 1.4 | 95.1 / 0.0 | 75.4 / 37.3 | 77.3 |
| MiMo-V2.5 | 311B | 73.9 / 9.9 | 99.3 / 2.1 | 89.4 / 15.3 | 78.4 |
Note: Avg Gap = average accuracy drop from Original to intervention conditions, measuring shortcut reliance. A perfect score of 100% on Original (e.g., Qwen3-Omni on Temporal Sync) collapsing to near 0% on the intervention reveals a "synchronized-default prior," not true grounding.
Key Findings:
- Models show large performance drops under interventions, indicating fragile performance on naturally correlated videos.
- Error analysis (Figures 3 & 4) reveals a uniform shortcut: models overwhelmingly hallucinate audio that matches the visuals (high Mute Hallucination and Swap False-Match rates) but rarely deny audio that is real (low False Silence and Swap False-Mismatch).
- Temporal perception is particularly poor, with models often missing offsets or guessing the wrong direction.
2. Efficacy of Targeted Alignment
Using Qwen3-Omni-30B as a trainable backbone, the authors test various alignment recipes.
Table 2: Accuracy (%) under different alignment recipes.
| Recipe | Sync | VGGSync | V-MME | LVB | WS | DO | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen3-Omni-30B (Vanilla) | 34.3 | 36.8 | 69.2 | 49.1 | 50.3 | 68.2 | 51.3 |
| DPO w/ SP + FV-D | 82.2 | 55.4 | 69.1 | 51.5 | 49.8 | 68.0 | 62.7 |
| Ours (Final 10K Recipe) | 83.1 | 56.4 | 70.1 | 52.1 | 50.3 | 67.9 | 63.3 |
OP: original-sync preferences; SP: SFT-policy negatives; CTP: counterfactual temporal preferences; FV-: FineVideo data.*
Key Findings:
- The final recipe (DPO with CTP + FV-D + FV-A) improves temporal synchronization (Sync) from 34.3% to 83.1% and maintains or slightly improves performance on general video benchmarks (V-MME, LVB, WS, DO), avoiding an "alignment tax."
- The model gains transferable temporal grounding, as shown by improved performance on the out-of-distribution VGGSoundSync benchmark (Figure 5).
- The improvement extends beyond coarse mismatch detection to fine-grained offset localization (Figure 6).
- Extending the recipe with small amounts of Mute/Swap SFT further improves performance on those dimensions, yielding a 28% average gain over the vanilla model across all three interventions (Figure 7).
Theoretical and Practical Implications
- Theoretical: The work provides a formal framework (THUD) for diagnosing a specific failure mode—the Clever Hans effect—in multimodal grounding. It decomposes audio-visual understanding into three testable dimensions: temporal synchronization, audio existence, and source consistency.
- Practical: The results strongly suggest that current evaluations of video MLLMs are insufficient. Benchmarks must include counterfactual interventions to assess genuine cross-modal grounding rather than shortcut exploitation. The proposed alignment recipe offers a pathway to train more reliable, audio-verified models without sacrificing general capabilities.
Conclusion
The paper demonstrates that current video-capable MLLMs frequently rely on visual shortcuts rather than genuine audio verification, an illusion termed the audio-visual Clever Hans effect. The introduced THUD diagnostic framework effectively exposes this behavior via Shift, Mute, and Swap interventions. Furthermore, a two-stage alignment recipe using intervention-derived and general video preferences can significantly mitigate these shortcuts while preserving broad video understanding. The findings advocate for future model development and evaluation to move beyond naturally correlated videos and incorporate controlled, counterfactual audio-visual conditions.