Mega-ASR: Towards In-the-wild Speech Recognition via Scaling Up Real-world Acoustic Simulation

Summary (Overview)

Proposes Mega-ASR: A unified framework combining scalable compound-data construction (Voices-in-the-Wild-2M) with progressive acoustic-to-semantic optimization (A2S-SFT and DG-WGPO) to overcome the "acoustic robustness bottleneck" in real-world ASR.
Introduces Voices-in-the-Wild-2M: A large-scale dataset (~2.4M clips) covering 7 atomic acoustic phenomena and 54 physically plausible compound scenarios, generated via spectral-manipulation-based simulation to reflect complex, high-WER real-world conditions.
Develops Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT): A three-phase training curriculum (encoder-aligner acoustic adaptation, LLM semantic adaptation, joint fine-tuning) to build robust perceptual recovery and semantic reconstruction capabilities incrementally.
Introduces Dual-Granularity WER-Gated Policy Optimization (DG-WGPO): A dynamic RL reward scheme that combines token-level refinement and sentence-level reconstruction rewards, with a WER-gated fusion strategy to address different error regimes (word-level vs. semantic-level failures).
Achieves State-of-the-Art Robustness: Mega-ASR significantly outperforms prior SOTA systems on adverse-condition benchmarks (e.g., 45.69% vs. 54.01% WER on VOiCES R4-B-F) and delivers over 30% relative WER reduction on complex compositional acoustic scenarios.

Introduction and Theoretical Foundation

Despite rapid advances in ASR and large audio-language models (LALMs), robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions (e.g., simultaneous noise, reverberation, echo, and frequency dropout). Performance drops sharply, with Word Error Rate (WER) rising to 10–30% or even 70% in harder cases.

Prior work on ASR-in-the-wild is limited by three key gaps:

D1: Limited scenario coverage. Models typically target isolated conditions (e.g., noise or far-field), requiring specialized models for different environments.
D2: Lack of compositional robustness. Robustness factors are studied independently, while large-scale data for realistic mixtures of effects is scarce.
D3: Mismatch between training data and real-world conditions. Existing training data emphasizes mild WER ranges (4–10%), not reflecting challenging settings where WER exceeds 30% and demands stronger semantic reasoning.

This work proposes Mega-ASR, a framework designed to push ASR capability under "in-the-wild 2" conditions—handling not just singly complex but compositionally complex and much harder acoustic settings. The core idea is to combine scalable compound-data construction with progressive acoustic-to-semantic optimization to bridge this robustness gap.

Methodology

1. Dataset: Voices-in-the-Wild-2M

To address the data scarcity for compositional, high-WER scenarios, the authors construct a large-scale dataset via a hierarchical, spectrogram-level simulation pipeline.

Atomic Acoustic Effects: Seven classic phenomena are implemented as dedicated spectral processing pipelines: {noise, far-field, obstructed, echo&reverb, recording, electronic distortion, transmission dropout}.
Compound Scenario Construction: Atomic effects are composed into 54 agent-validated configurations, retaining only physically plausible combinations (e.g., far-field with ambient noise in a church). A total of 2.4M synthesized clips are generated.
Difficulty Calibration & Filtering: A unified severity parameter $k \in [0, 1]$ is exposed for every effect. After comparing candidate distributions (Sqrt-Forward, Sqrt-Backward, Gaussian-Mid, Linear), the Linear distribution is adopted to provide a balanced difficulty profile. To ensure training stability, samples with WER above 70% are filtered out.
Evaluation Benchmark: Voices-in-the-Wild-Bench, a 5,000-clip evaluation set (3,500 synthetic, 1,500 real-world recordings) covering the same seven atomic phenomena, is released.

2. Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT)

A2S-SFT addresses two coupled bottlenecks in medium-to-high WER regimes: (i) extracting reliable acoustic evidence from corrupted waveforms, and (ii) leveraging the LLM's semantic prior to reconstruct the intended transcription.

The training proceeds in three progressive phases:

Encoder-Aligner Acoustic Adaptation: A WER-graded curriculum is applied to the encoder and aligner, successively expanding from WER < 30% to WER < 50% and finally to WER < 70%, to build acoustic perception incrementally.
LLM Semantic Adaptation: The LLM is fine-tuned on the full WER < 70% samples to activate semantic recovery under unreliable acoustic evidence.
Joint Acoustic-Semantic Adaptation: Encoder, aligner, and LLM are jointly fine-tuned for end-to-end alignment.

3. Dual-Granularity WER-Gated Policy Optimization (DG-WGPO)

Building on Mega-ASR-Base from A2S-SFT, DG-WGPO applies reinforcement learning (using DAPO) with a novel dynamic reward scheme. The key observation is that error modes change at a WER threshold: errors for WER ≤ 30% are predominantly word-level confusions, but beyond this threshold they shift abruptly into sentence-level failures (hallucinations, omissions). The standard WER reward saturates under heavy degradation.

The DG-WGPO reward $R$ combines a static rule-based anchor with a dynamic, dual-granularity signal:

R = (1 - \alpha_{dyn}) R_{static} + \alpha_{dyn} R_{dynamic}

where $\alpha_{dyn} = 0.6$ .

Static Rule-Based Reward ( $R_{static}$ ): Provides a stable anchor.
- WER Reward: $R_{wer}(H, R) = 1 - WER(H, R)$ .
- Anti-Repetition Reward: $R_{rep}(H)$ zeros out rollouts containing repeated n-grams beyond a threshold.
$R_{static} = R_{rep} \cdot R_{wer}$
Dual-Granularity Dynamic Reward ( $R_{dynamic}$ ): The core innovation.
- Token-Level Refinement Reward ( $R_{fine}$ ): Targets word-level errors. It classifies substitution errors as soft or hard based on character-level edit similarity $sim(h, r)$ : $sim(h, r) = 1 - \frac{edit(h, r)}{\max(|h|, |r|)} \in [0, 1]$ A substitution is soft if $sim(h, r) \geq 0.5$ . The reward is: $R_{fine} = \frac{n_C}{n_C + n_{hard} + \alpha_s n_{soft} + \epsilon}$ where $n_C$ , $n_{hard}$ , $n_{soft}$ are counts of correct tokens, hard errors, and soft errors; $\alpha_s = 0.4$ is the soft-error discount; $\epsilon = 10^{-8}$ .
- Sentence-Level Reconstruction Reward ( $R_{struc}$ ): Targets semantic-level failures, scoring hypothesis by backbone preservation. $R_{struc} = \frac{1}{2} \cdot \frac{LCS(H, R)}{|R|} + \frac{1}{2} \cdot \max\left(0, 1 - \frac{|H| - |R|}{|R|}\right)$ The first term rewards Longest Common Subsequence (LCS) agreement; the second penalizes length deviation.
- WER-Gated Dynamic Fusion: The fusion weights flip at a threshold $\tau = 0.3$ to emphasize the regime-appropriate granularity. $0.75 R_{fine} + 0.25 R_{struc}, & \text{if } WER(H, R) < \tau \\ 0.25 R_{fine} + 0.75 R_{struc}, & \text{if } WER(H, R) \geq \tau \end{cases}$$$

4. Environment-Aware Routing for Inference

To preserve the original model's clean-speech and complementary capabilities (e.g., hotword recognition), a lightweight binary classifier is fine-tuned with LoRA to route each utterance. Clean inputs use the original Qwen3-ASR backbone; degraded inputs activate the Mega-ASR LoRA weights. This makes Mega-ASR a plug-and-play robustness module.

Empirical Validation / Results

Experiments initialize from Qwen3-ASR-1.7B and train on Voices-in-the-Wild-2M. Evaluation covers three axes: (i) Standard ASR benchmarks, (ii) Adverse-condition benchmarks, and (iii) Compound conditions (Voices-in-the-Wild-Bench). Baselines include 12 representative closed- and open-source systems (e.g., Whisper-Large-v3, Gemini-3-Flash, Qwen2.5-Omni).

Key Findings:

Competitive General ASR with Adaptive Routing: Mega-ASR remains highly competitive on clean and multilingual benchmarks. With routing, it improves performance (e.g., LibriSpeech test WER from 1.78/3.57 to 1.63/3.37) and preserves clean-domain capabilities.
State-of-the-Art Robustness under Acoustic Perturbations: Mega-ASR achieves the best overall robustness on CHiME-4, VOiCES, and NOIZEUS.

Table 2: Performance comparison on noisy and robust ASR benchmarks (WER %, lower is better).

Model CHiME-4 Avg. VOiCES Avg. NOIZEUS Avg. Avg.
Closed-source models
Gemini3-Flash 6.125 13.81 26.82 15.59
GPT-4o-trans. 6.47 22.65 22.94 17.35
Open-source models
Whisper-L-v3 7.02 11.79 13.34 10.72
Qwen2.5-Omni 7.37 18.87 19.18 15.14
Qwen3-ASR 5.39 8.94 9.45 7.93
Our model
Mega-ASR 5.23 7.35 7.52 6.70
Mega-ASR w/ router 5.00 7.37 7.90 6.76

Model	CHiME-4 Avg.	VOiCES Avg.	NOIZEUS Avg.	Avg.
Closed-source models
Gemini3-Flash	6.125	13.81	26.82	15.59
GPT-4o-trans.	6.47	22.65	22.94	17.35
Open-source models
Whisper-L-v3	7.02	11.79	13.34	10.72
Qwen2.5-Omni	7.37	18.87	19.18	15.14
Qwen3-ASR	5.39	8.94	9.45	7.93
Our model
Mega-ASR	5.23	7.35	7.52	6.70
Mega-ASR w/ router	5.00	7.37	7.90	6.76

Under extreme conditions (NOIZEUS 0dB), Mega-ASR reduces WER to 19.80 vs. 23.97 for Qwen3-ASR and 55.78 for Gemini-3-Flash.

Superior Robustness in Compositional Real-World Environments: On Voices-in-the-Wild-Bench, Mega-ASR consistently achieves the strongest performance across all degradation types, especially under mixed degradations.

Table 4: Breakdown results on Voices-in-the-Wild-Bench by acoustic scenario (WER %, lower is better). Selected results for Mixed scenario.

Model Mixed (Real) Mixed (Sim.)
Gemini3-Flash 7.99 9.62
Whisper-L-v3 8.91 14.79
Qwen3-ASR 3.30 5.39
Mega-ASR 2.73 4.57

Model	Mixed (Real)	Mixed (Sim.)
Gemini3-Flash	7.99	9.62
Whisper-L-v3	8.91	14.79
Qwen3-ASR	3.30	5.39
Mega-ASR	2.73	4.57

This corresponds to a 65.8%/69.1% relative reduction over Whisper-Large-v3.

Ablation Studies & Analysis

[Obs.1] Semantic-Level Gains: Mega-ASR's improvements generalize beyond WER to semantic-level metrics (e.g., missed-content drops from 14.2 to 5.9, hallucinations from 18.7 to 11.8).
[Obs.2] Component Ablation: Removing A2S-SFT's progressive stages or DG-WGPO components (especially $R_{struc}$ ) causes consistent degradation, confirming their value.
[Obs.3] Efficient Reward Design: The rule-based DG-WGPO reward matches the performance of an LLM-as-judge reward but at a 3.2× lower time-cost (19.57s vs. 62.23s per training step).
[Obs.4] Hyperparameter Sensitivity: The dynamic reward weight $\alpha_{dyn}$ governs a more sensitive trade-off than the soft-error discount $\alpha_s$ . The default $(\alpha_{dyn}, \alpha_s)=(0.6, 0.4)$ works best.
[Obs.5] Case Study: Qualitative examples show Mega-ASR converting catastrophic failures (empty outputs, cross-lingual hallucinations, severe semantic drift) into correct or near-correct transcriptions, while SOTA baselines often produce fluent but incorrect content.

Theoretical and Practical Implications

Theoretical: The work formally identifies and addresses the "acoustic robustness bottleneck" and the shift in error regimes (word-level to semantic-level) at high WER. It demonstrates the necessity of compositional data simulation and progressive, granularity-aware optimization for true in-the-wild robustness.
Practical: Mega-ASR establishes a scalable paradigm for building robust ASR systems that can handle the complex, multi-factor degradations commonplace in real deployments (e.g., vehicles, public spaces). The released Voices-in-the-Wild-2M dataset and benchmark provide essential resources for future research. The environment-aware routing offers a practical deployment strategy to maintain clean-speech performance while activating robustness only when needed.

Conclusion

Mega-ASR presents a unified framework to overcome the acoustic robustness bottleneck in ASR under severe, compositional distortions. Central to its success are:

The large-scale, realistically simulated Voices-in-the-Wild-2M dataset.
The Acoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT) curriculum.
The Dual-Granularity WER-Gated Policy Optimization (DG-WGPO) reward scheme.

Extensive experiments show that Mega-ASR achieves significant improvements over prior SOTA systems, especially under challenging real-world conditions where relative WER reductions can exceed 30%. The results highlight the critical importance of modeling compound acoustic environments at scale and provide a robust, scalable pathway for ASR-in-the-wild.