Visual Summary | MMAE: A Massive Multitask Audio Editing Benchmark

Summary (Overview)

First comprehensive audio editing benchmark: MMAE is the first general-purpose, instruction-based audio editing benchmark covering sound, speech, music, and their mixtures across 7 modalities, 6 complexity levels, 2 granularities, and 8 operation types.
Rubric-based evaluation paradigm: Instead of coarse metrics, MMAE decomposes each editing task into 17,741 fine-grained, verifiable multiple-choice rubrics (avg. 8.87 per sample) that assess both instruction following (IFR) and consistency (CR), enabling objective, multi-dimensional diagnosis.
Systematic taxonomy: The benchmark organizes tasks orthogonally along modality, complexity, and operation dimensions, ensuring broad coverage and balanced distribution (2,000 samples).
Dramatic model failures: Current audio editing systems achieve Exact Match Rates (EMR) below 5% across the board, dropping to 0% on complex mixed-modality tasks, exposing severe bottlenecks in precise execution and structural robustness.
Key insights: Higher complexity and mixed modalities degrade performance; IFR and CR present a fundamental trade-off; average competency (IFR/CR) decouples from flawless execution (EMR); agent-guided planners show limited improvement and may harm consistency.

Introduction and Theoretical Foundation

Background and Motivation

Intelligent editing has advanced rapidly in visual domains (e.g., image editing with Nano-banana 2, video editing with Gemini-Omni), spurring interest in instruction-based audio editing models. These models allow users to alter speech, music, or sound effects via natural language. However, the evaluation infrastructure has lagged behind:

Fragmented data coverage: Existing benchmarks are restricted to specific modalities (speech-only or sound-only) or basic operations (addition, removal, replacement).
Inadequate metrics: Traditional signal-level metrics (FAD, LSD, CLAP similarity) or generic MOS ratings fail to explicitly assess editing correctness, especially for open-ended, multi-faceted instructions.

Theoretical Basis

The paper argues that a next-generation evaluation framework must achieve breakthroughs in two critical dimensions:

Data coverage: Comprehensively cover speech, music, and sound effects, including complex scenarios that stress-test perception, reasoning, and generation.
Evaluation paradigm: Replace coarse metrics with rubric-based evaluation, a paradigm validated in text reinforcement learning [10], audio reasoning [11], and image editing [12]. Rubrics decompose free-form tasks into structured, verifiable criteria.

MMAE is designed to bridge this gap by establishing the first universal benchmark for instruction-based audio editing, requiring models to integrate three core capabilities: perception (understanding source audio context), reasoning (interpreting implicit user intent), and generation (executing high-fidelity edits).

Methodology

Taxonomy

MMAE characterizes each editing task along three orthogonal dimensions:

Modality: 7 categories – sound, music, speech, sound-music, sound-speech, music-speech, sound-music-speech.
Complexity: 6 levels – Single, Multi-part, Multi-instruction, Multi-audio, Multi-round, Multi-hop.
Operation: 2 granularities – Local (addition, removal, replacement, extraction, alteration) and Global (background change, foreground change, global alteration). Each sample may involve a single operation or arbitrary compositions.

Evaluation Paradigm

The rubric-based paradigm evaluates models along two core dimensions:

Instruction Following (IF): Whether the model precisely performs the requested modifications.
Consistency: Whether all acoustic elements irrelevant to the instruction remain unaltered.

Each rubric is a multiple-choice question with one correct option. An external judge (audio language model) selects an option; a binary score (1 or 0) is assigned based on correctness. Rubrics are designed to satisfy four principles: Completeness, Atomicity, Orthogonality, Objectivity.

Metrics:

Instruction Following Rate (IFR): Average rubric score across IF rubrics for a sample.
Consistency Rate (CR): Average rubric score across consistency rubrics.
Exact Match Rate (EMR): Proportion of samples where all rubrics are answered correctly.

Data Curation Pipeline (5 stages)

Brainstorming: Expert sessions to collect diverse editing scenarios.
Taxonomy & Paradigm Construction: Define task taxonomy and rubric-based framework.
Instruction-Centric Data Collection: Manual search and collection from online videos; annotators write instructions and label metadata (modality, complexity, operation). Dynamic balancing across dimensions.
Rubrics Annotation: Human-agent collaborative workflow – Omni-Detective agent extracts captions; LLM generates initial rubric drafts; human annotators refine; LLM normalizes.
Quality Inspection: Cross-review by blind inspectors; iterative revision or rejection.

Statistics

Statistic	Value
Total Samples	2,000
Total Rubrics	17,741
Avg. Operations / Sample	1.22
Avg. Audio Duration / Sample	14.46 sec
Avg. Instruction Length	14.00 words
Avg. Rubrics / Sample	8.87
Avg. IF Rubrics / Sample	3.58
Avg. Consistency Rubrics / Sample	5.29
Avg. Choices / Rubric	3.53
Avg. Rubric Question Length	25.45 words

Empirical Validation / Results

Benchmarking Candidates

Five recent models evaluated: Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ (with/without planner). Two reference baselines: Identity (returns input unchanged) and Noise (outputs Gaussian noise). Due to input length constraints, MMEdit, Audio-Omni, SmartDJ evaluated only on samples ≤10s (801/2000).

Main Results (Table 2)

Model	Overall IFR	Overall CR	Overall EMR
Identity	27.37	94.13	4.60
Noise	32.08	15.68	0.00
Step-Audio-EditX	44.86	58.88	3.05
Ming-UniAudio	29.82	52.71	3.20
MMEdit*	43.12	47.64	3.50
Audio-Omni*	50.73	56.93	4.99
SmartDJ* w/o planner	38.20	55.41	4.62
SmartDJ* w/ planner	42.26	48.33	3.12

Key findings:

EMR consistently below 5%; in complex mixed-modality tasks (e.g., Sound-Music-Speech) EMR drops to 0% for several models.
Performance degradation with complexity: All models show clear drops from Single to Multiple tasks (e.g., Audio-Omni IFR 58.43% → 41.70%; CR 64.57% → 47.94%).
Domain biases: Step-Audio-EditX better on speech; Audio-Omni better on sound and music; mixed-modality tasks hardest.
IFR-CR trade-off: Identity has near-perfect CR but poor IFR; Noise has moderate IFR but very low CR. Models struggle to balance both.
Decoupling of average vs. exact match: Step-Audio-EditX outperforms Ming-UniAudio on IFR/CR but has lower EMR (3.05% vs. 3.20%), suggesting mean-seeking vs. mode-seeking behavior.
Agent planners limited: SmartDJ w/ planner improves IFR (42.26% vs. 38.20%) but harms CR (48.33% vs. 55.41%) and EMR (3.12% vs. 4.62%), due to cascaded errors and accumulated artifacts.

Theoretical and Practical Implications

Theoretical Implications

Need for atomic fidelity: Models fail to precisely execute individual operations while preserving context, indicating a fundamental gap in generative audio editing capability.
Complexity bottleneck: The sharp degradation with multi-step and mixed-modality tasks reveals that current architectures lack structural robustness for compositional, cross-domain reasoning.
Metric design matters: The decoupling between average metrics (IFR/CR) and exact match (EMR) shows that optimizing for average performance does not guarantee reliable, flawless editing. Evaluation must report both average and perfect-execution metrics.

Practical Implications

Diagnostic roadmap: MMAE provides clear failure modes: models either miss intended modifications (low IFR) or inadvertently alter preserved content (low CR). This guides researchers to focus on instruction adherence and context preservation simultaneously.
Modality unification: Current models show domain-specific strengths; true general-purpose editing requires universal support across sound, speech, music, and their mixtures.
Agentic planning caution: Decomposing instructions via high-level planners is insufficient if the base editor cannot reliably execute atomic steps. Future work should prioritize improving base model fidelity before relying on symbolic planning.
Standardized evaluation: MMAE establishes a reproducible, interpretable evaluation paradigm (rubric-based with MLLM judge) that can benchmark progress and serve as a community resource.

Conclusion

MMAE is the first comprehensive benchmark for instruction-based audio editing, covering sound, speech, music, and their mixtures through a systematic taxonomy (modality, complexity, operation) and a rubric-based evaluation framework. The benchmark comprises 2,000 high-fidelity samples and 17,741 fine-grained rubrics, enabling objective assessment of both instruction following and content consistency.

Evaluation of five leading models reveals that current systems, while showing basic capabilities, remain far from reliable editing: Exact Match Rates fall below 5% overall and hit 0% on complex mixed-modality tasks. Key bottlenecks include:

Balancing precise modification with context preservation.
Handling increasing complexity and cross-domain synchronization.
Achieving flawless execution beyond average competency.
Limited improvement from external agentic planners due to fragile base generation.

MMAE highlights critical future directions: improving atomic editing fidelity, developing models with universal modality support, and advancing robust agent-guided systems for compositional editing. The benchmark is publicly released to serve as a diagnostic roadmap and standardized evaluation paradigm for next-generation audio editing systems.