Summary (Overview)
- First comprehensive audio editing benchmark: MMAE is the first general-purpose, instruction-based audio editing benchmark covering sound, speech, music, and their mixtures across 7 modalities, 6 complexity levels, 2 granularities, and 8 operation types.
- Rubric-based evaluation paradigm: Instead of coarse metrics, MMAE decomposes each editing task into 17,741 fine-grained, verifiable multiple-choice rubrics (avg. 8.87 per sample) that assess both instruction following (IFR) and consistency (CR), enabling objective, multi-dimensional diagnosis.
- Systematic taxonomy: The benchmark organizes tasks orthogonally along modality, complexity, and operation dimensions, ensuring broad coverage and balanced distribution (2,000 samples).
- Dramatic model failures: Current audio editing systems achieve Exact Match Rates (EMR) below 5% across the board, dropping to 0% on complex mixed-modality tasks, exposing severe bottlenecks in precise execution and structural robustness.
- Key insights: Higher complexity and mixed modalities degrade performance; IFR and CR present a fundamental trade-off; average competency (IFR/CR) decouples from flawless execution (EMR); agent-guided planners show limited improvement and may harm consistency.
Introduction and Theoretical Foundation
Background and Motivation
Intelligent editing has advanced rapidly in visual domains (e.g., image editing with Nano-banana 2, video editing with Gemini-Omni), spurring interest in instruction-based audio editing models. These models allow users to alter speech, music, or sound effects via natural language. However, the evaluation infrastructure has lagged behind:
- Fragmented data coverage: Existing benchmarks are restricted to specific modalities (speech-only or sound-only) or basic operations (addition, removal, replacement).
- Inadequate metrics: Traditional signal-level metrics (FAD, LSD, CLAP similarity) or generic MOS ratings fail to explicitly assess editing correctness, especially for open-ended, multi-faceted instructions.
Theoretical Basis
The paper argues that a next-generation evaluation framework must achieve breakthroughs in two critical dimensions:
- Data coverage: Comprehensively cover speech, music, and sound effects, including complex scenarios that stress-test perception, reasoning, and generation.
- Evaluation paradigm: Replace coarse metrics with rubric-based evaluation, a paradigm validated in text reinforcement learning [10], audio reasoning [11], and image editing [12]. Rubrics decompose free-form tasks into structured, verifiable criteria.
MMAE is designed to bridge this gap by establishing the first universal benchmark for instruction-based audio editing, requiring models to integrate three core capabilities: perception (understanding source audio context), reasoning (interpreting implicit user intent), and generation (executing high-fidelity edits).
Methodology
Taxonomy
MMAE characterizes each editing task along three orthogonal dimensions:
- Modality: 7 categories – sound, music, speech, sound-music, sound-speech, music-speech, sound-music-speech.
- Complexity: 6 levels – Single, Multi-part, Multi-instruction, Multi-audio, Multi-round, Multi-hop.
- Operation: 2 granularities – Local (addition, removal, replacement, extraction, alteration) and Global (background change, foreground change, global alteration). Each sample may involve a single operation or arbitrary compositions.
Evaluation Paradigm
The rubric-based paradigm evaluates models along two core dimensions:
- Instruction Following (IF): Whether the model precisely performs the requested modifications.
- Consistency: Whether all acoustic elements irrelevant to the instruction remain unaltered.
Each rubric is a multiple-choice question with one correct option. An external judge (audio language model) selects an option; a binary score (1 or 0) is assigned based on correctness. Rubrics are designed to satisfy four principles: Completeness, Atomicity, Orthogonality, Objectivity.
Metrics:
- Instruction Following Rate (IFR): Average rubric score across IF rubrics for a sample.
- Consistency Rate (CR): Average rubric score across consistency rubrics.
- Exact Match Rate (EMR): Proportion of samples where all rubrics are answered correctly.
Data Curation Pipeline (5 stages)
- Brainstorming: Expert sessions to collect diverse editing scenarios.
- Taxonomy & Paradigm Construction: Define task taxonomy and rubric-based framework.
- Instruction-Centric Data Collection: Manual search and collection from online videos; annotators write instructions and label metadata (modality, complexity, operation). Dynamic balancing across dimensions.
- Rubrics Annotation: Human-agent collaborative workflow – Omni-Detective agent extracts captions; LLM generates initial rubric drafts; human annotators refine; LLM normalizes.
- Quality Inspection: Cross-review by blind inspectors; iterative revision or rejection.
Statistics
| Statistic | Value |
|---|---|
| Total Samples | 2,000 |
| Total Rubrics | 17,741 |
| Avg. Operations / Sample | 1.22 |
| Avg. Audio Duration / Sample | 14.46 sec |
| Avg. Instruction Length | 14.00 words |
| Avg. Rubrics / Sample | 8.87 |
| Avg. IF Rubrics / Sample | 3.58 |
| Avg. Consistency Rubrics / Sample | 5.29 |
| Avg. Choices / Rubric | 3.53 |
| Avg. Rubric Question Length | 25.45 words |
Empirical Validation / Results
Benchmarking Candidates
Five recent models evaluated: Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ (with/without planner). Two reference baselines: Identity (returns input unchanged) and Noise (outputs Gaussian noise). Due to input length constraints, MMEdit, Audio-Omni, SmartDJ evaluated only on samples ≤10s (801/2000).
Main Results (Table 2)
| Model | Overall IFR | Overall CR | Overall EMR |
|---|---|---|---|
| Identity | 27.37 | 94.13 | 4.60 |
| Noise | 32.08 | 15.68 | 0.00 |
| Step-Audio-EditX | 44.86 | 58.88 | 3.05 |
| Ming-UniAudio | 29.82 | 52.71 | 3.20 |
| MMEdit* | 43.12 | 47.64 | 3.50 |
| Audio-Omni* | 50.73 | 56.93 | 4.99 |
| SmartDJ* w/o planner | 38.20 | 55.41 | 4.62 |
| SmartDJ* w/ planner | 42.26 | 48.33 | 3.12 |
Key findings:
- EMR consistently below 5%; in complex mixed-modality tasks (e.g., Sound-Music-Speech) EMR drops to 0% for several models.
- Performance degradation with complexity: All models show clear drops from Single to Multiple tasks (e.g., Audio-Omni IFR 58.43% → 41.70%; CR 64.57% → 47.94%).
- Domain biases: Step-Audio-EditX better on speech; Audio-Omni better on sound and music; mixed-modality tasks hardest.
- IFR-CR trade-off: Identity has near-perfect CR but poor IFR; Noise has moderate IFR but very low CR. Models struggle to balance both.
- Decoupling of average vs. exact match: Step-Audio-EditX outperforms Ming-UniAudio on IFR/CR but has lower EMR (3.05% vs. 3.20%), suggesting mean-seeking vs. mode-seeking behavior.
- Agent planners limited: SmartDJ w/ planner improves IFR (42.26% vs. 38.20%) but harms CR (48.33% vs. 55.41%) and EMR (3.12% vs. 4.62%), due to cascaded errors and accumulated artifacts.
Theoretical and Practical Implications
Theoretical Implications
- Need for atomic fidelity: Models fail to precisely execute individual operations while preserving context, indicating a fundamental gap in generative audio editing capability.
- Complexity bottleneck: The sharp degradation with multi-step and mixed-modality tasks reveals that current architectures lack structural robustness for compositional, cross-domain reasoning.
- Metric design matters: The decoupling between average metrics (IFR/CR) and exact match (EMR) shows that optimizing for average performance does not guarantee reliable, flawless editing. Evaluation must report both average and perfect-execution metrics.
Practical Implications
- Diagnostic roadmap: MMAE provides clear failure modes: models either miss intended modifications (low IFR) or inadvertently alter preserved content (low CR). This guides researchers to focus on instruction adherence and context preservation simultaneously.
- Modality unification: Current models show domain-specific strengths; true general-purpose editing requires universal support across sound, speech, music, and their mixtures.
- Agentic planning caution: Decomposing instructions via high-level planners is insufficient if the base editor cannot reliably execute atomic steps. Future work should prioritize improving base model fidelity before relying on symbolic planning.
- Standardized evaluation: MMAE establishes a reproducible, interpretable evaluation paradigm (rubric-based with MLLM judge) that can benchmark progress and serve as a community resource.
Conclusion
MMAE is the first comprehensive benchmark for instruction-based audio editing, covering sound, speech, music, and their mixtures through a systematic taxonomy (modality, complexity, operation) and a rubric-based evaluation framework. The benchmark comprises 2,000 high-fidelity samples and 17,741 fine-grained rubrics, enabling objective assessment of both instruction following and content consistency.
Evaluation of five leading models reveals that current systems, while showing basic capabilities, remain far from reliable editing: Exact Match Rates fall below 5% overall and hit 0% on complex mixed-modality tasks. Key bottlenecks include:
- Balancing precise modification with context preservation.
- Handling increasing complexity and cross-domain synchronization.
- Achieving flawless execution beyond average competency.
- Limited improvement from external agentic planners due to fragile base generation.
MMAE highlights critical future directions: improving atomic editing fidelity, developing models with universal modality support, and advancing robust agent-guided systems for compositional editing. The benchmark is publicly released to serve as a diagnostic roadmap and standardized evaluation paradigm for next-generation audio editing systems.
Related papers
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.
- GENEB: Why Genomic Models Are Hard to Compare
GENEB reveals that architecture and pretraining alignment often outweigh model scale for genomic foundation model performance across diverse tasks.
- TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
TRL-BENCH reveals tabular encoder quality is capability-specific, with hybrid pipelines outperforming any single model in compositional enrichment.