# MMAE: A Massive Multitask Audio Editing Benchmark

> Current audio editing systems achieve exact match rates below 5%, dropping to 0% on complex mixed-modality tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.07229)
- **Published:** 2026-06-09
- **Permalink:** https://picx.dev/p/jsi1mM
- **Whiteboard:** https://picx.dev/p/jsi1mM/image

## Summary

## Summary (Overview)

- **First comprehensive audio editing benchmark**: MMAE is the first general-purpose, instruction-based audio editing benchmark covering sound, speech, music, and their mixtures across 7 modalities, 6 complexity levels, 2 granularities, and 8 operation types.
- **Rubric-based evaluation paradigm**: Instead of coarse metrics, MMAE decomposes each editing task into 17,741 fine-grained, verifiable multiple-choice rubrics (avg. 8.87 per sample) that assess both instruction following (IFR) and consistency (CR), enabling objective, multi-dimensional diagnosis.
- **Systematic taxonomy**: The benchmark organizes tasks orthogonally along modality, complexity, and operation dimensions, ensuring broad coverage and balanced distribution (2,000 samples).
- **Dramatic model failures**: Current audio editing systems achieve Exact Match Rates (EMR) below 5% across the board, dropping to 0% on complex mixed-modality tasks, exposing severe bottlenecks in precise execution and structural robustness.
- **Key insights**: Higher complexity and mixed modalities degrade performance; IFR and CR present a fundamental trade-off; average competency (IFR/CR) decouples from flawless execution (EMR); agent-guided planners show limited improvement and may harm consistency.

## Introduction and Theoretical Foundation

### Background and Motivation
Intelligent editing has advanced rapidly in visual domains (e.g., image editing with Nano-banana 2, video editing with Gemini-Omni), spurring interest in instruction-based audio editing models. These models allow users to alter speech, music, or sound effects via natural language. However, the evaluation infrastructure has lagged behind:
- **Fragmented data coverage**: Existing benchmarks are restricted to specific modalities (speech-only or sound-only) or basic operations (addition, removal, replacement).
- **Inadequate metrics**: Traditional signal-level metrics (FAD, LSD, CLAP similarity) or generic MOS ratings fail to explicitly assess editing correctness, especially for open-ended, multi-faceted instructions.

### Theoretical Basis
The paper argues that a next-generation evaluation framework must achieve breakthroughs in two critical dimensions:
1. **Data coverage**: Comprehensively cover speech, music, and sound effects, including complex scenarios that stress-test perception, reasoning, and generation.
2. **Evaluation paradigm**: Replace coarse metrics with rubric-based evaluation, a paradigm validated in text reinforcement learning [10], audio reasoning [11], and image editing [12]. Rubrics decompose free-form tasks into structured, verifiable criteria.

MMAE is designed to bridge this gap by establishing the first universal benchmark for instruction-based audio editing, requiring models to integrate three core capabilities: **perception** (understanding source audio context), **reasoning** (interpreting implicit user intent), and **generation** (executing high-fidelity edits).

## Methodology

### Taxonomy
MMAE characterizes each editing task along three orthogonal dimensions:
- **Modality**: 7 categories – sound, music, speech, sound-music, sound-speech, music-speech, sound-music-speech.
- **Complexity**: 6 levels – Single, Multi-part, Multi-instruction, Multi-audio, Multi-round, Multi-hop.
- **Operation**: 2 granularities – Local (addition, removal, replacement, extraction, alteration) and Global (background change, foreground change, global alteration). Each sample may involve a single operation or arbitrary compositions.

### Evaluation Paradigm
The rubric-based paradigm evaluates models along two core dimensions:
- **Instruction Following (IF)**: Whether the model precisely performs the requested modifications.
- **Consistency**: Whether all acoustic elements irrelevant to the instruction remain unaltered.

Each rubric is a multiple-choice question with one correct option. An external judge (audio language model) selects an option; a binary score (1 or 0) is assigned based on correctness. Rubrics are designed to satisfy four principles: **Completeness**, **Atomicity**, **Orthogonality**, **Objectivity**.

**Metrics**:
- **Instruction Following Rate (IFR)**: Average rubric score across IF rubrics for a sample.
- **Consistency Rate (CR)**: Average rubric score across consistency rubrics.
- **Exact Match Rate (EMR)**: Proportion of samples where all rubrics are answered correctly.

### Data Curation Pipeline (5 stages)
1. **Brainstorming**: Expert sessions to collect diverse editing scenarios.
2. **Taxonomy & Paradigm Construction**: Define task taxonomy and rubric-based framework.
3. **Instruction-Centric Data Collection**: Manual search and collection from online videos; annotators write instructions and label metadata (modality, complexity, operation). Dynamic balancing across dimensions.
4. **Rubrics Annotation**: Human-agent collaborative workflow – Omni-Detective agent extracts captions; LLM generates initial rubric drafts; human annotators refine; LLM normalizes.
5. **Quality Inspection**: Cross-review by blind inspectors; iterative revision or rejection.

### Statistics
| Statistic | Value |
|-----------|-------|
| Total Samples | 2,000 |
| Total Rubrics | 17,741 |
| Avg. Operations / Sample | 1.22 |
| Avg. Audio Duration / Sample | 14.46 sec |
| Avg. Instruction Length | 14.00 words |
| Avg. Rubrics / Sample | 8.87 |
| Avg. IF Rubrics / Sample | 3.58 |
| Avg. Consistency Rubrics / Sample | 5.29 |
| Avg. Choices / Rubric | 3.53 |
| Avg. Rubric Question Length | 25.45 words |

## Empirical Validation / Results

### Benchmarking Candidates
Five recent models evaluated: Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ (with/without planner). Two reference baselines: Identity (returns input unchanged) and Noise (outputs Gaussian noise). Due to input length constraints, MMEdit, Audio-Omni, SmartDJ evaluated only on samples ≤10s (801/2000).

### Main Results (Table 2)
| Model | Overall IFR | Overall CR | Overall EMR |
|-------|-------------|------------|-------------|
| Identity | 27.37 | 94.13 | 4.60 |
| Noise | 32.08 | 15.68 | 0.00 |
| Step-Audio-EditX | **44.86** | **58.88** | 3.05 |
| Ming-UniAudio | 29.82 | 52.71 | 3.20 |
| MMEdit* | 43.12 | 47.64 | 3.50 |
| Audio-Omni* | **50.73** | 56.93 | **4.99** |
| SmartDJ* w/o planner | 38.20 | 55.41 | 4.62 |
| SmartDJ* w/ planner | 42.26 | 48.33 | 3.12 |

Key findings:
- **EMR consistently below 5%**; in complex mixed-modality tasks (e.g., Sound-Music-Speech) EMR drops to 0% for several models.
- **Performance degradation with complexity**: All models show clear drops from Single to Multiple tasks (e.g., Audio-Omni IFR 58.43% → 41.70%; CR 64.57% → 47.94%).
- **Domain biases**: Step-Audio-EditX better on speech; Audio-Omni better on sound and music; mixed-modality tasks hardest.
- **IFR-CR trade-off**: Identity has near-perfect CR but poor IFR; Noise has moderate IFR but very low CR. Models struggle to balance both.
- **Decoupling of average vs. exact match**: Step-Audio-EditX outperforms Ming-UniAudio on IFR/CR but has lower EMR (3.05% vs. 3.20%), suggesting mean-seeking vs. mode-seeking behavior.
- **Agent planners limited**: SmartDJ w/ planner improves IFR (42.26% vs. 38.20%) but harms CR (48.33% vs. 55.41%) and EMR (3.12% vs. 4.62%), due to cascaded errors and accumulated artifacts.

## Theoretical and Practical Implications

### Theoretical Implications
- **Need for atomic fidelity**: Models fail to precisely execute individual operations while preserving context, indicating a fundamental gap in generative audio editing capability.
- **Complexity bottleneck**: The sharp degradation with multi-step and mixed-modality tasks reveals that current architectures lack structural robustness for compositional, cross-domain reasoning.
- **Metric design matters**: The decoupling between average metrics (IFR/CR) and exact match (EMR) shows that optimizing for average performance does not guarantee reliable, flawless editing. Evaluation must report both average and perfect-execution metrics.

### Practical Implications
- **Diagnostic roadmap**: MMAE provides clear failure modes: models either miss intended modifications (low IFR) or inadvertently alter preserved content (low CR). This guides researchers to focus on instruction adherence and context preservation simultaneously.
- **Modality unification**: Current models show domain-specific strengths; true general-purpose editing requires universal support across sound, speech, music, and their mixtures.
- **Agentic planning caution**: Decomposing instructions via high-level planners is insufficient if the base editor cannot reliably execute atomic steps. Future work should prioritize improving base model fidelity before relying on symbolic planning.
- **Standardized evaluation**: MMAE establishes a reproducible, interpretable evaluation paradigm (rubric-based with MLLM judge) that can benchmark progress and serve as a community resource.

## Conclusion

MMAE is the first comprehensive benchmark for instruction-based audio editing, covering sound, speech, music, and their mixtures through a systematic taxonomy (modality, complexity, operation) and a rubric-based evaluation framework. The benchmark comprises 2,000 high-fidelity samples and 17,741 fine-grained rubrics, enabling objective assessment of both instruction following and content consistency.

Evaluation of five leading models reveals that current systems, while showing basic capabilities, remain far from reliable editing: Exact Match Rates fall below 5% overall and hit 0% on complex mixed-modality tasks. Key bottlenecks include:
- Balancing precise modification with context preservation.
- Handling increasing complexity and cross-domain synchronization.
- Achieving flawless execution beyond average competency.
- Limited improvement from external agentic planners due to fragile base generation.

MMAE highlights critical future directions: improving atomic editing fidelity, developing models with universal modality support, and advancing robust agent-guided systems for compositional editing. The benchmark is publicly released to serve as a diagnostic roadmap and standardized evaluation paradigm for next-generation audio editing systems.

---

_Markdown view of https://picx.dev/p/jsi1mM, served by PicX — AI-generated visual whiteboard summaries of research papers._