Summary (Overview)

  • First comprehensive audio editing benchmark: MMAE is the first general-purpose, instruction-based audio editing benchmark covering sound, speech, music, and their mixtures across 7 modalities, 6 complexity levels, 2 granularities, and 8 operation types.
  • Rubric-based evaluation paradigm: Instead of coarse metrics, MMAE decomposes each editing task into 17,741 fine-grained, verifiable multiple-choice rubrics (avg. 8.87 per sample) that assess both instruction following (IFR) and consistency (CR), enabling objective, multi-dimensional diagnosis.
  • Systematic taxonomy: The benchmark organizes tasks orthogonally along modality, complexity, and operation dimensions, ensuring broad coverage and balanced distribution (2,000 samples).
  • Dramatic model failures: Current audio editing systems achieve Exact Match Rates (EMR) below 5% across the board, dropping to 0% on complex mixed-modality tasks, exposing severe bottlenecks in precise execution and structural robustness.
  • Key insights: Higher complexity and mixed modalities degrade performance; IFR and CR present a fundamental trade-off; average competency (IFR/CR) decouples from flawless execution (EMR); agent-guided planners show limited improvement and may harm consistency.

Introduction and Theoretical Foundation

Background and Motivation

Intelligent editing has advanced rapidly in visual domains (e.g., image editing with Nano-banana 2, video editing with Gemini-Omni), spurring interest in instruction-based audio editing models. These models allow users to alter speech, music, or sound effects via natural language. However, the evaluation infrastructure has lagged behind:

  • Fragmented data coverage: Existing benchmarks are restricted to specific modalities (speech-only or sound-only) or basic operations (addition, removal, replacement).
  • Inadequate metrics: Traditional signal-level metrics (FAD, LSD, CLAP similarity) or generic MOS ratings fail to explicitly assess editing correctness, especially for open-ended, multi-faceted instructions.

Theoretical Basis

The paper argues that a next-generation evaluation framework must achieve breakthroughs in two critical dimensions:

  1. Data coverage: Comprehensively cover speech, music, and sound effects, including complex scenarios that stress-test perception, reasoning, and generation.
  2. Evaluation paradigm: Replace coarse metrics with rubric-based evaluation, a paradigm validated in text reinforcement learning [10], audio reasoning [11], and image editing [12]. Rubrics decompose free-form tasks into structured, verifiable criteria.

MMAE is designed to bridge this gap by establishing the first universal benchmark for instruction-based audio editing, requiring models to integrate three core capabilities: perception (understanding source audio context), reasoning (interpreting implicit user intent), and generation (executing high-fidelity edits).

Methodology

Taxonomy

MMAE characterizes each editing task along three orthogonal dimensions:

  • Modality: 7 categories – sound, music, speech, sound-music, sound-speech, music-speech, sound-music-speech.
  • Complexity: 6 levels – Single, Multi-part, Multi-instruction, Multi-audio, Multi-round, Multi-hop.
  • Operation: 2 granularities – Local (addition, removal, replacement, extraction, alteration) and Global (background change, foreground change, global alteration). Each sample may involve a single operation or arbitrary compositions.

Evaluation Paradigm

The rubric-based paradigm evaluates models along two core dimensions:

  • Instruction Following (IF): Whether the model precisely performs the requested modifications.
  • Consistency: Whether all acoustic elements irrelevant to the instruction remain unaltered.

Each rubric is a multiple-choice question with one correct option. An external judge (audio language model) selects an option; a binary score (1 or 0) is assigned based on correctness. Rubrics are designed to satisfy four principles: Completeness, Atomicity, Orthogonality, Objectivity.

Metrics:

  • Instruction Following Rate (IFR): Average rubric score across IF rubrics for a sample.
  • Consistency Rate (CR): Average rubric score across consistency rubrics.
  • Exact Match Rate (EMR): Proportion of samples where all rubrics are answered correctly.

Data Curation Pipeline (5 stages)

  1. Brainstorming: Expert sessions to collect diverse editing scenarios.
  2. Taxonomy & Paradigm Construction: Define task taxonomy and rubric-based framework.
  3. Instruction-Centric Data Collection: Manual search and collection from online videos; annotators write instructions and label metadata (modality, complexity, operation). Dynamic balancing across dimensions.
  4. Rubrics Annotation: Human-agent collaborative workflow – Omni-Detective agent extracts captions; LLM generates initial rubric drafts; human annotators refine; LLM normalizes.
  5. Quality Inspection: Cross-review by blind inspectors; iterative revision or rejection.

Statistics

StatisticValue
Total Samples2,000
Total Rubrics17,741
Avg. Operations / Sample1.22
Avg. Audio Duration / Sample14.46 sec
Avg. Instruction Length14.00 words
Avg. Rubrics / Sample8.87
Avg. IF Rubrics / Sample3.58
Avg. Consistency Rubrics / Sample5.29
Avg. Choices / Rubric3.53
Avg. Rubric Question Length25.45 words

Empirical Validation / Results

Benchmarking Candidates

Five recent models evaluated: Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ (with/without planner). Two reference baselines: Identity (returns input unchanged) and Noise (outputs Gaussian noise). Due to input length constraints, MMEdit, Audio-Omni, SmartDJ evaluated only on samples ≤10s (801/2000).

Main Results (Table 2)

ModelOverall IFROverall CROverall EMR
Identity27.3794.134.60
Noise32.0815.680.00
Step-Audio-EditX44.8658.883.05
Ming-UniAudio29.8252.713.20
MMEdit*43.1247.643.50
Audio-Omni*50.7356.934.99
SmartDJ* w/o planner38.2055.414.62
SmartDJ* w/ planner42.2648.333.12

Key findings:

  • EMR consistently below 5%; in complex mixed-modality tasks (e.g., Sound-Music-Speech) EMR drops to 0% for several models.
  • Performance degradation with complexity: All models show clear drops from Single to Multiple tasks (e.g., Audio-Omni IFR 58.43% → 41.70%; CR 64.57% → 47.94%).
  • Domain biases: Step-Audio-EditX better on speech; Audio-Omni better on sound and music; mixed-modality tasks hardest.
  • IFR-CR trade-off: Identity has near-perfect CR but poor IFR; Noise has moderate IFR but very low CR. Models struggle to balance both.
  • Decoupling of average vs. exact match: Step-Audio-EditX outperforms Ming-UniAudio on IFR/CR but has lower EMR (3.05% vs. 3.20%), suggesting mean-seeking vs. mode-seeking behavior.
  • Agent planners limited: SmartDJ w/ planner improves IFR (42.26% vs. 38.20%) but harms CR (48.33% vs. 55.41%) and EMR (3.12% vs. 4.62%), due to cascaded errors and accumulated artifacts.

Theoretical and Practical Implications

Theoretical Implications

  • Need for atomic fidelity: Models fail to precisely execute individual operations while preserving context, indicating a fundamental gap in generative audio editing capability.
  • Complexity bottleneck: The sharp degradation with multi-step and mixed-modality tasks reveals that current architectures lack structural robustness for compositional, cross-domain reasoning.
  • Metric design matters: The decoupling between average metrics (IFR/CR) and exact match (EMR) shows that optimizing for average performance does not guarantee reliable, flawless editing. Evaluation must report both average and perfect-execution metrics.

Practical Implications

  • Diagnostic roadmap: MMAE provides clear failure modes: models either miss intended modifications (low IFR) or inadvertently alter preserved content (low CR). This guides researchers to focus on instruction adherence and context preservation simultaneously.
  • Modality unification: Current models show domain-specific strengths; true general-purpose editing requires universal support across sound, speech, music, and their mixtures.
  • Agentic planning caution: Decomposing instructions via high-level planners is insufficient if the base editor cannot reliably execute atomic steps. Future work should prioritize improving base model fidelity before relying on symbolic planning.
  • Standardized evaluation: MMAE establishes a reproducible, interpretable evaluation paradigm (rubric-based with MLLM judge) that can benchmark progress and serve as a community resource.

Conclusion

MMAE is the first comprehensive benchmark for instruction-based audio editing, covering sound, speech, music, and their mixtures through a systematic taxonomy (modality, complexity, operation) and a rubric-based evaluation framework. The benchmark comprises 2,000 high-fidelity samples and 17,741 fine-grained rubrics, enabling objective assessment of both instruction following and content consistency.

Evaluation of five leading models reveals that current systems, while showing basic capabilities, remain far from reliable editing: Exact Match Rates fall below 5% overall and hit 0% on complex mixed-modality tasks. Key bottlenecks include:

  • Balancing precise modification with context preservation.
  • Handling increasing complexity and cross-domain synchronization.
  • Achieving flawless execution beyond average competency.
  • Limited improvement from external agentic planners due to fragile base generation.

MMAE highlights critical future directions: improving atomic editing fidelity, developing models with universal modality support, and advancing robust agent-guided systems for compositional editing. The benchmark is publicly released to serve as a diagnostic roadmap and standardized evaluation paradigm for next-generation audio editing systems.

Related papers