StepAudio 2.5 Technical Report - Summary
Summary (Overview)
- Unified Audio-Language Foundation: StepAudio 2.5 is a single multimodal foundation model that achieves state-of-the-art performance across three core speech capabilities: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Realtime spoken interaction.
- Operational Regime Thesis: The model operates on the principle that once text and audio share a common representational space, task specialization is achieved not through distinct architectures, but through different operational regimes: data construction, optimization targets, and decoding constraints.
- RLHF-Centric Alignment: The model advances beyond standard supervised fine-tuning, using Reinforcement Learning from Human Feedback (RLHF) as the primary mechanism to define complex, nuanced optimization targets for TTS and Realtime interaction.
- Specialized Decoding Strategies: The ASR branch incorporates a novel Multi-Token Prediction (MTP) head with autoregressive verification for highly efficient transcription, while TTS and Realtime branches use tailored RLHF and data pipelines for control and naturalness.
- State-of-the-Art Results: The model demonstrates superior performance on standard benchmarks against both leading unified models and specialized systems in all three domains (ASR, TTS, Realtime), validating the unified foundation approach.
Introduction and Theoretical Foundation
The convergence of speech systems is driven by large language models (LLMs), treating speech as another sequence type within the same modeling framework. Traditional cascaded pipelines (ASR → LM → TTS) discard paralinguistic information when reducing speech to text. A unified audio-language foundation preserves this information end-to-end, allowing cues like emotion and context to directly influence recognition, synthesis, and dialogue.
While models like Step-Audio 2, Qwen3-Omni, and commercial systems (GPT-4o, Gemini) have moved towards this unified direction, simultaneously meeting the distinct deployment requirements of ASR (accuracy/efficiency), TTS (control/expressivity), and Realtime interaction (low-latency/persona consistency) within a single model remains challenging.
StepAudio 2.5 is built on the central thesis:
Once text and audio share a well-shaped representational space, the differences among downstream tasks migrate away from architecture toward operational regimes: data, objectives, and decoding constraints.
The model refines a shared multimodal prior through a unified alignment paradigm, moving beyond basic supervised fine-tuning (SFT) to establish RLHF as the central mechanism for capturing nuanced human preferences.
Methodology
2.1 Shared Backbone Architecture
The architecture follows an audio-encoder–adapter–LLM-decoder pattern:
- A frozen audio encoder converts waveforms into compact acoustic embeddings.
- A lightweight adaptor maps these embeddings into the hidden space of a large decoder initialized from a text LLM.
- The decoder operates over a unified sequence where conventional text tokens and newly introduced audio tokens can both appear.
This design is intentionally asymmetric: the encoder handles stable acoustic abstraction, while the decoder carries semantics, context management, and generation.
2.2 Task Specialization as Directional Inference
The foundation supports three primary inference directions:
- ASR: Audio embeddings condition the decoder to generate transcript tokens.
- TTS: Text and control instructions condition the decoder to generate audio tokens.
- Realtime: The model couples audio understanding and response generation under strict latency constraints.
3. Shared Data Engine and Foundation Pretraining
The model is initialized from a textual Mixture-of-Experts (MoE) LLM and continually pre-trained on 2.2T tokens of text and audio data via a staged curriculum:
- Alignment Phase (3B tokens): Only the adaptor is trained on ASR data to align speech and text feature spaces.
- Multimodal Training (1.6T tokens): The vocabulary is expanded with speech tokens. Training includes ASR, TTS, translation, and conversational data. This phase has a 128B-token warmup stage and a main training stage.
- Cooldown Phase (600B tokens): Sequence length is increased to 32K, focusing on high-quality multimodal supervision (Audio Caption, Instruct TTS) for capability refinement.
4. ASR Specialization
The ASR branch augments the shared backbone with a verifiable Multi-Token Prediction (MTP-5) head. At decoding position , the main branch predicts the next token , while the -th MTP branch predicts for . During inference, proposals are accepted only as a verified prefix; if a future token disagrees with the normal decoding path, later tokens are rejected.
Training Pipeline:
- ASR SFT: Supervised fine-tuning turns the model into a reliable autoregressive recognizer.
- MTP Training: A two-stage process:
- Frozen-branch alignment: Only the newly appended MTP blocks are optimized.
- Joint calibration: The adapter and LLM decoder are unfrozen for joint optimization with the MTP blocks.
The training objective at each position combines the standard next-token loss with weighted MTP losses:
where and are distributions from the main and auxiliary branches, and branch weights are exponentially decayed: , with , .
Data: Uses ~100K hours of short-form SFT data and a 50K-hour long-form dataset created via a multi-system verification pipeline (ROVER fusion [30]) and LLM-based refinement for cross-segment consistency.
###引 5. TTS Specialization The TTS branch eliminates the encoder-adapter module, relying solely on the LLM backbone. Audio tokens are treated as a new "language," reformulating synthesis as a pure next-token prediction task.
Training Pipeline:
- SFT: A two-stage process for controllability.
- Large-scale zero-shot TTS training with global instruction supervision (speaker, style, prosody).
- Training on high-quality in-house data with both global and inline instructions for fine-grained segment-level control.
- Reinforcement Learning: RLHF is applied to align generated audio with textual descriptions based on human preferences. A Generative Reward Model (GRM) is trained to evaluate candidate responses against a reference under prompt . The reward for policy optimization is: where is a reward-shaping transformation.
SFT Data: Combines model-synthesized data (using Step-Audio-EditX [39]) for global control and recorded speech data annotated with natural-language global and inline descriptions for hierarchical expressive modeling.
6. Realtime Specialization
The Realtime branch inherits the core foundation architecture without modification but targets multi-turn spoken interaction under latency constraints.
Training Pipeline:
- Audio-Centric Mid-Training: Inherited from foundation, provides baseline audio-grounded perception.
- Progressive SFT: A curriculum injecting interactive capabilities across three dimensions:
- Conversational Alignment: For multi-turn continuity and spoken-language artifacts.
- Persona and Stylistic Control: Using scalable, persona-conditioned data from a "persona matrix."
- Paralinguistic Sensitivity: Training to recognize and respond to non-verbal cues (hesitation, laughter, etc.). A dynamic rehearsal schedule interleaves interaction data with general-purpose tasks to prevent catastrophic forgetting.
- RLHF with Generated Rewards: Uses a PPO-style objective [40] with KL regularization. A generative reward model scores candidates, utilizing both preference comparisons and explicit interaction rubrics for aspects like coherence and faithfulness.
Data: Organized into three streams: natural multi-turn dialogues, scaled persona-conditioned dialogues, and paralinguistic cue-labeled dialogues, interleaved with general-capability data.
Empirical Validation / Results
4.3 ASR Evaluation
StepAudio 2.5 ASR was evaluated against baselines (VibeVoice-ASR, FunASR-Nano, Doubao-ASR-2603, Qwen3-ASR-1.7B) on Chinese, English, and long-form benchmarks.
Table 1: ASR results on Chinese, English, and long-form benchmarks (Error Rate, %). Lower is better. The second-best results are underlined.
| Category | Test Set | VibeVoice-ASR | FunASR-Nano | Doubao-ASR-2603 | Qwen3-ASR-1.7B | StepAudio 2.5 ASR | StepAudio 2.5 ASR w/o MTP |
|---|---|---|---|---|---|---|---|
| Chinese | AISHELL-1 | 5.19 | 1.88 | 2.07 | 1.49 | 0.71 | 0.79 |
| AISHELL-2 ios | 5.10 | 2.61 | 2.70 | 2.50 | 2.29 | 2.30 | |
| WenetSpeech testnet | 14.79 | 5.30 | 4.03 | 4.44 | 4.54 | 4.57 | |
| WenetSpeech testmeeting | 17.09 | 5.31 | 5.09 | 4.66 | 4.70 | 4.73 | |
| FLEURS zh | 8.77 | 3.19 | 2.83 | 2.74 | 2.63 | 2.63 | |
| Average | 10.19 | 3.66 | 3.34 | 3.17 | 2.97 | 3.00 | |
| English | LibriSpeech clean | 2.30 | 1.80 | 2.94 | 1.69 | 1.38 | 1.40 |
| LibriSpeech other | 5.79 | 4.43 | 5.98 | 3.57 | 3.16 | 3.14 | |
| Common Voice v11 en | 20.03 | 11.05 | 14.06 | 7.50 | 7.57 | 7.62 | |
| FLEURS en | 5.20 | 4.96 | 6.74 | 3.23 | 3.55 | 3.74 | |
| VoxPopuli cleaned AA | 2.38 | 3.97 | 3.61 | 3.28 | 2.76 | 3.23 | |
| Average | 7.14 | 5.24 | 6.67 | 3.85 | 3.68 | 3.83 | |
| Long-form | LibriSpeech clean long | 1.66 | 2.34 | 2.81 | 1.95 | 1.27 | 1.27 |
| LibriSpeech other long | 3.48 | 4.89 | 5.59 | 3.81 | 2.90 | 2.81 | |
| WenetSpeech testnet long | 8.73 | 4.74 | 3.72 | 4.15 | 4.09 | 4.09 | |
| Earnings22 cleaned AA | 5.62 | 10.38 | 12.33 | 6.90 | 6.52 | 6.34 | |
| Average | 4.87 | 5.59 | 6.11 | 4.20 | 3.70 | 3.63 |
Key Findings:
- Achieves state-of-the-art average error rates: 2.97% (Chinese), 3.68% (English), 3.70% (Long-form).
- The addition of MTP-5 leaves accuracy essentially unchanged (fluctuations within 0.06%), confirming the stability of the verification mechanism.
Decoding Efficiency:
Table 2: RTF comparison.
| Model | VibeVoice-ASR | FunASR-Nano | Doubao-ASR-2603 | Qwen3-ASR-1.7B | StepAudio 2.5 ASR |
|---|---|---|---|---|---|
| RTF | 0.1039 | 0.0591 | 0.0640 | 0.0094 | 0.0053 |
StepAudio 2.5 ASR achieves an exceptionally low Real-Time Factor (RTF) of 0.0053, significantly faster than baselines, demonstrating MTP's efficiency gain.
MTP Acceptance Behavior:
Table 3: Strict per-position MTP acceptance rate and average accepted length.
| Config | 1st | 2nd | 3rd | 4th | 5th | 6th | 7th | Avg. Length |
|---|---|---|---|---|---|---|---|---|
| MTP 3 | 0.96 | 0.88 | 0.80 | – | – | – | – | 3.6 / 4 |
| MTP 5 | 0.95 | 0.88 | 0.80 | 0.71 | 0.64 | – | – | 5.0 / 6 |
| MTP 7 | 0.96 | 0.88 | 0.80 | 0.72 | 0.65 | 0.59 | 0.53 | 6.1 / 8 |
Acceptance rates decay at a factor of ~0.9 per branch. MTP-5 was chosen as the optimal efficiency-complexity trade-off, providing a 39% gain in average accepted length over MTP-3, with diminishing returns for MTP-7.
5.3 TTS Evaluation
Due to limitations of traditional metrics, evaluation used an arena-style pairwise framework with human judges. StepAudio 2.5-TTS was compared against three strong baselines: MiniMax-2.8-HD, Elevenlabs-v3, and Gemini-3.1-Flash-TTS on 774 prompts.
Result: StepAudio 2.5-TTS achieved an overall win rate of 67.6% in pairwise evaluations, demonstrating superior controllable generation capabilities.
6.3 Realtime Evaluation
Evaluated in interactive settings across five suites combining subjective human evaluation (mobile-app) and objective API-based tests.
Figure 5: Realtime interaction evaluation. Higher is better. Best results are in bold. (The report indicates StepAudio 2.5 Realtime consistently outperformed competitive baselines across all suites, with a +10.0 margin on subjective human evaluation and a +16.6 margin on the Step-SPQA audio QA benchmark.)
Key Findings:
- Validates the efficacy of persona and naturalness conditioning.
- Demonstrates that paralinguistic conditioning enhances acoustic comprehension without degrading general reasoning.
- Confirms the effectiveness of the rehearsal schedule in balancing specialized interaction training with foundational capabilities.
Theoretical and Practical Implications
- Unified Modeling Viability: StepAudio 2.5 proves that a singular audio-language foundation can successfully internalize the distinct objectives of speech understanding, generation, and live interaction, challenging the need for completely separate specialized systems.
- Grounding Enables Efficiency: The ASR results demonstrate that grounded generation tasks (where an external modality like audio reduces semantic branching) can be accelerated more aggressively than free-form text generation, using techniques like verifiable multi-token decoding.
- RLHF for Speech: The work advances the application of RLHF from text to complex audio generation and interaction tasks, showing its utility in aligning models with nuanced human preferences for expressivity, naturalness, and conversational quality.
- Data and Regime over Architecture: The core thesis is validated: with a well-shaped shared representational space, specialization is effectively achieved through tailored data, objectives (RLHF), and decoding constraints, not through architectural changes.
Conclusion
StepAudio 2.5 is a unified audio-language foundation model whose recognition (ASR), synthesis (TTS), and realtime interaction abilities emerge through different optimization and deployment regimes applied to a shared backbone. The model achieves state-of-the-art results across all three capabilities, demonstrating that the gap between unified and specialized speech systems can be closed. The shared foundation is learned through a staged multimodal curriculum and then specialized via:
- ASR: Verifiable multi-token decoding for efficiency.
- TTS: Semantic-to-audio alignment strengthened by context-rich supervision and RLHF.
- Realtime: Low-latency dialogue with persona stability and paralinguistic sensitivity via progressive SFT and RLHF.
The model is best understood not as a collection of isolated endpoints, but as a coherent foundation whose diverse capabilities are unlocked through targeted operational regimes.