Summary (Overview)
- New paradigm: Proposes the Audio Interaction Model (LAIM) as a unified, always-on streaming audio language model that replaces isolated offline LALMs and task-specific streaming models.
- SoundFlow framework: An end-to-end pipeline covering streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference, enabling a perceive–decide–respond loop.
- StreamAudio-2M dataset: A 2.6M-item, 302k-hour streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, with sparse, context-dependent response cues.
- Proactive-Sound-Bench: A new benchmark for evaluating proactive audio intervention with 644 human-designed events across 6 categories.
- Competitive performance: Audio-Interaction matches or surpasses state-of-the-art on 8 benchmarks (e.g., MMAU 58.15 under audio instructions), while unlocking streaming capabilities like real-time ASR, audio instruction following, and proactive assistance that offline LALMs cannot achieve.
Introduction and Theoretical Foundation
Background
Audio is inherently a continuous, real-time modality. Humans perceive sound moment-by-moment and decide when to react. However, current Large Audio Language Models (LALMs) operate offline: they process complete audio clips and produce a single response. Streaming models exist but are siloed into narrow tasks (e.g., streaming ASR, voice dialogue), each requiring a separate model trained from scratch.
Motivation
The paper identifies two fundamental challenges in moving to an always-on interactive regime:
- (C1) Comprehension-grounded response triggering: An interactive model must decide whether to respond or remain silent based on semantic understanding of the unfolding context, not surface-level acoustic cues. Supervision for this decision is sparse and temporally ambiguous.
- (C2) Real-time context continuity under chunked inference: Chunking audio breaks temporal continuity; the model must reconstruct context across chunks without inflating the inference window or stalling.
The authors formalize a new paradigm: Large Audio Interaction Models (LAIMs), where audio is consumed chunk-by-chunk, and at each step the model outputs both a decision token and a response :
where is the current audio chunk, the streaming intervention decision, and the generated response.
This loop unifies traditional tasks (ASR, translation, dialogue) and streaming-native capabilities (simultaneous interpretation, proactive intervention) within a single model.
Methodology
1. Streaming Data Construction: SoundFlow
- Time-Frequency Joint Preprocessing (TFJP): An iterative algorithm that clips excessive silence, estimates background noise, locates core informative spans, and smooths boundaries with half-chunk alignment of the audio chunk and short-window spectral smoothing . This ensures seamless stitching of short clips into natural long-form recordings.
- Hierarchical Audio Event Selection: Instead of random concatenation, a multi-step pipeline uses an LLM to plan a scenario, refine it into a sequence of concrete audio events, and ground each event by retrieval from a database or generation via an audio model. This maintains semantic coherence and environmental plausibility across multi-turn streams.
2. Streaming Training
- Chunk-wise processing: Audio is consumed in fixed-length chunks (400 ms by default). At each step, the model predicts a special token . If silent, it continues listening; if response, it switches to autoregressive generation.
- Context Memory and Comprehension-Aware Silence Training: Two failure modes are addressed:
- Context forgetting → history review training: insert questions about preceding content at later positions.
- False triggering → comprehension-aware silence training: include a large amount of silent audio (verified by agents in Proactive-Sound-Bench) that warrants no response.
- Dual-loss Training: A dedicated streaming loss is added to the standard language modeling loss: where is the target text token, the streaming control token, the decoding context, and the weighting factor (set to 1.0 after ablation).
- Four-stage training pipeline: (1) Format training (offline data, teach <Spe_token>), (2) Adapter training, (3) Large-scale streaming supervised training, (4) Instruction-following fine-tuning with interleaved streaming sequences.
3. Asynchronous Inference via FIFO Scheduling
To avoid stalling due to encoder-decoder synchronization, the encoder continuously appends acoustic features to a temporal queue. Decoding is triggered conditionally based on the last generated token: if , the model consumes queued features; otherwise, it waits for more audio. This reduces first-frame latency by and eliminates stalls.
Empirical Validation / Results
Benchmarks and Baselines
Audio-Interaction (initialized from Qwen2.5-Omni-3B) is evaluated on 8 benchmarks: MMAU (general audio understanding), four spoken-dialogue benchmarks, LibriSpeech (ASR), CoVoST2 (speech translation), and Proactive-Sound-Bench.
Main Results
Enhancement 1: Retained audio understanding under streaming training.
| Model | Size | Stream. | Multi-turn | Text instruction | Audio instruction (Avg.) |
|---|---|---|---|---|---|
| Qwen2.5-Omni-3B | 3B | ✗ | ✓ | 57.81 | 42.51 |
| Audio-Interaction | 3B | ✓ | ✓ | 55.68 | 58.15 |
Under audio instructions, Audio-Interaction reaches 58.15, outperforming its offline initialization (42.51) and competing with larger 7B models.
Enhancement 2: Competitive performance on core speech tasks.
| Model | Size | ASR (LibriSpeech clean) | S2TT (en-zh BLEU) | S2TT (zh-en BLEU) |
|---|---|---|---|---|
| Qwen2.5-Omni-3B | 3B | 2.87 (WER) | 39.50 | 18.17 |
| Audio-Interaction | 3B | 3.17 (WER) | 55.22 | 35.21 |
Audio-Interaction improves translation BLEU by +15.72/+17.04 over its initialization, with only marginal ASR regression (the cost of switching from utterance-level to chunk-wise decoding).
Enhancement 3: Unlocked streaming capabilities.
| Category | Single | Multi |
|---|---|---|
| Human | 56.4 | 64.9 |
| Daily | 68.1 | 65.8 |
| Equipment | 57.1 | 55.7 |
| Traffic | 64.9 | 69.0 |
| Nature | 61.8 | 61.8 |
| Music | 66.7 | 60.0 |
| Average | 61.2 | 62.8 |
Audio-Interaction achieves balanced proactive triggering across categories, while offline baselines collapse under multiple concatenated events.
Additional Analysis
- Continuity ratio: The encoder output has low cross-chunk continuity (0.25), but Layer 0 of the decoder lifts it to 0.80, showing that context continuity is reconstructed via cross-chunk KV-cache access at the earliest decoder layer.
- Attention head specialization: A single head (L35H14) dominates the streaming-control-token decision across all tasks, indicating a narrow, task-independent pathway learned through the streaming objective.
Ablation Studies
| Ablation | Metric | Value |
|---|---|---|
| w/o FIFO scheduling | Avg. first-chunk latency / Stall rate | 831 ms / 5.2% vs. 392 ms / 0.0% |
| w/o TFJP preprocessing | Trigger accuracy | 85.35% vs. 92.42% (with streaming SFT) |
| w/o hierarchical event selection | Trigger accuracy | 88.51% vs. 92.42% |
| Full Audio-Interaction | Trigger accuracy | 96.77% |
| Chunk size 0.2s vs. 0.4s vs. 0.8s | MMAU / Latency | 49.74/258ms vs. 58.15/392ms vs. 59.13/786ms |
| Dual-loss weight vs. | MMAU / Trigger Acc. | 58.2/96.7 vs. 57.3/96.9 |
Theoretical and Practical Implications
- Unified framework: The perceive–decide–respond loop provides a principled formulation for streaming audio interaction, subsuming traditional offline tasks and streaming-native capabilities under one model. This eliminates the need for separate models per task.
- Comprehension-grounded triggering: The work shows that effective response triggering requires semantic understanding of the stream, not just acoustic cues. The hierarchical event selection and TFJP preprocessing are essential for training such a model.
- Practical deployment: The FIFO-scheduled asynchronous inference design reduces latency and eliminates stalling, making real-time deployment feasible. The 400 ms chunk size balances accuracy and responsiveness.
- Dataset and benchmark: StreamAudio-2M and Proactive-Sound-Bench fill a critical gap, providing the first large-scale resources for training and evaluating streaming audio interaction models, including proactive assistance.
- Scaling insights: The attention analysis reveals that the streaming decision is concentrated in a single head, suggesting that the training objective effectively routes this capability through a narrow, task-independent pathway. This may guide future architectural design for streaming models.
Conclusion
- Main contribution: Formalized the Audio Interaction Model paradigm and introduced Audio-Interaction, a unified streaming model that performs always-on audio interaction via a perceive–decide–respond loop.
- SoundFlow framework: End-to-end support from streaming data construction (TFJP + hierarchical event selection), through comprehension-aware training (history review, silence training, dual-loss), to asynchronous inference (FIFO scheduling).
- Resources: StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 events) are released to the community.
- Results: Competitive performance on 8 benchmarks while unlocking capabilities inaccessible to offline LALMs: real-time ASR, streaming audio instruction following, and proactive help.
- Future directions: The authors hope the LAIM formulation, SoundFlow framework, and released resources can serve as a foundation for future research on unified streaming audio intelligence, with potential extensions to lower latency, larger model scales, and more complex multi-modal streaming scenarios.
Related papers
- Kwai Keye-VL-2.0 Technical Report
First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.
- Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Z-Reward decouples reasoning-heavy judgment from efficient reward deployment, achieving 89.6% teacher and 88.6% student human
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA achieves 71% success on laboratory tasks by training a vision-language-action model on synthetic data from the RoboGenesis engine, outperforming prior methods.