Summary (Overview)

  • New paradigm: Proposes the Audio Interaction Model (LAIM) as a unified, always-on streaming audio language model that replaces isolated offline LALMs and task-specific streaming models.
  • SoundFlow framework: An end-to-end pipeline covering streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference, enabling a perceive–decide–respond loop.
  • StreamAudio-2M dataset: A 2.6M-item, 302k-hour streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, with sparse, context-dependent response cues.
  • Proactive-Sound-Bench: A new benchmark for evaluating proactive audio intervention with 644 human-designed events across 6 categories.
  • Competitive performance: Audio-Interaction matches or surpasses state-of-the-art on 8 benchmarks (e.g., MMAU 58.15 under audio instructions), while unlocking streaming capabilities like real-time ASR, audio instruction following, and proactive assistance that offline LALMs cannot achieve.

Introduction and Theoretical Foundation

Background

Audio is inherently a continuous, real-time modality. Humans perceive sound moment-by-moment and decide when to react. However, current Large Audio Language Models (LALMs) operate offline: they process complete audio clips and produce a single response. Streaming models exist but are siloed into narrow tasks (e.g., streaming ASR, voice dialogue), each requiring a separate model trained from scratch.

Motivation

The paper identifies two fundamental challenges in moving to an always-on interactive regime:

  • (C1) Comprehension-grounded response triggering: An interactive model must decide whether to respond or remain silent based on semantic understanding of the unfolding context, not surface-level acoustic cues. Supervision for this decision is sparse and temporally ambiguous.
  • (C2) Real-time context continuity under chunked inference: Chunking audio breaks temporal continuity; the model must reconstruct context across chunks without inflating the inference window or stalling.

The authors formalize a new paradigm: Large Audio Interaction Models (LAIMs), where audio is consumed chunk-by-chunk, and at each step the model outputs both a decision token dt{<silent>,<response>}d_t \in \{\text{<silent>}, \text{<response>}\} and a response rtr_t:

(dt,rt)=f(at,d<t,r<t)(d_t, r_t) = f(a_{\le t}, d_{<t}, r_{<t})

where ata_t is the current audio chunk, dtd_t the streaming intervention decision, and rtr_t the generated response.

This loop unifies traditional tasks (ASR, translation, dialogue) and streaming-native capabilities (simultaneous interpretation, proactive intervention) within a single model.


Methodology

1. Streaming Data Construction: SoundFlow

  • Time-Frequency Joint Preprocessing (TFJP): An iterative algorithm that clips excessive silence, estimates background noise, locates core informative spans, and smooths boundaries with half-chunk alignment δ=12\delta = \frac12 of the audio chunk and short-window spectral smoothing ω\omega. This ensures seamless stitching of short clips into natural long-form recordings.
  • Hierarchical Audio Event Selection: Instead of random concatenation, a multi-step pipeline uses an LLM to plan a scenario, refine it into a sequence of concrete audio events, and ground each event by retrieval from a database or generation via an audio model. This maintains semantic coherence and environmental plausibility across multi-turn streams.

2. Streaming Training

  • Chunk-wise processing: Audio is consumed in fixed-length chunks (400 ms by default). At each step, the model predicts a special token dt{<silent>,<response>}d_t \in \{\text{<silent>}, \text{<response>}\}. If silent, it continues listening; if response, it switches to autoregressive generation.
  • Context Memory and Comprehension-Aware Silence Training: Two failure modes are addressed:
    • Context forgetting → history review training: insert questions about preceding content at later positions.
    • False triggering → comprehension-aware silence training: include a large amount of silent audio (verified by agents in Proactive-Sound-Bench) that warrants no response.
  • Dual-loss Training: A dedicated streaming loss is added to the standard language modeling loss: L=1Nj=1N(logPθ(tjHj)LLM+λlogPθ(sjHj)Lstream)\mathcal{L} = \frac{1}{N}\sum_{j=1}^{N}\left( \underbrace{-\log P_\theta(t_j|H_j)}_{\mathcal{L}_{\text{LM}}} + \lambda \underbrace{-\log P_\theta(s_j|H_j)}_{\mathcal{L}_{\text{stream}}} \right) where tjt_j is the target text token, sjs_j the streaming control token, HjH_j the decoding context, and λ\lambda the weighting factor (set to 1.0 after ablation).
  • Four-stage training pipeline: (1) Format training (offline data, teach <Spe_token>), (2) Adapter training, (3) Large-scale streaming supervised training, (4) Instruction-following fine-tuning with interleaved streaming sequences.

3. Asynchronous Inference via FIFO Scheduling

To avoid stalling due to encoder-decoder synchronization, the encoder continuously appends acoustic features to a temporal queue. Decoding is triggered conditionally based on the last generated token: if rt1{<eos>,<silent>}r_{t-1} \in \{\text{<eos>}, \text{<silent>}\}, the model consumes queued features; otherwise, it waits for more audio. This reduces first-frame latency by 4.5×4.5\times and eliminates stalls.


Empirical Validation / Results

Benchmarks and Baselines

Audio-Interaction (initialized from Qwen2.5-Omni-3B) is evaluated on 8 benchmarks: MMAU (general audio understanding), four spoken-dialogue benchmarks, LibriSpeech (ASR), CoVoST2 (speech translation), and Proactive-Sound-Bench.

Main Results

Enhancement 1: Retained audio understanding under streaming training.

ModelSizeStream.Multi-turnText instructionAudio instruction (Avg.)
Qwen2.5-Omni-3B3B57.8142.51
Audio-Interaction3B55.6858.15

Under audio instructions, Audio-Interaction reaches 58.15, outperforming its offline initialization (42.51) and competing with larger 7B models.

Enhancement 2: Competitive performance on core speech tasks.

ModelSizeASR (LibriSpeech clean)S2TT (en-zh BLEU)S2TT (zh-en BLEU)
Qwen2.5-Omni-3B3B2.87 (WER)39.5018.17
Audio-Interaction3B3.17 (WER)55.2235.21

Audio-Interaction improves translation BLEU by +15.72/+17.04 over its initialization, with only marginal ASR regression (the cost of switching from utterance-level to chunk-wise decoding).

Enhancement 3: Unlocked streaming capabilities.

CategorySingleMulti
Human56.464.9
Daily68.165.8
Equipment57.155.7
Traffic64.969.0
Nature61.861.8
Music66.760.0
Average61.262.8

Audio-Interaction achieves balanced proactive triggering across categories, while offline baselines collapse under multiple concatenated events.

Additional Analysis

  • Continuity ratio: The encoder output has low cross-chunk continuity (0.25), but Layer 0 of the decoder lifts it to 0.80, showing that context continuity is reconstructed via cross-chunk KV-cache access at the earliest decoder layer.
  • Attention head specialization: A single head (L35H14) dominates the streaming-control-token decision across all tasks, indicating a narrow, task-independent pathway learned through the streaming objective.

Ablation Studies

AblationMetricValue
w/o FIFO schedulingAvg. first-chunk latency / Stall rate831 ms / 5.2% vs. 392 ms / 0.0%
w/o TFJP preprocessingTrigger accuracy85.35% vs. 92.42% (with streaming SFT)
w/o hierarchical event selectionTrigger accuracy88.51% vs. 92.42%
Full Audio-InteractionTrigger accuracy96.77%
Chunk size 0.2s vs. 0.4s vs. 0.8sMMAU / Latency49.74/258ms vs. 58.15/392ms vs. 59.13/786ms
Dual-loss weight λ=1.0\lambda=1.0 vs. λ=2.0\lambda=2.0MMAU / Trigger Acc.58.2/96.7 vs. 57.3/96.9

Theoretical and Practical Implications

  • Unified framework: The perceive–decide–respond loop provides a principled formulation for streaming audio interaction, subsuming traditional offline tasks and streaming-native capabilities under one model. This eliminates the need for separate models per task.
  • Comprehension-grounded triggering: The work shows that effective response triggering requires semantic understanding of the stream, not just acoustic cues. The hierarchical event selection and TFJP preprocessing are essential for training such a model.
  • Practical deployment: The FIFO-scheduled asynchronous inference design reduces latency and eliminates stalling, making real-time deployment feasible. The 400 ms chunk size balances accuracy and responsiveness.
  • Dataset and benchmark: StreamAudio-2M and Proactive-Sound-Bench fill a critical gap, providing the first large-scale resources for training and evaluating streaming audio interaction models, including proactive assistance.
  • Scaling insights: The attention analysis reveals that the streaming decision is concentrated in a single head, suggesting that the training objective effectively routes this capability through a narrow, task-independent pathway. This may guide future architectural design for streaming models.

Conclusion

  • Main contribution: Formalized the Audio Interaction Model paradigm and introduced Audio-Interaction, a unified streaming model that performs always-on audio interaction via a perceive–decide–respond loop.
  • SoundFlow framework: End-to-end support from streaming data construction (TFJP + hierarchical event selection), through comprehension-aware training (history review, silence training, dual-loss), to asynchronous inference (FIFO scheduling).
  • Resources: StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 events) are released to the community.
  • Results: Competitive performance on 8 benchmarks while unlocking capabilities inaccessible to offline LALMs: real-time ASR, streaming audio instruction following, and proactive help.
  • Future directions: The authors hope the LAIM formulation, SoundFlow framework, and released resources can serve as a foundation for future research on unified streaming audio intelligence, with potential extensions to lower latency, larger model scales, and more complex multi-modal streaming scenarios.

Related papers