Visual Summary | Audio Interaction Model

Summary (Overview)

New paradigm: Proposes the Audio Interaction Model (LAIM) as a unified, always-on streaming audio language model that replaces isolated offline LALMs and task-specific streaming models.
SoundFlow framework: An end-to-end pipeline covering streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference, enabling a perceive–decide–respond loop.
StreamAudio-2M dataset: A 2.6M-item, 302k-hour streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, with sparse, context-dependent response cues.
Proactive-Sound-Bench: A new benchmark for evaluating proactive audio intervention with 644 human-designed events across 6 categories.
Competitive performance: Audio-Interaction matches or surpasses state-of-the-art on 8 benchmarks (e.g., MMAU 58.15 under audio instructions), while unlocking streaming capabilities like real-time ASR, audio instruction following, and proactive assistance that offline LALMs cannot achieve.

Introduction and Theoretical Foundation

Background

Audio is inherently a continuous, real-time modality. Humans perceive sound moment-by-moment and decide when to react. However, current Large Audio Language Models (LALMs) operate offline: they process complete audio clips and produce a single response. Streaming models exist but are siloed into narrow tasks (e.g., streaming ASR, voice dialogue), each requiring a separate model trained from scratch.

Motivation

The paper identifies two fundamental challenges in moving to an always-on interactive regime:

(C1) Comprehension-grounded response triggering: An interactive model must decide whether to respond or remain silent based on semantic understanding of the unfolding context, not surface-level acoustic cues. Supervision for this decision is sparse and temporally ambiguous.
(C2) Real-time context continuity under chunked inference: Chunking audio breaks temporal continuity; the model must reconstruct context across chunks without inflating the inference window or stalling.

The authors formalize a new paradigm: Large Audio Interaction Models (LAIMs), where audio is consumed chunk-by-chunk, and at each step the model outputs both a decision token $d_t \in \{\text{<silent>}, \text{<response>}\}$ and a response $r_t$ :

(d_t, r_t) = f(a_{\le t}, d_{<t}, r_{<t})

where $a_t$ is the current audio chunk, $d_t$ the streaming intervention decision, and $r_t$ the generated response.

This loop unifies traditional tasks (ASR, translation, dialogue) and streaming-native capabilities (simultaneous interpretation, proactive intervention) within a single model.

Methodology

1. Streaming Data Construction: SoundFlow

Time-Frequency Joint Preprocessing (TFJP): An iterative algorithm that clips excessive silence, estimates background noise, locates core informative spans, and smooths boundaries with half-chunk alignment $\delta = \frac12$ of the audio chunk and short-window spectral smoothing $\omega$ . This ensures seamless stitching of short clips into natural long-form recordings.
Hierarchical Audio Event Selection: Instead of random concatenation, a multi-step pipeline uses an LLM to plan a scenario, refine it into a sequence of concrete audio events, and ground each event by retrieval from a database or generation via an audio model. This maintains semantic coherence and environmental plausibility across multi-turn streams.

2. Streaming Training

Chunk-wise processing: Audio is consumed in fixed-length chunks (400 ms by default). At each step, the model predicts a special token $d_t \in \{\text{<silent>}, \text{<response>}\}$ . If silent, it continues listening; if response, it switches to autoregressive generation.
Context Memory and Comprehension-Aware Silence Training: Two failure modes are addressed:
- Context forgetting → history review training: insert questions about preceding content at later positions.
- False triggering → comprehension-aware silence training: include a large amount of silent audio (verified by agents in Proactive-Sound-Bench) that warrants no response.
Dual-loss Training: A dedicated streaming loss is added to the standard language modeling loss: $\mathcal{L} = \frac{1}{N}\sum_{j=1}^{N}\left( \underbrace{-\log P_\theta(t_j|H_j)}_{\mathcal{L}_{\text{LM}}} + \lambda \underbrace{-\log P_\theta(s_j|H_j)}_{\mathcal{L}_{\text{stream}}} \right)$ where $t_j$ is the target text token, $s_j$ the streaming control token, $H_j$ the decoding context, and $\lambda$ the weighting factor (set to 1.0 after ablation).
Four-stage training pipeline: (1) Format training (offline data, teach <Spe_token>), (2) Adapter training, (3) Large-scale streaming supervised training, (4) Instruction-following fine-tuning with interleaved streaming sequences.

3. Asynchronous Inference via FIFO Scheduling

To avoid stalling due to encoder-decoder synchronization, the encoder continuously appends acoustic features to a temporal queue. Decoding is triggered conditionally based on the last generated token: if $r_{t-1} \in \{\text{<eos>}, \text{<silent>}\}$ , the model consumes queued features; otherwise, it waits for more audio. This reduces first-frame latency by $4.5\times$ and eliminates stalls.

Empirical Validation / Results

Benchmarks and Baselines

Audio-Interaction (initialized from Qwen2.5-Omni-3B) is evaluated on 8 benchmarks: MMAU (general audio understanding), four spoken-dialogue benchmarks, LibriSpeech (ASR), CoVoST2 (speech translation), and Proactive-Sound-Bench.

Main Results

Enhancement 1: Retained audio understanding under streaming training.

Model	Size	Stream.	Multi-turn	Text instruction	Audio instruction (Avg.)
Qwen2.5-Omni-3B	3B	✗	✓	57.81	42.51
Audio-Interaction	3B	✓	✓	55.68	58.15

Under audio instructions, Audio-Interaction reaches 58.15, outperforming its offline initialization (42.51) and competing with larger 7B models.

Enhancement 2: Competitive performance on core speech tasks.

Model	Size	ASR (LibriSpeech clean)	S2TT (en-zh BLEU)	S2TT (zh-en BLEU)
Qwen2.5-Omni-3B	3B	2.87 (WER)	39.50	18.17
Audio-Interaction	3B	3.17 (WER)	55.22	35.21

Audio-Interaction improves translation BLEU by +15.72/+17.04 over its initialization, with only marginal ASR regression (the cost of switching from utterance-level to chunk-wise decoding).

Enhancement 3: Unlocked streaming capabilities.

Category	Single	Multi
Human	56.4	64.9
Daily	68.1	65.8
Equipment	57.1	55.7
Traffic	64.9	69.0
Nature	61.8	61.8
Music	66.7	60.0
Average	61.2	62.8

Audio-Interaction achieves balanced proactive triggering across categories, while offline baselines collapse under multiple concatenated events.

Additional Analysis

Continuity ratio: The encoder output has low cross-chunk continuity (0.25), but Layer 0 of the decoder lifts it to 0.80, showing that context continuity is reconstructed via cross-chunk KV-cache access at the earliest decoder layer.
Attention head specialization: A single head (L35H14) dominates the streaming-control-token decision across all tasks, indicating a narrow, task-independent pathway learned through the streaming objective.

Ablation Studies

Ablation	Metric	Value
w/o FIFO scheduling	Avg. first-chunk latency / Stall rate	831 ms / 5.2% vs. 392 ms / 0.0%
w/o TFJP preprocessing	Trigger accuracy	85.35% vs. 92.42% (with streaming SFT)
w/o hierarchical event selection	Trigger accuracy	88.51% vs. 92.42%
Full Audio-Interaction	Trigger accuracy	96.77%
Chunk size 0.2s vs. 0.4s vs. 0.8s	MMAU / Latency	49.74/258ms vs. 58.15/392ms vs. 59.13/786ms
Dual-loss weight $\lambda=1.0$ vs. $\lambda=2.0$	MMAU / Trigger Acc.	58.2/96.7 vs. 57.3/96.9

Theoretical and Practical Implications

Unified framework: The perceive–decide–respond loop provides a principled formulation for streaming audio interaction, subsuming traditional offline tasks and streaming-native capabilities under one model. This eliminates the need for separate models per task.
Comprehension-grounded triggering: The work shows that effective response triggering requires semantic understanding of the stream, not just acoustic cues. The hierarchical event selection and TFJP preprocessing are essential for training such a model.
Practical deployment: The FIFO-scheduled asynchronous inference design reduces latency and eliminates stalling, making real-time deployment feasible. The 400 ms chunk size balances accuracy and responsiveness.
Dataset and benchmark: StreamAudio-2M and Proactive-Sound-Bench fill a critical gap, providing the first large-scale resources for training and evaluating streaming audio interaction models, including proactive assistance.
Scaling insights: The attention analysis reveals that the streaming decision is concentrated in a single head, suggesting that the training objective effectively routes this capability through a narrow, task-independent pathway. This may guide future architectural design for streaming models.

Conclusion

Main contribution: Formalized the Audio Interaction Model paradigm and introduced Audio-Interaction, a unified streaming model that performs always-on audio interaction via a perceive–decide–respond loop.
SoundFlow framework: End-to-end support from streaming data construction (TFJP + hierarchical event selection), through comprehension-aware training (history review, silence training, dual-loss), to asynchronous inference (FIFO scheduling).
Resources: StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 events) are released to the community.
Results: Competitive performance on 8 benchmarks while unlocking capabilities inaccessible to offline LALMs: real-time ASR, streaming audio instruction following, and proactive help.
Future directions: The authors hope the LAIM formulation, SoundFlow framework, and released resources can serve as a foundation for future research on unified streaming audio intelligence, with potential extensions to lower latency, larger model scales, and more complex multi-modal streaming scenarios.

Summary