# Audio Interaction Model > Audio-Interaction, a unified streaming audio model, matches offline LALMs on benchmarks while unlocking real-time ASR and proactive assistance. - **Source:** [arXiv](https://arxiv.org/abs/2606.05121) - **Published:** 2026-06-05 - **Permalink:** https://picx.dev/p/fh0Jks - **Whiteboard:** https://picx.dev/p/fh0Jks/image ## Summary ## Summary (Overview) - **New paradigm**: Proposes the **Audio Interaction Model (LAIM)** as a unified, always-on streaming audio language model that replaces isolated offline LALMs and task-specific streaming models. - **SoundFlow framework**: An end-to-end pipeline covering streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference, enabling a *perceive–decide–respond* loop. - **StreamAudio-2M dataset**: A 2.6M-item, 302k-hour streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, with sparse, context-dependent response cues. - **Proactive-Sound-Bench**: A new benchmark for evaluating proactive audio intervention with 644 human-designed events across 6 categories. - **Competitive performance**: Audio-Interaction matches or surpasses state-of-the-art on 8 benchmarks (e.g., MMAU 58.15 under audio instructions), while unlocking streaming capabilities like real-time ASR, audio instruction following, and proactive assistance that offline LALMs cannot achieve. --- ## Introduction and Theoretical Foundation ### Background Audio is inherently a continuous, real-time modality. Humans perceive sound moment-by-moment and decide when to react. However, current Large Audio Language Models (LALMs) operate **offline**: they process complete audio clips and produce a single response. Streaming models exist but are siloed into narrow tasks (e.g., streaming ASR, voice dialogue), each requiring a separate model trained from scratch. ### Motivation The paper identifies two fundamental challenges in moving to an always-on interactive regime: - **(C1) Comprehension-grounded response triggering**: An interactive model must decide whether to respond or remain silent based on semantic understanding of the unfolding context, not surface-level acoustic cues. Supervision for this decision is sparse and temporally ambiguous. - **(C2) Real-time context continuity under chunked inference**: Chunking audio breaks temporal continuity; the model must reconstruct context across chunks without inflating the inference window or stalling. The authors formalize a new paradigm: **Large Audio Interaction Models (LAIMs)**, where audio is consumed chunk-by-chunk, and at each step the model outputs both a decision token $d_t \in \{\text{}, \text{}\}$ and a response $r_t$: $$(d_t, r_t) = f(a_{\le t}, d_{}, \text{}\}$. If silent, it continues listening; if response, it switches to autoregressive generation. - **Context Memory and Comprehension-Aware Silence Training**: Two failure modes are addressed: - Context forgetting → *history review training*: insert questions about preceding content at later positions. - False triggering → *comprehension-aware silence training*: include a large amount of silent audio (verified by agents in Proactive-Sound-Bench) that warrants no response. - **Dual-loss Training**: A dedicated streaming loss is added to the standard language modeling loss: $$\mathcal{L} = \frac{1}{N}\sum_{j=1}^{N}\left( \underbrace{-\log P_\theta(t_j|H_j)}_{\mathcal{L}_{\text{LM}}} + \lambda \underbrace{-\log P_\theta(s_j|H_j)}_{\mathcal{L}_{\text{stream}}} \right)$$ where $t_j$ is the target text token, $s_j$ the streaming control token, $H_j$ the decoding context, and $\lambda$ the weighting factor (set to 1.0 after ablation). - **Four-stage training pipeline**: (1) Format training (offline data, teach ), (2) Adapter training, (3) Large-scale streaming supervised training, (4) Instruction-following fine-tuning with interleaved streaming sequences. ### 3. Asynchronous Inference via FIFO Scheduling To avoid stalling due to encoder-decoder synchronization, the encoder continuously appends acoustic features to a temporal queue. Decoding is triggered conditionally based on the last generated token: if $r_{t-1} \in \{\text{}, \text{}\}$, the model consumes queued features; otherwise, it waits for more audio. This reduces first-frame latency by $4.5\times$ and eliminates stalls. --- ## Empirical Validation / Results ### Benchmarks and Baselines Audio-Interaction (initialized from Qwen2.5-Omni-3B) is evaluated on 8 benchmarks: MMAU (general audio understanding), four spoken-dialogue benchmarks, LibriSpeech (ASR), CoVoST2 (speech translation), and Proactive-Sound-Bench. ### Main Results Enhancement 1: Retained audio understanding under streaming training. | Model | Size | Stream. | Multi-turn | Text instruction | Audio instruction (Avg.) | |-------|------|---------|------------|------------------|--------------------------| | Qwen2.5-Omni-3B | 3B | ✗ | ✓ | 57.81 | 42.51 | | **Audio-Interaction** | 3B | ✓ | ✓ | 55.68 | **58.15** | Under audio instructions, Audio-Interaction reaches 58.15, outperforming its offline initialization (42.51) and competing with larger 7B models. Enhancement 2: Competitive performance on core speech tasks. | Model | Size | ASR (LibriSpeech clean) | S2TT (en-zh BLEU) | S2TT (zh-en BLEU) | |-------|------|------------------------|-------------------|-------------------| | Qwen2.5-Omni-3B | 3B | 2.87 (WER) | 39.50 | 18.17 | | **Audio-Interaction** | 3B | 3.17 (WER) | **55.22** | **35.21** | Audio-Interaction improves translation BLEU by +15.72/+17.04 over its initialization, with only marginal ASR regression (the cost of switching from utterance-level to chunk-wise decoding). Enhancement 3: Unlocked streaming capabilities. | Category | Single | Multi | |----------|--------|-------| | Human | 56.4 | 64.9 | | Daily | 68.1 | 65.8 | | Equipment | 57.1 | 55.7 | | Traffic | 64.9 | 69.0 | | Nature | 61.8 | 61.8 | | Music | 66.7 | 60.0 | | **Average** | **61.2** | **62.8** | Audio-Interaction achieves balanced proactive triggering across categories, while offline baselines collapse under multiple concatenated events. ### Additional Analysis - **Continuity ratio**: The encoder output has low cross-chunk continuity (0.25), but Layer 0 of the decoder lifts it to 0.80, showing that context continuity is reconstructed via cross-chunk KV-cache access at the earliest decoder layer. - **Attention head specialization**: A single head (L35H14) dominates the streaming-control-token decision across all tasks, indicating a narrow, task-independent pathway learned through the streaming objective. ### Ablation Studies | Ablation | Metric | Value | |----------|--------|-------| | w/o FIFO scheduling | Avg. first-chunk latency / Stall rate | 831 ms / 5.2% vs. **392 ms / 0.0%** | | w/o TFJP preprocessing | Trigger accuracy | 85.35% vs. **92.42%** (with streaming SFT) | | w/o hierarchical event selection | Trigger accuracy | 88.51% vs. **92.42%** | | Full Audio-Interaction | Trigger accuracy | **96.77%** | | Chunk size 0.2s vs. 0.4s vs. 0.8s | MMAU / Latency | 49.74/258ms vs. **58.15/392ms** vs. 59.13/786ms | | Dual-loss weight $\lambda=1.0$ vs. $\lambda=2.0$ | MMAU / Trigger Acc. | **58.2/96.7** vs. 57.3/96.9 | --- ## Theoretical and Practical Implications - **Unified framework**: The perceive–decide–respond loop provides a principled formulation for streaming audio interaction, subsuming traditional offline tasks and streaming-native capabilities under one model. This eliminates the need for separate models per task. - **Comprehension-grounded triggering**: The work shows that effective response triggering requires semantic understanding of the stream, not just acoustic cues. The hierarchical event selection and TFJP preprocessing are essential for training such a model. - **Practical deployment**: The FIFO-scheduled asynchronous inference design reduces latency and eliminates stalling, making real-time deployment feasible. The 400 ms chunk size balances accuracy and responsiveness. - **Dataset and benchmark**: StreamAudio-2M and Proactive-Sound-Bench fill a critical gap, providing the first large-scale resources for training and evaluating streaming audio interaction models, including proactive assistance. - **Scaling insights**: The attention analysis reveals that the streaming decision is concentrated in a single head, suggesting that the training objective effectively routes this capability through a narrow, task-independent pathway. This may guide future architectural design for streaming models. --- ## Conclusion - **Main contribution**: Formalized the Audio Interaction Model paradigm and introduced Audio-Interaction, a unified streaming model that performs always-on audio interaction via a perceive–decide–respond loop. - **SoundFlow framework**: End-to-end support from streaming data construction (TFJP + hierarchical event selection), through comprehension-aware training (history review, silence training, dual-loss), to asynchronous inference (FIFO scheduling). - **Resources**: StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 events) are released to the community. - **Results**: Competitive performance on 8 benchmarks while unlocking capabilities inaccessible to offline LALMs: real-time ASR, streaming audio instruction following, and proactive help. - **Future directions**: The authors hope the LAIM formulation, SoundFlow framework, and released resources can serve as a foundation for future research on unified streaming audio intelligence, with potential extensions to lower latency, larger model scales, and more complex multi-modal streaming scenarios. --- _Markdown view of https://picx.dev/p/fh0Jks, served by PicX — AI-generated visual whiteboard summaries of research papers._