SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
Summary (Overview)
- Problem Focus: Addresses the challenge of generating expressive, long-form, multi-speaker dialogue in zero-shot TTS, moving beyond single-turn synthesis. The common workaround of stitching single-speaker outputs breaks acoustic consistency, conversational coherence, and affective continuity.
- Core Contributions: Introduces two main components:
- SwanData-Speech: A comprehensive data processing pipeline for constructing high-quality monologue and dialogue training corpora from in-the-wild audio, featuring pause-aware alignment and pronunciation-hard synthetic data.
- SwanVoice: A zero-shot TTS model for 1–4 speakers that combines a 25 Hz VAE, raw-text conditioning, and a flow-matching DiT with speaker-turn conditioning, trained via a curriculum and post-trained with DiffusionNFT.
- Key Findings: On the SwanBench-Speech benchmark, SwanVoice achieves the highest expressive richness (3.81) and expressive hierarchy (3.62) scores among all evaluated open-source baselines for both monologue and dialogue synthesis.
- Main Limitation: While excelling in expressiveness and acoustic metrics, the model's content accuracy (higher CER/WER) remains a weaker point compared to some top baselines.
- Architectural Choice: Employs a non-autoregressive (NAR), flow-matching generative model, chosen over autoregressive designs to avoid sequential latency and exposure-bias failures (e.g., word skipping/repetition) in long dialogue generation.
Introduction and Theoretical Foundation
Recent zero-shot TTS has advanced single-speaker synthesis, but applications like podcasts, dramas, and chatbots require modeling full multi-party conversations as a single generation problem. Synthesizing turns separately and concatenating them leads to inconsistencies in room acoustics, background noise, speaking intensity, and pause timing, making the output sound artificially assembled.
Dialogue-capable TTS models must therefore:
- Maintain a stable acoustic environment.
- Keep speaker turns separable, even for similar voices.
- Preserve affective continuity across turns.
- Not degrade monologue synthesis quality when trained on dialogue data.
These challenges are tightly coupled with data construction (turn boundaries, pauses, labels) and model architecture. While some dialogue TTS models use autoregressive (AR) designs, they suffer from sequential latency and exposure bias in long contexts. This paper argues that non-autoregressive (NAR) generative modeling is better suited, as it reduces latency and conditions on the full text and speaker-turn sequence simultaneously.
The paper identifies two central bottlenecks:
- Dialogue Data Requirements: Need speaker-consistent segments, pause-aware transcripts (not just semantic punctuation), quality filtering, and sufficient non-neutral speech for affective variation.
- Training Stability: Dialogue training should not erase monologue ability. Fine-tuning a monologue model on dialogue data often improves turn control but can weaken monologue quality and cause pronunciation drift.
Methodology
The methodology is built on two pillars: the data pipeline (SwanData-Speech) and the synthesis model (SwanVoice).
1. Data Processing Pipeline: SwanData-Speech
This pipeline processes ~2.59M hours of in-the-wild audio (podcasts, dramas, etc.) into monologue and dialogue subsets.
Key Components:
- Sources: ~2.24M hours Chinese, ~0.35M hours English internal resources, plus open-source datasets.
- RobustMegaTTS3: A pronunciation-hard synthetic subset. An LLM generates example sentences for rare words/polyphonic characters. This text is synthesized using MegaTTS 3 (a phoneme-based model) to provide dictionary-level pronunciation knowledge.
- Pipeline Stages (Fig. 1):
- Speech Enhancement & Speaker Diarization: Uses a vocal separator and the 3D-Speaker toolkit for VAD and clustering. Segments are merged (monologue ≤60s, dialogue 2-4 speakers ≤120s).
- Transcription & Alignment: ASR via SenseVoice-Small. Crucially, punctuation is corrected using a forced aligner to match acoustic pauses, not just semantics.
- Pause < 0.08s: Ignored.
- Pause 0.08s–0.18s: Insert
<|sp|>token. - Pause 0.18s–0.45s: Insert comma.
- Pause > 0.45s: Insert period/exclamation/question mark.
- Data Filtering: Uses non-intrusive metrics (DNSMOS, PESQ, STOI) for quality. Emotion2vec+ classifies emotion to create a high-expressiveness subset.
2. Synthesis Model: SwanVoice
An overview of the training and inference procedure is shown in Figure 2.
a) VAE (Variational Autoencoder) Compresses the waveform to a latent representation at 25 Hz (25 frames/sec). The encoder downsamples by factor , and the decoder reconstructs the waveform .
Training Objective:
where is a spectrogram reconstruction loss, is a light KL regularizer, and is an LSGAN-style adversarial loss using MPD, MSD, and MRD discriminators.
b) Tokenizer & Conditioning
- Uses the CosyVoice tokenizer on raw text (no separate G2P).
- Adds a dedicated pause token
<|sp|>. - Augments vocabulary with 1,549 pinyin syllables. During training, Chinese characters are randomly replaced with pinyin to improve pronunciation robustness.
- Speaker-turn conditioning: Text is wrapped with tags
<S{id}>and</S{id}>. A speaker label sequence of the same length as the text tokens is constructed, indicating the speaker ID for each token.
c) Flow-based Transformer (DiT) A Diffusion Transformer (DiT) estimates the velocity field between noise and the clean target latent .
Flow-Matching Loss:
where the interpolated latent is:
and denotes the full conditioning (text tokens, speaker-turn embeddings, reference speech latent).
The model uses RMSNorm and AdaLN-based global adapters for stability. Text/turn conditions are processed by a lightweight Transformer stack before interacting with the speech latent, improving in-context conditioning.
d) Three-Stage Curriculum Learning
- Monologue Pretraining: Train from scratch on ~2M hours of monologue speech + RobustMegaTTS3 synthetic data. Establishes basic synthesis and alignment.
- Mixed Conversational Training: Train the pretrained model on monologue data + concatenated 2–4-speaker data. Learns speaker switching without real dialogue errors.
- Supervised Fine-Tuning (SFT): Train on monologue data + real 2–4-speaker conversational data (from movies, TV, podcasts). Learns higher-level dialogue consistency and affective variation.
e) Post-Training with DiffusionNFT After supervised training, the model is fine-tuned using DiffusionNFT, an online RL method, to optimize for pronunciation and speaker similarity rewards.
Reward Models:
- Phone Consistency Reward (): Based on phone-level WER between generated speech and target text.
- Speaker Similarity Reward (): Cosine similarity in a speaker embedding space.
- Aggregate Reward:
DiffusionNFT Objective: The policy is optimized using a loss that combines a reward-weighted term and a reference-policy regularizer:
f) Inference Procedure
- Takes a reference speech segment and target text as input.
- Uses a duration model for target length estimation.
- Employs sway sampling to improve alignment.
- Introduces staircase classifier-free guidance (CFG) to separately control text content and reference speaker/style influence: where and are guidance scales.
Empirical Validation / Results
Models are evaluated on the SwanBench-Speech benchmark across three axes: Acoustics, Semantics, and Expressiveness.
Evaluation Metrics:
- Acoustics: Timbre Consistency (↑), Reverb Consistency (↓ std of SRMR), Sound Fidelity (↑ PESQ).
- Semantics: Content Error (↓ CER/WER), Prosodic Coherence (↑ 1-5 scale via SpeechJudge).
- Expressiveness: Richness (mean expressiveness score per 10s chunk) and Hierarchy (overall score for emotional variation, dynamics, scene appropriateness). Both scored 1-5 by Gemini-3-Pro as an MLLM judge.
1. Zero-Shot Monologue TTS
Table 1: Evaluation results of long-form TTS models across multi-dimensional metrics.
| Model | Timbre(↑) | Reverb(↓) | Sound Fidelity(↑) | Content Error(↓) | Prosody(↑) | Richness(↑) | Hierarchy(↑) |
|---|---|---|---|---|---|---|---|
| Open-Source Models | |||||||
| CosyVoice-2 | 0.93 | 2.37 | 3.58 | 0.106 | 2.81 | 2.02 | 2.59 |
| CosyVoice-3 | 0.93 | 2.73 | 3.80 | 0.077 | 3.26 | 2.64 | 2.47 |
| FishSpeech | 0.93 | 2.00 | 4.09 | 0.066 | 3.77 | 2.37 | 2.90 |
| F5TTS | 0.92 | 2.12 | 2.60 | 0.085 | 2.87 | 2.77 | 2.97 |
| GLM-TTS | 0.94 | 1.64 | 3.90 | 0.074 | 3.28 | 1.57 | 2.39 |
| IndexTTS-2 | 0.93 | 1.77 | 2.78 | 0.077 | 3.63 | 3.32 | 2.94 |
| MegaTTS-3 | 0.93 | 2.07 | 3.52 | 0.072 | 3.22 | 2.40 | 3.01 |
| SparkTTS | 0.92 | 2.04 | 3.53 | 0.314 | 2.35 | 2.23 | 2.22 |
| VibeVoice | 0.92 | 2.45 | 3.47 | 0.092 | 3.75 | 3.42 | 3.06 |
| ZipVoice | 0.89 | 2.10 | 3.53 | 0.213 | 2.97 | 2.11 | 2.05 |
| Average | 0.92 | 2.13 | 3.48 | 0.12 | 3.19 | 2.49 | 2.66 |
| SwanVoice | 0.93 | 2.06 | 3.60 | 0.172 | 3.56 | 3.81 | 3.62 |
Key Result: SwanVoice achieves the highest Richness (3.81) and Hierarchy (3.62) scores, significantly outperforming the strongest baseline (VibeVoice) by 0.39 and 0.56 points, respectively. It maintains competitive acoustic and prosodic scores but has a higher content error rate.
2. Zero-Shot Dialogue TTS
Table 2: Results of dialogue generation models across SwanBench-Speech metrics.
| Model | Timbre(↑) | Reverb(↓) | Sound Fidelity(↑) | Content Error(↓) | Prosody(↑) | Richness(↑) | Hierarchy(↑) |
|---|---|---|---|---|---|---|---|
| Open-Source Models | |||||||
| FireRedTTS-2 | 0.91 | 3.54 | 2.54 | 0.148 | 2.93 | 2.52 | 2.65 |
| MoonCast | 0.90 | 3.29 | 2.60 | 0.284 | 2.93 | 2.42 | 2.54 |
| MOSS-TTSD | 0.89 | 3.52 | 2.83 | 0.227 | 2.57 | 3.04 | 2.86 |
| SoulX-Podcast | 0.92 | 3.23 | 3.98 | 0.101 | 3.89 | 2.80 | 3.15 |
| VibeVoice | 0.89 | 2.09 | 2.75 | 0.204 | 3.00 | 3.09 | 2.83 |
| ZipVoice-Dialog | 0.90 | 3.49 | 2.48 | 0.116 | 3.46 | 2.88 | 2.93 |
| Average | 0.90 | 3.19 | 2.86 | 0.180 | 3.13 | 2.79 | 2.83 |
| SwanVoice | 0.92 | 3.02 | 3.77 | 0.145 | 3.70 | 3.62 | 3.71 |
Key Result: SwanVoice again achieves the highest expressiveness scores (3.62 Richness, 3.71 Hierarchy), outperforming the best baselines by 0.53 and 0.56 points. It also shows strong performance in Sound Fidelity and Prosody, with content error below the baseline average.
Theoretical and Practical Implications
- Holistic Dialogue Modeling: SwanVoice demonstrates the importance of treating long-form dialogue as a full-context generation problem, not a sequence of isolated turns. This approach is crucial for maintaining acoustic consistency, speaker separability, and affective continuity.
- Data-Centric Approach: The success of SwanVoice is heavily attributed to the SwanData-Speech pipeline. It highlights that high-quality, pause-aware, and expressively filtered data is a prerequisite for training models capable of natural long-form synthesis. The pipeline directly addresses failure modes (wrong prosody, speaker split errors) that become audible in long speech.
- Curriculum & Post-Training: The three-stage curriculum (monologue → mixed → real dialogue) effectively balances the acquisition of dialogue skills with the preservation of monologue quality. The subsequent DiffusionNFT post-training shows that reward-driven optimization can effectively target specific shortcomings (pronunciation, speaker similarity) without degrading overall quality.
- Expressiveness Benchmarking: The paper introduces a robust protocol for evaluating expressiveness using an MLLM-as-a-judge (Gemini-3-Pro) for both sentence-level richness and paragraph-level hierarchy. SwanVoice's top scores in these metrics validate its design goals.
- Practical Deployment: The model supports 1–4 speakers, raw text input, and pinyin hints for pronunciation control, making it suitable for practical applications like podcast generation, audiobooks, and interactive dramas.
Conclusion
SwanVoice presents a comprehensive solution for expressive long-form zero-shot TTS for both monologue and dialogue. By combining a sophisticated data construction pipeline (SwanData-Speech) with a carefully designed NAR flow-matching model (SwanVoice) trained via curriculum and reinforcement learning, the system achieves state-of-the-art performance in expressiveness on standardized benchmarks.
Main Takeaways:
- Expressive long-form synthesis requires modeling full conversations, not just turns.
- High-quality, pause-aware, and expressively labeled data is foundational.
- A curriculum from monologue to dialogue, combined with reward-based post-training, effectively balances multiple objectives.
Limitations & Future Work:
- Content accuracy remains the main limitation, with CER/WER higher than some baselines.
- Speaker switching can fail for acoustically close voices or short prompts.
- Future work should focus on: improving pronunciation control, refining alignment and pause modeling, and developing **more robust speaker-turn
Related papers
- GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Training image restoration models on 100,000 real-world image pairs generated by a multimodal foundation model consistently improves their generalization to diverse real-world degradations.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.
- Mellum2 Technical Report
Mellum 2 is an efficient 12B MoE model specialized for software engineering, matching the inference cost of a 7B dense model while achieving competitive performance on coding and reasoning tasks.