SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Summary (Overview)

Problem Focus: Addresses the challenge of generating expressive, long-form, multi-speaker dialogue in zero-shot TTS, moving beyond single-turn synthesis. The common workaround of stitching single-speaker outputs breaks acoustic consistency, conversational coherence, and affective continuity.
Core Contributions: Introduces two main components:
1. SwanData-Speech: A comprehensive data processing pipeline for constructing high-quality monologue and dialogue training corpora from in-the-wild audio, featuring pause-aware alignment and pronunciation-hard synthetic data.
2. SwanVoice: A zero-shot TTS model for 1–4 speakers that combines a 25 Hz VAE, raw-text conditioning, and a flow-matching DiT with speaker-turn conditioning, trained via a curriculum and post-trained with DiffusionNFT.
Key Findings: On the SwanBench-Speech benchmark, SwanVoice achieves the highest expressive richness (3.81) and expressive hierarchy (3.62) scores among all evaluated open-source baselines for both monologue and dialogue synthesis.
Main Limitation: While excelling in expressiveness and acoustic metrics, the model's content accuracy (higher CER/WER) remains a weaker point compared to some top baselines.
Architectural Choice: Employs a non-autoregressive (NAR), flow-matching generative model, chosen over autoregressive designs to avoid sequential latency and exposure-bias failures (e.g., word skipping/repetition) in long dialogue generation.

Introduction and Theoretical Foundation

Recent zero-shot TTS has advanced single-speaker synthesis, but applications like podcasts, dramas, and chatbots require modeling full multi-party conversations as a single generation problem. Synthesizing turns separately and concatenating them leads to inconsistencies in room acoustics, background noise, speaking intensity, and pause timing, making the output sound artificially assembled.

Dialogue-capable TTS models must therefore:

Maintain a stable acoustic environment.
Keep speaker turns separable, even for similar voices.
Preserve affective continuity across turns.
Not degrade monologue synthesis quality when trained on dialogue data.

These challenges are tightly coupled with data construction (turn boundaries, pauses, labels) and model architecture. While some dialogue TTS models use autoregressive (AR) designs, they suffer from sequential latency and exposure bias in long contexts. This paper argues that non-autoregressive (NAR) generative modeling is better suited, as it reduces latency and conditions on the full text and speaker-turn sequence simultaneously.

The paper identifies two central bottlenecks:

Dialogue Data Requirements: Need speaker-consistent segments, pause-aware transcripts (not just semantic punctuation), quality filtering, and sufficient non-neutral speech for affective variation.
Training Stability: Dialogue training should not erase monologue ability. Fine-tuning a monologue model on dialogue data often improves turn control but can weaken monologue quality and cause pronunciation drift.

Methodology

The methodology is built on two pillars: the data pipeline (SwanData-Speech) and the synthesis model (SwanVoice).

1. Data Processing Pipeline: SwanData-Speech

This pipeline processes ~2.59M hours of in-the-wild audio (podcasts, dramas, etc.) into monologue and dialogue subsets.

Key Components:

Sources: ~2.24M hours Chinese, ~0.35M hours English internal resources, plus open-source datasets.
RobustMegaTTS3: A pronunciation-hard synthetic subset. An LLM generates example sentences for rare words/polyphonic characters. This text is synthesized using MegaTTS 3 (a phoneme-based model) to provide dictionary-level pronunciation knowledge.
Pipeline Stages (Fig. 1):
1. Speech Enhancement & Speaker Diarization: Uses a vocal separator and the 3D-Speaker toolkit for VAD and clustering. Segments are merged (monologue ≤60s, dialogue 2-4 speakers ≤120s).
2. Transcription & Alignment: ASR via SenseVoice-Small. Crucially, punctuation is corrected using a forced aligner to match acoustic pauses, not just semantics.
  - Pause < 0.08s: Ignored.
  - Pause 0.08s–0.18s: Insert <|sp|> token.
  - Pause 0.18s–0.45s: Insert comma.
  - Pause > 0.45s: Insert period/exclamation/question mark.
3. Data Filtering: Uses non-intrusive metrics (DNSMOS, PESQ, STOI) for quality. Emotion2vec+ classifies emotion to create a high-expressiveness subset.

2. Synthesis Model: SwanVoice

An overview of the training and inference procedure is shown in Figure 2.

a) VAE (Variational Autoencoder) Compresses the waveform to a latent representation $z$ at 25 Hz (25 frames/sec). The encoder $E$ downsamples by factor $d$ , and the decoder $D$ reconstructs the waveform $\hat{s} = D(E(s))$ .

Training Objective:

L = L_{rec} + L_{KL} + L_{Adv}

where $L_{rec} = \|\Phi(s) - \Phi(\hat{s})\|_2^2$ is a spectrogram reconstruction loss, $L_{KL}$ is a light KL regularizer, and $L_{Adv}$ is an LSGAN-style adversarial loss using MPD, MSD, and MRD discriminators.

b) Tokenizer & Conditioning

Uses the CosyVoice tokenizer on raw text (no separate G2P).
Adds a dedicated pause token <|sp|>.
Augments vocabulary with 1,549 pinyin syllables. During training, Chinese characters are randomly replaced with pinyin to improve pronunciation robustness.
Speaker-turn conditioning: Text is wrapped with tags <S{id}> and </S{id}>. A speaker label sequence of the same length as the text tokens is constructed, indicating the speaker ID for each token.

c) Flow-based Transformer (DiT) A Diffusion Transformer (DiT) estimates the velocity field between noise and the clean target latent $z^\star$ .

Flow-Matching Loss:

L_{flow} = \mathbb{E}_{t \sim \mathcal{U}(0,1), z^\star \sim p_{data}, \epsilon \sim \mathcal{N}(0, I)} \left[ \| u_\theta(z_t, t, c) - (z^\star - \epsilon) \|_2^2 \right]

where the interpolated latent is:

z_t = (1 - t)\epsilon + t z^\star

and $c$ denotes the full conditioning (text tokens, speaker-turn embeddings, reference speech latent).

The model uses RMSNorm and AdaLN-based global adapters for stability. Text/turn conditions are processed by a lightweight Transformer stack before interacting with the speech latent, improving in-context conditioning.

d) Three-Stage Curriculum Learning

Monologue Pretraining: Train from scratch on ~2M hours of monologue speech + RobustMegaTTS3 synthetic data. Establishes basic synthesis and alignment.
Mixed Conversational Training: Train the pretrained model on monologue data + concatenated 2–4-speaker data. Learns speaker switching without real dialogue errors.
Supervised Fine-Tuning (SFT): Train on monologue data + real 2–4-speaker conversational data (from movies, TV, podcasts). Learns higher-level dialogue consistency and affective variation.

e) Post-Training with DiffusionNFT After supervised training, the model is fine-tuned using DiffusionNFT, an online RL method, to optimize for pronunciation and speaker similarity rewards.

Reward Models:

Phone Consistency Reward ( $r_{phone}$ ): Based on phone-level WER between generated speech and target text. $\text{WER}(u_{ref}, u_{hyp}) = \frac{S + D + I}{\max(1, |u_{ref}|)}, \quad r_{phone} = \exp\left(-\text{WER}(u_{ref}, u_{hyp})\right)$
Speaker Similarity Reward ( $r_{sim}$ ): Cosine similarity in a speaker embedding space. $r_{sim}(\hat{x}, x_{ref}) = \cos\left(f_{spk}(\hat{x}), f_{spk}(x_{ref})\right)$
Aggregate Reward: $r = \frac{1}{2}(r_{phone} + r_{sim})$

DiffusionNFT Objective: The policy is optimized using a loss that combines a reward-weighted term and a reference-policy regularizer:

L = L_{NFT} + \lambda_{ref} L_{ref}, \quad L_{ref} = \mathbb{E}\left[ \| v_\theta - \text{sg}(v_{ref}) \|_2^2 \right]

f) Inference Procedure

Takes a reference speech segment and target text as input.
Uses a duration model for target length estimation.
Employs sway sampling to improve alignment.
Introduces staircase classifier-free guidance (CFG) to separately control text content and reference speaker/style influence: $\tilde{v}_t = v_\emptyset + \omega_{text}(v_{text} - v_\emptyset) + \omega_{ref}(v_{full} - v_{text})$ where $\omega_{text}$ and $\omega_{ref}$ are guidance scales.

Empirical Validation / Results

Models are evaluated on the SwanBench-Speech benchmark across three axes: Acoustics, Semantics, and Expressiveness.

Evaluation Metrics:

Acoustics: Timbre Consistency (↑), Reverb Consistency (↓ std of SRMR), Sound Fidelity (↑ PESQ).
Semantics: Content Error (↓ CER/WER), Prosodic Coherence (↑ 1-5 scale via SpeechJudge).
Expressiveness: Richness (mean expressiveness score per 10s chunk) and Hierarchy (overall score for emotional variation, dynamics, scene appropriateness). Both scored 1-5 by Gemini-3-Pro as an MLLM judge.

1. Zero-Shot Monologue TTS

Table 1: Evaluation results of long-form TTS models across multi-dimensional metrics.

Model	Timbre(↑)	Reverb(↓)	Sound Fidelity(↑)	Content Error(↓)	Prosody(↑)	Richness(↑)	Hierarchy(↑)
Open-Source Models
CosyVoice-2	0.93	2.37	3.58	0.106	2.81	2.02	2.59
CosyVoice-3	0.93	2.73	3.80	0.077	3.26	2.64	2.47
FishSpeech	0.93	2.00	4.09	0.066	3.77	2.37	2.90
F5TTS	0.92	2.12	2.60	0.085	2.87	2.77	2.97
GLM-TTS	0.94	1.64	3.90	0.074	3.28	1.57	2.39
IndexTTS-2	0.93	1.77	2.78	0.077	3.63	3.32	2.94
MegaTTS-3	0.93	2.07	3.52	0.072	3.22	2.40	3.01
SparkTTS	0.92	2.04	3.53	0.314	2.35	2.23	2.22
VibeVoice	0.92	2.45	3.47	0.092	3.75	3.42	3.06
ZipVoice	0.89	2.10	3.53	0.213	2.97	2.11	2.05
Average	0.92	2.13	3.48	0.12	3.19	2.49	2.66
SwanVoice	0.93	2.06	3.60	0.172	3.56	3.81	3.62

Key Result: SwanVoice achieves the highest Richness (3.81) and Hierarchy (3.62) scores, significantly outperforming the strongest baseline (VibeVoice) by 0.39 and 0.56 points, respectively. It maintains competitive acoustic and prosodic scores but has a higher content error rate.

2. Zero-Shot Dialogue TTS

Table 2: Results of dialogue generation models across SwanBench-Speech metrics.

Model	Timbre(↑)	Reverb(↓)	Sound Fidelity(↑)	Content Error(↓)	Prosody(↑)	Richness(↑)	Hierarchy(↑)
Open-Source Models
FireRedTTS-2	0.91	3.54	2.54	0.148	2.93	2.52	2.65
MoonCast	0.90	3.29	2.60	0.284	2.93	2.42	2.54
MOSS-TTSD	0.89	3.52	2.83	0.227	2.57	3.04	2.86
SoulX-Podcast	0.92	3.23	3.98	0.101	3.89	2.80	3.15
VibeVoice	0.89	2.09	2.75	0.204	3.00	3.09	2.83
ZipVoice-Dialog	0.90	3.49	2.48	0.116	3.46	2.88	2.93
Average	0.90	3.19	2.86	0.180	3.13	2.79	2.83
SwanVoice	0.92	3.02	3.77	0.145	3.70	3.62	3.71

Key Result: SwanVoice again achieves the highest expressiveness scores (3.62 Richness, 3.71 Hierarchy), outperforming the best baselines by 0.53 and 0.56 points. It also shows strong performance in Sound Fidelity and Prosody, with content error below the baseline average.

Theoretical and Practical Implications

Holistic Dialogue Modeling: SwanVoice demonstrates the importance of treating long-form dialogue as a full-context generation problem, not a sequence of isolated turns. This approach is crucial for maintaining acoustic consistency, speaker separability, and affective continuity.
Data-Centric Approach: The success of SwanVoice is heavily attributed to the SwanData-Speech pipeline. It highlights that high-quality, pause-aware, and expressively filtered data is a prerequisite for training models capable of natural long-form synthesis. The pipeline directly addresses failure modes (wrong prosody, speaker split errors) that become audible in long speech.
Curriculum & Post-Training: The three-stage curriculum (monologue → mixed → real dialogue) effectively balances the acquisition of dialogue skills with the preservation of monologue quality. The subsequent DiffusionNFT post-training shows that reward-driven optimization can effectively target specific shortcomings (pronunciation, speaker similarity) without degrading overall quality.
Expressiveness Benchmarking: The paper introduces a robust protocol for evaluating expressiveness using an MLLM-as-a-judge (Gemini-3-Pro) for both sentence-level richness and paragraph-level hierarchy. SwanVoice's top scores in these metrics validate its design goals.
Practical Deployment: The model supports 1–4 speakers, raw text input, and pinyin hints for pronunciation control, making it suitable for practical applications like podcast generation, audiobooks, and interactive dramas.

Conclusion

SwanVoice presents a comprehensive solution for expressive long-form zero-shot TTS for both monologue and dialogue. By combining a sophisticated data construction pipeline (SwanData-Speech) with a carefully designed NAR flow-matching model (SwanVoice) trained via curriculum and reinforcement learning, the system achieves state-of-the-art performance in expressiveness on standardized benchmarks.

Main Takeaways:

Expressive long-form synthesis requires modeling full conversations, not just turns.
High-quality, pause-aware, and expressively labeled data is foundational.
A curriculum from monologue to dialogue, combined with reward-based post-training, effectively balances multiple objectives.

Limitations & Future Work:

Content accuracy remains the main limitation, with CER/WER higher than some baselines.
Speaker switching can fail for acoustically close voices or short prompts.
Future work should focus on: improving pronunciation control, refining alignment and pause modeling, and developing **more robust speaker-turn