StableForm-TTS:

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

Changjin Han Seokgi Lee Gyuhyeon Nam Gyeongsu Chae

DeepBrain AI Inc.


Abstract. Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase.

Model Overview

Overall architecture of StableForm-TTS. For brevity, the phoneme, pitch, and energy embedding layers are omitted.

Source-Filter Decomposition

StableForm-TTS synthesizes audio through excitation and formant pathways, and combines them for final output. Below are inference samples from both seen and unseen datasets.

Speaker Text Reference Ground Truth Excitation (𝜇) Refined Excitation Formant Final Output
LibriTTS-R: 1183
(seen)
The best physicians in America could do nothing for him.
VCTK: 238
(unseen)
Five years later, the deception continued.

Visual Comparison

The examples of pronunciation improvement in zero-shot scenarios. The text beneath the mel-spectrogram indicates the ASR model’s transcription of the area enclosed within the red dotted box. We use the ML solver with 10 steps on the models trained on LibriTTS-R.

Speaker Text Ground Truth Grad-StyleSpeech StableForm-TTS (ours)
245 We think all other measures are not exhausted.
261 Corporate banking would be based in Edinburgh.
238 People have been wonderful beyond belief.

Zero-Shot Speech Synthesis

* LT-460: trained on LibriTTS train-clean-460, LT-R: trained on LibriTTS-R
Speaker Text Reference Ground Truth Ground Truth (voc.) Grad-StyleSpeech (LT-460) StableForm-TTS (LT-460) Grad-StyleSpeech (LT-R) StableForm-TTS (LT-R)
238 The early physical reports were clear. PF 5 PF 5 PF 5 PF 5
PF 10 PF 10 PF 10 PF 10
PF 50 PF 50 PF 50 PF 50
PF 100 PF 100 PF 100 PF 100
ML 5 ML 5 ML 5 ML 5
ML 10 ML 10 ML 10 ML 10
ML 50 ML 50 ML 50 ML 50
ML 100 ML 100 ML 100 ML 100
248 Naturally enough, the letter in question was an E. PF 5 PF 5 PF 5 PF 5
PF 10 PF 10 PF 10 PF 10
PF 50 PF 50 PF 50 PF 50
PF 100 PF 100 PF 100 PF 100
ML 5 ML 5 ML 5 ML 5
ML 10 ML 10 ML 10 ML 10
ML 50 ML 50 ML 50 ML 50
ML 100 ML 100 ML 100 ML 100
302 Of course we make mistakes, but we don't make too many. PF 5 PF 5 PF 5 PF 5
PF 10 PF 10 PF 10 PF 10
PF 50 PF 50 PF 50 PF 50
PF 100 PF 100 PF 100 PF 100
ML 5 ML 5 ML 5 ML 5
ML 10 ML 10 ML 10 ML 10
ML 50 ML 50 ML 50 ML 50
ML 100 ML 100 ML 100 ML 100
225 I always felt that I was in control of the match. PF 5 PF 5 PF 5 PF 5
PF 10 PF 10 PF 10 PF 10
PF 50 PF 50 PF 50 PF 50
PF 100 PF 100 PF 100 PF 100
ML 5 ML 5 ML 5 ML 5
ML 10 ML 10 ML 10 ML 10
ML 50 ML 50 ML 50 ML 50
ML 100 ML 100 ML 100 ML 100
347 Some have accepted it as a miracle without physical explanation. PF 5 PF 5 PF 5 PF 5
PF 10 PF 10 PF 10 PF 10
PF 50 PF 50 PF 50 PF 50
PF 100 PF 100 PF 100 PF 100
ML 5 ML 5 ML 5 ML 5
ML 10 ML 10 ML 10 ML 10
ML 50 ML 50 ML 50 ML 50
ML 100 ML 100 ML 100 ML 100

Ablation Study

We showcase the samples from the ablation study using the ML solver with 10 steps on the models trained on LibriTTS-R.

Speaker Text Reference Ground Truth StableForm-TTS (ours) w/o E-F generators w/o Energy
302 The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drop increases.
225 The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
234 Yet the performance was not entirely convincing.
248 It's also important that they are not seen as a soft option.

Scalability Test

To examine the scalability of StableForm-TTS, we increase the English training dataset to 19,000 hours by including LibriTTS-R, LJSpeech, DailyTalk, HiFi-TTS, Common Voice, and MLS, while also doubling the model size to create the StableForm-large version (69.30M). Furthermore, the GSS-large version (68.13M) was created to verify the effects of scale-up on Grad-StyleSpeech. We provide several samples for comparing StableForm-large with five publicly available large-scale TTS models. We use the pretrained checkpoint for Bark, Tortoise, XTTS-v2, and YourTTS from the Coqui TTS toolkit, while we use the official code for VoiceCraft.

Speaker Text Reference Ground Truth StableForm-large (ours) GSS-large Bark Tortoise VoiceCraft XTTS-v2 YourTTS
261 He put some colour into Scottish history.
302 We're just a family working hard, working seven days a week.
335 Clearly, she says, she would never resort to such devices.
326 Another suggested the company should carry only pedestrians.
294 This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

Citation

If you want to cite our work, please use:

	@article{han2024improving,
		title={Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation},
		author={Han, Changjin and Lee, Seokgi and Nam, Gyuhyeon and Chae, Gyeongsu},
		journal={arXiv preprint arXiv:2409.09311},
		year={2024}
	}