StableForm-TTS:
Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
Abstract. Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase.
Model Overview
Overall architecture of StableForm-TTS. For brevity, the phoneme, pitch, and energy embedding layers are omitted.
Source-Filter Decomposition
StableForm-TTS synthesizes audio through excitation and formant pathways, and combines them for final output. Below are inference samples from both seen and unseen datasets.
Speaker | Text | Reference | Ground Truth | Excitation (𝜇) | Refined Excitation | Formant | Final Output |
---|---|---|---|---|---|---|---|
LibriTTS-R: 1183 (seen) |
The best physicians in America could do nothing for him. | ||||||
VCTK: 238 (unseen) |
Five years later, the deception continued. |
Visual Comparison
The examples of pronunciation improvement in zero-shot scenarios. The text beneath the mel-spectrogram indicates the ASR model’s transcription of the area enclosed within the red dotted box. We use the ML solver with 10 steps on the models trained on LibriTTS-R.
Speaker | Text | Ground Truth | Grad-StyleSpeech | StableForm-TTS (ours) |
---|---|---|---|---|
245 | We think all other measures are not exhausted. | |||
261 | Corporate banking would be based in Edinburgh. | |||
238 | People have been wonderful beyond belief. |
Zero-Shot Speech Synthesis
* LT-460: trained on LibriTTS train-clean-460, LT-R: trained on LibriTTS-RSpeaker | Text | Reference | Ground Truth | Ground Truth (voc.) | Grad-StyleSpeech (LT-460) | StableForm-TTS (LT-460) | Grad-StyleSpeech (LT-R) | StableForm-TTS (LT-R) |
---|---|---|---|---|---|---|---|---|
238 | The early physical reports were clear. | PF 5 | PF 5 | PF 5 | PF 5 | |||
PF 10 | PF 10 | PF 10 | PF 10 | |||||
PF 50 | PF 50 | PF 50 | PF 50 | |||||
PF 100 | PF 100 | PF 100 | PF 100 | |||||
ML 5 | ML 5 | ML 5 | ML 5 | |||||
ML 10 | ML 10 | ML 10 | ML 10 | |||||
ML 50 | ML 50 | ML 50 | ML 50 | |||||
ML 100 | ML 100 | ML 100 | ML 100 | |||||
248 | Naturally enough, the letter in question was an E. | PF 5 | PF 5 | PF 5 | PF 5 | |||
PF 10 | PF 10 | PF 10 | PF 10 | |||||
PF 50 | PF 50 | PF 50 | PF 50 | |||||
PF 100 | PF 100 | PF 100 | PF 100 | |||||
ML 5 | ML 5 | ML 5 | ML 5 | |||||
ML 10 | ML 10 | ML 10 | ML 10 | |||||
ML 50 | ML 50 | ML 50 | ML 50 | |||||
ML 100 | ML 100 | ML 100 | ML 100 | |||||
302 | Of course we make mistakes, but we don't make too many. | PF 5 | PF 5 | PF 5 | PF 5 | |||
PF 10 | PF 10 | PF 10 | PF 10 | |||||
PF 50 | PF 50 | PF 50 | PF 50 | |||||
PF 100 | PF 100 | PF 100 | PF 100 | |||||
ML 5 | ML 5 | ML 5 | ML 5 | |||||
ML 10 | ML 10 | ML 10 | ML 10 | |||||
ML 50 | ML 50 | ML 50 | ML 50 | |||||
ML 100 | ML 100 | ML 100 | ML 100 | |||||
225 | I always felt that I was in control of the match. | PF 5 | PF 5 | PF 5 | PF 5 | |||
PF 10 | PF 10 | PF 10 | PF 10 | |||||
PF 50 | PF 50 | PF 50 | PF 50 | |||||
PF 100 | PF 100 | PF 100 | PF 100 | |||||
ML 5 | ML 5 | ML 5 | ML 5 | |||||
ML 10 | ML 10 | ML 10 | ML 10 | |||||
ML 50 | ML 50 | ML 50 | ML 50 | |||||
ML 100 | ML 100 | ML 100 | ML 100 | |||||
347 | Some have accepted it as a miracle without physical explanation. | PF 5 | PF 5 | PF 5 | PF 5 | |||
PF 10 | PF 10 | PF 10 | PF 10 | |||||
PF 50 | PF 50 | PF 50 | PF 50 | |||||
PF 100 | PF 100 | PF 100 | PF 100 | |||||
ML 5 | ML 5 | ML 5 | ML 5 | |||||
ML 10 | ML 10 | ML 10 | ML 10 | |||||
ML 50 | ML 50 | ML 50 | ML 50 | |||||
ML 100 | ML 100 | ML 100 | ML 100 |
Ablation Study
We showcase the samples from the ablation study using the ML solver with 10 steps on the models trained on LibriTTS-R.
Speaker | Text | Reference | Ground Truth | StableForm-TTS (ours) | w/o E-F generators | w/o Energy |
---|---|---|---|---|---|---|
302 | The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drop increases. | |||||
225 | The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky. | |||||
234 | Yet the performance was not entirely convincing. | |||||
248 | It's also important that they are not seen as a soft option. |
Scalability Test
To examine the scalability of StableForm-TTS, we increase the English training dataset to 19,000 hours by including LibriTTS-R, LJSpeech, DailyTalk, HiFi-TTS, Common Voice, and MLS, while also doubling the model size to create the StableForm-large version (69.30M). Furthermore, the GSS-large version (68.13M) was created to verify the effects of scale-up on Grad-StyleSpeech. We provide several samples for comparing StableForm-large with five publicly available large-scale TTS models. We use the pretrained checkpoint for Bark, Tortoise, XTTS-v2, and YourTTS from the Coqui TTS toolkit, while we use the official code for VoiceCraft.
Speaker | Text | Reference | Ground Truth | StableForm-large (ours) | GSS-large | Bark | Tortoise | VoiceCraft | XTTS-v2 | YourTTS |
---|---|---|---|---|---|---|---|---|---|---|
261 | He put some colour into Scottish history. | |||||||||
302 | We're just a family working hard, working seven days a week. | |||||||||
335 | Clearly, she says, she would never resort to such devices. | |||||||||
326 | Another suggested the company should carry only pedestrians. | |||||||||
294 | This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue. |
Citation
If you want to cite our work, please use:
@article{han2024improving, title={Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation}, author={Han, Changjin and Lee, Seokgi and Nam, Gyuhyeon and Chae, Gyeongsu}, journal={arXiv preprint arXiv:2409.09311}, year={2024} }