Abstract

We propose a novel architecture and improved training objectives for non-parallel voice conversion. Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram, converting its style (i.e. speaker identity) while preserving the speech content. Throughout the entire conversion process, the model does not resort to compressed intermediate representations of any sort (e.g. mel spectrogram, low resolution spectrogram, decomposed network feature). We propose an efficient axial residual block architecture to support this expensive procedure and various modifications to the CycleGAN losses to stabilize the training process. We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.

Audio Samples

Note: Please refer to the paper for experimental details.

VCTK dataset (English)

Source is the source speech samples. Target is the target speech samples. They are provided as references. Note that we did not use these data during training.

	Source	Target	Scyclone	CycleGAN-VC2	Ours
Female -> Female (p299, p301)
Female -> Female (p301, p299)
Female -> Male (p299, p311)
Male -> Female (p311, p299)
Male -> Male (p311, p360)
Male -> Male (p360, p311)

KSS & Internal dataset (Korean)

Target is omitted since KSS and our internal datset are non-parallel.

	Source	Scyclone	CycleGAN-VC2	Ours
Female -> Female (JEY, KSS)

Female -> Female (KSS, JEY)

Citation

@misc{you2021axial,
      title={Axial Residual Networks for CycleGAN-based Voice Conversion},
      author={Jaeseong You and Gyuhyeon Nam and Dalhyun Kim and Gyeongsu Chae},
      year={2021},
      eprint={2102.08075},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}