Audio samples from "Parallel Tacotron: Non-Autoregressive and Controllable TTS"

Paper: arXiv

Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu.

Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still a room for improvements in its efficiency during inference. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called Parallel Tacotron, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem. To further improve the naturalness, we introduce an iterative spectrogram loss, which is inspired by iterative refinement, and lightweight convolution, which can efficiently capture local contexts. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective naturalness with significantly decreased inference time.

Click here for more from the Tacotron team.

Comparison among systems: Evaluation set 1

Random sample from Evaluation set 1 in the paper.

Tacotron2Global VAEGlobal VAE w/o iterative lossFine grainedXformXform w/o iterative lossNoVAENo VAE w/o iterative loss
1:
2:
3:
4:
5:
6:
7:

Comparison among systems: Evaluation set 2

Random sample from Evaluation set 2 in the paper. Comparison with real human speech.

Human SpeechTacotron2Global VAEFine grained
1:
2:
3:
4:
5:
6:
7:
8: