Audio samples from "Parallel Tacotron: Non-Autoregressive and Controllable TTS"

Paper: arXiv

Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu.

Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still a room for improvements in its efficiency during inference. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called Parallel Tacotron, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem. To further improve the naturalness, we introduce an iterative spectrogram loss, which is inspired by iterative refinement, and lightweight convolution, which can efficiently capture local contexts. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective naturalness with significantly decreased inference time.

Click here for more from the Tacotron team.