Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu.
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still a room for improvements in its efficiency during inference.
This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder.
This model, called
Random sample from Evaluation set 1 in the paper.
|Tacotron2||Global VAE||Global VAE w/o iterative loss||Fine grained||Xform||Xform w/o iterative loss||NoVAE||No VAE w/o iterative loss|
Random sample from Evaluation set 2 in the paper. Comparison with real human speech.
|Human Speech||Tacotron2||Global VAE||Fine grained|