Paper: arXiv
Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu.
Abstract:
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still a room for improvements in its efficiency during inference.
This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder.
This model, called
Click here for more from the Tacotron team.
Random sample from Evaluation set 1 in the paper.
Tacotron2 | Global VAE | Global VAE w/o iterative loss | Fine grained | Xform | Xform w/o iterative loss | NoVAE | No VAE w/o iterative loss |
---|---|---|---|---|---|---|---|
1: | |||||||
2: | |||||||
3: | |||||||
4: | |||||||
5: | |||||||
6: | |||||||
7: | |||||||
Random sample from Evaluation set 2 in the paper. Comparison with real human speech.
Human Speech | Tacotron2 | Global VAE | Fine grained | ||
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
4: | |||||
5: | |||||
6: | |||||
7: | |||||
8: | |||||