Paper: arXiv
Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu.
Abstract:
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still a room for improvements in its efficiency during inference.
This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder.
This model, called
Click here for more from the Tacotron team.
Random sample from Evaluation set 1 in the paper.
| Tacotron2 | Global VAE | Global VAE w/o iterative loss | Fine grained | Xform | Xform w/o iterative loss | NoVAE | No VAE w/o iterative loss |
|---|---|---|---|---|---|---|---|
| 1: | |||||||
| 2: | |||||||
| 3: | |||||||
| 4: | |||||||
| 5: | |||||||
| 6: | |||||||
| 7: | |||||||
Random sample from Evaluation set 2 in the paper. Comparison with real human speech.
| Human Speech | Tacotron2 | Global VAE | Fine grained | ||
|---|---|---|---|---|---|
| 1: | |||||
| 2: | |||||
| 3: | |||||
| 4: | |||||
| 5: | |||||
| 6: | |||||
| 7: | |||||
| 8: | |||||