Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Paper: arXiv

Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

Abstract: This paper introduces Parallel Tacotron 2, a new non-autoregressive neural text-to-speech model. This model has fully differentiable duration modeling, learned upsampling mechanism based on attention, and an iterative reconstruction loss based on Soft Dynamic Time Warping. These features allow the model to learn alignments and token durations automatically, without requiring supervised duration signals. Experimental results show that Parallel Tacotron 2 outperforms the baselines in naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.

Click here for more from the Tacotron team.

YOU WILL NEED TO WAIT FOR THE AUDIO FILES TO LOAD BEFORE PRESSING PLAY!

Main evaluation set

Hold-out evaluation set

Hard lines evaluation set

Rapid Reprompts evaluation set

Questions evaluation set

Global Duration/Pace Control

Last word Duration Control

Individual Word Duration Control

Comparison among systems: Main Evaluation Set

Random sample from Main Evaluation Set in the paper.

Parallel Tacotron 2Parallel Tacotron Global VAEParallel Tacotron Fine VAETacotron 2
1:
2:
3:
4:
5:
6:
7:
8:

Comparison among systems: Hold-out

Random sample from Hold-out set in the paper.

Ground TruthParallel Tacotron 2Parallel Tacotron Global VAEParallel Tacotron Fine VAETacotron 2
1:
2:
3:
4:
5:
6:
7:
8:

Comparison among systems: Hard Lines

Random sample from Hard Lines in the paper.

Parallel Tacotron 2Parallel Tacotron Global VAEParallel Tacotron Fine VAETacotron 2
1:
2:
3:
4:
5:
6:
7:
8:

Comparison among systems: Rapid Reprompts

Random sample from Rapid Reprompts in the paper.

Parallel Tacotron 2Parallel Tacotron Global VAEParallel Tacotron Fine VAETacotron 2
1:
2:
3:
4:
5:
6:
7:
8:

Comparison among systems: Questions

Random sample from Questions in the paper.

Parallel Tacotron 2Parallel Tacotron Global VAEParallel Tacotron Fine VAETacotron 2
1:
2:
3:
4:
5:
6:
7:
8:

Duration Controllability: Global pace control

Global pace control with duration factors [0.75, 1.0, 1.25]

Duration factorParallel Tacotron 2
0.75
1.0
1.25

Duration Controllability: Last word speed up and slow down

Last word "Big Basin" is with duration factors [0.5, 1.0, 1.5]

Duration factorParallel Tacotron 2
0.5
1.0
1.5

Duration Controllability: Per word slow down

Different word in the sequence is slowed down by a factor of 1.5

Duration factorParallel Tacotron 2
saddened
hear about
devestation
Big Basin