Audio samples from "Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis"

Paper: arXiv slides poster

Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

Abstract: We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The interdependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding frames. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features.The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

Contents

1. Architecture comparison on single speaker proprietary dataset

TextGround TruthInput Tacotron + Griffin-Lim Tacotron + WaveRNN Tacotron + FlowCoder Wave-Tacotron
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. char
phone
Tajima Airport serves Toyooka. char
phone
If you were going to space, would you be nervous? char
phone

2. Architecture comparison on single speaker LJSpeech dataset

Input is char in all settings.

TextGround TruthTacotron + Griffin-Lim Tacotron + WaveRNN Tacotron + FlowCoder Wave-Tacotron
Even the Caslon type when enlarged shows great shortcomings in this respect:
while at sea the captain of the ship was responsible for the security of the prisoner.
Marina Oswald appeared before the Commission again on June 11, 1964,

3. Ablations: temperature

Input is phone in all settings.

TextT = 0.6 T = 0.7 Base (T = 0.8) T = 0.9
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

4. Ablations: architecture tweaks

Input is phone in all settings.

TextBase (pre-emp + pos-emb + skip con) no pre-emphasis no position embedding no skip connection
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

5. Ablations: flow parameterization

Input is phone in all settings.

TextBase (256 channels, 60 steps, 5 stages) 128 flow channels 30 steps, 5 stages 60 steps, 4 stages 60 steps, 3 stages
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

6. Ablations: waveform segment length per decoder step (K)

Input is phone in all settings.

TextK = 320 K = 640 Base (K = 960) K = 1280
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

7. Unconditional generation

Samples generated by a uncoditional model, by removing the encoder and attention, which is capable of generating coherent syllables.

unconditional