Audio samples from "Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis"

Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

Abstract: We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The interdependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding frames. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features.The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

Contents

1. Architecture comparison on single speaker proprietary dataset
2. Architecture comparison on single speaker LJSpeech dataset
3. Ablations: temperature
4. Ablations: architecture tweaks
5. Ablations: flow parameterization
6. Ablations: waveform segment length per decoder step (K)
7. Unconditional generation

1. Architecture comparison on single speaker proprietary dataset

Text	Ground Truth	Input	Tacotron + Griffin-Lim	Tacotron + WaveRNN	Tacotron + FlowCoder	Wave-Tacotron
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.		char
		phone
Tajima Airport serves Toyooka.		char
Tajima Airport serves Toyooka.		phone
If you were going to space, would you be nervous?		char
If you were going to space, would you be nervous?		phone

2. Architecture comparison on single speaker LJSpeech dataset

Input is char in all settings.

Text	Ground Truth	Tacotron + Griffin-Lim	Tacotron + WaveRNN	Tacotron + FlowCoder	Wave-Tacotron
Even the Caslon type when enlarged shows great shortcomings in this respect:
while at sea the captain of the ship was responsible for the security of the prisoner.
Marina Oswald appeared before the Commission again on June 11, 1964,

3. Ablations: temperature

Input is phone in all settings.

Text	T = 0.6	T = 0.7	Base (T = 0.8)	T = 0.9
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

4. Ablations: architecture tweaks

Input is phone in all settings.

Text	Base (pre-emp + pos-emb + skip con)	no pre-emphasis	no position embedding	no skip connection
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

5. Ablations: flow parameterization

Input is phone in all settings.

Text	Base (256 channels, 60 steps, 5 stages)	128 flow channels	30 steps, 5 stages	60 steps, 4 stages	60 steps, 3 stages
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

6. Ablations: waveform segment length per decoder step (K)

Input is phone in all settings.

Text	K = 320	K = 640	Base (K = 960)	K = 1280
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year.
Tajima Airport serves Toyooka.
If you were going to space, would you be nervous?

7. Unconditional generation

Samples generated by a uncoditional model, by removing the encoder and attention, which is capable of generating coherent syllables.

unconditional