Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma
Abstract: We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The interdependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding frames. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features.The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.
Contents
Text | Ground Truth | Input | Tacotron + Griffin-Lim | Tacotron + WaveRNN | Tacotron + FlowCoder | Wave-Tacotron |
---|---|---|---|---|---|---|
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. | char | |||||
phone | ||||||
Tajima Airport serves Toyooka. | char | |||||
phone | ||||||
If you were going to space, would you be nervous? | char | |||||
phone | ||||||
Input is char in all settings.
Text | Ground Truth | Tacotron + Griffin-Lim | Tacotron + WaveRNN | Tacotron + FlowCoder | Wave-Tacotron |
---|---|---|---|---|---|
Even the Caslon type when enlarged shows great shortcomings in this respect: | |||||
while at sea the captain of the ship was responsible for the security of the prisoner. | |||||
Marina Oswald appeared before the Commission again on June 11, 1964, | |||||
Input is phone in all settings.
Text | T = 0.6 | T = 0.7 | Base (T = 0.8) | T = 0.9 |
---|---|---|---|---|
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. | ||||
Tajima Airport serves Toyooka. | ||||
If you were going to space, would you be nervous? | ||||
Input is phone in all settings.
Text | Base (pre-emp + pos-emb + skip con) | no pre-emphasis | no position embedding | no skip connection |
---|---|---|---|---|
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. | ||||
Tajima Airport serves Toyooka. | ||||
If you were going to space, would you be nervous? | ||||
Input is phone in all settings.
Text | Base (256 channels, 60 steps, 5 stages) | 128 flow channels | 30 steps, 5 stages | 60 steps, 4 stages | 60 steps, 3 stages |
---|---|---|---|---|---|
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. | |||||
Tajima Airport serves Toyooka. | |||||
If you were going to space, would you be nervous? | |||||
Input is phone in all settings.
Text | K = 320 | K = 640 | Base (K = 960) | K = 1280 |
---|---|---|---|---|
Talib Kweli confirmed to AllHipHop that he will be releasing an album in the next year. | ||||
Tajima Airport serves Toyooka. | ||||
If you were going to space, would you be nervous? | ||||
Samples generated by a uncoditional model, by removing the encoder and attention, which is capable of generating coherent syllables.
unconditional |
---|