Abstract: Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect.
In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron.
The idea is to allow Tacotron to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora.
Importantly, these external data are unpaired and potentially noisy.
Specifically, first we embed each word in the input text into word vectors and condition the Tacotron encoder on them.
We then use an unpaired speech corpus to pre-train the Tacotron decoder in the acoustic domain.
Finally, we fine-tune the model using available paired data.
We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.
Since the focus of this work is to enable Tacotron training on small data, all audio samples on this demo page were synthesized using the Griffin-Lim algorithm for fast experiment cycles.
These samples refer to Section 3.1 of our paper, which aims to search for the least amount of training data needed for a baseline Tacotron to produce intelligible speech.
We varied the amount of data using "shards" as a unit, where each shard contains about 24 minutes of speech.
As you can hear, audios produced by Tacotron trained on 100 shards (about 40 hours) and 25 shards (about 10 hours) sound almost identically good.
When trained on 5 (about 2 hours) to 1.5 shards (about 36 minutes), Tacotron is still able to produce intelligible speech.
When trained on less than 1 shard (about 24 minutes), Tacotron fails to produce intelligible speech.
0.5 shards
1 shard
1.5 shards
2 shards
2.5 shards
3 shards
5 shards
25 shards
100 shards
Text: I wish I had arms so I could give you a hug. But for now maybe a joke or some music might help.
Text: I don't think you've told me. What's his name?
These samples refer to Section 3.2 of our paper.
The four encoder conditioning configurations are:
Input-concat: Concatenate the word vectors to the encoder input token embeddings to form the new encoder input.
Top-concat: Concatenate the word vectors to the encoder top features to form the new text representations to be fed to decoder.
Input-attention: Use a separate attention head between the word vectors and the encoder input token embeddings. The attention takes each token embedding as an attention query and produces a context vector, which is the weighted sum of all the word vectors. The context vector and the token embedding are then concatenated to form the new encoder input.
Top-attention: Same as Input-concat but now the attention is applied between the word vectors and the encoder top features.
All four configurations were trained on 1 shard of data (24 minutes), which is the threshold found above.
As you can hear, by conditioning pre-trained word vectors at either location (encoder input or top) using either approach (concatenation or attention), Tacotron is able to produce intelligible speech, significantly outperforming the baseline Tacotron.
Among the four configurations, top-concat is the best based on our informal listening and objective tests.
Baseline Tacotron
Input-concat
Top-concat
Input-attention
Top-attention
"Text: I am sorry, I don't support Navigation. You can ask me on your mobile device and I can get you to your destination."
"Text: I'm a bit of an outsider on the food chain, but I find the culinary sciences fascinating According to Wikipedia, almonds are stone fruits related to cherries, plums and peaches. They aren't actually true nuts – their fruit is called a "drupe""
These samples demonstrate the effectiveness of decoder pre-training, which improves data efficiency while also enables fast adaptation to the fine-tuning data.
Here the speech is synthesized by models trained around only 1000 steps (very early stage of training) using 1 shard of data (24 minutes).
As you can hear, speech produced by pre-trained Tacotron significantly outperforms those produced by the baseline Tacotron in terms of both intelligibility and audio quality.
Note that even after the training converges, the baseline Tacotron is still unable to produce intelligible speech.
Baseline Tacotron
Tacotron w/ pre-trained decoder
"Text: It's fifty seven degrees with light showers and snow."
"Text: Sodium hydride is used as a strong base in organic synthesis."
We present audio samples produced by all Tacotron variants trained on varying amounts of small data, ranging from about 12 minutes to about 72 minutes (remember that each shard is about 24 minutes).
Each row below corresponds to a Tacotron variant and each column the amount of training data.
0.5 shards
1 shard
1.5 shards
2 shards
2.5 shards
3 shards
Text: Here's what a monkey sounds like.
Baseline Tacotron
Tacotron w/ only encoder conditioning
Tacotron w/ only pre-trained decoder
Tacotron w/ both encoder conditioning and pre-trained decoder
Text: Because of their moves, some rules for pairs' skating were changed.
Baseline Tacotron
Tacotron w/ only encoder conditioning
Tacotron w/ only pre-trained decoder
Tacotron w/ both encoder conditioning and pre-trained decoder