Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"

Paper: arXiv

Authors: Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

Abstract: Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora. Importantly, these external data are unpaired and potentially noisy. Specifically, first we embed each word in the input text into word vectors and condition the Tacotron encoder on them. We then use an unpaired speech corpus to pre-train the Tacotron decoder in the acoustic domain. Finally, we fine-tune the model using available paired data. We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

All phrases below are unseen by Tacotron during training. Click here for more from the Tacotron team.

Since the focus of this work is to enable Tacotron training on small data, all audio samples on this demo page were synthesized using the Griffin-Lim algorithm for fast experiment cycles.

Contents

Data requirements of the baseline Tacotron
Encoder conditioning
Decoder pre-training
Overall comparison

Data requirements of the baseline Tacotron

These samples refer to Section 3.1 of our paper, which aims to search for the least amount of training data needed for a baseline Tacotron to produce intelligible speech. We varied the amount of data using "shards" as a unit, where each shard contains about 24 minutes of speech. As you can hear, audios produced by Tacotron trained on 100 shards (about 40 hours) and 25 shards (about 10 hours) sound almost identically good. When trained on 5 (about 2 hours) to 1.5 shards (about 36 minutes), Tacotron is still able to produce intelligible speech. When trained on less than 1 shard (about 24 minutes), Tacotron fails to produce intelligible speech.

0.5 shards	1 shard	1.5 shards	2 shards	2.5 shards	3 shards	5 shards	25 shards	100 shards
Text: I wish I had arms so I could give you a hug. But for now maybe a joke or some music might help.

Text: I don't think you've told me. What's his name?

Encoder conditioning

These samples refer to Section 3.2 of our paper. The four encoder conditioning configurations are:

Input-concat: Concatenate the word vectors to the encoder input token embeddings to form the new encoder input.
Top-concat: Concatenate the word vectors to the encoder top features to form the new text representations to be fed to decoder.
Input-attention: Use a separate attention head between the word vectors and the encoder input token embeddings. The attention takes each token embedding as an attention query and produces a context vector, which is the weighted sum of all the word vectors. The context vector and the token embedding are then concatenated to form the new encoder input.
Top-attention: Same as Input-concat but now the attention is applied between the word vectors and the encoder top features.

All four configurations were trained on 1 shard of data (24 minutes), which is the threshold found above. As you can hear, by conditioning pre-trained word vectors at either location (encoder input or top) using either approach (concatenation or attention), Tacotron is able to produce intelligible speech, significantly outperforming the baseline Tacotron. Among the four configurations, top-concat is the best based on our informal listening and objective tests.

Baseline Tacotron	Input-concat	Top-concat	Input-attention	Top-attention
"Text: I am sorry, I don't support Navigation. You can ask me on your mobile device and I can get you to your destination."

"Text: I'm a bit of an outsider on the food chain, but I find the culinary sciences fascinating According to Wikipedia, almonds are stone fruits related to cherries, plums and peaches. They aren't actually true nuts – their fruit is called a "drupe""

Decoder pre-training

These samples demonstrate the effectiveness of decoder pre-training, which improves data efficiency while also enables fast adaptation to the fine-tuning data. Here the speech is synthesized by models trained around only 1000 steps (very early stage of training) using 1 shard of data (24 minutes). As you can hear, speech produced by pre-trained Tacotron significantly outperforms those produced by the baseline Tacotron in terms of both intelligibility and audio quality. Note that even after the training converges, the baseline Tacotron is still unable to produce intelligible speech.

Baseline Tacotron	Tacotron w/ pre-trained decoder
"Text: It's fifty seven degrees with light showers and snow."

"Text: Sodium hydride is used as a strong base in organic synthesis."

Overall comparison

We present audio samples produced by all Tacotron variants trained on varying amounts of small data, ranging from about 12 minutes to about 72 minutes (remember that each shard is about 24 minutes). Each row below corresponds to a Tacotron variant and each column the amount of training data.

0.5 shards	1 shard	1.5 shards	2 shards	2.5 shards	3 shards
Text: Here's what a monkey sounds like.
Baseline Tacotron

Tacotron w/ only encoder conditioning

Tacotron w/ only pre-trained decoder

Tacotron w/ both encoder conditioning and pre-trained decoder

Text: Because of their moves, some rules for pairs' skating were changed.
Baseline Tacotron

Tacotron w/ only encoder conditioning

Tacotron w/ only pre-trained decoder

Tacotron w/ both encoder conditioning and pre-trained decoder