Paper: arXiv
Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran.
Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
Click here for more from the Tacotron team.
All the samples are in 16 kHz sampling. The synthetic speech samples are generated from predicted speech features using a WaveGrad [1] vocoder.
Comparing different Virtuoso TTS models with baseline models: Tacotron2 [2] and Fine-tuning of Maestro [3]. The Tacotron2 is only trained on the paired TTS data. Maestro-Finetuning uses all the paired and unpaired data for the pretraining and then is finetuned only on the paired TTS data. The following Virtuoso model uses paired ASR and unpaired data for the joint pretraining. Refer to Table 1 of the paper.
Spanish (Seen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Tamil (Unseen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Slovenian (Seen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Bulgarian (Unseen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
English (Seen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Turkish (Unseen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Thai (Seen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Afrikaans (Unseen)
Natural | Tacotron2 (Grapheme) | Maestro-Finetuning (Grapheme) | Virtuoso (Grapheme, Paired data) | Virtuoso (Grapheme, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
This demonstration compares generated speech of an unseen language using a speaker from the similar language or an English speaker. While the left-hand side languages can be more similar in writing systems and phonetics, and thus can produce more intelligible speech. While the training data for English typically has a large amount of higher-hidelity training data, the synthetic speech tend to keep the prosody of the original language.
Tamil
Natural | Hindi speaker | English speaker |
---|---|---|
1: | ||
2: | ||
3: | ||
Bulgarian
Natural | Russian speaker | English speaker |
---|---|---|
1: | ||
2: | ||
3: | ||
Turkish
Natural | French speaker | English speaker |
---|---|---|
1: | ||
2: | ||
3: | ||
Afrikaans
Natural | Dutch speaker | English speaker |
---|---|---|
1: | ||
2: | ||
3: | ||
This demonstration compares all the Virtuoso methods. See Table 1 and 3 in the paper for the results.
Spanish (Seen)
Natural | Virtuoso (Grapheme, Paired TTS) | Virtuoso (Grapheme, Paired ASR+TTS) | Virtuoso (Grapheme, All data) | Virtuoso (Grapheme+LID, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Tamil (Unseen)
Natural | Virtuoso (Grapheme, Paired TTS) | Virtuoso (Grapheme, Paired ASR+TTS) | Virtuoso (Grapheme, All data) | Virtuoso (Grapheme+LID, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Slovenian (Seen)
Natural | Virtuoso (Grapheme, Paired TTS) | Virtuoso (Grapheme, Paired ASR+TTS) | Virtuoso (Grapheme, All data) | Virtuoso (Grapheme+LID, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
Bulgarian (Unseen)
Natural | Virtuoso (Grapheme, Paired TTS) | Virtuoso (Grapheme, Paired ASR+TTS) | Virtuoso (Grapheme, All data) | Virtuoso (Grapheme+LID, All data) | Virtuoso (Byte+LID, All data) |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||