Audio samples from "Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech"

Paper: arXiv

Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran.

Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.

Click here for more from the Tacotron team.

All the samples are in 16 kHz sampling. The synthetic speech samples are generated from predicted speech features using a WaveGrad [1] vocoder.

Comparison of Virtuoso and Baseline methods

Comparing different Virtuoso TTS models with baseline models: Tacotron2 [2] and Fine-tuning of Maestro [3]. The Tacotron2 is only trained on the paired TTS data. Maestro-Finetuning uses all the paired and unpaired data for the pretraining and then is finetuned only on the paired TTS data. The following Virtuoso model uses paired ASR and unpaired data for the joint pretraining. Refer to Table 1 of the paper.

Spanish (Seen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Tamil (Unseen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Slovenian (Seen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Bulgarian (Unseen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

English (Seen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Turkish (Unseen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Thai (Seen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Afrikaans (Unseen)

NaturalTacotron2 (Grapheme)Maestro-Finetuning (Grapheme)Virtuoso (Grapheme, Paired data)Virtuoso (Grapheme, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Comparison of speaker ids for unseen languages

This demonstration compares generated speech of an unseen language using a speaker from the similar language or an English speaker. While the left-hand side languages can be more similar in writing systems and phonetics, and thus can produce more intelligible speech. While the training data for English typically has a large amount of higher-hidelity training data, the synthetic speech tend to keep the prosody of the original language.

Tamil

NaturalHindi speakerEnglish speaker
1:
2:
3:

Bulgarian

NaturalRussian speakerEnglish speaker
1:
2:
3:

Turkish

NaturalFrench speakerEnglish speaker
1:
2:
3:

Afrikaans

NaturalDutch speakerEnglish speaker
1:
2:
3:

Ablation study on Virtuoso

This demonstration compares all the Virtuoso methods. See Table 1 and 3 in the paper for the results.

Spanish (Seen)

NaturalVirtuoso (Grapheme, Paired TTS)Virtuoso (Grapheme, Paired ASR+TTS)Virtuoso (Grapheme, All data)Virtuoso (Grapheme+LID, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Tamil (Unseen)

NaturalVirtuoso (Grapheme, Paired TTS)Virtuoso (Grapheme, Paired ASR+TTS)Virtuoso (Grapheme, All data)Virtuoso (Grapheme+LID, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Slovenian (Seen)

NaturalVirtuoso (Grapheme, Paired TTS)Virtuoso (Grapheme, Paired ASR+TTS)Virtuoso (Grapheme, All data)Virtuoso (Grapheme+LID, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Bulgarian (Unseen)

NaturalVirtuoso (Grapheme, Paired TTS)Virtuoso (Grapheme, Paired ASR+TTS)Virtuoso (Grapheme, All data)Virtuoso (Grapheme+LID, All data)Virtuoso (Byte+LID, All data)
1:
2:
3:

Reference