Paper: Coming soon
Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov.
Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems.This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining, un-supervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speechand text representation learning. Without any transcribed speech in anew language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of < 10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achievenaturalness scores that match the ground-truth in several languages.
Click here for more from the Tacotron team.
All the samples are in 16 kHz sampling.
Below are audio samples from Group B languages, where we compare across a few different settings.
Spanish
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Zulu
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Arabics
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Tamil
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Uzbek
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Icelandic
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||
Javanese
Ground Truth | Supervised Baseline | Zero Proposed | 15m Proposed | Supervised |
---|---|---|---|---|
1: | ||||
2: | ||||
3: | ||||
4: | ||||