Audio samples from "Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data"

Paper: Coming soon

Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov.

Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems.This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining, un-supervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speechand text representation learning. Without any transcribed speech in anew language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of < 10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achievenaturalness scores that match the ground-truth in several languages.

Click here for more from the Tacotron team.

All the samples are in 16 kHz sampling.

Selected FLEURS Group B Languages

Below are audio samples from Group B languages, where we compare across a few different settings.

Spanish

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Zulu

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Arabics

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Tamil

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Uzbek

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Icelandic

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4:

Javanese

Ground TruthSupervised BaselineZero Proposed15m ProposedSupervised
1:
2:
3:
4: