Audio samples from "Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data"

Paper: Coming soon

Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov.

Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems.This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining, un-supervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speechand text representation learning. Without any transcribed speech in anew language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of < 10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achievenaturalness scores that match the ground-truth in several languages.

Click here for more from the Tacotron team.

All the samples are in 16 kHz sampling.

Selected FLEURS Group B Languages

Below are audio samples from Group B languages, where we compare across a few different settings.

Ground truth: Ground-truth audio for reference.
Supervised Baseline: Supervised Baseline byte config including all supervised data in FLEURS Group B.
Zero Proposed: Proposed model including zero supervised data in FLEURS Group B.
15m Proposed: Proposed model including 15 minutes of supervised data in FLEURS Group B.
Supervised: Proposed model including all supervised data in FLEURS Group B.