Audio samples from "Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning"

Paper: arXiv

Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

Abstract:
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, i.e. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin.

Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents

Click here for more from the Tacotron team.

 
Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

 

We used a proprietary dataset consisting ofspeech from 3 different languages: (1) 385 hours of high-quality English speech from 84 professional voice talents with accents from the United States, Great Britain, Australia, and Singapore; (2) 97 hours of Spanish speech from 3 female speakers include Castilian Spanish and American Spanish; (3) 68 hours of Mandarin speech from 5 speakers.

All of the phrases below are unseen during training.

Multilingual speech synthesis

English

Text: The first commercial flights took place between the United States and Canada in 1919.
Speaker 1
Speaker 2
Speaker 3

Spanish

Text: Una rival que había tenido una historia amorosa con Christopher.
Speaker 1
Speaker 2
Speaker 3

Mandarin

Text: 第一班商业航班在一九一九年来往于美国和加拿大.
Speaker 1
Speaker 2
Speaker 3

Cross-language voice cloning

English speakers speaking fluent Spanish and Mandarin

Reference
Text: Una rival que había tenido una historia amorosa con Christopher.
Text: No somos seres aislados, formamos parte de la cadena universal de la vida.
Speaker 1
Speaker 2
Speaker 3
Reference
Text: 第一班商业航班在一九一九年来往于美国和加拿大.
Text: 他的动机更加务实和政治化.
Speaker 1
Speaker 2
Speaker 3

Accent Control

Text: No somos seres aislados, formamos parte de la cadena universal de la vida.
Text: Esta demostración es presentada por el equipo de Google.
Reference Speaker
Language id=Mandarin, adv weight=0.02
Language id=English, adv weight=0.02
Language id=English, adv weight=0.05
Language id=English, adv weight=0.1
Accent level (light to strong)
Text: 美国人学习中文, 很难区分不同音调.
Text: 第一班商业航班在一九一九年来往于美国和加拿大.
Reference Speaker
Language id=Mandarin, adv weight=0.02
Language id=English, adv weight=0.02
Language id=English, adv weight=0.05
Language id=English, adv weight=0.1

Code-Switching

Text: Use the left 2 lanes to take the exit toward Puerto Lápice.
Text: Exit the roundabout onto Carretera Eivissa-Sant Antoni.
Text: En ochocientos metros, Toma la salida 3B hacia Illinois 3 North en dirección a Kansas City.
Text: En doscientos metros, Toma la salida Interstate 40 West hacia 八达岭长城.
Text: In 300 meters, Continue onto Avenida Almadén.
EN Speaker 1
EN Speaker 2
ES Speaker 1
CN Speaker 1
Text: 今天真的是很high.
Text: Use the left 2 lanes to take the exit toward 八达岭长城.
Text: 使用左侧车道向 San Jose 方向行驶.
Text: By the way, 请给我十颗球.
Text: Happy Birthday, 我的宝贝.
EN Speaker 1
EN Speaker 2
ES Speaker 1
CN Speaker 1