Zero-shot Cross-lingual Voice Transfer for TTS, and Its Application to Voice Restoration for Accessibility and Inclusion

We present a zero-shot voice transfer (VT) module that can be easily plugged into a state-of-the-art TTS system to restore the voices of input speakers. We include the following audio samples for our research blogpost. Our paper is available in Arxiv.

Zero-shot Examples using Typical Reference Speech

Below are zero-shot examples using typical reference speech, to simulate the scenario when the speaker's voice was recorded before any voice degradation occurred. we demonstrate the concept of zero-shot capability using samples from the VCTK corpus:

ReferenceTTS with Zero-shot VT (VAE)TTS with Zero-shot VT (SharedGST)TTS with Zero-shot VT (MultiGST)TTS with Zero-shot VT (SegmentGST)
Female (P257)
Male (P256)
Female (P244)
Male (P243)
Female (P253)
Male (P285)
Female (P231)
Male (P246)
Female (P282)
Male (P271)
Female (P303)

Cross-lingual Zero-shot Examples using Typical Reference Speech

We also evaluate the cross-lingual capability of our TTS zero-shot model on typical reference speech across six different languages, using English reference speakers from the VCTK corpus. The transcripts and their translations were automatically generated using Gemini:

Cross-lingual Zero-shot Examples on VAE


ReferenceU.S. EnglishChinese MandarinSpanishArabicFrenchJapaneseGermanItalianHindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity 85% ± 6% 68% ± 4% 67% ± 14% 72% ± 7% 90% ± 7% 88% ± 5% 77% ± 5% 85% ± 6% 78% ± 6% 57% ± 6%
MOS Naturalness 3.3 ± .29 3.7 ± .04 3.7 ± .08 3.5 ± .05 4.2 ± .03 4.0 ± .03 3.8 ± .05 4.1 ± .04 3.6 ± .05 3.9 ± .04

Cross-lingual Zero-shot Examples on SharedGST


ReferenceU.S. EnglishChinese MandarinSpanishArabicFrenchJapaneseGermanItalianHindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity 85% ± 6% 48% ± 4% 70% ± 14% 65% ± 12% 90% ± 7% 86% ± 7% 68% ± 6% 69% ± 8% 72% ± 7% 38% ± 5%
MOS Naturalness 3.3 ± .29 3.6 ± .05 3.7 ± .06 3.8 ± .05 4.2 ± .03 4.3 ± .03 3.4 ± .06 4.1 ± .04 3.7 ± .05 4.1 ± .03

Cross-lingual Zero-shot Examples on MultiGST


ReferenceU.S. EnglishChinese MandarinSpanishArabicFrenchJapaneseGermanItalianHindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity 85% ± 6% 58% ± 4% 70% ± 14% 62% ± 12% 74% ± 11% 75% ± 7% 70% ± 6% 70% ± 6% 70% ± 6% 42% ± 6%
MOS Naturalness 3.3 ± .29 3.5 ± .05 3.9 ± .05 3.6 ± .06 4.2 ± .03 4.1 ± .04 3.6 ± .05 4.0 ± .04 3.6 ± .05 4.1 ± .04

Cross-lingual Zero-shot Examples on SegmentGST


ReferenceU.S. EnglishChinese MandarinSpanishArabicFrenchJapaneseGermanItalianHindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity 85% ± 6% 64% ± 4% 68% ± 15% 68% ± 14% 90% ± 6% 87% ± 5% 73% ± 5% 82% ± 6% 76% ± 6% 47% ± 6%
MOS Naturalness 3.3 ± .29 3.7 ± .05 3.9 ± .05 3.6 ± .04 4.2 ± .03 4.0 ± .03 3.7 ± .05 4.1 ± .04 3.7 ± .04 4.1 ± .04

Case study: Atypical Speech as a Reference

We work with two Googlers Dimitri Kanevsky and Aubrie Lee to synthesize the videos below using only 12 seconds of Dimitri's atypical voice and 14 seconds of Aubrie's atypical voice as reference, respectively.

ParticipantsOriginal VideosVideos with VT Outputs
Dimitri Kanevsky
Aubrie Lee

We also test whether our model that makes use of the same atypical English reference speech from Dimitri and Aubrie can generalize and transfer their voice to other languages, given non-English text.
Below are VT outputs in six different languages for Dimitri (French, Spanish, Italian, Arabic, German, Russian):

French Spanish Italian
Arabic German Russian

Below are VT outputs in six different languages for Aubrie (French, Spanish, Italian, Arabic, Hindi, Norwegian):

French Spanish Italian
Arabic Hindi Norwegian