Zero-shot Cross-lingual Voice Transfer for TTS, and Its Application to Voice Restoration for Accessibility and Inclusion

We present a zero-shot voice transfer (VT) module that can be easily plugged into a state-of-the-art TTS system to restore the voices of input speakers. We include the following audio samples for our research blogpost. Our paper is available in Arxiv.

Zero-shot Examples using Typical Reference Speech

Below are zero-shot examples using typical reference speech, to simulate the scenario when the speaker's voice was recorded before any voice degradation occurred. we demonstrate the concept of zero-shot capability using samples from the VCTK corpus:

	Reference	TTS with Zero-shot VT (VAE)	TTS with Zero-shot VT (SharedGST)	TTS with Zero-shot VT (MultiGST)	TTS with Zero-shot VT (SegmentGST)
Female (P257)
Male (P256)
Female (P244)
Male (P243)
Female (P253)
Male (P285)
Female (P231)
Male (P246)
Female (P282)
Male (P271)
Female (P303)

Cross-lingual Zero-shot Examples using Typical Reference Speech

We also evaluate the cross-lingual capability of our TTS zero-shot model on typical reference speech across six different languages, using English reference speakers from the VCTK corpus. The transcripts and their translations were automatically generated using Gemini:

Cross-lingual Zero-shot Examples on VAE

	Reference	U.S. English	Chinese Mandarin	Spanish	Arabic	French	Japanese	German	Italian	Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity	85% ± 6%	68% ± 4%	67% ± 14%	72% ± 7%	90% ± 7%	88% ± 5%	77% ± 5%	85% ± 6%	78% ± 6%	57% ± 6%
MOS Naturalness	3.3 ± .29	3.7 ± .04	3.7 ± .08	3.5 ± .05	4.2 ± .03	4.0 ± .03	3.8 ± .05	4.1 ± .04	3.6 ± .05	3.9 ± .04

Cross-lingual Zero-shot Examples on SharedGST

	Reference	U.S. English	Chinese Mandarin	Spanish	Arabic	French	Japanese	German	Italian	Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity	85% ± 6%	48% ± 4%	70% ± 14%	65% ± 12%	90% ± 7%	86% ± 7%	68% ± 6%	69% ± 8%	72% ± 7%	38% ± 5%
MOS Naturalness	3.3 ± .29	3.6 ± .05	3.7 ± .06	3.8 ± .05	4.2 ± .03	4.3 ± .03	3.4 ± .06	4.1 ± .04	3.7 ± .05	4.1 ± .03

Cross-lingual Zero-shot Examples on MultiGST

	Reference	U.S. English	Chinese Mandarin	Spanish	Arabic	French	Japanese	German	Italian	Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity	85% ± 6%	58% ± 4%	70% ± 14%	62% ± 12%	74% ± 11%	75% ± 7%	70% ± 6%	70% ± 6%	70% ± 6%	42% ± 6%
MOS Naturalness	3.3 ± .29	3.5 ± .05	3.9 ± .05	3.6 ± .06	4.2 ± .03	4.1 ± .04	3.6 ± .05	4.0 ± .04	3.6 ± .05	4.1 ± .04

Cross-lingual Zero-shot Examples on SegmentGST

	Reference	U.S. English	Chinese Mandarin	Spanish	Arabic	French	Japanese	German	Italian	Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity	85% ± 6%	64% ± 4%	68% ± 15%	68% ± 14%	90% ± 6%	87% ± 5%	73% ± 5%	82% ± 6%	76% ± 6%	47% ± 6%
MOS Naturalness	3.3 ± .29	3.7 ± .05	3.9 ± .05	3.6 ± .04	4.2 ± .03	4.0 ± .03	3.7 ± .05	4.1 ± .04	3.7 ± .04	4.1 ± .04

Case study: Atypical Speech as a Reference

We work with two Googlers Dimitri Kanevsky and Aubrie Lee to synthesize the videos below using only 12 seconds of Dimitri's atypical voice and 14 seconds of Aubrie's atypical voice as reference, respectively.

Participants	Original Videos	Videos with VT Outputs
Dimitri Kanevsky
Aubrie Lee

We also test whether our model that makes use of the same atypical English reference speech from Dimitri and Aubrie can generalize and transfer their voice to other languages, given non-English text.

Below are VT outputs in six different languages for Dimitri (French, Spanish, Italian, Arabic, German, Russian):


French	Spanish	Italian

Arabic	German	Russian

Below are VT outputs in six different languages for Aubrie (French, Spanish, Italian, Arabic, Hindi, Norwegian):


French	Spanish	Italian

Arabic	Hindi	Norwegian