Zero-shot Cross-lingual Voice Transfer for TTS, and Its Application to Voice Restoration for Accessibility and Inclusion
We present a zero-shot voice transfer (VT) module that can be easily plugged into a state-of-the-art TTS system to restore the voices of input speakers.
We include the following audio samples for our research blogpost.
Our paper is available in Arxiv.
Zero-shot Examples using Typical Reference Speech
Below are zero-shot examples using typical reference speech, to simulate the scenario when the speaker's voice was recorded before any voice degradation occurred. we demonstrate the concept of zero-shot capability using samples from the VCTK corpus:
Reference
TTS with Zero-shot VT (VAE)
TTS with Zero-shot VT (SharedGST)
TTS with Zero-shot VT (MultiGST)
TTS with Zero-shot VT (SegmentGST)
Female (P257)
Male (P256)
Female (P244)
Male (P243)
Female (P253)
Male (P285)
Female (P231)
Male (P246)
Female (P282)
Male (P271)
Female (P303)
Cross-lingual Zero-shot Examples using Typical Reference Speech
We also evaluate the cross-lingual capability of our TTS zero-shot model on typical reference speech across six different languages, using English reference speakers from the VCTK corpus. The transcripts and their translations were automatically generated using Gemini:
Cross-lingual Zero-shot Examples on VAE
Reference
U.S. English
Chinese Mandarin
Spanish
Arabic
French
Japanese
German
Italian
Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity
85% ± 6%
68% ± 4%
67% ± 14%
72% ± 7%
90% ± 7%
88% ± 5%
77% ± 5%
85% ± 6%
78% ± 6%
57% ± 6%
MOS Naturalness
3.3 ± .29
3.7 ± .04
3.7 ± .08
3.5 ± .05
4.2 ± .03
4.0 ± .03
3.8 ± .05
4.1 ± .04
3.6 ± .05
3.9 ± .04
Cross-lingual Zero-shot Examples on SharedGST
Reference
U.S. English
Chinese Mandarin
Spanish
Arabic
French
Japanese
German
Italian
Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity
85% ± 6%
48% ± 4%
70% ± 14%
65% ± 12%
90% ± 7%
86% ± 7%
68% ± 6%
69% ± 8%
72% ± 7%
38% ± 5%
MOS Naturalness
3.3 ± .29
3.6 ± .05
3.7 ± .06
3.8 ± .05
4.2 ± .03
4.3 ± .03
3.4 ± .06
4.1 ± .04
3.7 ± .05
4.1 ± .03
Cross-lingual Zero-shot Examples on MultiGST
Reference
U.S. English
Chinese Mandarin
Spanish
Arabic
French
Japanese
German
Italian
Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity
85% ± 6%
58% ± 4%
70% ± 14%
62% ± 12%
74% ± 11%
75% ± 7%
70% ± 6%
70% ± 6%
70% ± 6%
42% ± 6%
MOS Naturalness
3.3 ± .29
3.5 ± .05
3.9 ± .05
3.6 ± .06
4.2 ± .03
4.1 ± .04
3.6 ± .05
4.0 ± .04
3.6 ± .05
4.1 ± .04
Cross-lingual Zero-shot Examples on SegmentGST
Reference
U.S. English
Chinese Mandarin
Spanish
Arabic
French
Japanese
German
Italian
Hindi
Male (P246)
Female (P303)
Male (P256)
Female (P244)
Male (P243)
Female (P231)
Male (P285)
Female (P282)
Speaker Similarity
85% ± 6%
64% ± 4%
68% ± 15%
68% ± 14%
90% ± 6%
87% ± 5%
73% ± 5%
82% ± 6%
76% ± 6%
47% ± 6%
MOS Naturalness
3.3 ± .29
3.7 ± .05
3.9 ± .05
3.6 ± .04
4.2 ± .03
4.0 ± .03
3.7 ± .05
4.1 ± .04
3.7 ± .04
4.1 ± .04
Case study: Atypical Speech as a Reference
We work with two Googlers Dimitri Kanevsky and Aubrie Lee to synthesize the videos below using only 12 seconds of Dimitri's atypical voice and 14 seconds of Aubrie's atypical voice as reference, respectively.
Participants
Original Videos
Videos with VT Outputs
Dimitri Kanevsky
Aubrie Lee
We also test whether our model that makes use of the same atypical English reference speech from Dimitri and Aubrie can generalize and transfer their voice to other languages, given non-English text.
Below are VT outputs in six different languages for Dimitri (French, Spanish, Italian, Arabic, German, Russian):
French
Spanish
Italian
Arabic
German
Russian
Below are VT outputs in six different languages for Aubrie (French, Spanish, Italian, Arabic, Hindi, Norwegian):