Audio samples from "Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization"

Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, James Glass

Abstract: To leverage crowd-sourced data to train multi-speaker text-to-speech (TTS) models that can synthesize clean speech for all speakers, it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. However, learning such representations can be challenging, due to the lack of labels describing the recording conditions of each training example, and the fact that speakers and recording conditions are often correlated, e.g. since users often make many recordings using the same equipment. This paper proposes three components to address this problem by: (1) formulating a conditional generative model with factorized latent variables, (2) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (3) using adversarial factorization to improve disentanglement. Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers. Ablation studies verify the importance of each proposed component.

Click here for more from the Tacotron team.

Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

Reference Audio for Inferring Latents

Speaker Latent Variable
Residual Latent Variable

Synthesized Audio (Griffin-Lim)

p225 (clean)
p272 (clean)
p244 (noisy)
p252 (noisy)

Comparing Griffin-Lim and WaveRNN Spectrogram Inversion

Reference Audio for Inferring Latents

In this section, we present reference audio used to infer latent speaker variables z_s and latent residual variables z_r. These latent variables are used to control the speaker identity and noise condition, respectively.

Speaker Latent Variable

z_s⁽¹⁾: p225_253 (clean)	z_s⁽²⁾: p272_016 (clean)	z_s⁽³⁾: p244_011 (noisy)	z_s⁽⁴⁾: p252_005 (noisy)

Residual Latent Variable

z_r⁽¹⁾: p265_163 (clean)	z_r⁽²⁾: p251_004 (clean)	z_r⁽³⁾: p230_017 (noisy)	z_r⁽⁴⁾: p246_112 (noisy)

Synthesized Audio (Griffin-Lim)

This section contains synthesized audio from the baseline model and the proposed model using Griffin-Lim algorithm. The baseline model utilizes a speaker embedding table to control speaker identity. The proposed model uses z_s to control speaker identity and z_r to control the acoustic condition. These examples demonstrate model's ability to synthesize clean speech for all speakers, regardless of the training data quality of the speaker. In contrast, the baseline model always generates noisy speech for speakers whose training data are noisy.

Text 1: Try these pages.
Text 2: In 2009 these reports were collected in the book Chambermaids and Soldiers.
Text 3: In later years in films she switched to playing character parts.
Text 4: The other bodies are juxtaposed in various unlit areas behind them.

p225 (clean)

	Baseline	Proposed: (z_s⁽¹⁾, z_r⁽¹⁾)	Proposed: (z_s⁽¹⁾, z_r⁽²⁾)	Proposed: (z_s⁽¹⁾, z_r⁽³⁾)	Proposed: (z_s⁽¹⁾, z_r⁽⁴⁾)
text 1
text 2
text 3
text 4

p272 (clean)

	Baseline	Proposed: (z_s⁽²⁾, z_r⁽¹⁾)	Proposed: (z_s⁽²⁾, z_r⁽²⁾)	Proposed: (z_s⁽²⁾, z_r⁽³⁾)	Proposed: (z_s⁽²⁾, z_r⁽⁴⁾)
text 1
text 2
text 3
text 4

p244 (noisy)

	Baseline	Proposed: (z_s⁽³⁾, z_r⁽¹⁾)	Proposed: (z_s⁽³⁾, z_r⁽²⁾)	Proposed: (z_s⁽³⁾, z_r⁽³⁾)	Proposed: (z_s⁽³⁾, z_r⁽⁴⁾)
text 1
text 2
text 3
text 4

p252 (noisy)

	Baseline	Proposed: (z_s⁽⁴⁾, z_r⁽¹⁾)	Proposed: (z_s⁽⁴⁾, z_r⁽²⁾)	Proposed: (z_s⁽⁴⁾, z_r⁽³⁾)	Proposed: (z_s⁽⁴⁾, z_r⁽⁴⁾)
text 1
text 2
text 3
text 4

Comparing Griffin-Lim and WaveRNN Spectrogram Inversion

In this section, we compare using Griffin-Lim and WaveRNN to synthesize waveforms using spectrograms generated by the proposed model. The results demonstrate that using WaveRNN can improve the audio fidelity.

Text: The Godfather Release date is March 15, 1972

	(z_s⁽¹⁾, z_r⁽¹⁾)	(z_s⁽²⁾, z_r⁽¹⁾)	(z_s⁽³⁾, z_r⁽¹⁾)	(z_s⁽⁴⁾, z_r⁽¹⁾)
Griffin-Lim
WaveRNN

Text: The Palladium-Item is the daily morning newspaper for Richmond, Indiana and surrounding areas.

	(z_s⁽¹⁾, z_r⁽¹⁾)	(z_s⁽²⁾, z_r⁽¹⁾)	(z_s⁽³⁾, z_r⁽¹⁾)	(z_s⁽⁴⁾, z_r⁽¹⁾)
Griffin-Lim
WaveRNN