Audio samples from "Semi-Supervised Generative Modeling for Controllable Speech Synthesis"

Paper: arXiv

Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised methods. We demonstrate that our model is able to reliably discover and control important but rarely labeled attributes of speech, such as affect and speaking rate, with as little as 0.1% (3 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline.

This page contains a set of audio samples in support of the paper. All the utterances are unseen during training. Sections 1-3 demo controlling speaking rate, pitch variation (continuous labels), and affect (discrete labels), via semi-supervision of these factors, at multiple supervision levels. Sections 4-6 demo the transfer of controllability to unlabeled speakers, for whom we do not observe labels of the factors above (i.e., domain transfer).

Click here for more from the Tacotron team.

Contents

1. Speaking Rate Control at Varying Levels of Supervision

In this section, we demonstrate the degree to which we are able to control the variation in speaking rate at different levels of supervision. Speaking rate is normalized to match the standard normal prior (zs~N(μ=0, σ=1)).

   Supervision %
zs(Speaking Rate)   0.1% 1% 10%
-5σ
 
 
 
-3σ
 
 
 
-1σ
 
 
 
0
 
 
 
 
 
 
 
 
 
 
 
 

2. Pitch Variation Control at Varying Levels of Supervision

In this section, we demonstrate the degree to which we are able to control the variation in F0 (fundamental frequency, a proxy for arousal or excitement) at different levels of supervision. F0 variation is normalized to match the standard normal prior (zs~N(μ=0, σ=1)).

   Supervision %
zs(Pitch variation)   0.1% 1% 10%
-5σ
 
 
 
-3σ
 
 
 
-1σ
 
 
 
0
 
 
 
 
 
 
 
 
 
 
 
 

3. Affect Control at Varying Levels of Supervision

In this section, we demonstrate the degree to which we are able to control the variation in affect at different levels of supervision. Higher levels of supervision is required for controlling affect, compared to speaking rate and pitch variations. In general the variations are subtle, so as the upper bound we have also presented the fully supervised case (i.e., 100% supervision).

   Supervision %
zs(Affect)   1% 10% 20% 100%
Arousal= -2 (low) Valence=-2 (angry)
 
 
 
 
Arousal= -2 (low) Valence=-1 (sad)
 
 
 
 
Arousal= -2 (low) Valence=2 (happy)
 
 
 
 
Arousal=2 (high) Valence=-2 (angry)
 
 
 
 
Arousal=2 (high) Valence=-1 (sad)
 
 
 
 
Arousal=2 (high) Valence=2 (happy)
 
 
 
 

4. Speaking Rate Control Generalization to Unlabeled Speakers

In this section, we demonstrate the degree to which speaking rate control at 10% supervision is able to generalize to speakers without speaking rate labels in training, which are colored blue.

   Speaker
zs(Speaking Rate)   female
w/ label
male
w/ label
female
w/o label
female
w/o label
male
w/o label
-5σ
 
 
 
 
 
-3σ
 
 
 
 
 
-1σ
 
 
 
 
 
0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

5. Pitch Variation Control Generalization to Unlabeled Speakers

In this section, we demonstrate the degree to which pitch variation control at 10% supervision is able to generalize to speakers without pitch variation labels in training, which are colored blue.

   Speaker
zs(Pitch Variation)   female
w/ label
male
w/ label
female
w/o label
female
w/o label
male
w/o label
-5σ
 
 
 
 
 
-3σ
 
 
 
 
 
-1σ
 
 
 
 
 
0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

6. Affect Control Generalization to Unlabeled Speakers

In this section, we demonstrate the degree to which affect control at 10% supervision is able to generalize to speakers without affect labels in training, which are colored blue.

   Speaker
zs(Affect)   male
w/ label
female
w/ label
female
w/o label
female
w/o label
male
w/o label
Arousal= -2 (low) Valence=-2 (angry)
 
 
 
 
 
Arousal= -2 (low) Valence=-1 (sad)
 
 
 
 
 
Arousal= -2 (low) Valence=2 (happy)
 
 
 
 
 
Arousal=2 (high) Valence=-2 (angry)
 
 
 
 
 
Arousal=2 (high) Valence=-1 (sad)
 
 
 
 
 
Arousal=2 (high) Valence=2 (happy)