Audio samples from "Hierarchical Generative Modeling for Controllable Speech Synthesis"

Paper: arXiv

Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

Abstract: This paper proposes a neural end-to-end text-to-speech model which can control latent attributes in the generation of speech, that are rarely annotated in the training data (e.g. speaking styles, accents, background noise level, and recording conditions). The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation of the proposed model demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.

Click here for more from the Tacotron team.

 
Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

 

Multi-Speaker English Corpus (Section 4.1)

We used a proprietary dataset of 385 hours of high-quality English speech from 84 professional voice talents with accents from the United States (US), Great Britain (GB), Australia (AU), and Singapore (SG). Speaker labels were not seen during training, and were only used for evaluation.

Random Samples by Mixture Component

We present random samples drawn from several mixture components. Theses samples correspond to Appendix D.1. Each of them models a speaker cluster. We choose a passage that can emphasize accent difference for text conditioning.
Text: The fake lawyer from New Orleans is caught again.
Component 1:
male
Component 3:
US female (low-pitched)
Component 4:
GB/AU female
Component 5:
US female (high-pitched)
Component 7:
US/SG male
Component 8:
US/SG Female

 

Control of Latent Attributes

We demonstrate the ability to independently control a latent attribute by changing a single dimension of a latent attribute representation while keeping other dimensions fixed. The values set to the target dimension are shown in the first row, where μ and σ denote the mean and standard deviation of that dimension, respectively. Theses samples correspond to Appendix D.2.
Text: The fake lawyer from New Orleans is caught again.

Dimension 2: Speed (Fast -> Slow)

Val = μ - 2σ μ μ + 2σ
Sample 1
Sample 2

Dimension 3: Accent (US -> British)

Val = μ - 2σ μ μ + 2σ
Sample 1
Sample 2

Dimension 9: Pitch (High -> Low)

Sample 1
Sample 2

 

Noisy Multi-Speaker English Corpus (Section 4.2)

We artificially generate training sets using a room simulator to add background noise and reverberation to clean speech from the multi-speaker English corpus above. Noise was added to a random selection of 50% of utterances by each speaker, holding out two speakers (one male and one female) for whom noise was added to all of their utterances. In this experiment, we provided speaker labels as input to the decoder, and only expect the latent attribute representations to capture the acoustic condition of each utterance.

Random Samples from Noisy and Clean Components

These samples correspond to Section 4.2.1 / Appendix E.1. As shown in Figure 3 in our paper, the mixture components form two distinct clusters, one for clean speech and one for noisified speech. Here we select one component from each cluster, and draw three latent attribute representation samples from each component. For each sample, we synthesize one utterance conditioning on three speakers, two of which are the held-out speakers. All three samples drawn from a noisy component generate noisy speech, where the type of noise is consistent regardless of the conditioned speaker. All three samples drawn from a clean component generate clean speech, even for the two held-out speakers that have no clean training data, as discussed in Section 4.2.3 in our paper.
Text: This model is trained on multi-speaker English data.

Samples from a Noisy Component

Sample 1
Loud Wideband Noise
Sample 2
Low-Frequency Noise, Reverb
Sample 3
Musical Noise
Clean Speaker 1
Noisy Speaker A
Noisy Speaker B

Samples from a Clean Component

Sample 1 Sample 2 Sample 3
Clean Speaker 1
Noisy Speaker A
Noisy Speaker B

 

Control of Background Noise Level

These samples correspond to Section 4.2.2 / Appendix E.2. We demonstrate that the level of noise can be controlled by changing one dimension, and this dimension is automatically determined by the per-dimension linear discriminative analysis. The values set to the noise-level dimension are shown in the first row.
Text: Traversing the noise level dimension.
Noise-Level Dim = -0.6 -0.2 0.2
Sample 1:
Speaker 1
Sample 2:
Speaker 1
Sample 3:
Speaker A
Sample 4:
Speaker A

 

Single-Speaker Audiobook Corpus (Section 4.3)

We evaluated the ability of the proposed model to sample and control speaking styles. A single speaker US English audiobook dataset of 147 hours, recorded by professional speaker, Catherine Byers, from the 2013 Blizzard Challenge is used for training.

Non-Parallel Style Transfer

These samples correspond to Appendix F.2 in our paper. We demonstrate the ability of our model to synthesize speech that resembles the prosody or style of a given reference utterance. The first row of audio samples are the reference utterances. Condition on each style, we synthesize four utterances, which have different text content from that of the reference, and are shown in the same column as the reference utterance.
Reference 1 Text: I haven't the least idea what you're talking about, said Alice.
Reference 2 Text: "That's made a-purpose," said the Djinn, "all because you missed those three days.
Reference 3 Text: "And I shall get my courage," said the Lion thoughtfully.
Reference 4 Text: She was not going through any acute mental process or reasoning with herself, nor was she striving to explain to her satisfaction the motive of her action.
Synthesized 1 Text: By water in the midst of water!
Synthesized 2 Text: We must burn the house down! said the Rabbit's voice; and Alice called out as loud as she could, If you do.
Synthesized 3 Text: And she began fancying the sort of thing that would happen: Miss Alice!
Synthesized 4 Text: She tasted a bite, and she read a word or two, and she sipped the amber wine and wiggled her toes in the silk stockings.
Style 1
Tremulous, High-Pitched
Style 2
Prolonged, Medium-Low-Pitched
Style 3
Rough, Low-Pitched
Style 4
Narrative
Reference
Synthesized 1
Synthesized 2
Synthesized 3
Synthesized 4

 

Random Style Samples

Here we draw ten random latent attribute representations from the prior, and synthesize three utterances for each sample. These samples refer to Section 4.3.1 / Appendix F.3 in our paper. Each column conditions on the same text, and each row conditions on the same latent attribute representation. The ten samples show wide variation of speaking rate, rhythm, pitch, and tone. When conditioning on the same latent attribute representation, the style is consistent across utterances of different texts.
Text 1: Lady Jane Grey had carried fashion to the point of knowing Hebrew.
Text 2: And how soon was the alarm raised along the countryside?
Text 3: And as she raised one slim white hand to brush back some wisps that floated by her face, I saw distinctly the webs between her fingers.
Text 1 Text 2 Text 3
Random Style 1
Random Style 2
Random Style 3
Random Style 4
Random Style 5
Random Style 6
Random Style 7
Random Style 8
Random Style 9
Random Style 10

 

Control of Style Attributes

These samples refer to Section 4.3.1 / Appendix F.4 in our paper. We show that several aspects of speaking style/prosody can be controlled by changing the value of one dimension to the latent attribute representation. For each property, we generate two random latent attribute representations from the prior as seeds (sample 1 and sample 2). For each seed, we set the target dimension with three different values and synthesize two utterances (text 1 and text 2) for each value, which are shown in the first row.
Text 1: "Luck has taken us into its own hands," Eric laughed.
Text 2: Mrs. Lynde drove home, meeting several people on the road and stopping to tell them about the hall.

Dimension 3: Deepness, Masculinity (Less -> More)

Dim 3= μ - 4σ μ μ + 4σ
Sample 1, Text 1
Sample 1, Text 2
Sample 2, Text 1
Sample 2, Text 2

Dimension 7: Speed, Emphasis of Ending (Fast -> Slow)

Dim 7= μ - 4σ μ μ + 4σ
Sample 1, Text 1
Sample 1, Text 2
Sample 2, Text 1
Sample 2, Text 2

Dimension 8: Excitement (Less -> More)

Dim 8= μ - 4σ μ μ + 4σ
Sample 1, Text 1
Sample 1, Text 2
Sample 2, Text 1
Sample 2, Text 2

Dimension 14: Roughness (More -> Less)

Dim 14= μ - 4σ μ μ + 4σ
Sample 1, Text 1
Sample 1, Text 2
Sample 2, Text 1
Sample 2, Text 2

 

Crowd-Sourced Audiobook Corpus (Section 4.4)

We used an audiobook dataset derived from the same subset of LibriVox audiobooks used for the LibriSpeech corpus (Panayotov et al., 2015), but sampled at 24kHz and segmented differently, making it appropriate for TTS instead of speech recognition. The corpus contains recordings from thousands of speakers, with wide variation in recording conditions and speaking style. Speaker identity is often highly correlated with the recording channel and background noise level, since many speakers tended to use the same microphone in a consistent recording environment.

Control of Style, Channel, and Noise Attributes

These samples correspond to Appendix G.1 in our paper. We demonstrate the ability of our model to control a wide variety of attributes, each of which can be independently manipulated by changing the value of one dimension. To traverse each dimension, we the mean of the same mixture component as the seed, and set the target dimension with three different values. For each value, four utterances are synthesized, conditioning on two speakers and two texts.
Text 1: How many times am I to be compelled to beg that of you!
Text 2: I halted at a window farther down the street and studied him; then returned to pass him again, and watched him patiently.

Dimension 0: Pitch (Low -> High)

Val = μ - 4σ μ μ + 4σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

Dimension 1: Band-Pass Filter (High-Pass -> Low-Pass)

Val = μ - 4σ μ μ + 4σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

Dimension 2: Reverberation (More -> Less)

Val = μ - 4σ μ μ + 4σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

Dimension 4: Noise (More -> Less)

Val = μ - 4σ μ μ + 4σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

Dimension 12: Length of Pause between Sentences (Short -> Long)

Val = μ - 2σ μ μ + 2σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

Dimension 13: Speaking Rate (High -> Low)

Val = μ - 4σ μ μ + 4σ
Text 1, Speaker 1
Text 1, Speaker 2
Text 2, Speaker 1
Text 2, Speaker 2

 

Synthesizing High-Quality Speech for Low-Quality Speakers

These samples refer to Section 4.4 and Appendix G.2 in our paper. We select several low-quality speakers from the training set, whose recordings contain a perceivable amount of background noise. We present the original recording from each speaker, and synthesized results of that speaker speaking different text content, conditioning on different latent attribute representations.The results of our proposed high-quality synthesis methods are shown in the "Synthesized — Denoised Latent" and "Synthesized — Component 5 mean" rows. In addition, results of conditioning on inferred latent attribute representations are presented in the "Synthesized — Latent" row. For all three synthesized methods, the observed attribute representation (which encodes speaker information) is inferred from the noisy audio shown in the leftmost column. Our proposed methods can generate clean speech for those low-quality speakers.
Text 1: Kenneth decided that he was ill at ease and in a state of dogged self-repression.
Text 2: The hotel is situated at an elevation of thirty-five hundred feet above the sea, and was at that time forty miles from the railroad.
Original Synthesized — Latent Synthesized — Denoised Latent Synthesized — Component 5 Mean
Speaker 1447, Text 1
Speaker 1447, Text 2
Speaker 426, Text 1
Speaker 426, Text 2
Speaker 7178, Text 1
Speaker 7178, Text 2
Speaker 1578, Text 1
Speaker 1578, Text 2
Speaker 78, Text 1
Speaker 78, Text 2
Speaker 669, Text 1
Speaker 669, Text 2

 

Inferring Speaker Representations of Unseen Speakers

These samples refer to Section 4.4 and Appendix G.2 in our paper. We demonstrate that our model can generate utterances that resemble the voice of a reference utterance by conditioning the generation on the inferred observed attribute representations from it. We present examples of ten unseen speakers. Each row corresponds to one speaker, where we show the reference utterance and two synthesized utterances from left to right.
Text 1: Kenneth decided that he was ill at ease and in a state of dogged self-repression.
Text 2: The hotel is situated at an elevation of thirty-five hundred feet above the sea, and was at that time forty miles from the railroad.
Original Synthesized, Text 1 Synthesized, Text 2
Speaker 3570, Text 1
Speaker 3575, Text 1
Speaker 4970, Text 1
Speaker 4992, Text 1
Speaker 1320, Text 1
Speaker 7021, Text 1
Speaker 7729, Text 1
Speaker 8230, Text 1