Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
Abstract:
This paper proposes a neural end-to-end text-to-speech model which can control latent attributes in the generation of speech, that are rarely annotated in the training data (e.g. speaking styles, accents, background noise level, and recording conditions). The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation of the proposed model demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.
We used a proprietary dataset of 385 hours of high-quality English speech from 84 professional voice talents with accents from the United States (US), Great Britain (GB), Australia (AU), and Singapore (SG). Speaker labels were not seen during training, and were only used for evaluation.
We present random samples drawn from several mixture components. Theses samples correspond to Appendix D.1. Each of them models a speaker cluster. We choose a passage that can emphasize accent difference for text conditioning.
Text: The fake lawyer from New Orleans is caught again.
We demonstrate the ability to independently control a latent attribute by changing a single dimension of a latent attribute representation while keeping other dimensions fixed. The values set to the target dimension are shown in the first row, where μ and σ denote the mean and standard deviation of that dimension, respectively. Theses samples correspond to Appendix D.2.
Text: The fake lawyer from New Orleans is caught again.
We artificially generate training sets using a room simulator to add background noise and reverberation to clean speech from the multi-speaker English corpus above. Noise was added to a random selection of 50% of utterances by each speaker, holding out two speakers (one male and one female) for whom noise was added to all of their utterances. In this experiment, we provided speaker labels as input to the decoder, and only expect the latent attribute representations to capture the acoustic condition of each utterance.
These samples correspond to Section 4.2.1 / Appendix E.1. As shown in Figure 3 in our paper, the mixture components form two distinct clusters, one for clean speech and one for noisified speech. Here we select one component from each cluster, and draw three latent attribute representation samples from each component. For each sample, we synthesize one utterance conditioning on three speakers, two of which are the held-out speakers. All three samples drawn from a noisy component generate noisy speech, where the type of noise is consistent regardless of the conditioned speaker. All three samples drawn from a clean component generate clean speech, even for the two held-out speakers that have no clean training data, as discussed in Section 4.2.3 in our paper.
Text: This model is trained on multi-speaker English data.
These samples correspond to Section 4.2.2 / Appendix E.2. We demonstrate that the level of noise can be controlled by changing one dimension, and this dimension is automatically determined by the per-dimension linear discriminative analysis. The values set to the noise-level dimension are shown in the first row.
We evaluated the ability of the proposed model to sample and control speaking styles. A single speaker US English audiobook dataset of 147 hours, recorded by professional speaker, Catherine Byers, from the 2013 Blizzard Challenge is used for training.
These samples correspond to Appendix F.2 in our paper. We demonstrate the ability of our model to synthesize speech that resembles the prosody or style of a given reference utterance. The first row of audio samples are the reference utterances. Condition on each style, we synthesize four utterances, which have different text content from that of the reference, and are shown in the same column as the reference utterance.
Reference 1 Text: I haven't the least idea what you're talking about, said Alice.
Reference 2 Text: "That's made a-purpose," said the Djinn, "all because you missed those three days.
Reference 3 Text: "And I shall get my courage," said the Lion thoughtfully.
Reference 4 Text: She was not going through any acute mental process or reasoning with herself, nor was she striving to explain to her satisfaction the motive of her action.
Synthesized 1 Text: By water in the midst of water!
Synthesized 2 Text: We must burn the house down! said the Rabbit's voice; and Alice called out as loud as she could, If you do.
Synthesized 3 Text: And she began fancying the sort of thing that would happen: Miss Alice!
Synthesized 4 Text: She tasted a bite, and she read a word or two, and she sipped the amber wine and wiggled her toes in the silk stockings.
Here we draw ten random latent attribute representations from the prior, and synthesize three utterances for each sample. These samples refer to Section 4.3.1 / Appendix F.3 in our paper.
Each column conditions on the same text, and each row conditions on the same latent attribute representation.
The ten samples show wide variation of speaking rate, rhythm, pitch, and tone.
When conditioning on the same latent attribute representation, the style is consistent across utterances of different texts.
Text 1: Lady Jane Grey had carried fashion to the point of knowing Hebrew.
Text 2: And how soon was the alarm raised along the countryside?
Text 3: And as she raised one slim white hand to brush back some wisps that floated by her face, I saw distinctly the webs between her fingers.
These samples refer to Section 4.3.1 / Appendix F.4 in our paper. We show that several aspects of speaking style/prosody can be controlled by changing the value of one dimension to the latent attribute representation. For each property, we generate two random latent attribute representations from the prior as seeds (sample 1 and sample 2). For each seed, we set the target dimension with three different values and synthesize two utterances (text 1 and text 2) for each value, which are shown in the first row.
Text 1: "Luck has taken us into its own hands," Eric laughed.
Text 2: Mrs. Lynde drove home, meeting several people on the road and stopping to tell them about the hall.
Dimension 3: Deepness, Masculinity (Less -> More)
Dim 3=
μ - 4σ
μ
μ + 4σ
Sample 1, Text 1
Sample 1, Text 2
Sample 2, Text 1
Sample 2, Text 2
Dimension 7: Speed, Emphasis of Ending (Fast -> Slow)
We used an audiobook dataset derived from the same subset of LibriVox audiobooks used for the LibriSpeech corpus (Panayotov et al., 2015), but sampled at 24kHz and segmented differently, making it appropriate for TTS instead of speech recognition. The corpus contains recordings from thousands of speakers, with wide variation in recording conditions and speaking style. Speaker identity is often highly correlated with the recording channel and background noise level, since many speakers tended to use the same microphone in a consistent recording environment.
These samples correspond to Appendix G.1 in our paper. We demonstrate the ability of our model to control a wide variety of attributes, each of which can be independently manipulated by changing the value of one dimension. To traverse each dimension, we the mean of the same mixture component as the seed, and set the target dimension with three different values. For each value, four utterances are synthesized, conditioning on two speakers and two texts.
Text 1: How many times am I to be compelled to beg that of you!
Text 2: I halted at a window farther down the street and studied him; then returned to pass him again, and watched him patiently.
These samples refer to Section 4.4 and Appendix G.2 in our paper. We select several low-quality speakers from the training set, whose recordings contain a perceivable amount of background noise. We present the original recording from each speaker, and synthesized results of that speaker speaking different text content, conditioning on different latent attribute representations.The results of our proposed high-quality synthesis methods are shown in the "Synthesized — Denoised Latent" and "Synthesized — Component 5 mean" rows. In addition, results of conditioning on inferred latent attribute representations are presented in the "Synthesized — Latent" row. For all three synthesized methods, the observed attribute representation (which encodes speaker information) is inferred from the noisy audio shown in the leftmost column. Our proposed methods can generate clean speech for those low-quality speakers.
Text 1: Kenneth decided that he was ill at ease and in a state of dogged self-repression.
Text 2: The hotel is situated at an elevation of thirty-five hundred feet above the sea, and was at that time forty miles from the railroad.
These samples refer to Section 4.4 and Appendix G.2 in our paper. We demonstrate that our model can generate utterances that resemble the voice of a reference utterance by conditioning the generation on the inferred observed attribute representations from it. We present examples of ten unseen speakers. Each row corresponds to one speaker, where we show the reference utterance and two synthesized utterances from left to right.
Text 1: Kenneth decided that he was ill at ease and in a state of dogged self-repression.
Text 2: The hotel is situated at an elevation of thirty-five hundred feet above the sea, and was at that time forty miles from the railroad.