Audio samples from "Fully-hierarchical Fine-grained Prosody Modelling for Interpretable Speech Synthesis"

Paper: arXiv

Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

Abstract: This paper proposes a fully-hierarchical fine-grained prosody modelling component under the Tacotron~2 framework which adopts hierarchical latent representations of prosody attributes with interpretation and control at each level. The hierarchical conditioning in this work is not only imposed between different levels, but also introduced across the latent dimensions via an auto-regressive factorization. Reconstruction performance are evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Moreover, in addition to qualitative interpretation with spectrograms at different level, quantitative evaluations by measuring the nominated attributes are also provided which demonstrates an improved disentanglement.

Click here for more from the Tacotron team.

 
Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

 

Phone and Word Level Energy Control


Controled text (phone): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Controled text (word): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 1 (male):
Phone Level Energy Control
Speaker 1 (male):
Word Level Energy Control
 
 
Speaker 2 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 2 (male):
Phone Level Energy Control
Speaker 2 (male):
Word Level Energy Control
 
 
Speaker 3 (female):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 3 (female):
Phone Level Energy Control
Speaker 3 (female):
Word Level Energy Control

 

Controled text (phone): This would have changed the grand result of the war.
Controled text (word): This would have changed the grand result of the war.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 1 (male):
Phone Level Energy Control
Speaker 1 (male):
Word Level Energy Control
 
Speaker 2 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 2 (male):
Phone Level Energy Control
Speaker 2 (male):
Word Level Energy Control
 
Speaker 3 (female):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 3 (female):
Phone Level Energy Control
Speaker 3 (female):
Word Level Energy Control

Phone and Word Level Duration Control


Controled text (phone): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Controled text (word): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 1 (male):
Phone Level Duration Control
Speaker 1 (male):
Word Level Duration Control
 
 
Speaker 2 (male):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 2 (male):
Phone Level Duration Control
Speaker 2 (male):
Word Level Duration Control
 
 
Speaker 3 (female):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 3 (female):
Phone Level Duration Control
Speaker 3 (female):
Word Level Duration Control

 

Controled text (phone): This would have changed the grand result of the war.
Controled text (word): This would have changed the grand result of the war.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 1 (male):
Phone Level Duration Control
Speaker 1 (male):
Word Level Duration Control
 
Speaker 2 (male):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 2 (male):
Phone Level Duration Control
Speaker 2 (male):
Word Level Duration Control
 
Speaker 3 (female):
Copy Synthesis
z = [0, -1, 0] z = [0, 0, 0] z = [0, +1, 0]
Speaker 3 (female):
Phone Level Duration Control
Speaker 3 (female):
Word Level Duration Control

Phone and Word Level F0 Control


Controled text (phone): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Controled text (word): The plenty around me, the ease and independence gave me a delightful sense of comfort.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 1 (male):
Phone Level F0 Control
Speaker 1 (male):
Word Level F0 Control
 
 
Speaker 2 (male):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 2 (male):
Phone Level F0 Control
Speaker 2 (male):
Word Level F0 Control
 
 
Speaker 3 (female):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 3 (female):
Phone Level F0 Control
Speaker 3 (female):
Word Level F0 Control

 

Controled text (phone): This would have changed the grand result of the war.
Controled text (word): This would have changed the grand result of the war.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 1 (male):
Phone Level F0 Control
Speaker 1 (male):
Word Level F0 Control
 
Speaker 2 (male):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 2 (male):
Phone Level F0 Control
Speaker 2 (male):
Word Level F0 Control
 
Speaker 3 (female):
Copy Synthesis
z = [0, 0, -1] z = [0, 0, 0] z = [0, 0, +1]
Speaker 3 (female):
Phone Level F0 Control
Speaker 3 (female):
Word Level F0 Control

Phone and Word Level Silence Control


Controled text (phone): The plenty around me, "sil" the ease and independence gave me a delightful sense of comfort.
Controled text (word): The plenty around me, "sil" the ease and independence gave me a delightful sense of comfort.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 1 (male):
Phone Level Silence Control
Speaker 1 (male):
Word Level Silence Control
 
Speaker 2 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 2 (male):
Phone Level Silence Control
Speaker 2 (male):
Word Level Silence Control
 
Speaker 3 (female):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 3 (female):
Phone Level Silence Control
Speaker 3 (female):
Word Level Silence Control

 

Controled text (word): This "sil" would have changed "sil" the grand result of the war.
Ground Truth:
Speaker 1 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 1 (male):
Word Level Silence Control
 
Speaker 2 (male):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 2 (male):
Word Level Silence Control
 
Speaker 3 (female):
Copy Synthesis
z = [-1, 0, 0] z = [0, 0, 0] z = [+1, 0, 0]
Speaker 3 (female):
Word Level Silence Control

 

Phone and Word Level Sampling Comparison


Random Samples F0 at Phone and Word Level

The value of the last latent dimension is randomly sampled from a standard normal distribution.
Text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
Speaker 1 (male) Phone Level
Speaker 1 (male) Word Level

Random Samples Duration at Phone and Word Level

The value of the second latent dimension is randomly sampled from a standard normal distribution.
Text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
Speaker 1 (male) Phone Level
Speaker 1 (male) Word Level