Audio samples from "Generating Natural and Diverse Text-to-Speech Samples Using A Quantized Fine-grained VAE and Auto-regressive Prosody Prior"

Paper: arXiv

Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

Abstract: This paper proposes a vector-quantization approach to generate natural and diverse audio samples based on the fine-grained VAE structure in end-to-end TTS synthesis systems like tacotron-2. The fine-grained VAE structure extracts latent prosody features at phoneme level, and vector-quantization is applied to those latent features. Besides, the prosody discontinuity across phonemes during generation is mitigated by sampling from an auto-regressive (AR) prior instead of the independent standard Gaussian. The AR prior is trained in either the continuous latent space or the discrete latent space, and is separate from the training of the posterior. This page shows the audio samples generated from representative models covered in the paper.

Click here for more from the Tacotron team.

 
Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

 

Baseline Fine-grained VAE Independent Sampling


Sampled text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
 
Speaker 1 (male):
Copy Synthesis
Scale = 1.0:
Scale = 0.2:
Scale = 0.0:
 
Speaker 2 (male):
Copy Synthesis
Scale = 1.0:
Scale = 0.2:
Scale = 0.0:
 
Speaker 3 (female):
Copy Synthesis
Scale = 1.0:
Scale = 0.2:
Scale = 0.0:

 

Sampled text: Because, I don't know where it ends, and also because it is full of poachers.
 
Speaker 1 (male):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:
 
Speaker 2 (male):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:
 
Speaker 3 (female):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:

 

Sampled text: This would have changed the grand result of the war.
 
Speaker 1 (male):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:
 
Speaker 2 (male):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:
 
Speaker 3 (female):
Copy Synthesis
Scale = 0.2:
Scale = 0.0:

Baseline Fine-grained VAE Sample from AR Prior


Sampled text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: Because, I don't know where it ends, and also because it is full of poachers.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: This would have changed the grand result of the war.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

VQVAE with Independent Samples


Sampled text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: Because, I don't know where it ends, and also because it is full of poachers.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: This would have changed the grand result of the war.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

VQVAE with AR Prior in the Continuous Space


Sampled text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: Because, I don't know where it ends, and also because it is full of poachers.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: This would have changed the grand result of the war.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

VQVAE with AR Prior in the Discrete Space


Sampled text: The plenty around me, the ease and independence gave me a delightful sense of comfort.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: Because, I don't know where it ends, and also because it is full of poachers.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples:

 

Sampled text: This would have changed the grand result of the war.
 
Speaker 1 (male):
Copy Synthesis
Random Samples:
Speaker 2 (male):
Copy Synthesis
Random Samples:
Speaker 3 (female):
Copy Synthesis
Random Samples: