Audio samples from "Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis"

Paper: arXiv

Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, and Tom Bagby

Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web.

This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper. All reference and target utterances were unseen during training.

Click here for more from the Tacotron team.

Contents

1. Single-speaker Same-text Prosody Transfer
2. Single-speaker Inter-text Style Transfer
3. Single-speaker Prior Samples
4. Multi-speaker Same-text Prosody Transfer
5. Multi-speaker Prior Samples
6. Single-speaker Hierarchical Style Transfer
7. Single-speaker Hierarchial Prosody Transfer

1. Single-speaker Same-text Prosody Transfer

Here we show that models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) exhibit increasing similarity to the reference as embedding capacity, C, is increased. The baseline model does not use a reference embedding.

Reference text:
"Not in that way.
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
That's right! shouted the Queen.
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
"Never again shall Eleanor Lavish be a friend of mine."
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
'Try her again, then.'
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
How dare you!"
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
'Speak, won't you!' cried the King. 'How are they getting on with the fight?'
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
"Oh, Gabriel, how could you serve me so unkindly!"
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
And how the parson would pray!
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
"I've swallowed a pollywog.
Reference

Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

2. Single-speaker Inter-text Style Transfer

In this section, we compare models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) when transferring between utterances with different text. The examples show how the quality of the samples from Var deteriorate as the capacity, C, is increased. This is especially noticeable when there is a significant mismatch in the length of the two utterances.

Reference text:
"Not in that way.
Reference

Target text:
And the distinction is not quite so much against the candour and common sense of the world as appears at first; for a very narrow income has a tendency to contract the mind, and sour the temper.
Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
Quick, now!
Reference

Target text:
Yet he might not have been so perfectly humane, so thoughtful in his generosity, so full of kindness and tenderness amidst his passion for adventurous exploit, had she not unfolded to him the real loveliness of beneficence and made the doing good the end and aim of his soaring ambition.
Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
That's right! shouted the Queen.
Reference

Target text:
CHAPTER 1
Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
"Never again shall Eleanor Lavish be a friend of mine."
Reference

Target text:
He left them, carefully closing the front door; and when they looked through the hall window, they saw him go up the drive and begin to climb the slopes of withered fern behind the house.
Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

Reference text:
"Oh, Gabriel, how could you serve me so unkindly!"
Reference

Target text:
Was it that the far clear voice had meant?
Baseline

	Var	Var+Txt
C=10
C=50
C=100
C=300

3. Single-speaker Prior Samples

Here we sample the latent embedding from the prior distribution for models with different capacities. The quality of the samples from the Var model deteriorates as capacity, C, is increased, while Var+Txt remains stable. This is especially noticeable for very short and very long utterances.

Text:
CHAPTER 1
Baseline

	Var	Var+T
C=10
C=50
C=100
C=300

Text:
With my sharp and long sight, as I look up, I have seen it distinctly; now if it happens to hurt the young lady, and I think it must, here am I, here are my file, my punch, my nippers; I will make it round and blunt, if her ladyship pleases; no longer the tooth of a fish, but of a beautiful young lady as she is.
Baseline

	Var	Var+T
C=10
C=50
C=100
C=300

Text:
I charge you by all that is sacred, not to attempt concealment.
Baseline

	Var	Var+T
C=10
C=50
C=100
C=300

Text:
"Never again shall Eleanor Lavish be a friend of mine."
Baseline

	Var	Var+T
C=10
C=50
C=100
C=300

Text:
How dare you!"
Baseline

	Var	Var+T
C=10
C=50
C=100
C=300

4. Multi-speaker Same-text Prosody Transfer

In this experiment, we demonstrate prosody transfer between speakers. When conditioned on the reference speaker (Var+Txt+Spk), the output preserves the pitch range of the target speaker, whereas without speaker conditioning (Var+Txt), the output tends to match the pitch of the reference speaker. As the capacity, C, is increased, the difference becomes more apparent.

Reference text:
I ventured onto the platform.
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
Quick, now!
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
Good job! Sounds like Snuffles is stuffed!
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
How many licks does it take to get to the center of a Tootsie Pop?
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
Moby Dick's House of Kabob is at US two eight one Santo.
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
"Half-past six o'clock.
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Reference text:
"I've swallowed a pollywog.
Reference:

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

5. Multi-speaker Prior Samples

Here we sythesize from multi-speaker models by sampling the latent embeddings from the prior. The model that includes speaker dependencies in the posterior (Var+Txt+Spk) does a better job of preserving target speaker identity and pitch range, compared to the model without posterior speaker dependencies (Var+Txt). Notice also how the samples become more expressive (and erratic) as embedding capacity is increased.

Text:
Kanoute, the biggest Malian star at the moment, currently plays for Sevilla F C in Spain's La Liga.

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Text:
It was a superb adaptation.

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

Text:
Does your shirt have holes in it? No? Then how did you put it on?

	Australian female	British female	North American male	British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

6. Single-speaker Hierarchical Style Transfer

In this experiment, we use models with hierarchical latents (high-level latents, Z_H, and low-level latents, Z_L) and demonstrate the effect of varying C_H and C_L when transferring via Z_H. When increasing C_H, similarity to the reference increases. When increasing C_L, inter-sample variation increases (between samples generated using the same Z_H). We include groups of 3 samples generated using the same Z_H inferred from the reference.

Reference text:
"I've swallowed a pollywog.
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

Reference text:
"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments.
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

Reference text:
"Not in that way.
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

Reference text:
I charge you by all that is sacred, not to attempt concealment.
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

Reference text:
Quick, now!
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

Reference text:
That's right! shouted the Queen.
Reference:

Baseline:

	C_H=20	C_H=50	C_H=100
C_L=50
C_L=100

7. Single-speaker Hierarchial Prosody Transfer

In this experiment, we use models with hierarchical latents (Z_H and Z_L) and demonstrate the effect of varying C_H and C_L when transferring via Z_L (instead of Z_H). We can see that as the sum of C_H and C_L increases, similarity to the reference increases. The variation across samples generated using the same reference is low. We include groups of 3 samples generated using different Z_L's that are inferred from the same reference.

Reference text:
"I've swallowed a pollywog.
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200

Reference text:
"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments.
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200

Reference text:
"Not in that way.
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200

Reference text:
I charge you by all that is sacred, not to attempt concealment.
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200

Reference text:
Quick, now!
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200

Reference text:
That's right! shouted the Queen.
Reference

Baseline

C_H=20, C_L=50 ➡ C=70	C_H=50, C_L=50 ➡ C=100	C_H=20, C_L=100 ➡ C=120	C_H=50, C_L=100 ➡ C=150	C_H=100, C_L=50 ➡ C=150	C_H=100, C_L=100 ➡ C=200