Audio samples from "Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis"

Paper: arXiv

Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, and Tom Bagby

Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web.

This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper. All reference and target utterances were unseen during training.

Click here for more from the Tacotron team.

Contents

1. Single-speaker Same-text Prosody Transfer

Here we show that models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) exhibit increasing similarity to the reference as embedding capacity, C, is increased. The baseline model does not use a reference embedding.

Reference text:
"Not in that way.
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
That's right! shouted the Queen.
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
"Never again shall Eleanor Lavish be a friend of mine."
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
'Try her again, then.'
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
How dare you!"
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
'Speak, won't you!' cried the King. 'How are they getting on with the fight?'
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
"Oh, Gabriel, how could you serve me so unkindly!"
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
And how the parson would pray!
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
"I've swallowed a pollywog.
Reference

Baseline
Var Var+Txt
C=10
C=50
C=100
C=300

2. Single-speaker Inter-text Style Transfer

In this section, we compare models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) when transferring between utterances with different text. The examples show how the quality of the samples from Var deteriorate as the capacity, C, is increased. This is especially noticeable when there is a significant mismatch in the length of the two utterances.

Reference text:
"Not in that way.
Reference

Target text:
And the distinction is not quite so much against the candour and common sense of the world as appears at first; for a very narrow income has a tendency to contract the mind, and sour the temper.
Baseline

Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
Quick, now!
Reference

Target text:
Yet he might not have been so perfectly humane, so thoughtful in his generosity, so full of kindness and tenderness amidst his passion for adventurous exploit, had she not unfolded to him the real loveliness of beneficence and made the doing good the end and aim of his soaring ambition.
Baseline

Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
That's right! shouted the Queen.
Reference

Target text:
CHAPTER 1
Baseline

Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
"Never again shall Eleanor Lavish be a friend of mine."
Reference

Target text:
He left them, carefully closing the front door; and when they looked through the hall window, they saw him go up the drive and begin to climb the slopes of withered fern behind the house.
Baseline

Var Var+Txt
C=10
C=50
C=100
C=300
Reference text:
"Oh, Gabriel, how could you serve me so unkindly!"
Reference

Target text:
Was it that the far clear voice had meant?
Baseline

Var Var+Txt
C=10
C=50
C=100
C=300

3. Single-speaker Prior Samples

Here we sample the latent embedding from the prior distribution for models with different capacities. The quality of the samples from the Var model deteriorates as capacity, C, is increased, while Var+Txt remains stable. This is especially noticeable for very short and very long utterances.

Text:
CHAPTER 1
Baseline
Var Var+T
C=10
C=50
C=100
C=300
Text:
With my sharp and long sight, as I look up, I have seen it distinctly; now if it happens to hurt the young lady, and I think it must, here am I, here are my file, my punch, my nippers; I will make it round and blunt, if her ladyship pleases; no longer the tooth of a fish, but of a beautiful young lady as she is.
Baseline
Var Var+T
C=10
C=50
C=100
C=300
Text:
I charge you by all that is sacred, not to attempt concealment.
Baseline
Var Var+T
C=10
C=50
C=100
C=300
Text:
"Never again shall Eleanor Lavish be a friend of mine."
Baseline
Var Var+T
C=10
C=50
C=100
C=300
Text:
How dare you!"
Baseline
Var Var+T
C=10
C=50
C=100
C=300

4. Multi-speaker Same-text Prosody Transfer

In this experiment, we demonstrate prosody transfer between speakers. When conditioned on the reference speaker (Var+Txt+Spk), the output preserves the pitch range of the target speaker, whereas without speaker conditioning (Var+Txt), the output tends to match the pitch of the reference speaker. As the capacity, C, is increased, the difference becomes more apparent.

Reference text:
I ventured onto the platform.
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
Quick, now!
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
Good job! Sounds like Snuffles is stuffed!
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
How many licks does it take to get to the center of a Tootsie Pop?
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
Moby Dick's House of Kabob is at US two eight one Santo.
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
"Half-past six o'clock.
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Reference text:
"I've swallowed a pollywog.
Reference:

Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

5. Multi-speaker Prior Samples

Here we sythesize from multi-speaker models by sampling the latent embeddings from the prior. The model that includes speaker dependencies in the posterior (Var+Txt+Spk) does a better job of preserving target speaker identity and pitch range, compared to the model without posterior speaker dependencies (Var+Txt). Notice also how the samples become more expressive (and erratic) as embedding capacity is increased.

Text:
Kanoute, the biggest Malian star at the moment, currently plays for Sevilla F C in Spain's La Liga.
Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Text:
It was a superb adaptation.
Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)
Text:
Does your shirt have holes in it? No? Then how did you put it on?
Australian female British female North American male British male
Base
Var+Txt (C=50)
Var+Txt+Spk (C=50)
Var+Txt (C=150)
Var+Txt+Spk (C=150)
Var+Txt (C=300)
Var+Txt+Spk (C=300)

6. Single-speaker Hierarchical Style Transfer

In this experiment, we use models with hierarchical latents (high-level latents, ZH, and low-level latents, ZL) and demonstrate the effect of varying CH and CL when transferring via ZH. When increasing CH, similarity to the reference increases. When increasing CL, inter-sample variation increases (between samples generated using the same ZH). We include groups of 3 samples generated using the same ZH inferred from the reference.

Reference text:
"I've swallowed a pollywog.
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100
Reference text:
"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments.
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100
Reference text:
"Not in that way.
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100
Reference text:
I charge you by all that is sacred, not to attempt concealment.
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100
Reference text:
Quick, now!
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100
Reference text:
That's right! shouted the Queen.
Reference:

Baseline:
CH=20CH=50CH=100
CL=50
CL=100

7. Single-speaker Hierarchial Prosody Transfer

In this experiment, we use models with hierarchical latents (ZH and ZL) and demonstrate the effect of varying CH and CL when transferring via ZL (instead of ZH). We can see that as the sum of CH and CL increases, similarity to the reference increases. The variation across samples generated using the same reference is low. We include groups of 3 samples generated using different ZL's that are inferred from the same reference.

Reference text:
"I've swallowed a pollywog.
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200
Reference text:
"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments.
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200
Reference text:
"Not in that way.
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200
Reference text:
I charge you by all that is sacred, not to attempt concealment.
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200
Reference text:
Quick, now!
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200
Reference text:
That's right! shouted the Queen.
Reference

Baseline
CH=20, CL=50
➡ C=70
CH=50, CL=50
➡ C=100
CH=20, CL=100
➡ C=120
CH=50, CL=100
➡ C=150
CH=100, CL=50
➡ C=150
CH=100, CL=100
➡ C=200