Paper: arXiv
Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, and Tom Bagby
Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web.
This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper. All reference and target utterances were unseen during training.
Click here for more from the Tacotron team.
Contents
Here we show that models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) exhibit increasing similarity to the reference as embedding capacity, C, is increased. The baseline model does not use a reference embedding.
Reference text:"Not in that way. Reference Baseline |
|
|||||||||||||||
Reference text:That's right! shouted the Queen. Reference Baseline |
|
|||||||||||||||
Reference text:"Never again shall Eleanor Lavish be a friend of mine." Reference Baseline |
|
|||||||||||||||
Reference text:'Try her again, then.' Reference Baseline |
|
|||||||||||||||
Reference text:How dare you!" Reference Baseline |
|
|||||||||||||||
Reference text:'Speak, won't you!' cried the King. 'How are they getting on with the fight?' Reference Baseline |
|
|||||||||||||||
Reference text:"Oh, Gabriel, how could you serve me so unkindly!" Reference Baseline |
|
|||||||||||||||
Reference text:And how the parson would pray! Reference Baseline |
|
|||||||||||||||
Reference text:"I've swallowed a pollywog. Reference Baseline |
|
In this section, we compare models with and without text dependencies in the variational posterior (Var+Txt and Var, respectively) when transferring between utterances with different text. The examples show how the quality of the samples from Var deteriorate as the capacity, C, is increased. This is especially noticeable when there is a significant mismatch in the length of the two utterances.
Reference text:"Not in that way. Reference Target text:And the distinction is not quite so much against the candour and common sense of the world as appears at first; for a very narrow income has a tendency to contract the mind, and sour the temper. Baseline |
|
|||||||||||||||
Reference text:Quick, now! Reference Target text:Yet he might not have been so perfectly humane, so thoughtful in his generosity, so full of kindness and tenderness amidst his passion for adventurous exploit, had she not unfolded to him the real loveliness of beneficence and made the doing good the end and aim of his soaring ambition. Baseline |
|
|||||||||||||||
Reference text:That's right! shouted the Queen. Reference Target text:CHAPTER 1 Baseline |
|
|||||||||||||||
Reference text:"Never again shall Eleanor Lavish be a friend of mine." Reference Target text:He left them, carefully closing the front door; and when they looked through the hall window, they saw him go up the drive and begin to climb the slopes of withered fern behind the house. Baseline |
|
|||||||||||||||
Reference text:"Oh, Gabriel, how could you serve me so unkindly!" Reference Target text:Was it that the far clear voice had meant? Baseline |
|
Here we sample the latent embedding from the prior distribution for models with different capacities. The quality of the samples from the Var model deteriorates as capacity, C, is increased, while Var+Txt remains stable. This is especially noticeable for very short and very long utterances.
Text:CHAPTER 1 Baseline |
|
|||||||||||||||
Text:With my sharp and long sight, as I look up, I have seen it distinctly; now if it happens to hurt the young lady, and I think it must, here am I, here are my file, my punch, my nippers; I will make it round and blunt, if her ladyship pleases; no longer the tooth of a fish, but of a beautiful young lady as she is. Baseline |
|
|||||||||||||||
Text:I charge you by all that is sacred, not to attempt concealment. Baseline |
|
|||||||||||||||
Text:"Never again shall Eleanor Lavish be a friend of mine." Baseline |
|
|||||||||||||||
Text:How dare you!" Baseline |
|
In this experiment, we demonstrate prosody transfer between speakers. When conditioned on the reference speaker (Var+Txt+Spk), the output preserves the pitch range of the target speaker, whereas without speaker conditioning (Var+Txt), the output tends to match the pitch of the reference speaker. As the capacity, C, is increased, the difference becomes more apparent.
Reference text:I ventured onto the platform. Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:Quick, now! Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:Good job! Sounds like Snuffles is stuffed! Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:How many licks does it take to get to the center of a Tootsie Pop? Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:Moby Dick's House of Kabob is at US two eight one Santo. Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:"Half-past six o'clock. Reference: |
|
||||||||||||||||||||||||||||||||||||||||
Reference text:"I've swallowed a pollywog. Reference: |
|
Here we sythesize from multi-speaker models by sampling the latent embeddings from the prior. The model that includes speaker dependencies in the posterior (Var+Txt+Spk) does a better job of preserving target speaker identity and pitch range, compared to the model without posterior speaker dependencies (Var+Txt). Notice also how the samples become more expressive (and erratic) as embedding capacity is increased.
Text:Kanoute, the biggest Malian star at the moment, currently plays for Sevilla F C in Spain's La Liga. |
|
||||||||||||||||||||||||||||||||||||||||
Text:It was a superb adaptation. |
|
||||||||||||||||||||||||||||||||||||||||
Text:Does your shirt have holes in it? No? Then how did you put it on? |
|
In this experiment, we use models with hierarchical latents (high-level latents, ZH, and low-level latents, ZL) and demonstrate the effect of varying CH and CL when transferring via ZH. When increasing CH, similarity to the reference increases. When increasing CL, inter-sample variation increases (between samples generated using the same ZH). We include groups of 3 samples generated using the same ZH inferred from the reference.
Reference text:"I've swallowed a pollywog. Reference: Baseline: |
|
||||||||||||
Reference text:"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments. Reference: Baseline: |
|
||||||||||||
Reference text:"Not in that way. Reference: Baseline: |
|
||||||||||||
Reference text:I charge you by all that is sacred, not to attempt concealment. Reference: Baseline: |
|
||||||||||||
Reference text:Quick, now! Reference: Baseline: |
|
||||||||||||
Reference text:That's right! shouted the Queen. Reference: Baseline: |
|
In this experiment, we use models with hierarchical latents (ZH and ZL) and demonstrate the effect of varying CH and CL when transferring via ZL (instead of ZH). We can see that as the sum of CH and CL increases, similarity to the reference increases. The variation across samples generated using the same reference is low. We include groups of 3 samples generated using different ZL's that are inferred from the same reference.
Reference text:"I've swallowed a pollywog. Reference Baseline |
|
||||||||||||
Reference text:"Oh, very well," said the latter resignedly, and the footman proceeded to open the folding tea-table and set out its complicated appointments. Reference Baseline |
|
||||||||||||
Reference text:"Not in that way. Reference Baseline |
|
||||||||||||
Reference text:I charge you by all that is sacred, not to attempt concealment. Reference Baseline |
|
||||||||||||
Reference text:Quick, now! Reference Baseline |
|
||||||||||||
Reference text:That's right! shouted the Queen. Reference Baseline |
|