Audio samples from "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron"

Paper: arXiv

Talk: ICML 2018

Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, Rif A. Saurous

Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

This page contains a set of audio samples in support of the conclusions in our paper: we suggest that the reader listen to the samples in conjunction with reading the paper. All utterances were unseen in training.

Click here for more from the Tacotron team.

Contents

1. Prosody transfer from unseen speakers onto a single-speaker model
2. Prosody transfer from speakers seen in training
3. Prosody transfer from an unseen speaker onto a multi-speaker model
4. Varying prosody embedding bottleneck dimension and activation
5. Same speaker prosody transfer with text perturbations

1. Prosody transfer from unseen speakers onto a single-speaker model

A single-speaker model trained on audiobook recordings, conditioned on utterances from an expressive corpus of digital assistant responses. The model never saw these speakers nor digital-assistant-style utterances during training. The base model uses no prosody embedding, while the tanh-128 model uses a 128-dimensional tanh-scaled prosody embedding computed from the reference utterance.

Reference:
Text: How do bureaucrats wrap presents? With lots of red tape.

Voice	North American female
base:
tanh-128:

Reference:
Text: Why are libraries so strict? They have to go by the book.

Voice	North American female
base:
tanh-128:

Reference:
Text: Why are fish so smart? Because they hang out in schools so much.

Voice	North American female
base:
tanh-128:

Reference:
Text: Heaps of things. Like fairy bread, how the surf is today and why magpies swoop.

Voice	North American female
base:
tanh-128:

Reference:
Text: The past, the present, and the future walk into a bar. It was tense.

Voice	North American female
base:
tanh-128:

Reference:
Text: I usually down a cup of java script. Then I put on nature sounds and run a few strenuous searches to improve my speed

Voice	North American female
base:
tanh-128:

Reference:
Text: I don't have eyes, but I don't need them to know the vibe in here feels good

Voice	North American female
base:
tanh-128:

Reference:
Text: What time do you go to the dentist? At tooth-hurty!

Voice	North American female
base:
tanh-128:

Reference:
Text: Sweet dreams are made of these. Friendly Assistants who work hard to please

Voice	North American female
base:
tanh-128:

Reference:
Text: You are what you eat. So I guess I'm a whole lot of data and a little bit of pizza recipes.

Voice	North American female
base:
tanh-128:

Reference:
Text: Men say they know many things; But lo! they have taken wings, The arts and sciences, And a thousand appliances; The wind that blows Is all that any body knows.

Voice	North American female
base:
tanh-128:

Reference:
Text: Do you prefer chocolate or jelly? Which would you like in your belly? You could make a good case, For a cool ice cream base, But I'd argue against vermicelli

Voice	North American female
base:
tanh-128:

Reference:
Text: Halloween Edition it is! Remember to follow the moves as I say them.

Voice	North American female
base:
tanh-128:

Reference:
Text: Why are archaeologists so annoyed? They always have a bone to pick.

Voice	North American female
base:
tanh-128:

Reference:
Text: That one sailed RIGHT over my head.

Voice	North American female
base:
tanh-128:

Reference:
Text: Wear your heart on your sleeve. It'll terrify people.

Voice	North American female
base:
tanh-128:

2. Prosody transfer from speakers seen during training

Using our 44 speaker dataset, we trained a baseline and prosody encoder model. Below, on the left are a set of reference utterances — we want to capture the prosody of these utterances. On the right are utterances synthesized with this prosody, but with a voice already in the dataset (these are labeled tanh-128). We also provide a baseline, from a Tacotron model with no prosody encoder. Notice how we are able to capture the prosody of the reference speaker, yet with the voice characteristics of the target voice.

Reference:
Text: The group didn't *WALK* through Central Park, they *RAN* through it.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: That'd be some cosmic conversation.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: No no, dear reader.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: *IS* that Utah travel agency?

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: Happy Halloween.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: No one at fort griswold had been excepting anything especially after there had been six years of false alarms

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: Only one was deployed, while they need a hundred teams.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

Reference:
Text: Quick as he himself thought, he was to keep the batsman on toes.

Voice	North American female	North American male	Australian female	British female	British male	Indian female
base:
tanh-128:

3. Prosody transfer from an unseen speaker onto a multi-speaker model

A 44-speaker model conditioned on utterances from an emotive corpus of audiobook recordings. The 44-speaker model was not exposed to the audiobook speaker, nor any audiobook-style recordings in training. We are able to transfer the extreme variations in prosody from the audiobook speaker to our 44-speakers, despite having far less prosodic variation in the training dataset.

Reference:
Text: "Wait a minute!" called the Scarecrow.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: It will be good for both of you.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: "Oh, now and then," said Lucy, who had rather enjoyed herself.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: I've swallowed a pollywog.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: Not in that way.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: I charge you by all that is sacred, not to attempt concealment.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: I have escaped; and that I should escape, may be a matter of grateful wonder to you and myself.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: His had a cold, ugly look of dislike and contempt, and indifference to what would happen.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

Reference:
Text: I hate helping to hang heavy, hot, hairy hides on them.

Voice	Australian female	British female	North American male	North American female
base:
tanh-128:

4. Varying prosody embedding bottleneck dimension and activation.

This experiment examines the effect of changing the bottleneck size and activation of the prosody encoder. In general, the larger the bottleneck, the more faithful the reproduction of the reference prosody—but some of the speaker characteristics start leaking through. Interestingly,the softmax activation enforces a more aggressive bottlenecking of information.

Reference:
Text: 'It's spoilt, of course!' Here he looked at Tweedledee, who immediately sat down on the ground, and tried to hide himself under the umbrella.

Voice	North American female
base:
softmax-128:
softmax-16:
tanh-16:
tanh-32:
tanh-64:
tanh-128:

Reference:
Text: So, to punish it, she held it up to the Looking-glass, that it might see how sulky it was -- 'and if you're not good directly,' she added, 'I'll put you through into Looking-glass House.

Voice	North American female
base:
softmax-128:
softmax-16:
tanh-16:
tanh-32:
tanh-64:
tanh-128:

5. Same speaker prosody transfer with text perturbations.

A single-speaker model trained on audiobook recordings. In these examples, the left-hand column contains an unseen reference utterance from the same speaker. On the right-hand side, we synthesize a perturbed version of that text: on top is synthesis with no conditioning, and on the bottom is the single-speaker model conditioned on the reference utterance.

Reference:
Reference text: 'I can now,' said the Leopard.
Perturbed text: 'I can now,' said the Porcupine.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: For the first time in her life she had been danced tired.
Perturbed text: For the last time in his life he had been handily embarrassed.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Second--Her family was very ancient and noble.
Perturbed text: First--Her family was very sarcastic and horrible.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Never again shall Eleanor Lavish be a friend of mine.
Perturbed text: Never again shall Bartholomew Bigglesby be a son of mine.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Now it is all dark.
Perturbed text: Later it will all be lost.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Never!
Perturbed text: Forever!

Voice	North American female
base:
tanh-128:

Reference:
Reference text: So I did.
Perturbed text: So I didn't.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Oh, yes.
Perturbed text: Oh, no.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: I cannot think why this wall is here, nor what it is made of.
Perturbed text: I do not know why this guy is here, or what his name is.

Voice	North American female
base:
tanh-128:

Reference:
Reference text: Alice was not much surprised at this, she was getting so used to queer things happening.
Perturbed text: Eric was not much surprised at this, he was getting so used to TensorFlow breaking.

Voice	North American female
base:
tanh-128: