Audio samples from "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron"

Paper: arXiv

Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, Rif A. Saurous

Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

This page contains a set of audio samples in support of the conclusions in our paper: we suggest that the reader listen to the samples in conjunction with reading the paper. All utterances were unseen in training. Click here for more from the Tacotron team.

Contents

 

1. Prosody transfer from unseen speakers onto a single-speaker model

A single-speaker model trained on audiobook recordings, conditioned on utterances from an expressive corpus of digital assistant responses. The model never saw these speakers nor digital-assistant-style utterances during training. The base model uses no prosody embedding, while the tanh-128 model uses a 128-dimensional tanh-scaled prosody embedding computed from the reference utterance.

Reference:
Text: How do bureaucrats wrap presents? With lots of red tape.
Voice North American female
base:
tanh-128:
Reference:
Text: Why are libraries so strict? They have to go by the book.
Voice North American female
base:
tanh-128:
Reference:
Text: Why are fish so smart? Because they hang out in schools so much.
Voice North American female
base:
tanh-128:
Reference:
Text: Heaps of things. Like fairy bread, how the surf is today and why magpies swoop.
Voice North American female
base:
tanh-128:
Reference:
Text: The past, the present, and the future walk into a bar. It was tense.
Voice North American female
base:
tanh-128:
Reference:
Text: I usually down a cup of java script. Then I put on nature sounds and run a few strenuous searches to improve my speed
Voice North American female
base:
tanh-128:
Reference:
Text: I don't have eyes, but I don't need them to know the vibe in here feels good
Voice North American female
base:
tanh-128:
Reference:
Text: What time do you go to the dentist? At tooth-hurty!
Voice North American female
base:
tanh-128:
Reference:
Text: Sweet dreams are made of these. Friendly Assistants who work hard to please
Voice North American female
base:
tanh-128:
Reference:
Text: You are what you eat. So I guess I'm a whole lot of data and a little bit of pizza recipes.
Voice North American female
base:
tanh-128:
Reference:
Text: Men say they know many things; But lo! they have taken wings, The arts and sciences, And a thousand appliances; The wind that blows Is all that any body knows.
Voice North American female
base:
tanh-128:
Reference:
Text: Do you prefer chocolate or jelly? Which would you like in your belly? You could make a good case, For a cool ice cream base, But I'd argue against vermicelli
Voice North American female
base:
tanh-128:
Reference:
Text: Halloween Edition it is! Remember to follow the moves as I say them.
Voice North American female
base:
tanh-128:
Reference:
Text: Why are archaeologists so annoyed? They always have a bone to pick.
Voice North American female
base:
tanh-128:
Reference:
Text: That one sailed RIGHT over my head.
Voice North American female
base:
tanh-128:
Reference:
Text: Wear your heart on your sleeve. It'll terrify people.
Voice North American female
base:
tanh-128:

2. Prosody transfer from speakers seen during training

Using our 44 speaker dataset, we trained a baseline and prosody encoder model. Below, on the left are a set of reference utterances — we want to capture the prosody of these utterances. On the right are utterances synthesized with this prosody, but with a voice already in the dataset (these are labeled tanh-128). We also provide a baseline, from a Tacotron model with no prosody encoder. Notice how we are able to capture the prosody of the reference speaker, yet with the voice characteristics of the target voice.

Reference:
Text: The group didn't *WALK* through Central Park, they *RAN* through it.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: That'd be some cosmic conversation.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: No no, dear reader.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: *IS* that Utah travel agency?
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: Happy Halloween.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: No one at fort griswold had been excepting anything especially after there had been six years of false alarms
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: Only one was deployed, while they need a hundred teams.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:
Reference:
Text: Quick as he himself thought, he was to keep the batsman on toes.
Voice North American female North American male Australian female British female British male Indian female
base:
tanh-128:

3. Prosody transfer from an unseen speaker onto a multi-speaker model

A 44-speaker model conditioned on utterances from an emotive corpus of audiobook recordings. The 44-speaker model was not exposed to the audiobook speaker, nor any audiobook-style recordings in training. We are able to transfer the extreme variations in prosody from the audiobook speaker to our 44-speakers, despite having far less prosodic variation in the training dataset.

Reference:
Text: "Wait a minute!" called the Scarecrow.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: It will be good for both of you.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: "Oh, now and then," said Lucy, who had rather enjoyed herself.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: I've swallowed a pollywog.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: Not in that way.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: I charge you by all that is sacred, not to attempt concealment.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: I have escaped; and that I should escape, may be a matter of grateful wonder to you and myself.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: His had a cold, ugly look of dislike and contempt, and indifference to what would happen.
Voice Australian female British female North American male North American female
base:
tanh-128:
Reference:
Text: I hate helping to hang heavy, hot, hairy hides on them.
Voice Australian female British female North American male North American female
base:
tanh-128:

4. Varying prosody embedding bottleneck dimension and activation.

This experiment examines the effect of changing the bottleneck size and activation of the prosody encoder. In general, the larger the bottleneck, the more faithful the reproduction of the reference prosody—but some of the speaker characteristics start leaking through. Interestingly,the softmax activation enforces a more aggressive bottlenecking of information.

Reference:
Text: 'It's spoilt, of course!' Here he looked at Tweedledee, who immediately sat down on the ground, and tried to hide himself under the umbrella.
Voice North American female
base:
softmax-128:
softmax-16:
tanh-16:
tanh-32:
tanh-64:
tanh-128:
Reference:
Text: So, to punish it, she held it up to the Looking-glass, that it might see how sulky it was -- 'and if you're not good directly,' she added, 'I'll put you through into Looking-glass House.
Voice North American female
base:
softmax-128:
softmax-16:
tanh-16:
tanh-32:
tanh-64:
tanh-128:

5. Same speaker prosody transfer with text perturbations.

A single-speaker model trained on audiobook recordings. In these examples, the left-hand column contains an unseen reference utterance from the same speaker. On the right-hand side, we synthesize a perturbed version of that text: on top is synthesis with no conditioning, and on the bottom is the single-speaker model conditioned on the reference utterance.

Reference:
Reference text: 'I can now,' said the Leopard.
Perturbed text: 'I can now,' said the Porcupine.
Voice North American female
base:
tanh-128:
Reference:
Reference text: For the first time in her life she had been danced tired.
Perturbed text: For the last time in his life he had been handily embarrassed.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Second--Her family was very ancient and noble.
Perturbed text: First--Her family was very sarcastic and horrible.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Never again shall Eleanor Lavish be a friend of mine.
Perturbed text: Never again shall Bartholomew Bigglesby be a son of mine.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Now it is all dark.
Perturbed text: Later it will all be lost.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Never!
Perturbed text: Forever!
Voice North American female
base:
tanh-128:
Reference:
Reference text: So I did.
Perturbed text: So I didn't.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Oh, yes.
Perturbed text: Oh, no.
Voice North American female
base:
tanh-128:
Reference:
Reference text: I cannot think why this wall is here, nor what it is made of.
Perturbed text: I do not know why this guy is here, or what his name is.
Voice North American female
base:
tanh-128:
Reference:
Reference text: Alice was not much surprised at this, she was getting so used to queer things happening.
Perturbed text: Eric was not much surprised at this, he was getting so used to TensorFlow breaking.
Voice North American female
base:
tanh-128: