Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, Rif A. Saurous
Abstract:
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.
We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different.
Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance.
We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
This page contains a set of audio samples in support of the conclusions in our paper: we suggest that the reader listen to the samples in conjunction with reading the paper.
All utterances were unseen in training.
A single-speaker model trained on audiobook recordings, conditioned on utterances from an expressive corpus of digital assistant responses. The model never saw these speakers nor digital-assistant-style utterances during training. The base model uses no prosody embedding, while the tanh-128 model uses a 128-dimensional tanh-scaled prosody embedding computed from the reference utterance.
Reference: Text: How do bureaucrats wrap presents? With lots of red tape.
Voice
North American female
base:
tanh-128:
Reference: Text: Why are libraries so strict? They have to go by the book.
Voice
North American female
base:
tanh-128:
Reference: Text: Why are fish so smart? Because they hang out in schools so much.
Voice
North American female
base:
tanh-128:
Reference: Text: Heaps of things. Like fairy bread, how the surf is today and why magpies swoop.
Voice
North American female
base:
tanh-128:
Reference: Text: The past, the present, and the future walk into a bar. It was tense.
Voice
North American female
base:
tanh-128:
Reference: Text: I usually down a cup of java script. Then I put on nature sounds and run a few strenuous searches to improve my speed
Voice
North American female
base:
tanh-128:
Reference: Text: I don't have eyes, but I don't need them to know the vibe in here feels good
Voice
North American female
base:
tanh-128:
Reference: Text: What time do you go to the dentist? At tooth-hurty!
Voice
North American female
base:
tanh-128:
Reference: Text: Sweet dreams are made of these. Friendly Assistants who work hard to please
Voice
North American female
base:
tanh-128:
Reference: Text: You are what you eat. So I guess I'm a whole lot of data and a little bit of pizza recipes.
Voice
North American female
base:
tanh-128:
Reference: Text: Men say they know many things; But lo! they have taken wings, The arts and sciences, And a thousand appliances; The wind that blows Is all that any body knows.
Voice
North American female
base:
tanh-128:
Reference: Text: Do you prefer chocolate or jelly? Which would you like in your belly? You could make a good case, For a cool ice cream base, But I'd argue against vermicelli
Voice
North American female
base:
tanh-128:
Reference: Text: Halloween Edition it is! Remember to follow the moves as I say them.
Voice
North American female
base:
tanh-128:
Reference: Text: Why are archaeologists so annoyed? They always have a bone to pick.
Voice
North American female
base:
tanh-128:
Reference: Text: That one sailed RIGHT over my head.
Voice
North American female
base:
tanh-128:
Reference: Text: Wear your heart on your sleeve. It'll terrify people.
Using our 44 speaker dataset, we trained a baseline and prosody encoder model. Below, on the left are a set of reference utterances — we want to capture the prosody of these utterances. On the right are utterances synthesized with this prosody, but with a voice already in the dataset (these are labeled tanh-128). We also provide a baseline, from a Tacotron model with no prosody encoder. Notice how we are able to capture the prosody of the reference speaker, yet with the voice characteristics of the target voice.
Reference: Text: The group didn't *WALK* through Central Park, they *RAN* through it.
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: That'd be some cosmic conversation.
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: No no, dear reader.
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: *IS* that Utah travel agency?
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: Happy Halloween.
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: No one at fort griswold had been excepting anything especially after there had been six years of false alarms
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: Only one was deployed, while they need a hundred teams.
Voice
North American female
North American male
Australian female
British female
British male
Indian female
base:
tanh-128:
Reference: Text: Quick as he himself thought, he was to keep the batsman on toes.
A 44-speaker model conditioned on utterances from an emotive corpus of audiobook recordings. The 44-speaker model was not exposed to the audiobook speaker, nor any audiobook-style recordings in training. We are able to transfer the extreme variations in prosody from the audiobook speaker to our 44-speakers, despite having far less prosodic variation in the training dataset.
Reference: Text: "Wait a minute!" called the Scarecrow.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: It will be good for both of you.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: "Oh, now and then," said Lucy, who had rather enjoyed herself.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: I've swallowed a pollywog.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: Not in that way.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: I charge you by all that is sacred, not to attempt concealment.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: I have escaped; and that I should escape, may be a matter of grateful wonder to you and myself.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: His had a cold, ugly look of dislike and contempt, and indifference to what would happen.
Voice
Australian female
British female
North American male
North American female
base:
tanh-128:
Reference: Text: I hate helping to hang heavy, hot, hairy hides on them.
This experiment examines the effect of changing the bottleneck size and activation of the prosody encoder. In general, the larger the bottleneck, the more faithful the reproduction of the reference prosody—but some of the speaker characteristics start leaking through. Interestingly,the softmax activation enforces a more aggressive bottlenecking of information.
Reference: Text: 'It's spoilt, of course!' Here he looked at Tweedledee, who immediately sat down on the ground, and tried to hide himself under the umbrella.
Voice
North American female
base:
softmax-128:
softmax-16:
tanh-16:
tanh-32:
tanh-64:
tanh-128:
Reference: Text: So, to punish it, she held it up to the Looking-glass, that it might see how sulky it was -- 'and if you're not good directly,' she added, 'I'll put you through into Looking-glass House.
A single-speaker model trained on audiobook recordings. In these examples, the left-hand column contains an unseen reference utterance from the same speaker. On the right-hand side, we synthesize a perturbed version of that text: on top is synthesis with no conditioning, and on the bottom is the single-speaker model conditioned on the reference utterance.
Reference: Reference text: 'I can now,' said the Leopard. Perturbed text: 'I can now,' said the Porcupine.
Voice
North American female
base:
tanh-128:
Reference: Reference text: For the first time in her life she had been danced tired. Perturbed text: For the last time in his life he had been handily embarrassed.
Voice
North American female
base:
tanh-128:
Reference: Reference text: Second--Her family was very ancient and noble. Perturbed text: First--Her family was very sarcastic and horrible.
Voice
North American female
base:
tanh-128:
Reference: Reference text: Never again shall Eleanor Lavish be a friend of mine. Perturbed text: Never again shall Bartholomew Bigglesby be a son of mine.
Voice
North American female
base:
tanh-128:
Reference: Reference text: Now it is all dark. Perturbed text: Later it will all be lost.
Reference: Reference text: I cannot think why this wall is here, nor what it is made of. Perturbed text: I do not know why this guy is here, or what his name is.
Voice
North American female
base:
tanh-128:
Reference: Reference text: Alice was not much surprised at this, she was getting so used to queer things happening. Perturbed text: Eric was not much surprised at this, he was getting so used to TensorFlow breaking.