Audio samples from "Predicting Expressive Speaking Style From Text in End-to-End Speech Synthesis"

Paper: arXiv

Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as ``virtual'' speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. We show that, when trained on an expressive speech dataset, our system can render text with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style.

Click here for more from the Tacotron team.

Contents

Single Speaker Experiments

These examples refer to Section 4.1 of our paper, "Single Speaker Experiments".

1. Style token variation

These samples show the effect of conditioning a single-speaker TP-GST model on an individual style token. They demonstate that, as in a GST-Tacotron, a TP-GST system learns a rich set of style tokens during training.

These samples use the Griffin-Lim algorithm to produce waveforms, which shows that style tokens learn to represent speaking style variations independent of the vocoder used.
Text: And I have found both freedom of loneliness and the safety from being understood, for those who understand us enslave something in us.
Token A
Token B
Token C
Token D
Token E
Token F
Token G

2. Synthesizing with text-predicted style based on GST combination weights (TPCW-GST)

The following samples compare synthesis from a baseline Tacotron vs a system Synthesizing with text-predicted style based on GST combination weights in inference mode ("TPCW-GST"). Note how the text-prediction system often leads to clearer, more expressive speech.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.
Text: And without a backward glance at Harry, Filch ran flat-footed from the office, Mrs. Norris streaking alongside him. Peeves was the school poltergeist, a grinning, airborne menace who lived to cause havoc and distress.
TacotronTPCW-GST
Text: "I expect they've let it rot to give it a stronger flavor," said Hermione knowledgeably, pinching her nose and leaning closer to look at the putrid haggis. "Can we move? I feel sick," said Ron.
TacotronTPCW-GST
Text: There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before. His aunt was back outside the door. "Are you up yet?" she demanded.
TacotronTPCW-GST
Text: Uncle Vernon now came in, smiling jovially as he shut the door. "Tea, Marge?" he said. "And what will Ripper take?"
TacotronTPCW-GST
Text: "Have you - did you read -?" he sputtered. "No," Harry lied quickly. Filch's knobbly hands were twisting together.
TacotronTPCW-GST
Text: "This is boring," Dudley moaned. He shuffled away. Harry moved in front of the tank and looked intently at the snake.
TacotronTPCW-GST

3. Synthesizing with text-predicted style based on GST style embedding (TPSE-GST)

The following samples compare synthesis from a baseline Tacotron vs a system using a text-predicted style in inference mode ("TPSE-GST"). This system predicts a style embedding directly; anecdotally, it results in even clearer, more expressive speech than a TPCW-GST system.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.
Text: "Thirty-six," he said, looking up at his mother and father. "That's two less than last year." "Darling, you haven't counted Auntie Marge's present, see, it's here under this big one from Mommy and Daddy."
TacotronTPSE-GST
Text: Harry sat up and gasped; the glass front of the boa constrictor's tank had vanished. The great snake was uncoiling itself rapidly, slithering out onto the floor. People throughout the reptile house screamed and started running for the exits.
TacotronTPSE-GST
Text: But nobody heard much more. Sir Patrick and the rest of the Headless Hunt had just started a game of Head Hockey and the crowd were turning to watch. Nearly Headless Nick tried vainly to recapture his audience, but gave up as Sir Patrick's head went sailing past him to loud cheers.
TacotronTPSE-GST
Text: "Harry, what was that all about?" said Ron, wiping sweat off his face. "I couldn't hear anything-- ." But Hermione gave a sudden gasp, pointing down the corridor. "Look!"
TacotronTPSE-GST
Text: Go with Errol. Ron'll look after you. I'll write him a note, explaining. And don't look at me like that" - Hedwig's large amber eyes were reproachful - "it's not my fault. It's the only way I'll be allowed to visit Hogsmeade with Ron and Hermione."
TacotronTPSE-GST
Text: "Do something about your hair!" Aunt Petunia snapped as he reached the hall. Harry couldn't see the point of trying to make his hair lie flat. Aunt Marge loved criticizing him, so the untidier he looked, the happier she would be.
TacotronTPSE-GST

4. Automatically removing background noise

These samples refer to Section 4.1.4 of our paper, "Automatic Denoising".

About 10% of the recordings used to train the models used to synthesize the output below have some high-frequency background noise. The samples below show that while the baseline Tacotron model reproduces this noise, the TP-GST samples have removed this noise without using any supervision.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.
Text: When he was dressed he went down the hall into the kitchen. The table was almost hidden beneath all Dudley's birthday presents. It looked as though Dudley had gotten the new computer he wanted, not to mention the second television and the racing bike.
TacotronTP-GST
Text: "Thirty-six," he said, looking up at his mother and father. "That's two less than last year." "Darling, you haven't counted Auntie Marge's present, see, it's here under this big one from Mommy and Daddy."
TacotronTP-GST
Text: With a quick glance at the door to check that Filch wasn't on his way back, Harry picked up the envelope and read: kwikspell A Correspondence Course in Beginners' Magic. Intrigued, Harry flicked the envelope open and pulled out the sheaf of parchment inside.
TacotronTP-GST
Text: Harry was at the point of telling Ron and Hermione about Filch and the Kwikspell course when the salamander suddenly whizzed into the air, emitting loud sparks and bangs as it whirled wildly round the room..
TacotronTP-GST
Text: The horses galloped into the middle of the dance floor and halted, rearing and plunging. At the front of the pack was a large ghost who held his bearded head under his arm, from which position he was blowing the horn. The ghost leapt down, lifted his head high in the air so he could see over the crowd -- everyone laughed --, and strode over to Nearly Headless Nick, squashing his head back onto his neck.
TacotronTP-GST

Multi-Speaker Experiments

These samples refer to Section 4.2 of our paper, "Multiple Speaker Experiments".

Speaker-independent style tokens

Like the single-speaker examples above, these samples show the effect of conditioning a multi-speaker TP-GST model on the same style token. The output below is generated by a multi-speaker model, trained on one expressive and 21 neutral prosody datasets. Note that conditioning on speaker ID results in audio that preserves that speaker's voice, while also using the speaking style learned by the style token during training.

These samples use the Griffin-Lim algorithm to produce waveforms, which shows that style tokens learn to represent speaking style variations independent of the vocoder used.
Text: "Why do you bore me with these dreams of yours? They get more childish every time! You can't dream anything but sentimental nonsense!"
  Expressive voice Neutral voice 1 Neutral voice 2
Token A
Token B
Token C
Token D
Token E
Token F
Token G
Token H
Token I

Synthesizing with text-predicted style

This section compares a baseline Tacotron versus TP-GST. The TP-GST model is the same as in the previous section, and the baseline Tacotron was trained using the same multi-speaker setup. The samples below show each model conditioned on one expressive voice and two neutral voices.

Note that, when conditioned on the expressive dataset speaker ID, the multi-speaker TP-GST systems yield higher quality output and more stylistic variation than a multi-speaker Tacotron. This is consistent with the single-speaker model results. As expected, when conditioned on the prosodically neutral speaker IDs, the style predicted by the multi-speaker TP-GST matches the limited dynamic range of the dataset.

These samples use the Griffin-Lim algorithm to produce waveforms.
Text: "Oh, to travel, to travel!" cried he; "there is no greater happiness in the world: it is the height of my ambition."
System Baseline Tacotron TPCW-GST TPSE-GST
Expressive voice
Neutral voice 1
Neutral voice 2
Text: "Then commit them over again," he said gravely. "To get back one's youth, one has merely to repeat one's follies."
System Baseline Tacotron TPCW-GST TPSE-GST
Expressive voice
Neutral voice 1
Neutral voice 2
Text: "Who do you think you are?" he said, in a harsh voice. "How dare you insult my sister?"
System Baseline Tacotron TPCW-GST TPSE-GST
Expressive voice
Neutral voice 1
Neutral voice 2
Text: How the sunshine cheers me, and how sweet and refreshing is the rain; my happiness overpowers me, no one in the world can feel happier than I am.
System Baseline Tacotron TPCW-GST TPSE-GST
Expressive voice
Neutral voice 1
Neutral voice 2