Audio samples from "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Paper: arXiv

Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

Abstract: In this work, we propose "Global Style Tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style — independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabelled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Click here for more from the Tacotron team.

Contents

Style Tokens with neural vocoding

Style Selection
Prosody vs Audio Quality

Style control

Style selection
Style scaling
Text-side style control/morphing

Style transfer

Parallel transfer
Non-parallel transfer

Artificial noisy data

Tokens learned from a noisy dataset
Synthesizing with a clean Style Token

Style Tokens with neural vocoding

The experiments in our paper use the Griffin-Lim algorithm to convert the neural network's predictions (spectrograms) to audio waveforms. We use Griffin-Lim because it's relatively fast; it also demonstrates that the choice of vocoder doesn't affect Tacotron's prosody. Before demonstrating the wide range of GST capabilities using only Griffin-Lim vocoding, however, we first provide GST-augmented Tacotron samples with a neural (WaveNet) vocoder. These show the combined power of good prosodic style and high audio fidelity.

Style Selection

1. Long-form audiobook phrases

Here we show the effect of conditioning a GST-augmented Tacotron on individual Style Tokens. As you can hear, each token corresponds a distinct "style", and synthesizing different phrases with the same token results in the same style.

Text: Thinking that he should probably wait for Filch to come back, Harry sank into a moth-eaten chair next to the desk. There was only one thing on it apart from his half-completed form: a large, glossy, purple envelope with silver lettering on the front.

Style 1
Style 2
Style 3
Style 4
Style 5

Text: She got up and went to the table to measure herself by it, and found that, as nearly as she could guess, she was now about two feet high, and was going on shrinking rapidly: she soon found out that the cause of this was the fan she was holding, and she dropped it hastily, just in time to avoid shrinking away altogether.

Style 1
Style 2
Style 3
Style 4
Style 5

Text: Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago, there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets - but Dudley Dursley was no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a computer game with his father, being hugged and kissed by his mother.

Style 1
Style 2
Style 3
Style 4
Style 5

2. Short web search responses

Here we again condition a GST-augmented Tacotron with individual Style Tokens. Note that these short web search phrases are completely different than the long-form audiobook phrases above, demonstrating that GSTs control style independently of the text domain.

Text: There are several listings for gas station..

Style 1
Style 2
Style 3
Style 4
Style 5

Text: A subspace is a space that is wholly contained in another space.

Style 1
Style 2
Style 3
Style 4
Style 5

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.

Style 1
Style 2
Style 3
Style 4
Style 5

Prosody vs Audio Quality

While WaveNet vocoding leads to high-fidelity audio, Global Style Tokens learn to capture stylistic variation entirely during Tacotron training, independently of the vocoding technique used afterwards. Here we include some samples to demonstrate that Tacotron models prosody, while WaveNet provides last-mile audio quality.

Baseline Tacotron (Griffin-Lim vocoding)
Baseline Tacotron (WaveNet vocoding)
GST Tacotron (Griffin-Lim vocoding) (Style 1)
GST Tacotron (WaveNet vocoding) (Style 1)

Style control

As noted above, the rest of the samples on this page use the Griffin-Lim algorithm to produce waveforms. These again show that Global Style Tokens model style differences entirely within Tacotron.

1. Style selection

These samples refer to Section 6.1.1 of our paper, "Style selection". They show the effect of conditioning the model on an individual style token. Note how the same token yields the same style for different text inputs.

Text: Here you go, a link for Biondo Racing Products and other related pages.

Token A
Token B
Token C
Token D
Token E

Text: The forecast for San Mateo tomorrow is sixty one degrees and Mostly Sunny.

Token A
Token B
Token C
Token D
Token E

2. Style scaling

These samples refer to Section 6.1.2 of our paper, "Style scaling". They show that multiplying a token embedding by a scalar value intensifies its style effect. Note how the effect is decreased by negative values, despite the fact that they are not observed during training.

Text: Here you go, a link for Biondo Racing Products and other related pages.

Token A (faster speaking rate)

scale=-0.3
scale=0.1
scale=0.3
scale=0.5

Token B (more animated speech)

scale=-0.3
scale=0.1
scale=0.3
scale=0.5

3. Text-side style control/morphing

These samples refer to Section 6.1.4 of our paper, "Text-side style control/morphing". Here we show how style can be "morphed" by conditioning the encoder on different tokens as text input progresses. The top two samples demonstrate each token alone, and are followed by two morphed variations.

Text: Computer phone calls, which do everything from selling magazine subscriptions to reminding people about meetings have become the telephone equivalent of junk mail.

Token A alone (fast)
Token B alone (low pitched)
Token A → B
Token A → B → A

Style transfer

Style transfer is an active area of research that aims to synthesize a phrase in the prosodic style of a given audio clip.

1. Parallel transfer

These samples refer to Section 6.2.1 of our paper, "Parallel Style Transfer". In parallel style transfer, the synthesizer is given an audio clip matching the text it's asked to synthesize (i.e. the source and target text are the same). Global Style Tokens are designed to be robust, so they don't transfer prosody perfectly. In these samples, prosodic style is transferred, though without fine time-aligned variations.

Text: Pull his canoe home with your line, Fisherman.

Raw audio (reference)
Baseline Tacotron
GST Tacotron

Text: You are not so important after all, Pau Amma, he said.

Raw audio (reference)
Baseline Tacotron
GST Tacotron

2. Non-parallel transfer

These samples refer to Section 6.2.2 of our paper, "Non-Parallel Style Transfer". In non-parallel style transfer, the TTS system must transfer prosodic style when the source and target text are completely different. Below, contrast the monotonous prosody of the baseline with examples of long-form synthesis with a narrative source style.

Source 1: Something, however, happened this time that had not happened before; his stare into my face, through the glass and across the room, was as deep and hard as then, but it quitted me for a moment during which I could still watch it, see it fix successively several other things.

Target 1: He was pale as smoke, and Harry could see right through him to the dark sky and torrential rain outside. "You look troubled, young Potter," said Nick, folding a transparent letter as he spoke and tucking it inside his doublet. "So do you," said Harry.

Baseline Tacotron
GST Tacotron

Source 2: She got up and went to the table to measure herself by it, and found that, as nearly as she could guess, she was now about two feet high, and was going on shrinking rapidly: she soon found out that the cause of this was the fan she was holding, and she dropped it hastily, just in time to avoid shrinking away altogether.

Target 2: Harry got slowly out of bed and started looking for socks. He found a pair under his bed and, after pulling a spider off one of them, put them on. Harry was used to spiders, because the cupboard under the stairs was full of them, and that was where he slept.

Baseline Tacotron
GST Tacotron

Artificial noisy data

These samples refer to Section 7.1 of our paper, "Artificial Noisy Data".

1. Tokens learned from a noisy dataset

Here we play samples conditioning on tokens learned from training on a noisy dataset. Tacotron learned to disentangle various acoustic factors, with the resulting tokens roughly corresponding to music, reverberation, noise, and clean speech.

Text: The avocado is a pear-shaped fruit with leathery skin, smooth edible flesh and a large stone.

Token A (music)
Token B (reverb)
Token C (noise)
Token D (clean)

2. Synthesizing with a clean Style Token

Here we play samples from both a baseline and GST Tacotron where 90% of the training data had artificial noise added to it. The GST output is synthesized by conditioning on a clean token, manually identified after training.

Text: The Blue Lagoon is a nineteen eighty American romance and adventure film directed by Randal Kleiser.

Baseline Tacotron
GST Tacotron

Text: The avocado is a pear-shaped fruit with leathery skin, smooth edible flesh and a large stone.

Baseline Tacotron
GST Tacotron