Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous
Abstract: In this work, we propose "Global Style Tokens" (GSTs), a bank of
embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech
synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large
range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft
interpretable "labels" they generate can be used to control synthesis in novel ways, such as
varying speed and speaking style — independently of the text content.
They can also be used for
style transfer, replicating the speaking style of a single audio clip across an entire long-form
text corpus. When trained on noisy, unlabelled found data, GSTs learn to factorize noise and
speaker identity, providing a path towards highly scalable but robust speech synthesis.
The experiments in our paper use the Griffin-Lim algorithm to convert the neural network's
predictions (spectrograms) to audio waveforms. We use Griffin-Lim because it's
relatively fast; it also demonstrates that the choice of vocoder doesn't affect Tacotron's prosody.
Before demonstrating the wide range of GST capabilities using only Griffin-Lim
vocoding, however, we first provide GST-augmented Tacotron samples with a neural (WaveNet) vocoder.
These show the combined power of good prosodic style and high audio fidelity.
Here we show the effect of conditioning a GST-augmented Tacotron on
individual Style Tokens. As you can hear, each token corresponds a distinct "style", and
synthesizing different phrases with the same token results in the same style.
Text: Thinking that he should probably wait for Filch to come back, Harry sank into a
moth-eaten chair next to the desk. There was only one thing on it apart from his half-completed form: a large, glossy, purple envelope with silver lettering on the
front.
Style 1
Style 2
Style 3
Style 4
Style 5
Text: She got up and went to the table to measure herself by it, and found that, as nearly
as she could guess, she was now about two feet high, and was going on shrinking rapidly: she
soon found out that the cause of this was the fan she was holding, and she dropped it hastily,
just in time to avoid shrinking away altogether.
Style 1
Style 2
Style 3
Style 4
Style 5
Text: Only the photographs on the mantelpiece really showed how much time had passed. Ten
years ago, there had been lots of pictures of what looked like a large pink beach ball wearing
different-colored bonnets - but Dudley Dursley was no longer a baby, and now the photographs
showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a
computer game with his father, being hugged and kissed by his mother.
Here we again condition a GST-augmented Tacotron with individual Style Tokens. Note that
these short web search phrases are completely different than the long-form audiobook phrases
above, demonstrating that GSTs control style independently of the text domain.
Text: There are several listings for gas station..
Style 1
Style 2
Style 3
Style 4
Style 5
Text: A subspace is a space that is wholly contained in another space.
Style 1
Style 2
Style 3
Style 4
Style 5
Text: United Airlines five six three from Los Angeles to New Orleans has Landed.
While WaveNet vocoding leads to high-fidelity audio, Global Style Tokens learn to capture
stylistic variation entirely during Tacotron training, independently of the vocoding
technique used afterwards. Here we include some samples to demonstrate that Tacotron models
prosody, while WaveNet provides last-mile audio quality.
Text: She got up and went to the table to measure herself by it, and
found that, as nearly as she could guess, she was now about two feet high, and was going on
shrinking rapidly: she soon found out that the cause of this was the fan she was holding, and
she dropped it hastily, just in time to avoid shrinking away altogether.
As noted above, the rest of the samples on this page use the Griffin-Lim algorithm to
produce waveforms. These again show that Global Style Tokens model style differences
entirely within Tacotron.
These samples refer to Section 6.1.1 of our paper, "Style selection".
They show the effect of conditioning the model on an individual style token.
Note how the same token yields the same style for different text inputs.
Text: Here you go, a link for Biondo Racing Products and other related pages.
Token A
Token B
Token C
Token D
Token E
Text: The forecast for San Mateo tomorrow is sixty one degrees and Mostly Sunny.
These samples refer to Section 6.1.2 of our paper, "Style scaling".
They show that multiplying a token embedding by a scalar value intensifies its style effect.
Note how the effect is decreased by negative values, despite the fact that they are not observed during training.
Text: Here you go, a link for Biondo Racing Products and other related pages.
These samples refer to Section 6.1.4 of our paper, "Text-side style control/morphing".
Here we show how style can be "morphed" by conditioning the encoder on different
tokens as text input progresses. The top two samples demonstrate each token alone,
and are followed by two morphed variations.
Text: Computer phone calls, which do everything from selling magazine subscriptions to reminding
people about meetings have become the telephone equivalent of junk mail.
These samples refer to Section 6.2.1 of our paper, "Parallel Style Transfer".
In parallel style transfer, the synthesizer is given an audio clip matching the text it's asked to synthesize
(i.e. the source and target text are the same). Global Style Tokens are designed to be
robust, so they don't transfer prosody perfectly. In these samples, prosodic style
is transferred, though without fine time-aligned variations.
Text: Pull his canoe home with your line, Fisherman.
Raw audio (reference)
Baseline Tacotron
GST Tacotron
Text: You are not so important after all, Pau Amma, he said.
These samples refer to Section 6.2.2 of our paper, "Non-Parallel Style Transfer".
In non-parallel style transfer, the TTS system must transfer prosodic style
when the source and target text are completely different.
Below, contrast the monotonous prosody of the baseline with examples of long-form
synthesis with a narrative source style.
Source 1: Something, however, happened this time that had not happened before;
his stare into my face, through the glass and across the room, was as deep and hard as then,
but it quitted me for a moment during which I could still watch it, see it fix successively several other things.
Target 1: He was pale as smoke, and Harry could see right through him to the dark sky and
torrential rain outside. "You look troubled, young Potter," said Nick, folding a transparent
letter as he spoke and tucking it inside his doublet. "So do you," said Harry.
Baseline Tacotron
GST Tacotron
Source 2: She got up and went to the table to measure herself by it, and found that, as
nearly as she could guess, she was now about two feet high, and was going on shrinking
rapidly: she soon found out that the cause of this was the fan she was holding, and she
dropped it hastily, just in time to avoid shrinking away altogether.
Target 2: Harry got slowly out of bed and started looking for socks. He found a pair
under his bed and, after pulling a spider off one of them, put them on. Harry was used to
spiders, because the cupboard under the stairs was full of them, and that was where he slept.
Here we play samples conditioning on tokens learned from training on a noisy dataset.
Tacotron learned to disentangle various acoustic factors, with the resulting tokens roughly corresponding to music, reverberation, noise, and clean speech.
Text: The avocado is a pear-shaped fruit with leathery skin, smooth edible flesh and a large stone.
Here we play samples from both a baseline and GST Tacotron where 90% of the training data had artificial noise added to it.
The GST output is synthesized by conditioning on a clean token,
manually identified after training.
Text: The Blue Lagoon is a nineteen eighty American romance and adventure film directed by Randal Kleiser.
Baseline Tacotron
GST Tacotron
Text: The avocado is a pear-shaped fruit with leathery skin, smooth edible flesh and a large stone.