Audio samples from "Uncovering Latent Style Factors for Expressive Speech Synthesis"

Paper: arXiv
Authors: Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous
Abstract: Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of ``style tokens'' in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.


"neutral prosody": baseline model that is not trained with style tokens.
"token id": force the style attention head to only attend to the specified style token. Note: this method can lead to unintelligible speech since the training process learns to rely on a mixture of tokens, but it is a useful technique for getting a quick idea of the prosodic style each token corresponds to.
"mix w/ token id(s)": broadcast-add the embedding vector of a token to the full style embedding matrix to bias the overall style. Multiple styles can be mixed by consecutively applying the operation. It is also possible to do more sophisticated mixing and style recreation (e.g. time-varying), which is not shown.

"The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser."
neutral prosody
token 0
token 1
token 2
token 3
token 4
token 5
token 6
token 7
token 8
token 9
mix w/ token 1
(sloppy)
mix w/ token 4
(high pitched)
mix w/ token 7
(prominence)
mix w/ token 1+4
(sloppy & high pitched)

"The forecast for San Mateo tomorrow is sixty one degrees and mostly sunny."
neutral prosody
token 0
token 1
token 2
token 3
token 4
token 5
token 6
token 7
token 8
token 9
mix w/ token 1
(sloppy)
mix w/ token 4
(high pitched)
mix w/ token 7
(prominence)
mix w/ token 1+4
(sloppy & high pitched)