Click here for more from the Tacotron team.
"neutral prosody": baseline model that is not trained with style tokens. |
"token id": force the style attention head to only attend to the specified style token. Note: this method can lead to unintelligible speech since the training process learns to rely on a mixture of tokens, but it is a useful technique for getting a quick idea of the prosodic style each token corresponds to. |
"mix w/ token id(s)": broadcast-add the embedding vector of a token to the full style embedding matrix to bias the overall style. Multiple styles can be mixed by consecutively applying the operation. It is also possible to do more sophisticated mixing and style recreation (e.g. time-varying), which is not shown. |
neutral prosody | |
---|---|
token 0 | |
token 1 | |
token 2 | |
token 3 | |
token 4 | |
token 5 | |
token 6 | |
token 7 | |
token 8 | |
token 9 | |
mix w/ token 1 (sloppy) |
|
mix w/ token 4 (high pitched) |
|
mix w/ token 7 (prominence) |
|
mix w/ token 1+4 (sloppy & high pitched) |
neutral prosody | |
---|---|
token 0 | |
token 1 | |
token 2 | |
token 3 | |
token 4 | |
token 5 | |
token 6 | |
token 7 | |
token 8 | |
token 9 | |
mix w/ token 1 (sloppy) |
|
mix w/ token 4 (high pitched) |
|
mix w/ token 7 (prominence) |
|
mix w/ token 1+4 (sloppy & high pitched) |