Paper: arXiv
Authors: Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao
Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper. All utterances were unseen during training.
The systems being compared are:
Contents
Examples of synthesized speech for models trained on an internal multi-speaker dataset, with the exception of Tacotron-GMMA which was trained only on the Lessac data. The utterances are taken from the collection of 885 utterances used in naturalness evaluations. Samples which deviate from the transcript are marked in red. Note that:
Text | Ground Truth | Tacotron-GMMA | T5-TTS | VAT (proposed) |
---|---|---|---|---|
That I can't remember, said the Hatter. | ||||
"I--I am very sorry." | ||||
He was occupied in his cigar, and in holding back the pliant boughs. | ||||
"No, but what'll Mrs. Deyo think tomorrow night? | ||||
He would stroll round the precincts of the court and call out: "I say, listen to this, Lucy. Three split infinitives." | ||||
Yet he might not have been so perfectly humane, so thoughtful in his generosity, so full of kindness and tenderness amidst his passion for adventurous exploit, had she not unfolded to him the real loveliness of beneficence and made the doing good the end and aim of his soaring ambition. |
Examples of synthesized speech for models trained on the clean-460 subset of LibriTTS. The utterances are taken from the collection of 900 utterances used in naturalness evaluations. Samples which deviate from the transcript are marked in red. Note that:
Text | Ground Truth | NAT | T5-TTS | VAT (proposed) |
---|---|---|---|---|
"But don't you know anyone in London?" he asked in a sensible postscript. | ||||
Early in the afternoon a message came from the ship to say that all stores had been landed. | ||||
"The Ruler of a country ought to be treated with great respec'," declared Trot, a little indignantly, for she thought the pretty little queen was not being properly deferred to. | ||||
She picked him up in her arms, and the minute his head touched her shoulder he was sound asleep, the music at last hushed in his head. | ||||
For us, as connected with the idea of summer, it had a singular charm; and we watched its progress with excited feelings until nearly sunset, when the sky cleared off brightly, and we saw a shining line of water directing its course towards another, a broader and larger sheet. |
Examples of synthesis of long utterances for models trained on the internal multi-speaker dataset. The maximum training lengths were 9.6 seconds and 192 phonemes. Samples which deviate from the transcript are marked in red. Note that:
Text | Tacotron-GMMA | T5-TTS | VAT (proposed) |
---|---|---|---|
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls. | |||
The room held no sign at all that another boy lived in the house, too. Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day. | |||
"Up! Get up! Now!" Harry woke with a start. His aunt rapped on the door again. "Up!" she screeched. | |||
Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones. They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles. As Harry squelched along the deserted corridor he came across somebody who looked just as preoccupied as he was. Nearly Headless Nick, the ghost of Gryffindor Tower, was staring morosely out of a window, muttering under his breath, ". . . don't fulfill their requirements . . . half an inch, if that . . ." |
Examples of synthesis of long utterances for models trained on LibriTTS. The maximum training lengths were 9.6 seconds and 192 phonemes. Samples which deviate from the transcript are marked in red. Note that:
Text | NAT | T5-TTS | VAT (proposed) |
---|---|---|---|
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls. | |||
The room held no sign at all that another boy lived in the house, too. Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day. | |||
"Up! Get up! Now!" Harry woke with a start. His aunt rapped on the door again. "Up!" she screeched. | |||
Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones. They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles. As Harry squelched along the deserted corridor he came across somebody who looked just as preoccupied as he was. Nearly Headless Nick, the ghost of Gryffindor Tower, was staring morosely out of a window, muttering under his breath, ". . . don't fulfill their requirements . . . half an inch, if that . . ." |
Examples of synthesis of utterances containing repeated words for models trained on the internal multi-speaker dataset. Samples which deviate from the transcript are marked in red. The number of occurrences of the repeated word in the synthesized audio is indicated next to the play button. Note that:
Text | T5-TTS | VAT (proposed) |
---|---|---|
I am really, really, super duper tired. | (2x) | (2x) |
I am really, really, really, really, really, really, super duper tired. | (7x) | (6x) |
Wow! That's pretty, pretty, pretty good! | (3x) | (3x) |
Wow! That's pretty, pretty, pretty, pretty, pretty good! | (7x) | (5x) |
Wow! That's pretty, pretty, pretty, pretty, pretty, pretty, pretty, pretty good! | (3x) | (8x) |
My phone number is 1, 800, 9, 9, 2. | (3x) | (2x) |
My phone number is 1, 800, 9, 9, 9, 2. | (7x) | (3x) |
My phone number is 1, 800, 9, 9, 9, 9, 9, 9, 9, 9, 2. | (29x) | (8x) |
My phone number is 1, 800, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2. | (52x) | (9x) |
Examples of synthesized speech from additional speakers for the T5-TTS and VAT models in section 1. The utterances are taken from a held-out subset of the internal multi-speaker dataset.
Text | Ground Truth | T5-TTS | VAT (proposed) |
---|---|---|---|
I don't have any plans today, I'm honestly more than happy to help you out. | |||
Thanks for understanding, you're right I guess. | |||
In discussing its fourth quarter earnings and its announcement that it would buy Rite Aid for more than $9 billion, one analyst pressed Walgreens management on whether now was the time to stop tobacco sales. | |||
Karim Benzema runs so fast, that the fans see him solely as a mirage. | |||
It's like you don't even want to graduate from college. | |||
That sucks. | |||
Honestly, I don't have that much faith in you anymore. | |||
You can't be so negative. |