Audio samples from "Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech"

Paper: arXiv

Authors: Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao

Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper. All utterances were unseen during training.

The systems being compared are:

Ground Truth: Human-produced audio
Tacotron-GMMA: Tacotron with GMM-based attention.
- Recurrent autoregressive decoder with cross-attention to the encoded phoneme sequence predicting mel spectrogram frames (L1 loss).
NAT: Non-attentive Tacotron with unsupervised durations.
- Recurrent autoregressive decoder with Gaussian upsampling of the encoded phoneme sequence predicting mel spectrogram frames (L1+L2 loss).
T5 Baseline: Baseline transformer-based system.
- T5 transformer decoder with repeated multi-head cross-attentions to the encoded phoneme sequence predicting discrete audio tokens (negative log likelihood loss).
VAT (proposed): Very Attentive Tacotron.
- An extension of the T5 baseline model where the multi-head cross-attentions are informed by a single monotonic alignment position.

Contents

1. Test Set Samples: Lessac Voice
2. Test Set Samples: LibriTTS
3. Generalization to Long Utterances: Lessac Voice
4. Generalization to Long Utterances: LibriTTS
5. Repeated Words
6. Additional Speakers: Internal Multi-speaker Dataset

1. Test Set Samples: Lessac Voice

Examples of synthesized speech for models trained on an internal multi-speaker dataset, with the exception of Tacotron-GMMA which was trained only on the Lessac data. The utterances are taken from the collection of 885 utterances used in naturalness evaluations. Samples which deviate from the transcript are marked in red. Note that:

The T5 baseline and VAT both produce produce expressive speech of high quality on short utterances without repeated words.
Tacotron-GMMA lacks expressivity and sometimes produces low-quality speech.

Text	Ground Truth	Tacotron-GMMA	T5 Baseline	VAT (proposed)
That I can't remember, said the Hatter.
"I--I am very sorry."
He was occupied in his cigar, and in holding back the pliant boughs.
"No, but what'll Mrs. Deyo think tomorrow night?
He would stroll round the precincts of the court and call out: "I say, listen to this, Lucy. Three split infinitives."
Yet he might not have been so perfectly humane, so thoughtful in his generosity, so full of kindness and tenderness amidst his passion for adventurous exploit, had she not unfolded to him the real loveliness of beneficence and made the doing good the end and aim of his soaring ambition.

2. Test Set Samples: LibriTTS

Examples of synthesized speech for models trained on the clean-460 subset of LibriTTS. The utterances are taken from the collection of 900 utterances used in naturalness evaluations. Samples which deviate from the transcript are marked in red. Note that:

The T5 baseline and VAT are similarly expressive on short utterances without repeated words.
NAT lacks expressivity and often has highly robotic pacing.

Text	Ground Truth	NAT	T5 Baseline	VAT (proposed)
"But don't you know anyone in London?" he asked in a sensible postscript.
Early in the afternoon a message came from the ship to say that all stores had been landed.
"The Ruler of a country ought to be treated with great respec'," declared Trot, a little indignantly, for she thought the pretty little queen was not being properly deferred to.
She picked him up in her arms, and the minute his head touched her shoulder he was sound asleep, the music at last hushed in his head.
For us, as connected with the idea of summer, it had a singular charm; and we watched its progress with excited feelings until nearly sunset, when the sky cleared off brightly, and we saw a shining line of water directing its course towards another, a broader and larger sheet.

3. Generalization to Long Utterances: Lessac Voice

Examples of synthesis of long utterances for models trained on the internal multi-speaker dataset. The maximum training lengths were 9.6 seconds and 192 phonemes. Samples which deviate from the transcript are marked in red. Note that:

The T5 baseline deviates from the transcript for utterances longer than those on which it was trained.
VAT adheres to the transcript, even for extremely long utterances.
Tacotron-GMMA adheres to the transcript but lacks expressivity.

Text	Tacotron-GMMA	T5 Baseline	VAT (proposed)
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls.
The room held no sign at all that another boy lived in the house, too. Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.
"Up! Get up! Now!" Harry woke with a start. His aunt rapped on the door again. "Up!" she screeched.
Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones. They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles. As Harry squelched along the deserted corridor he came across somebody who looked just as preoccupied as he was. Nearly Headless Nick, the ghost of Gryffindor Tower, was staring morosely out of a window, muttering under his breath, ". . . don't fulfill their requirements . . . half an inch, if that . . ."

4. Generalization to Long Utterances: LibriTTS

Examples of synthesis of long utterances for models trained on LibriTTS. The maximum training lengths were 9.6 seconds and 192 phonemes. Samples which deviate from the transcript are marked in red. Note that:

The T5 baseline deviates from the transcript for utterances longer than those on which it was trained.
VAT adheres to the transcript, even for extremely long utterances.
NAT adheres to the transcript but lacks expressivity and often has robotic pacing.

Text	NAT	T5 Baseline	VAT (proposed)
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls.
The room held no sign at all that another boy lived in the house, too. Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.
"Up! Get up! Now!" Harry woke with a start. His aunt rapped on the door again. "Up!" she screeched.
Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones. They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles. As Harry squelched along the deserted corridor he came across somebody who looked just as preoccupied as he was. Nearly Headless Nick, the ghost of Gryffindor Tower, was staring morosely out of a window, muttering under his breath, ". . . don't fulfill their requirements . . . half an inch, if that . . ."

5. Repeated Words

Examples of synthesis of utterances containing repeated words for models trained on the internal multi-speaker dataset. Samples which deviate from the transcript are marked in red. The number of occurrences of the repeated word in the synthesized audio is indicated next to the play button. Note that:

The T5 baseline deviates from the transcript for utterances with repeated words.
VAT adheres to the transcript, even for utterances with many repeated words.

Text	T5 Baseline	VAT (proposed)
I am really, really, super duper tired.	(2x)	(2x)
I am really, really, really, really, really, really, super duper tired.	(7x)	(6x)
Wow! That's pretty, pretty, pretty good!	(3x)	(3x)
Wow! That's pretty, pretty, pretty, pretty, pretty good!	(7x)	(5x)
Wow! That's pretty, pretty, pretty, pretty, pretty, pretty, pretty, pretty good!	(3x)	(8x)
My phone number is 1, 800, 9, 9, 2.	(3x)	(2x)
My phone number is 1, 800, 9, 9, 9, 2.	(7x)	(3x)
My phone number is 1, 800, 9, 9, 9, 9, 9, 9, 9, 9, 2.	(29x)	(8x)
My phone number is 1, 800, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2.	(52x)	(9x)

6. Additional Speakers: Internal Multi-speaker Dataset

Examples of synthesized speech from additional speakers for the T5 and VAT models in section 1. The utterances are taken from a held-out subset of the internal multi-speaker dataset.

Text	Ground Truth	T5 Baseline	VAT (proposed)
I don't have any plans today, I'm honestly more than happy to help you out.
Thanks for understanding, you're right I guess.
In discussing its fourth quarter earnings and its announcement that it would buy Rite Aid for more than $9 billion, one analyst pressed Walgreens management on whether now was the time to stop tobacco sales.
Karim Benzema runs so fast, that the fans see him solely as a mirage.
It's like you don't even want to graduate from college.
That sucks.
Honestly, I don't have that much faith in you anymore.
You can't be so negative.