Audio samples from "PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS"

Paper: arXiv

Authors: Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu.

Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.

Click here for more from the Tacotron team.

The example from Section 1

In Section 1 in the paper, we described an example input "To cancel the payment, press one; or to continue, two.", which is a pattern that used frequently by conversational AI agents for call centers. In the phoneme representation of this sentence, the trailing "..., two." can be easily confused with "..., too.", which is used more frequently in English. However, in natural speech, different prosody is expected at the comma positions in these two patterns.

The baseline NAT is a state-of-the-art duration-based neural TTS model that takes phoneme as input. Clearly, it suffers from such ambiguity of the phoneme representation on the homophones and thus produces speech that sounds confusing. In contrast, using PnG BERT as the encoder for NAT produces natural speech without confusion.

NAT w/ PnG BERTNAT (baseline)

NAT w/ PnG BERT vs. NAT baseline

These examples correspond to the side-by-side preferece test in Table 2 in the paper. They are cherry-picked to be representative.

Positive examples
NAT w/ PnG BERT NAT (baseline)
1: Here's a birthday present.
2: That soccer mom is also a piano teacher.
3: Did you put it in a salad bowl?
4: Take your Vitamin B complex when you wake up.
5: Window shopping can be such a tease.
Negative examples
NAT w/ PnG BERT NAT (baseline)
1: Meet Mr. Potato Head.
2: You need to take the long view, career-wise.
3: How about a back rub?
4: Performance reviews are stressful, time-consuming, and often meaningless.
5: Hand me the movie reviews.

NAT w/ PnG BERT vs. human recordings

These examples correspond to the side-by-side preferece test in Table 3 in the paper. They are cherry-picked to be representative.

Positive examples
NAT w/ PnG BERT Human recording
1: Dirty bomb. A nuclear weapon improvised from radioactive nuclear waste material and conventional explosives.
2: The Omnibus Cartel Repeal Act Law Number Six will be on the bar exam.
3: My food is in the oven, since I'm working from home today.
4: Is Canada assuming him to be guilty?
5: Tina Fey's children are Alice Zenobia Richmond and Penelope Athena Richmond.
Negative examples
NAT w/ PnG BERT Human recording
1: Finn. A native or inhabitant of Finland or a person of Finnish descent.
2: Ok, pictures of teal nail polish with white polka dots.
3: According to English Grammar Revolution, there is an adverb describing the verb eat.
4: Bubble, bubble, toil, and... this thing.
5: I wish I had arms so I could give you a hug. But for now maybe a joke or some music might help.