Audio samples from "Speaker Generation"

Paper: arXiv

Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, and David Kao

Abstract: This work explores the task of synthesizing speech in non-existent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity.

This page contains a set of audio samples in support of the paper: we suggest that readers listen to the samples in conjunction with reading the paper.

Note that the TacoSpawn systems described in the paper extend a baseline Tacotron -- that is, they do not use any additional acoustic or prosody embeddings to improve the naturalness of synthesized speech.

Click here for more from the Tacotron team.

Contents

1. Novel speakers
2. Speaker distance
3. Interactive t-SNE plot

Novel speakers

The audio clips below show examples of generated speakers. These were synthesized by a TacoSpawn system trained on the 1468-speaker English dataset described in our paper.

1 2 3 4 5 6 7

British Male

British Female

American Male

American Female

Australian Male

Australian Female

Speaker Distance

Here we provide examples to accompany the "Speaker Distance" section of our paper. Speaker distance is measured using using cosine similarity in d-vector space, and the audio clips below show "closest" and "furthest away" speakers from a given generated or training set speaker. Samples are from a TacoSpawn model trained on the 1100-speaker US English dataset described in our paper.

Starting from generated speakers

The table below shows a generated TacoSpawn speaker in the left-hand column, followed by "closest" and "furthest away" speaker samples of the same gender.

From left to right, the columns mean:

gen: utterance from a generated TacoSpawn speaker.
nearest synth: the synthesized training speaker nearest to gen.
furthest synth: the synthesized same-gender training speaker furthest from gen.
nearest gen: the generated speaker nearest to gen.
furthest gen: the generated same-gender speaker furthest from gen.

	gen	nearest synth	furthest synth	nearest gen	furthest gen
Male 1
Male 2
Male 3
Female 1
Female 2
Female 3

Starting from training speakers

The table below shows a ground-truth utterance from a training set speaker in the left-hand column, followed by "closest" and "furthest away" speaker samples of the same gender.

From left to right, the columns mean:

ground-truth: ground-truth speaker in the training set.
synth: same training speaker as synthesized by TacoSpawn.
nearest synth: synthesized training speaker nearest to synth.
furthest synth: synthesized same-gender training speaker furthest from synth.
nearest gen: generated speaker nearest to synth.
furthest gen: generated same-gender speaker furthest from synth.

	ground-truth	synth	nearest synth	furthest synth	nearest gen	furthest gen
Male 1
Male 2
Male 3
Female 1
Female 2
Female 3

Interactive t-SNE plot

The interactive t-SNE plot below shows speakers represented in d-vector space, colored by region. Each audio sample is from either a generated ("gen") or training set speaker ("synth"), synthesized by a TacoSpawn system trained on the 1468-speaker English dataset described in our paper. The two t-SNE clusters show natural separation of speakers by gender.