Audio samples from "Speaker Generation"

Paper: arXiv

Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, and David Kao

Abstract: This work explores the task of synthesizing speech in non-existent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity.

This page contains a set of audio samples in support of the paper: we suggest that readers listen to the samples in conjunction with reading the paper.

Note that the TacoSpawn systems described in the paper extend a baseline Tacotron -- that is, they do not use any additional acoustic or prosody embeddings to improve the naturalness of synthesized speech.

Click here for more from the Tacotron team.

Contents


Novel speakers

The audio clips below show examples of generated speakers. These were synthesized by a TacoSpawn system trained on the 1468-speaker English dataset described in our paper.

1 2 3 4 5 6 7
British Male
British Female
American Male
American Female
Australian Male
Australian Female

Speaker Distance

Here we provide examples to accompany the "Speaker Distance" section of our paper. Speaker distance is measured using using cosine similarity in d-vector space, and the audio clips below show "closest" and "furthest away" speakers from a given generated or training set speaker. Samples are from a TacoSpawn model trained on the 1100-speaker US English dataset described in our paper.

Starting from generated speakers

The table below shows a generated TacoSpawn speaker in the left-hand column, followed by "closest" and "furthest away" speaker samples of the same gender.

From left to right, the columns mean:

gen
nearest synth
furthest synth
nearest gen
furthest gen
Male 1
Male 2
Male 3
Female 1
Female 2
Female 3

Starting from training speakers

The table below shows a ground-truth utterance from a training set speaker in the left-hand column, followed by "closest" and "furthest away" speaker samples of the same gender.

From left to right, the columns mean:

ground-truth
synth
nearest synth
furthest synth
nearest gen
furthest gen
Male 1
Male 2
Male 3
Female 1
Female 2
Female 3

Interactive t-SNE plot

The interactive t-SNE plot below shows speakers represented in d-vector space, colored by region. Each audio sample is from either a generated ("gen") or training set speaker ("synth"), synthesized by a TacoSpawn system trained on the 1468-speaker English dataset described in our paper. The two t-SNE clusters show natural separation of speakers by gender.