Audio samples from "Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation"

Paper: arXiv

Authors: Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia

Abstract: We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, a spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task.

Click here for more from the Tacotron team.

Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

Section 3.1: Voice normalization

In this section, we present examples of running Parrotron to directly normalize speech to a TTS voice. Examples are from the VCTK corpus licensed under the Open Data Commons Attribution License (ODC-By) v1.0. (Griffin-Lim Vocoder).

Input Output

Extra

Examples of running the normalization model, which is trained on American English speech, on non-English inputs.

Input Output


Section 3.2: Normalization of hearing-impaired speech

In this section, we present examples of running Parrotron to convert atypical speech from a deaf speaker to fluent speech. The model is same as the normalization model above, but trained on a male target speaker voice. We include examples before and after adapting the model on 13.5 hours of speech from a deaf speaker.
 

After adaptation


Input Output    Reference Transcript
   Here is some information about Oklahoma
   When do the girls get to the party?
   You can use your regular name outside the game.
   Here are listings for Clarks Village near Ann Arbor.
   Maybe something happened to them.
   Amber India restaurant is open until two A M tomorrow.
   Here are your directions.
   OK, three hours 45 minutes.
   What is the weather tomorrow in Mountain View?
   I want to see if Parrotron would work for other languages.
   Salam [Arabic for Peace]
   I like Fadi.
   I'm hungry.

Before adaptation

Input Output


Section 3.3: Speech separation

In this section, we present examples of training Parrotron to perform a speech separation task. We train it to identify and extract the loudest speaker in a mix of overlapping 8 speakers. Examples are from the VCTK corpus licensed under the Open Data Commons Attribution License (ODC-By) v1.0. (Griffin-Lim Vocoder).

Input Output


Acknowledgments

We thank Françoise Beaufays, Michael Brenner, Diamantino Caseiro, Zhifeng Chen, Mohamed Elfeky, Patrick Nguyen, Bhuvana Ramabhadran, Andrew Rosenberg, Jason Pelecanos, Johan Schalkwyk, Yonghui Wu, and Zelin Wu for useful feedback.