Authors: Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia
Abstract:
We describe Parrotron, an end-to-end-trained speech-to-speech
conversion model that maps an input spectrogram directly to
another spectrogram, without utilizing any intermediate discrete
representation. The network is composed of an encoder, a spectrogram and phoneme decoders,
followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model
can be trained to normalize speech from any speaker regardless
of accent, prosody, and background noise, into the voice of a
single canonical target speaker with a fixed accent and consistent
articulation and prosody. We further show that this normalization
model can be adapted to normalize highly atypical speech from
a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and
listening tests. Finally, demonstrating the utility of this model
on other speech tasks, we show that the same model architecture
can be trained to perform a speech separation task.
In this section, we present examples of running Parrotron to convert atypical speech from a deaf speaker to fluent speech. The model is same as the normalization model above, but trained on a male target speaker voice. We include examples before and after adapting the model on 13.5 hours of speech from a deaf speaker.
After adaptation
Input
Output
Reference Transcript
Here is some information about Oklahoma
When do the girls get to the party?
You can use your regular name outside the game.
Here are listings for Clarks Village near Ann Arbor.
Maybe something happened to them.
Amber India restaurant is open until two A M tomorrow.
Here are your directions.
OK, three hours 45 minutes.
What is the weather tomorrow in Mountain View?
I want to see if Parrotron would work for other languages.
In this section, we present examples of training Parrotron to perform a speech separation task. We train it to identify and extract the loudest speaker in a mix of overlapping 8 speakers.
Examples are from the VCTK corpus licensed under the Open Data Commons Attribution License (ODC-By) v1.0.
(Griffin-Lim Vocoder).
We thank Françoise Beaufays, Michael Brenner, Diamantino Caseiro, Zhifeng Chen, Mohamed Elfeky, Patrick Nguyen, Bhuvana Ramabhadran, Andrew Rosenberg, Jason Pelecanos, Johan Schalkwyk, Yonghui Wu, and Zelin Wu for useful feedback.