Audio samples from "Real Time Spectrogram Inversion on Mobile Phone"

Paper: arXiv

Authors: Oleg Rybakov, Marco Tagliasacchi, Yunpeng Li, Liyang Jiang, Xia Zhang, Fadi Biadsy

Abstract: With the growth of computing power on mobile phones and privacy concerns over user’s data, on-device real time speech pro-cessing has become an important research topic. In this paper, we focus on methods for real time spectrogram inversion, where an algorithm receives a portion of the input signal (e.g., one frame) and processes it incrementally, i.e., operating in streaming mode. We present a real time Griffin Lim(GL) algorithm using a sliding window approach in STFT domain. The proposed algorithm is 2.4x faster than real time on the ARM CPU of a Pixel4. In addition we explore a neural vocoder operating in streaming mode and demonstrate the impact of looking ahead on perceptual quality. As little as one hop size (12.5ms) of lookahead is able to significantly improve perceptual quality in comparison to a causal model. We compare GL with the neural vocoder and show different trade-offs in terms of perceptual quality, on-device latency, algorithmic delay, memory footprint and noise sensitivity. For fair quality assessment of the GL approach, we use input log magnitude spectrogram without mel transformation. We evaluate presented real time spectrogram inversion approaches on clean, noisy and atypical speech.

Click here for more from the Tacotron team.

Note: To obtain the best quality, we strongly recommend readers to listen to the audio samples with headphones.

Contents

Section 1: Vocoder evaluation on VCTK data

In this section, we compare the ground truth audio file, together with non-streaming GL with 70 iterations (NonStreamGLi70), nonstreaming GL with 3 iterations (NonStreamGLi3), streaming GL (StreamGLs4i4c2), streaming neural vocoder with one hop lookahead (StreamMelGANlookahead1), non-streaming neural vocoder (NonStreamMelGANlookahead), and the causal vocoder (StreamMelGANcausal0).
 
Examples are from the VCTK corpus licensed under the Open Data Commons Attribution License (ODC-By) v1.0.

Input NonStreamGLi3 NonStreamGLi70 StreamGLs4i4c2 StreamMelGANlookahead1 NonStreamMelGANlookahead StreamMelGANcausal0


Section 2: Parrotron evaluation on atypical speech (Deaf) with different vocoders

Parrotron is an end-to-end speech conversion model that is trained to convert atypical speech to fluent speech of a canonical synthesized voice. This model takes a log-mel spectrogram as an input and directly produces linear spectrogram. It requires a vocoder to synthesize a time-domain waveform in real time, thus a lightweight streaming vocoder is a crucial requirement for such an application.
 
In this section, we present examples of running Parrotron with different vocoders to convert atypical speech from a deaf speaker to fluent speech. We compare the ground truth audio file, together with non-streaming GL with 70 iterations (NonStreamGLi70), streaming GL (StreamGLs4i4c2), and streaming neural vocoder with one hop lookahead (StreamMelGANlookahead1).

Input NonStreamGLi70 StreamGLs4i4c2 StreamMelGANlookahead1


Acknowledgments

The authors would like to thank Parrotron and Catalyst teams for valuable discussions and suggestions.