Abstract:
We learn audio representations by solving a novel self-supervised
learning task, which consists of predicting the phase of the short-time
Fourier transform from its magnitude. A convolutional encoder is used
to map the magnitude spectrum of the input waveform to a lower
dimensional embedding. A convolutional decoder is then used to predict
the instantaneous frequency (i.e., the temporal rate of change of the
phase) from such embedding. To evaluate the quality of the learned
representations, we evaluate how they transfer to a wide variety of
downstream audio tasks. Our experiments reveal that the phase
prediction task leads to representations that generalize across
different tasks, partially bridging the gap with fully-supervised
models. In addition, we show that the predicted phase can be used as
initialization of the Griffin-Lim algorithm, thus reducing the number
of iterations needed to reconstruct the waveform in the time domain.
Each column displays a different phase initialization before running a
couple of Griffin-Lim steps. For each audio snippet, a corresponding
"rainbowgram" is shown, where the color corresponds to the instantaneous
frequency, and the intensity is proportional to the logarithm of the
magnitude.
Audio examples present in this page have been random selected, and
not seen during training.