Source Separation by Flow Matching

By Robin Scheibler, John R. Hershey, Arnaud Doucet, and Henry Li (Google DeepMind)

Abstract

We consider the problem of single-channel audio source separation with the goal of reconstructing K sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation method based on flow matching, ensuring strict mixture consistency. Flow matching is a general methodology that, when given samples from two probability distributions defined on the same space, learns an ordinary differential equation to output a sample from one of the distributions when provided with a sample from the other. In our context, we have access to samples from the joint distribution of K sources and so the corresponding samples from the lower-dimensional distribution of their mixture. To apply flow matching, we augment these mixture samples with artificial noise components to ensure the resulting ``augmented" distribution matches the dimensionality of the K source distribution. Additionally, as any permutation of the sources yields the same mixture, we adopt an equivariant formulation of flow matching which relies on a suitable custom-designed neural network architecture. We demonstrate the performance of the method for the separation of overlapping speech.

Audio Samples

The following table presents audio samples used in the study.

Sample Mixture Targets Conv-TasNet Mel-band split Locoformer Diffsep EDSep FLOSS (25 steps) FLOSS (1 step) FLOSS (5 steps)
1 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
2 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
3 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
4 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
5 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
6 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
7 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
8 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
9 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
10 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.