Academic Paper Presentation

Abstract

We consider the problem of single-channel audio source separation with the goal of reconstructing K sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation method based on flow matching, ensuring strict mixture consistency. Flow matching is a general methodology that, when given samples from two probability distributions defined on the same space, learns an ordinary differential equation to output a sample from one of the distributions when provided with a sample from the other. In our context, we have access to samples from the joint distribution of K sources and so the corresponding samples from the lower-dimensional distribution of their mixture. To apply flow matching, we augment these mixture samples with artificial noise components to ensure the resulting ``augmented" distribution matches the dimensionality of the K source distribution. Additionally, as any permutation of the sources yields the same mixture, we adopt an equivariant formulation of flow matching which relies on a suitable custom-designed neural network architecture. We demonstrate the performance of the method for the separation of overlapping speech.

Sample	Mixture	Targets	Conv-TasNet	Mel-band split Locoformer	Diffsep	EDSep	FLOSS (25 steps)	FLOSS (1 step)	FLOSS (5 steps)
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10

Source Separation by Flow Matching

Abstract

Audio Samples