SURF: Separation via Unsupervised Remixing Flow

Henry Li1,✏️,*, Robin Scheibler2,*, Efthymios Tzinis1, Matt Shannon2, Arnaud Doucet2,†, John Hershey2,†
1Google      2Google DeepMind
*Equal contribution      Equal senior contribution      ✏️Work partly done as an intern at Google DeepMind

Abstract

The goal of single-channel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a “remixing” step to bootstrap the learning of a student flow model from the teacher’s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods.

Method Overview

Illustration of SURF method
Figure 1: Illustration of SURF. Given initial mixtures, a teacher model first produces source estimates. These are shuffled, then used as self-supervised examples to a student flow matching model. The student is trained to predict the estimated sources (ReMixIT) or original mixtures (Self-Remixing).

Unsupervised Image Separation

We evaluate image separation performance using the MNIST and CIFAR-10 datasets. Training and evaluation mixtures are constructed by averaging pairs of randomly selected images. Below we compare the qualitative separation results against various baselines.

Qualitative image separation results on MNIST and CIFAR-10
Figure 2: Qualitative examples for image separation on the MNIST (above) and CIFAR-10 (below) datasets. We compare Supervised Regression, Supervised Flow, BASIS, MixIT, and SURF algorithms. SURF delivers high-quality source separation and perceptually realistic reconstructions.

Libri2Mix: Speech Source Separation

Separation of mixtures of clean speech sources. The model is trained on the Libri2Mix train-360-clean mixtures only.

Sample Mixture Targets Conv-TasNet MixIT ReMixIT Regression Self-Remixing Regression ReMixIT Flow Self-Remixing Flow Supervised Flow
1 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
2 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
3 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
4 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
5 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
6 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
7 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
8 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
9 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
10 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.

LibriSpeech+FUSS

Separation of mixtures of clean speech from LibriSpeech and noise from FUSS. The model is trained on AudioSet.

Sample Mixture Targets MixIT ConvTasNet ReMixIT Regression Self-Remixing Regression ReMixIT Flow Self-Remixing Flow
1 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
2 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
3 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
4 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
5 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
6 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
7 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
8 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
9 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
10 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.

FUSS: Universal Source Separation

Separation of mixtures of general sounds from FUSS. The model is trained on AudioSet.

Sample Mixture Targets MixIT ConvTasNet ReMixIT Regression Self-Remixing Regression ReMixIT Flow Self-Remixing Flow
1 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
2 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
3 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
4 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
5 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
6 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
7 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
8 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
9 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
10 An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.
An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram. An audio spectrogram.