Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Shigeki Karita , Yuma Koizumi , Heiga Zen , Haruko Ishikawa, Robin Scheibler , Michiel Bacchiani

Table of Contents

  1. Abstract
  2. LibriTTS (Miipher-1 vs Miipher-2)
  3. Multilingual LibriSpeech (known languages)
  4. FLEURS (unknown languages)
  5. Acknowledgement
  6. References

preprint: https://arxiv.org/abs/2505.04457

Abstract

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multilingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2 is superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using 100 lite accelerators.

Miipher-2 Model Architecture

Figure 1: Miipher-2 Architecture


LibriTTS (Miipher-1 vs Miipher-2)

This section shows samples of the objective experiments. We evaluated the effectiveness of Miipher-2 using USM and other methods on LibriTTS [1]. Each table includes the raw LibriTTS input, URGENT2025 [2] baseline TF-GridNet output, Miipher-1 [3] (monolingual with text/speaker conditioning) outputs from LibriTTS-R [4], and our propsoed Miipher-2 (multilingual without text/speaker conditioning) outputs.
Comparing to TF-GridNet, Miipher-1 and our Miipher-2 work very noisy and reverberant environments. In addtion, Miipher-2 can preserve more speaker characteristics and time-frequency structures than Miipher-1.

Example 3005_163391_000039_000000: "How are you on the deef and dumb, Bilgewater?"
NOTE: Miipher-2 can restore the noisy input in various intonation by its multilingual capability while Miipher-1 could not preserve the original English intonation in this case.
LibriTTS input
TF-GridNet
Miipher-1
Miipher-2 (ours)
LibriTTS input 3005_163391_000039_000000 Spec
TF-GridNet 3005_163391_000039_000000 Spec
Miipher-1 3005_163391_000039_000000 Spec
Miipher-2 3005_163391_000039_000000 Spec
Example 4852_28311_000006_000000: "So-" Mike swallowed.
NOTE: Miipher-2 could restore the word of "So-", while Miipher-1 changed it a lot and TF-GridNet also stuttered the word.
LibriTTS input
TF-GridNet
Miipher-1
Miipher-2 (ours)
LibriTTS input 4852_28311_000006_000000 Spec
TF-GridNet 4852_28311_000006_000000 Spec
Miipher-1 4852_28311_000006_000000 Spec
Miipher-2 4852_28311_000006_000000 Spec
Example 8280_266249_000082_000000: "Where's that Dutch villain?" Ward was screaming, following up his question with a volley of oaths.
NOTE: Miipher-2 could restore emotional speech while Miipher-1 completely changed the speaker and style.
LibriTTS input
TF-GridNet
Miipher-1
Miipher-2 (ours)
LibriTTS input 8280_266249_000082_000000 Spec
TF-GridNet 8280_266249_000082_000000 Spec
Miipher-1 8280_266249_000082_000000 Spec
Miipher-2 8280_266249_000082_000000 Spec
Example 6128_63244_000002_000000: "I can't talk to those people, I can't!" said Olive Chancellor, with a face which seemed to plead for a remission of responsibility.
NOTE: Miipher-1 output sometimes sounds robotic (vocoder-ish) when the input is very breathy.
LibriTTS input
TF-GridNet
Miipher-1
Miipher-2 (ours)
LibriTTS input 6128_63244_000002_000000 Spec
TF-GridNet 6128_63244_000002_000000 Spec
Miipher-1 6128_63244_000002_000000 Spec
Miipher-2 6128_63244_000002_000000 Spec

Multilingual LibriSpeech (known languages)

This section shows samples of the subjective experiments on mulitlingual librispeech (MLS) [5], which has languages overlapped with our finetuning dataset.

Example de_de 1844_931_000007: poste gefasst und wie trabanten mit langen stangen bewaffnet lassen sie deren inschriften embleme über den köpfen ihrer männer und söhne hin und her wehen diese inschriften lauten der marquis von blandford ist gegen das billige brot
NOTE: This output demonstrated that Miipher-2 can recover bandlimited and noisy background recording.
MLS input
Miipher-2 (ours)
MLS input de_de 1844_931_000007 Spec
Miipher-2 de_de 1844_931_000007 Spec
Example es_es 10667_6706_000019: porque si no caía en la boca del tigre y entonces gritó rica papa atención más cerca aún rugió el tigre agachándose para saltar rico té con leche cuidado va a saltar
NOTE: It is difficult to find noisy es_es samples because es_es showed the highest DNSMOS and SQuID quality scores in the MLS.
MLS input
Miipher-2 (ours)
MLS input es_es 10667_6706_000019 Spec
Miipher-2 es_es 10667_6706_000019 Spec
Example fr_fr 2223_1745_000094: race stupide et idiote tu te repentiras de te conduire ainsi c'est moi qui te le dis tu t'en repentiras va tu t'en repentiras
MLS input
Miipher-2 (ours)
MLS input fr_fr 2223_1745_000094 Spec
Miipher-2 fr_fr 2223_1745_000094 Spec
Example it_it 280_529_000090: e chinando la mano a la sua faccia rispuosi siete voi qui ser brunetto e quelli oh figliuol mio non ti dispiaccia se brunetto latino un poco teco ritorna indietro e lascia andar la traccia
MLS input
Miipher-2 (ours)
MLS input it_it 280_529_000090 Spec
Miipher-2 it_it 280_529_000090 Spec
Example nl_nl 4429_3991_000055: god zegene uw lief kalm gelaat riep zij zenuwachtig snikkend het doet mij goed u te zien o wat heb ik heden een dag vol angst doorgemaakt
MLS input
Miipher-2 (ours)
MLS input nl_nl 4429_3991_000055 Spec
Miipher-2 nl_nl 4429_3991_000055 Spec
Example pl_pl 9098_8338_000085: uwiązał się on w przyzwoitej odległości na nitce która mu służy za linkę bezpieczeństwa na wypadek brutalnego odepchnięcia zrzucony na przykład z siatki na niej zawisa w powietrzu aby nie zlecieć na ziemię
MLS input
Miipher-2 (ours)
MLS input pl_pl 9098_8338_000085 Spec
Miipher-2 pl_pl 9098_8338_000085 Spec
Example pt_pt 5739_4739_000087: não sei mas seja ou não impossível não é a conversão que eu peço basta-me que seja menos indiferente e mais compassivo mas que pretendes fazer perguntou adelaide sentindo que as lágrimas também lhe rebentavam dos olhos houve alguns instantes de silêncio mas o que tu não sabes continuou emília é que ele não é para mim um simples
NOTE: this pt_pt example consists of multiple speakers.
MLS input
Miipher-2 (ours)
MLS input pt_pt 5739_4739_000087 Spec
Miipher-2 pt_pt 5739_4739_000087 Spec

FLEURS (unknown languages)

This section shows samples of the objective experiments on FLEURS[6], which has low-resource languages unavailable in our finetuning dataset. We selected one locales per FLEURS language categories except for CJK locales, which were fully covered by our finetuning data. Comparing to MLS, its samples are much more noisy and reverberant but Miipher-2 can somehow restore clean speech with subtle noise.

Example ca_es 001633443218971: Això sembla raonable perquè no tenim pas la sensació que la Terra s'estigui movent o sí
FLEURS input
Miipher-2 (ours)
FLEURS input ca_es 001633443218971 Spec
Miipher-2 ca_es 001633443218971 Spec
Example ru_ru 001633540429756: Их термические характеристики не такие стабильные как у больших пещер на Земле которые часто поддерживают почти постоянную температуру но они соответствуют тому что является глубокими ямами в грунте — сообщил Глен Кушинг из отдела планетной геологии Геологической службы США и Университета Северной Аризоны расположенного во Флагстаффе Аризона
FLEURS input
Miipher-2 (ours)
FLEURS input ru_ru 001633540429756 Spec
Miipher-2 ru_ru 001633540429756 Spec
Example ur_pk 001633351247178: سندربن دنیا کا سب سے بڑا ساحلی مینگروو بیلٹ ہے جو ساحلی علاقے سے بنگلادیشی اور ہندوستان کے دور افتادہ پسماندہ علاقے میں 80 کیلو میٹر 50 میل تک پھیلا ہوا ہے
FLEURS input
Miipher-2 (ours)
FLEURS input ur_pk 001633351247178 Spec
Miipher-2 ur_pk 001633351247178 Spec
Example sw_ke 001634489347424: Unaweza kutumia boda-boda teksi ya pikipiki kuzunguka Goma Nauli ya kawaida kwa wenyeji ni Franki 500 za Kongo kwa safari fupi
FLEURS input
Miipher-2 (ours)
FLEURS input sw_ke 001634489347424 Spec
Miipher-2 sw_ke 001634489347424 Spec
Example mi_nz 001635418752917: Nā Hong Kong Island te ingoa o te rohe o Hong Kong ā koirā te wāhi ka whakaarohia e ngā tini tūruhi hei aronga matua
FLEURS input
Miipher-2 (ours)
FLEURS input mi_nz 001635418752917 Spec
Miipher-2 mi_nz 001635418752917 Spec

Acknowledgement

We appreciate valuable feedback and support from Keisuke Kinoshita, Bhuvana Ramabhadran, Richard William Sproat, Yotaro Kubo, Wataru Nakata.

References

[1] H. Zen et al., "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech," Proc. Interspeech 2019, doi: 10.21437/Interspeech.2019-2441 http://www.openslr.org/60/

[2] K. Saijo et al., "TF-GridNet baseline", Interspeech URGENT 2025 challenge https://huggingface.co/kohei0209/tfgridnet_urgent25

[3] Y. Koizumi et al., "Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations," 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), doi: 10.1109/WASPAA58266.2023.10248089. https://arxiv.org/abs/2303.01664

[4] Y. Koizumi et al., "LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus," Proc. Interspeech 2023, doi: 10.21437/Interspeech.2023-1584 https://www.openslr.org/141/

[5] V. Pratap et al., "MLS: A Large-Scale Multilingual Dataset for Speech Research," Proc. Interspeech 2020, doi: 10.21437/Interspeech.2020-2826 https://www.openslr.org/94/

[6] A. Conneau et al., "FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech," 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798-805, doi: 10.1109/SLT54892.2023.10023141. https://huggingface.co/datasets/google/fleurs