LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang Wei Han Ankur Bapna Michiel Bacchiani

Abstract: This paper introduces a new speech dataset called "LibriTTS-R" designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples.

The corpus is freely available for download from http://www.openslr.org/141/>

For more information, refer to the dataset paper: Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023. If you use the LibriTTS-R corpus in your work, please cite the dataset paper where it was introduced.

Postscript (Sept. 4th, 2023): We have published the list of file paths where speech restoration may have failed. Speech restoration is not always perfect, so some phonemes may be lost or changed during the restoration process. We ran automatic speech recognition (ASR) on all LibriTTS-R samples and created these lists of samples with a word error rate (WER) above a certain threshold. The experiments in the LibriTTS-R paper were conducted using these files that may have failed to be restored. However, the files included in these lists are likely to have uncorresponding transcripts and waveforms. Therefore, we recommend excluding them during model training. The list can be also download from the OpenSLR dataset page.

UPDATED Aug 17 2023: We added TTS examples using OSS toolkits, see this new page.


Ground-truth example comparison:

Example 1: And then Brynhild fell a-weeping till her heart broke.
LibriTTS
LibriTTS-R

Example 2: I want to be near to them--to help them.
LibriTTS
LibriTTS-R

Example 3: Guess I'll have to stick to selling meals, mostly--for a while, at least.
LibriTTS
LibriTTS-R

Example 4: I should so like one to hang in my morning-room at Jocelyn's Rock.
LibriTTS
LibriTTS-R

Example 5: She works too hard, and she---- But there, I don't know that I ought to say any more.
LibriTTS
LibriTTS-R


TTS generated speech comparison:

This section shows output examples of multi-speaker TTS models trained on either LibriTTS or LibriTTS-R. The TTS model consists of a duration unsupervised non-attentive Tacotron (NAT) [1] acoustic model and a WaveRNN neural vocoder [2]. All models were trained with the same model size, hyper-parameters, and training steps. The TTS model was trained on two types of training splits: Train-460 and Train-960. Train-460 consists of the "train-clean-100" and "train-clean-360'" subsets, and Train-960 indicates using "train-other-500" in addition to Train-460. For more details, please refer our paper.

Example1: The Edison construction department took entire charge of the installation of the plant, and the formal opening was attended on October 1, 1883, by Mr. Edison, who then remained a week in ceaseless study and consultation over the conditions developed by this initial three-wire underground plant.
Speaker ID LibriTTS Train460 LibriTTS Train960 LibriTTS-R Train460 LibriTTS-R Train960
103
1841
1121
5717

Example2: Her sea going qualities were excellent, and would have amply sufficed for a circumnavigation of the globe.
Speaker ID LibriTTS Train460 LibriTTS Train960 LibriTTS-R Train460 LibriTTS-R Train960
103
1841
1121
5717

Example3: Therefore her Majesty paid no attention to anyone and no one paid any attention to her.
Speaker ID LibriTTS Train460 LibriTTS Train960 LibriTTS-R Train460 LibriTTS-R Train960
103
1841
1121
5717

Example4: The Free State Hotel served as barracks.
Speaker ID LibriTTS Train460 LibriTTS Train960 LibriTTS-R Train460 LibriTTS-R Train960
103
1841
1121
5717

Example5: The military force, partly rabble, partly organized, had meanwhile moved into the town.
Speaker ID LibriTTS Train460 LibriTTS Train960 LibriTTS-R Train460 LibriTTS-R Train960
103
1841
1121
5717


Acknowledgement:

We appreciate valuable feedback and support from Daniel S. Park, Hakan Erdogan, Haruko Ishikawa, Hynek Hermansky Johan Schalkwyk, John R. Hershey, Keisuke Kinoshita, Llion Jones, Neil Zeghidour, Quan Wang, Richard William Sproat, Ron Weiss, Shiori Yamashita, Yotaro Kubo, and Victor Ungureanu.

References:

[1] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen and Y. Wu, "Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling," arXiv:2010.04301, 2020. [paper]
[2] N. Kalchbrenner, W. Elsen, K. Simonyan, S. Noury, N. Casagrande, W. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman and K. Kavukcuoglu "Efficient Neural Audio Synthesis," in Proc. ICML, 2018 [paper]