LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang Wei Han Ankur Bapna Michiel Bacchiani

Abstract: This paper introduces a new speech dataset called "LibriTTS-R" designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples.

The corpus is freely available for download from http://www.openslr.org/141/

For more information, refer to the dataset paper: Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023. If you use the LibriTTS-R corpus in your work, please cite the dataset paper where it was introduced.


Ground-truth and official TTS example comparison:

For results shown in our paper, see this page.


Examples using OSS TTS toolkits

This section shows additional results of multi-speaker TTS models using OSS toolkits trained on either LibriTTS or LibriTTS-R. The TTS model consists of a Transformer phoneme-to-mel acoustic model [1] and a HiFiGAN mel-to-waveform neural vocoder [2]. All models were trained with the same model size, hyper-parameters, and training steps. For details, see all-in-one scripts in these open-source toolkits:

The pretrained models in those toolkits are COMING SOON!

Example1: The Edison construction department took entire charge of the installation of the plant, and the formal opening was attended on October 1, 1883, by Mr. Edison, who then remained a week in ceaseless study and consultation over the conditions developed by this initial three-wire underground plant.
Speaker ID LibriTTS LibriTTS-R
103
1841
1121
5717

Example2: Her sea going qualities were excellent, and would have amply sufficed for a circumnavigation of the globe.
Speaker ID LibriTTS LibriTTS-R
103
1841
1121
5717

Example3: Therefore her Majesty paid no attention to anyone and no one paid any attention to her.
Speaker ID LibriTTS LibriTTS-R
103
1841
1121
5717

Example4: The Free State Hotel served as barracks.
Speaker ID LibriTTS LibriTTS-R
103
1841
1121
5717

Example5: The military force, partly rabble, partly organized, had meanwhile moved into the town.
Speaker ID LibriTTS LibriTTS-R
103
1841
1121
5717


Acknowledgement:

We appreciate valuable feedback and support from Tomoki Hayashi and Shinji Watanabe to setup the OSS toolkits used in thoese results, and Daniel S. Park, Hakan Erdogan, Haruko Ishikawa, Hynek Hermansky Johan Schalkwyk, John R. Hershey, Keisuke Kinoshita, Llion Jones, Neil Zeghidour, Quan Wang, Richard William Sproat, Ron Weiss, Shiori Yamashita, Yotaro Kubo, and Victor Ungureanu for research activities at Google.

References:

[1] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI'19/IAAI'19/EAAI'19). AAAI Press, Article 823, 6706–6713. https://doi.org/10.1609/aaai.v33i01.33016706
[2] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS'20). Curran Associates Inc., Red Hook, NY, USA, Article 1428, 17022–17033. https://arxiv.org/abs/2010.05646