The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses. In particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using Whisper large v3. Whisper is released under the OS Apache 2.0 License which allows releasing the generated content under any license. Since LibriLight, differently from VoxPopuli, contains segments longer than Whisper’s maximum duration limit of 30sec, we split them into chunks of up to 30sec.
Dataset Resources
- Collection Repository: MOSEL GitHub
- Dataset Repository: MOSEL HuggingFace
- Paper: MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
License
- CC-BY-4.0
Citation
Release 1.0:
@inproceedings{mosel,
title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, United States",
publisher = "Association for Computational Linguistics",
}