MOSEL

Oct 31, 2024 | Corpora

The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses. In particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using Whisper large v3. Whisper is released under the OS Apache 2.0 License which allows releasing the generated content under any license. Since LibriLight, differently from VoxPopuli, contains segments longer than Whisper’s maximum duration limit of 30sec, we split them into chunks of up to 30sec.

Download MOSEL pseudolabels!

Dataset Resources

Collection Repository: MOSEL GitHub
Dataset Repository: MOSEL HuggingFace
Paper: MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

License

CC-BY-4.0

Citation

Release 1.0:

@inproceedings{mosel,
  title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
  author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
  booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2024",
  address = "Miami, United States",
  publisher = "Association for Computational Linguistics",
}

MT Group at FBK Follow

#MachineTranslation Research Unit @FBK_research. #nlproc #deeplearning #ai

Avatar MT Group at FBK @fbk_mt ·

8 May

🌍 @lina_conti and @luisabentivogli are heading to #LREC2026 in Palma! They'll present two papers:
📄 "Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation"
Paper link:

Reply on Twitter 2052754767252697458 Retweet on Twitter 2052754767252697458 2 Like on Twitter 2052754767252697458 Twitter 2052754767252697458

Avatar MT Group at FBK @fbk_mt ·

6 May

🤔 What Matters in Data for DPO? I asked myself this question a few days ago while trying to understand how to generate a dataset with preferences to run #DPO. This recent #NeurIPS paper answered some of my questions. The findings are simple but crucial for data creation:

Reply on Twitter 2051965087871537565 Retweet on Twitter 2051965087871537565 1 Like on Twitter 2051965087871537565 3 Twitter 2051965087871537565

Avatar MT Group at FBK @fbk_mt ·

29 Apr

🎓 Come and join our group! 🎓
We offer 2 fully funded PhD positions:
🌍 Human-Centred Evaluation Frameworks for Multilingual Technologies (A6)
🤖 Multimedia Personalization with Multimodal Large Language Models (A7)
⏰ Deadline: 15 May 2026
🔗 Details: https://iecs.unitn.it/education/admission/call-for-application

Reply on Twitter 2049536621352202269 Retweet on Twitter 2049536621352202269 5 Like on Twitter 2049536621352202269 6 Twitter 2049536621352202269

Avatar MT Group at FBK @fbk_mt ·

29 Apr

Our pick of the week by
@FBKZhihangXie
: "Detecting Hallucination in SpeechLLMs at Inference Time Using Attention Maps" by @JWaldendorf, Bashar Awwad Shiekh Hasan and Evgenii Tsymbalov
📰
#SpeechLLM #Hallucination

Zhihang Xie @FBKZhihangXie

🚀 New paper: Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
📄 http://arxiv.org/abs/2604.19565
🧩 Lightweight inference-time detection for SpeechLLM hallucinations via audio attention.
✨ Attention classifiers beat uncertainty baselines on ASR and S2TT.

Reply on Twitter 2049500394204627182 Retweet on Twitter 2049500394204627182 1 Like on Twitter 2049500394204627182 3 Twitter 2049500394204627182

Load More