MOSEL
The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled...
Read Moreby Beomseok Lee | Aug 21, 2024 | Corpora | 0
Spoken Language Understanding (SLU) involves interpreting spoken input using Natural Language Processing (NLP). Voice assistants like Alexa and Siri are real-world examples of SLU applications. The core tasks in SLU include...
Read Moreby Mauro Cettolo | Apr 30, 2024 | Corpora | 0
Ready-to-use version for MT research purposes of the multilingual transcriptions of TED talks
Read Moreby Dennis Fucci | Oct 20, 2023 | Corpora | 0
Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker
Read Moreby Beatrice Savoldi | Oct 19, 2023 | Corpora | 1
The INclusive Evaluation Suite (INES) is a test set designed to assess MT systems ability to produce gender-inclusive translations for the German→English language pair. By design, each German source sentence in INES includes an...
Read Moreby Beatrice Savoldi | Oct 9, 2023 | Corpora | 0
GeNTE (Gender-Neutral Translation Evaluation) is a natural, bilingual corpus designed to benchmark the ability of machine translation systems to generate gender-neutral translations. Built from European Parliament speeches,...
Read Moreby Marco Gaido | Jul 7, 2023 | Corpora | 0
EC Short Clips is a test set dedicated to evaluate automatic subtitling systems.
Read Moreby Marco Gaido | Jul 7, 2023 | Corpora | 0
EuroParl Interviews is a test set dedicated to evaluate automatic subtitling systems.
Read Moreby Matteo Negri | Jun 1, 2023 | Corpora | 0
Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology
Read Moreby Mauro Cettolo | May 30, 2023 | Corpora | 0
Annotation of dubbing segments based on the Heroes corpus
Read Moreby Beatrice Savoldi | May 30, 2023 | Corpora | 0
This multilingual dataset was created within the TOSCA-MP project as ground truth data for the evaluation of automatic transcription and spoken language translation technologies.
Read Moreby Marco Gaido | May 30, 2023 | Corpora | 0
The largest freely-available Synthetic Corpus for Automatic Post-Editing
Read More
The 22nd edition of IWSLT will be co-located with @aclmeeting in Vienna, Austria on 31 July-1 Aug 2025!
Stay tuned for the CFP and more info about our 2025 shared tasks! Join our google group for periodic updates.
In "Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps," @BeatriceSavoldi, @DennisFucci, @dirk_hovy, and I show how speech recognition serves different gender groups differently and what to do about it.
Meet @sarapapi, @BeatriceSavoldi, and @negri_teo at EMNLP 2024 in Miami next week! 🌴
They will present two main conference papers about human-centered #MT and #genderbias, and #opensource #speech resources!
📍 Details here: https://mt.fbk.eu/our-postdocs-sara-papi-and-beatrice-savoldi-and-our-researcher-matteo-negri-at-emnlp-2024/
#NLProc #EMNLP2024
Weekly pick from the #MeetweenScientificWatch: "Vcoder: Versatile Vision Encoders for Multimodal LLMs" - A novel encoder boosts object perception in MLLMs, outperforming GPT-4V in visual reasoning! 🌆👀