Corpora

MCIF

by Sara Papi | Apr 17, 2026 | Corpora | 0

MCIF (Multimodal Crosslingual Instruction Following) is a multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short-...

mGeNTE

by Beatrice Savoldi | Jan 13, 2025 | Corpora | 0

mGeNTE (Multilingual Gender-Neutral Translation Evaluation) is a natural, multilingual corpus designed to benchmark gender-neutral language and automatic translation.mGente is built upon European Parliament speech data extracted...

MOSEL

by Sara Papi | Oct 31, 2024 | Corpora | 0

The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled...

Speech-MASSIVE

by Beomseok Lee | Aug 21, 2024 | Corpora | 0

Spoken Language Understanding (SLU) involves interpreting spoken input using Natural Language Processing (NLP). Voice assistants like Alexa and Siri are real-world examples of SLU applications. The core tasks in SLU include...

WAGS

by Mauro Cettolo | Apr 30, 2024 | Corpora | 0

Ready-to-use version for MT research purposes of the multilingual transcriptions of TED talks

GenderCrawl

by Dennis Fucci | Oct 20, 2023 | Corpora | 0

Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker

INES

by Beatrice Savoldi | Oct 19, 2023 | Corpora | 0

The INclusive Evaluation Suite (INES) is a test set designed to assess MT systems ability to produce gender-inclusive translations for the German→English language pair. By design, each German source sentence in INES includes an...

GeNTE

by Beatrice Savoldi | Oct 9, 2023 | Corpora | 0

GeNTE (Gender-Neutral Translation Evaluation) is a natural, bilingual corpus designed to benchmark the ability of machine translation systems to generate gender-neutral translations. Built from European Parliament speeches,...

EC Short Clips

by Marco Gaido | Jul 7, 2023 | Corpora | 0

EC Short Clips is a test set dedicated to evaluate automatic subtitling systems.

EuroParl Interviews

by Marco Gaido | Jul 7, 2023 | Corpora | 0

EuroParl Interviews is a test set dedicated to evaluate automatic subtitling systems.

NEuRoparl-ST

by Matteo Negri | Jun 1, 2023 | Corpora | 0

Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology

Heroes-ON-OFF

by Mauro Cettolo | May 30, 2023 | Corpora | 0

Annotation of dubbing segments based on the Heroes corpus

MT Group at FBK Follow

#MachineTranslation Research Unit @FBK_research. #nlproc #deeplearning #ai

Avatar MT Group at FBK @fbk_mt ·

8 Jul

Our pick of the week by
@FBKZhihangXie : "Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models" by Haoqin Sun, @Chenyang_Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, @wangly0229, Weihua Luo and Yong Qin
#SpeechLLM #LongFormSpeech #SLU

Zhihang Xie @FBKZhihangXie

🚀 New paper: Speech-XL for long-form SpeechLLMs
📄 https://arxiv.org/abs/2602.05373
🧩 Uses Speech Summarization Tokens to compress local speech intervals into compact KV states efficiently.
✨ Improves long-form speech understanding while reducing memory and FLOPs on 10-minute audio.

Reply on Twitter 2074844624506503630 Retweet on Twitter 2074844624506503630 Like on Twitter 2074844624506503630 3 Twitter 2074844624506503630

Avatar MT Group at FBK @fbk_mt ·

25 Jun

Last week, we had a great talk for our MT Seminar Series!
@julius_gulius a PhD from @cambridgenlp presented a talk on "Effective uses of grammatical knowledge in extremely low-resource Machine Translation"
#MachineTranslation #LowResourceMT #NLProc #FBK

Reply on Twitter 2070140239809593843 Retweet on Twitter 2070140239809593843 Like on Twitter 2070140239809593843 8 Twitter 2070140239809593843

Avatar MT Group at FBK @fbk_mt ·

24 Jun

Our pick of the week by
@BeatriceSavoldi
: "Accuracy: Community Perspectives on Machine Translation" by Yujun Wang,
@EhudReiter
, Shimei Pan,
@egere14
and Wei Zhao #MachineTranslation #TranslationQuality #Evaluation

BeatriceSavoldi @BeatriceSavoldi

📖 #PickoftheWeek @fbk_mt "Accuracy: Community Perspectives on Machine Translation"

A cool analysis of the conflicting interests of different communities around MT(AI developers, LSPs, and users)
https://arxiv.org/pdf/2606.09655
#NLP #MachineTranslation #DiverseStakeholders

Reply on Twitter 2069781136587239840 Retweet on Twitter 2069781136587239840 Like on Twitter 2069781136587239840 6 Twitter 2069781136587239840

Avatar MT Group at FBK @fbk_mt ·

17 Jun

We are at the Alliance for Language Technologies - #ALTEDIC Week 2026!
@luisabentivogli and @negri_teo are attending two project meetings (ALT-EDIC4EU and #LLMs4EU), presenting the Evaluation of Spoken Language Translation in the context of IWSLT.

#LanguageTechnologies #FBK

Reply on Twitter 2067287234273784154 Retweet on Twitter 2067287234273784154 3 Like on Twitter 2067287234273784154 7 Twitter 2067287234273784154