Corpora

MCIF

MCIF (Multimodal Crosslingual Instruction Following) is a multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short-...

Read More

mGeNTE

mGeNTE (Multilingual Gender-Neutral Translation Evaluation) is a natural, multilingual corpus designed to benchmark gender-neutral language and automatic translation.mGente is built upon European Parliament speech data extracted...

Read More

MOSEL

The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled...

Read More

Speech-MASSIVE

Spoken Language Understanding (SLU) involves interpreting spoken input using Natural Language Processing (NLP). Voice assistants like Alexa and Siri are real-world examples of SLU applications. The core tasks in SLU include...

Read More

INES

The INclusive Evaluation Suite (INES) is a test set designed to assess MT systems ability to produce gender-inclusive translations for the German→English language pair. By design, each German source sentence in INES includes an...

Read More

GeNTE

GeNTE (Gender-Neutral Translation Evaluation) is a natural, bilingual corpus designed to benchmark the ability of machine translation systems to generate gender-neutral translations. Built from European Parliament speeches,...

Read More
Loading

🏝️ Yesterday at #LREC2026, Palma de Mallorca!
@lina_conti presented "Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation" at the poster session.
📄Paper:
💻Code: https://github.com/lina-conti/voice-bias-coreference
#SpeechTranslation #NLProc

How does the granularity of speech-text pairs impact SpeechLLM performance, and what is the optimal way to interleave tokens? Furthermore, what are the best practices for generating synthetic data to boost training?🧐

🎙️ Our paper on connecting Speech Foundation Models with LLMs is featured in the SpeechLMM Training Journal on Weights & Biases.

Read it 👉 https://bit.ly/4svG7ll

SpeechLMM 2.0 coming this summer. 👀

#Meetween #SpeechLMM #AI #NLP

Load More