mGeNTE

Jan 13, 2025 | Corpora

mGeNTE (Multilingual Gender-Neutral Translation Evaluation) is a natural, multilingual corpus designed to benchmark gender-neutral language and automatic translation.

mGente is built upon European Parliament speech data extracted from the Europarl corpus, and represents a multilingual expansion of the bilingual GeNTE dataset.

For each language pair, mGeNTE comprises 1500 parallel sentences, which are enriched with manual annotations and feature a balanced distribution of translation phenomena that either entail i) a gender-neutral translation (set-N), or ii) a gendered translation in the target language (set-G).

For full details about and access to the dataset, see below.

How to obtain mGeNTE

The mGeNTE corpus is released under a Creative Commons Attribution 4.0 International license (CC BY 4.0).

mGeNTE contains text data extracted from the Europarl Corpus (common test set 2) and all rights of the data belong to the European Union and/or respective copyright holders. Please refer to Europarl “Terms of Use” for details.

Click here to download mGeNTE

MT Group at FBK Follow

#MachineTranslation Research Unit @FBK_research. #nlproc #deeplearning #ai

Avatar MT Group at FBK @fbk_mt ·

27 May

⭐ For our #PickOfTheWeek, this paper explores an important question for modern speech AI:

🎙️ Which Evaluation for Which Speech Model?
👥 Authors: @Maureendss , @EeshanDhekane

Speech foundation models are evolving rapidly, but evaluation practices are still fragmented.

Reply on Twitter 2059703767570780492 Retweet on Twitter 2059703767570780492 Like on Twitter 2059703767570780492 2 Twitter 2059703767570780492

Avatar MT Group at FBK @fbk_mt ·

15 May

🏝️ Yesterday at #LREC2026, Palma de Mallorca!
@lina_conti presented "Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation" at the poster session.
📄Paper:
💻Code: https://github.com/lina-conti/voice-bias-coreference
#SpeechTranslation #NLProc

Reply on Twitter 2055326042957713546 Retweet on Twitter 2055326042957713546 Like on Twitter 2055326042957713546 6 Twitter 2055326042957713546

Avatar MT Group at FBK @fbk_mt ·

13 May

How does the granularity of speech-text pairs impact SpeechLLM performance, and what is the optimal way to interleave tokens? Furthermore, what are the best practices for generating synthetic data to boost training?🧐

Reply on Twitter 2054503147721097286 Retweet on Twitter 2054503147721097286 Like on Twitter 2054503147721097286 3 Twitter 2054503147721097286

Retweet on Twitter MT Group at FBK Retweeted

Avatar fbk_stek @fbk_stek ·

8 May

Reply on Twitter 2052822167511826811 Retweet on Twitter 2052822167511826811 2 Like on Twitter 2052822167511826811 3 Twitter 2052822167511826811

Load More