GeNTE (Gender-Neutral Translation Evaluation) is a natural, bilingual corpus designed to benchmark the ability of machine translation systems to generate gender-neutral translations.

Built from European Parliament speeches, GeNTE comprises a subset of the English-Italian portion of the Europarl corpus. GeNTE comprises 1500 parallel sentences, which are enriched with manual annotations and feature a balanced distribution of translation phenomena that either entail i) a gender-neutral translation, or ii) a gendered translation in the target language.

For full details about the dataset, see the reference paper below.

How to obtain GeNTE

The GeNTE corpus is released under a Creative Commons Attribution 4.0 International license (CC BY 4.0).

GeNTE contains text data extracted from the Europarl Corpus (common test set 2) and all rights of the data belong to the European Union and/or respective copyright holders. Please refer to Europarl “Terms of Use” for details.

Reference papers

  • If you use GeNTE in your work, please cite the following paper:

Andrea Piergentili*, Beatrice Savoldi*, Dennis Fucci, Matteo Negri, Luisa Bentivogli.
β€œHi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpusβ€œ.
To appear in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 6th–10th December 2023, Singapore.

(*) equal contribution