Annotation of MuST-C talks with  speakers’ gender information

MuST-Speakers is a resource designed to i) foster research around gender bias in speech translation (ST) and machine translation (MT), and ii) facilitate the development of gender-enhanced translation models.

MuST-Speakers comprises the annotation of speakers’ gender information for the English talks contained in MuST-C V1.2, a TED-based multilingual speech translation corpus. Given the language coverage of MuST-C, MuST-Speakers annotations thus allows research on gender translation for 14 different language directions.

MuST-Speakers Data Statements are available HERE.

Annotation methodology

All the 2,545 TED talks included in MuST-C V1.2 (training/dev/tst-common), have been manually labeled with speakers’ gender information based on the personal pronouns found in their publicly available personal TED section.

Accordingly, such manual assignment reports the gender linguistic forms by which the speakers accept to be referred to in English, and most likely want their translation to conform to.

We stress that pronoun usage does not directly map to speakers’ self-determined gender identity. As such, by relying on personal pronouns in our annotation we do not make any assumption about speakers’ gender identity.

How to obtain MuST-Speakers

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.

MuST-Speakers is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license and is freely downloadable.

Reference paper

If you use MuST-Speakers annotation in your work, please cite the following paper:

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Breeding Gender-Aware Direct Speech Translation Systems
In Proceedings of the  28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.

Bibtex

@inproceedings{gaido-etal-2020-breeding,
title = “Breeding Gender-aware Direct Speech Translation Systems”,
author = “Gaido, Marco and Savoldi, Beatrice and Bentivogli, Luisa and Negri, Matteo andTurchi, Marco”,
booktitle = “Proceedings of the 28th International Conference on Computational Linguistics”,
month = dec,
year = “2020”,
address = “Barcelona, Spain (Online)”,
publisher = “International Committee on Computational Linguistics”,
url = “https://www.aclweb.org/anthology/2020.coling-main.350”,
pages = “3951–3964”}

Related resources for research on gender translation

  • MuST-C Gender-balanced Validation Set: a new MuST-C validation set specifically designed to train ST systems for experiments on gender translation.
  • MuST-SHE: a benchmark derived from MuST-C which allows for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.
  • Code to generate the ST systems presented in the COLING’2020 paper.