GenderCrawl includes monolingual text corpora for Spanish, French, and Italian. These corpora are derived from ParaCrawl, from which we automatically selected sentences with speaker-dependent words that clarify the speaker’s gender (e.g., Spanish: Soy nueva en esta zona). For each language we collected two gender-specific corpora, one for feminine and one for masculine forms.

For comprehensive statistics and detailed information about these corpora, see the reference paper below.

License

These datasets are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0). Please review the full license terms for more details on how you can use and share this data while giving appropriate attribution.

Citing

@inproceedings{fucci-etal-2023-integrating,
title = "Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection",
author = "Fucci, Dennis and
Gaido, Marco and
Papi, Sara and
Cettolo, Mauro and
Negri, Matteo and
Bentivogli, Luisa},
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}