MAGMATic (Multi-domain Academic Gold Standard with Manual Annotation of Terminology) is a novel Italian–English benchmark which allows MT evaluation focused on terminology translation.
The data set comprises 2,055 parallel sentences extracted from institutional academic texts, namely course unit and degree program descriptions. This text type is particularly interesting since it contains terminology from multiple domains, e.g. education and different academic disciplines described in the texts. All terms in the English target side of the data set were manually identified and annotated with a domain label, for a total of 7,517 annotated terms.
MAGMATic terms include both single-words and multi-words and are classified into Sure or Possible, depending on whether the terminological status and specialisation of the term are certain or not.
The identified terms are annotated with one of the following domain categories:
- Education: terms related to activities carried out inside an educational institution or to people being part of an educational institution (e.g. course, module, lecturer).
- Education-equipment: terms referring to educational equipment that could also be used elsewhere (e.g. overhead projector, desk, lab).
- Disciplinary: terms related to the discipline taught in the course. Disciplinary terms are additionally annotated with a more specific domain label – e.g. biology, chemistry, medicine – based on the name of the course each sentence was extracted from, for a total of 20 different disciplinary domains.
How to obtain MAGMATic
MAGMATic is released under a Creative Commons Attribution – Non Commercial – Share Alike 4.0 International license (CC BY-NC-SA 4.0)
If you use MAGMATic in your work, please cite the following paper:
Randy Scansani, Luisa Bentivogli, Silvia Bernardini and Adriano Ferraresi.
“MAGMATic: A Multi-domain Academic Gold Standard with Manual Annotation of Terminology for Machine Translation Evaluation“
In Proceedings of MT Summit XVII, Volume 1: Research Track, Dublin, Ireland, 19-23 August 2019, pages 78-86.