NEuRoparl-ST is a multilingual benchmark built from European Parliament speeches and annotated with Named Entities (NEs) and terminology.
The dataset comprises a subset of the Europarl-ST corpus, namely the English/French, English/Italian, and English/Spanish test sets, which are composed of (audio, transcript, translation) triplets. NEuRoparl-ST enriches the textual portions of these test sets (transcripts, translations) with NE and terminology annotation. Since the three Europarl-ST test sets are mainly derived from the same original speeches, the result is a multilingual benchmark featuring very high content overlap, thus enabling cross-lingual comparisons.
Besides being the first benchmark of this type for speech-to-text translation (ST), NEuRoparl-ST can also be used for the evaluation of NEs and terminology transcription (ASR) and translation (MT).
For full details about the dataset, see the reference paper below.
How to obtain NEuRoparl-ST
The NEuRoparl-ST annotated corpus is released under the same licence as Europarl-ST, namely a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0).
All rights of the data belong to the European Union and respective copyright holders. Please refer to the Europarl-ST copyright notice for details.
The NEuRoparl-ST-extension consists of a manual annotation layer which enriches the multilingual corpus NEuRoparl-ST v1.0.
This annotation allows for fine-grained analyses of the main factors that can influence the ability of a ST system to transcribe/translate a person name, i.e. i) the nationality of the referent, as different languages may involve different phoneme-to-grapheme mappings and may contain different
sounds, and ii) the nationality of the speaker, as non-native speakers typically have different accents and hence different pronunciations of the same name.
To this purpose, in this extension each person name marked in NEuRoparl-ST is further annotated with information about their nationality and the nationality of the speaker uttering the sentence. For instance, if a German person says “Macron is the French president”, the speaker nationality is German, while the referent nationality is French.
NEuRoparl-ST-extension is released under the same license as NEuRoparl-ST
- If you use NEuroparl-ST in your work, please cite the following paper:
Marco Gaido, Susana Rodríguez, Matteo Negri, Luisa Bentivogli, Marco Turchi.
“Is “moby dick” a Whale or a Bird? Named Entities and Terminology in Speech Translation“.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), 7th–11th November 2021, Online and Punta Cana, Dominican Republic.
- If you use NEuroparl-ST-extension, please cite the following paper:
Marco Gaido, Matteo Negri, Marco Turchi
“Who Are We Talking About? Handling Person Names in Speech Translation”.
In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), 26-27 May, 2022, Online and Dublin, Ireland.