eSCAPE is the largest freely-available Synthetic Corpus for Automatic Post-Editing. It consists of millions of entries in which the MT element of the training triplets has been obtained by translating the source side of publicly-available parallel corpora and using the target side as an artificial human post-edit. Translations are obtained both with phrase-based and neural models.
For each MT paradigm, eSCAPE contains 7.2 million triplets for English–German and 3.3 million for English–Italian, resulting in a total of 14,4 and 6,6 million instances respectively. In addition in version 2, it contains also an English-Russian section including 7.7 million triplets.
If you use the corpus, please cite the above paper.
@inproceedings{negri-etal-2018-escape,
title = “{ESCAPE}: a Large-scale Synthetic Corpus for Automatic Post-Editing”,
author = “Negri, Matteo and Turchi, Marco and Chatterjee, Rajen and Bertoldi, Nicola”,
booktitle = “Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)”,
month = may,
year = “2018”,
address = “Miyazaki, Japan”,
publisher = “European Language Resources Association (ELRA)”,
url = “https://www.aclweb.org/anthology/L18-1004”,
}
How to obtain eSCAPE:
Contact us: negri@fbk.eu