MuST-Cinema-PE is a corpus containing post-editing data of automatically-generated subtitles. It contains automatically-generated subtitles for 9 TED talk videos, their post-edited versions as well as process data (process logs, keystrokes) from three professional subtitlers in two language pairs (English into German/Italian).

The 9 TED talks come from the test set of the MuST-Cinema corpus. They were split into 12 tasks of equal video duration. For en->it, 1,199 subtitles were collected for subtitler1 and 1,208 subtitles for subtitler2, while for en->de 1,198 subtitles. These correspond to 545 sentences for Italian and 542 sentences for German.

How to obtain MuST-Cinema-PE

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 .

MuST-Cinema-PE is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

Reference Paper

If you use MuST-Cinema in your work, please cite the following paper:

Alina Karakanta, Mauro Cettolo, Matteo Negri, Luisa Bentivogli. 2024.
“MEvaluating Automatic Subtitling: Correlating Post-editing Effort and Automatic Metrics.uST-Cinema: a Speech-to-Subtitles Corpus”
In Proceeedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, May 20-25 2024.

Credits

This work was partially funded by the EAMT grant “2021 Sponsorship of Activities – Students’ edition” with the project title “Towards a methodology for evaluating automatic subtitling”.