The MuST-C-PE-SET is an extension of the Common Test Sets of MuST-C, a publicly released
multilingual Speech Translation (ST) corpus based on English TED Talks.

Additional reference translations were collected for 3 out of the 14 language directions covered by MuST-C, i.e. English-German/Italian/Spanish, and consist of professional post-edits of the output of two state-of-the-art systems that represent the main current ST approaches, namely a cascade ASR+MT system and a direct ST system.

For each language direction, the PE-SET includes 550 segments, corresponding to around 10,000 English source words.

For each segment we release the audio file, the manual reference transcription and translation from MuST-C, the outputs of the cascade and direct ST systems, the post-edits of the two systems.

The segments are largely overlapping across the three covered language directions to allow cross-lingual comparative evaluation.

How to obtain the MuST-C-PE-SET

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 .

The MuST-C-PE-SET is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

Credits

The creation of the post-edits was funded by the European Association for Machine Translation (EAMT) through its 2020 sponsorship of activities programme.

Reference paper

If you use these post-edits in your work, please cite the following paper:

L. Bentivogli, M. Cettolo, M. Gaido, A. Karakanta, A. Martinelli, M. Negri, M. Turchi
Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), Online, August 2021.

Source code

The ASR and direct ST models are available HERE.

The MT component of the cascade models can be found HERE.