MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

The latest releases of the corpus are:

  • release v3.0:
    1 language direction:
    (includes data for IWSLT-2023 Offline Speech Translation task)
    A documentation of the release is provided in the README file .
    ** more language directions coming soon **
  • release v2.0:
    3 language direction:
    English-to-{German, Chinese, Japanese}
    (includes data for IWSLT-2022 and IWSLT-2021 Offline Speech Translation task)
    A documentation of the release is provided in the README file .
  • release v1.2:
    14 language directions:
    English-to-{Arabic, Chinese, Czech, Dutch, French, German, Italian, Persian, Portuguese, Romanian, Russian, Spanish, Turkish, Vietnamese}
    (includes the 8 language directions of release v1.0)
  • release v1.1 (special release for IWSLT-2019):
    English-to-Czech language direction (TXT-only)
  • release v1.0:
    8 language directions:

.MuST-C is continuously growing in size and language coverage, stay tuned for updates!

How to obtain MuST-C

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 .

MuST-C is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

Reference paper

If you use MuST-C in your work, please cite the following paper:

Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, Marco Turchi. 2020.
“MuST-C: A multilingual corpus for end-to-end speech translation”.
In Computer Speech & Language Journal.