Corpora



NOTICE: The distribution of the following TED-based resources:

  • MuST-C
  • MuST-C Post-Edited Test Set
  • MuST-Cinema
  • MuST-Cinema Post-Edits
  • MuST-SHE
  • MuST-Speakers
  • MuST-C Gender-balanced Validation Set
  • WIT3

is temporarily suspended pending clarification of the new policy adopted by TED for the use of its proprietary data.


 

  • BinQE – Machine Translation Dataset Annotated with Binary Quality Judgements
  • BitterCorpus – English-Italian corpus with annotated bilingual terms in IT domain
  • CLTE Benchmark – Cross-Lingual Textual Entailment Dataset
  • EC Short Clips – Automatic subtitling benchmark for English-German/Spanish made of European Commission clips
  • EuroParl Interviews – Automatic subtitling benchmark for English-German/Spanish made of European Parliament Interviews
  • eSCAPE – Large-scale Synthetic Corpus for Automatic Post-Editing
  • GenderCrawl – Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker
  • GeNTE – Bilingual English-Italian benchmark for the evaluation of gender-neutral MT
  • Heroes-ON-OFF – Annotation of dubbing segments based on the Heroes corpus
  • INES – Bilingual German-English Test Suite for inclusive machine translation
  • MAGMATic – Italian-English multi-domain academic gold standard with manual annotation of terminology
  • MOSEL – Dataset collection of 950K hours of open-source speech covering the 24 official languages of the European Union
  • Neo-GATE – Bilingual English-Italian benchmark for the evaluation of gender-inclusive MT with neomorphemes
  • NEuRoparl-ST – Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology
  • RTE3-derived CLTE dataset – Cross-lingual entailment corpus, obtained by translating the RTE-3 dataset
  • SPEECH-MASSIVE – Multilingual dataset spanning 12 languages for SLU and beyond
  • TOSCA-MP Speech Ground Truth – Multilingual dataset of news and talk show transcriptions and translations
  • WAGS – English-Italian Word Alignment Gold Standard

Software

Actively Mantained
  • FBK fairseq – Code and models for Speech Translation based on the fairseq python package
  • pangolinn – a Python library for neural network developers that contains test suites aimed at finding bugs (if any) in newly-created models.
  • SubSONAR – a Python library that evaluates the quality of SRT files using the multilingual multimodal SONAR model
Past Contributions
  • Moses – A statistical machine translation system
  • IRSTLM – A toolkit featuring algorithms and data structures to store and access very large n-gram language models
    online
  • MGIZA++ – An extension of MGIZA++, which allows to align sentence pair in an online mode
  • AQET – Adaptive Quality Estimation tool for Machine Translation
  • ModernMT – A neural adaptive machine translation system that adapts to context and learns from corrections