• BinQE – Machine Translation Dataset Annotated with Binary Quality Judgements
  • BitterCorpus – English-Italian corpus with annotated bilingual terms in IT domain
  • CLTE Benchmark – Cross-Lingual Textual Entailment Dataset
  • EC Short Clips – Automatic subtitling benchmark for English-German/Spanish made of European Commission clips
  • EuroParl Interviews – Automatic subtitling benchmark for English-German/Spanish made of European Parliament Interviews
  • eSCAPE – Large-scale Synthetic Corpus for Automatic Post-Editing
  • GenderCrawl – Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker
  • GeNTE – Bilingual English-Italian benchmark for the evaluation of gender-neutral MT
  • Heroes-ON-OFF – Annotation of dubbing segments based on the Heroes corpus
  • INES – Bilingual German-English Test Suite for inclusive machine translation
  • MAGMATic – Italian-English multi-domain academic gold standard with manual annotation of terminology
  • MuST-C – Multilingual Speech Translation Corpus
  • MuST-C Common Post-Edited Test Set – Additional reference translations for English-German/Italian/Spanish
  • MuST-Cinema – Speech-to-Subtitles corpus
  • MuST-Cinema Post-Edits: Post-Edits of the En-De and En-It portions of the MuST-Cinema corpus
  • MuST-SHE – Multilingual benchmark for the evaluation of gender bias in Machine Translation and Speech Translation
  • MuST-Speakers – Annotation of MuST-C talks with  speakers’ gender information
  • MuST-C Gender-balanced Validation Set – New MuST-C validation set balanced with respect to speakers’ gender
  • Neo-GATE – Bilingual English-Italian benchmark for the evaluation of gender-inclusive MT with neomorphemes
  • NEuRoparl-ST – Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology
  • RTE3-derived CLTE dataset – Cross-lingual entailment corpus, obtained by translating the RTE-3 dataset
  • TOSCA-MP Speech Ground Truth – Multilingual dataset of news and talk show transcriptions and translations
  • WAGS – English-Italian Word Alignment Gold Standard
  • WIT3 – Ready-to-use version for MT research purposes of the multilingual transcriptions of TED talks


Actively Mantained
  • FBK fairseq – Code and models for Speech Translation based on the fairseq python package
  • pangolinn – a Python library for neural network developers that contains test suites aimed at finding bugs (if any) in newly-created models.
  • SubSONAR – a Python library that evaluates the quality of SRT files using the multilingual multimodal SONAR model
Past Contributions
  • Moses – A statistical machine translation system
  • IRSTLM – A toolkit featuring algorithms and data structures to store and access very large n-gram language models
  • MGIZA++ – An extension of MGIZA++, which allows to align sentence pair in an online mode
  • AQET – Adaptive Quality Estimation tool for Machine Translation
  • ModernMT – A neural adaptive machine translation system that adapts to context and learns from corrections