Resources | Machine Translation Unit

Corpora

BinQE – Machine Translation Dataset Annotated with Binary Quality Judgements
BitterCorpus – English-Italian corpus with annotated bilingual terms in IT domain
CLTE Benchmark – Cross-Lingual Textual Entailment Dataset
EC Short Clips – Automatic subtitling benchmark for English-German/Spanish made of European Commission clips
EuroParl Interviews – Automatic subtitling benchmark for English-German/Spanish made of European Parliament Interviews
eSCAPE – Large-scale Synthetic Corpus for Automatic Post-Editing
GenderCrawl – Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker
GeNTE – Bilingual English-Italian benchmark for the evaluation of gender-neutral MT
Heroes-ON-OFF – Annotation of dubbing segments based on the Heroes corpus
INES – Bilingual German-English Test Suite for inclusive machine translation
MAGMATic – Italian-English multi-domain academic gold standard with manual annotation of terminology
MCIF – Multimodal Crosslingual Instruction Following benchmark
mGente: Multilingual English-* benchmark for the evaluation of gender-neutral modeling and MT
MOSEL – Dataset collection of 950K hours of open-source speech covering the 24 official languages of the European Union
Neo-GATE – Bilingual English-Italian benchmark for the evaluation of gender-inclusive MT with neomorphemes
NEuRoparl-ST – Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology
RTE3-derived CLTE dataset – Cross-lingual entailment corpus, obtained by translating the RTE-3 dataset
SPEECH-MASSIVE – Multilingual dataset spanning 12 languages for SLU and beyond
TOSCA-MP Speech Ground Truth – Multilingual dataset of news and talk show transcriptions and translations
WAGS – English-Italian Word Alignment Gold Standard

Software

Actively Mantained

FBK fairseq – Code and models for Speech Translation based on the fairseq python package
simulstream – a Python library for simultaneous/streaming speech recognition and translation.
pangolinn – a Python library for neural network developers that contains test suites aimed at finding bugs (if any) in newly-created models.
SubSONAR – a Python library that evaluates the quality of SRT files using the multilingual multimodal SONAR model

Past Contributions

Moses – A statistical machine translation system
IRSTLM – A toolkit featuring algorithms and data structures to store and access very large n-gram language models
online
MGIZA++ – An extension of MGIZA++, which allows to align sentence pair in an online mode
AQET – Adaptive Quality Estimation tool for Machine Translation
ModernMT – A neural adaptive machine translation system that adapts to context and learns from corrections