Corpora
NOTICE: The distribution of the following TED-based resources:
- MuST-C
- MuST-C Post-Edited Test Set
- MuST-Cinema
- MuST-Cinema Post-Edits
- MuST-SHE
- MuST-Speakers
- MuST-C Gender-balanced Validation Set
- WIT3
is temporarily suspended pending clarification of the new policy adopted by TED for the use of its proprietary data.
- BinQE – Machine Translation Dataset Annotated with Binary Quality Judgements
- BitterCorpus – English-Italian corpus with annotated bilingual terms in IT domain
- CLTE Benchmark – Cross-Lingual Textual Entailment Dataset
- EC Short Clips – Automatic subtitling benchmark for English-German/Spanish made of European Commission clips
- EuroParl Interviews – Automatic subtitling benchmark for English-German/Spanish made of European Parliament Interviews
- eSCAPE – Large-scale Synthetic Corpus for Automatic Post-Editing
- GenderCrawl – Text corpora for Spanish, French, and Italian containing gendered words referring to the first-person speaker
- GeNTE – Bilingual English-Italian benchmark for the evaluation of gender-neutral MT
- Heroes-ON-OFF – Annotation of dubbing segments based on the Heroes corpus
- INES – Bilingual German-English Test Suite for inclusive machine translation
- MAGMATic – Italian-English multi-domain academic gold standard with manual annotation of terminology
- MOSEL – Dataset collection of 950K hours of open-source speech covering the 24 official languages of the European Union
- Neo-GATE – Bilingual English-Italian benchmark for the evaluation of gender-inclusive MT with neomorphemes
- NEuRoparl-ST – Multilingual benchmark built from European Parliament speeches and annotated with Named Entities and Terminology
- RTE3-derived CLTE dataset – Cross-lingual entailment corpus, obtained by translating the RTE-3 dataset
- SPEECH-MASSIVE – Multilingual dataset spanning 12 languages for SLU and beyond
- TOSCA-MP Speech Ground Truth – Multilingual dataset of news and talk show transcriptions and translations
- WAGS – English-Italian Word Alignment Gold Standard
Software
Actively Mantained
- FBK fairseq – Code and models for Speech Translation based on the fairseq python package
- pangolinn – a Python library for neural network developers that contains test suites aimed at finding bugs (if any) in newly-created models.
- SubSONAR – a Python library that evaluates the quality of SRT files using the multilingual multimodal SONAR model
Past Contributions
- Moses – A statistical machine translation system
- IRSTLM – A toolkit featuring algorithms and data structures to store and access very large n-gram language models
online - MGIZA++ – An extension of MGIZA++, which allows to align sentence pair in an online mode
- AQET – Adaptive Quality Estimation tool for Machine Translation
- ModernMT – A neural adaptive machine translation system that adapts to context and learns from corrections