CLTE Benchmark | Machine Translation Unit

Cross-Lingual Textual Entailment (CLTE) is the task of identifying multi-directional entailment relations between two sentences, T1 and T2, written in different languages.

Each T1/T2 pair in the dataset is annotated (XML format) with one of the following entailment relations:

Bidirectional (T1 ->T2 & T1 <- T2): the two fragments entail each other (semantic equivalence)
Forward (T1 -> T2 & T1 !<- T2): unidirectional entailment from T1 to T2
Backward (T1 !-> T2 & T1 <- T2): unidirectional entailment from T2 to T1
No Entailment (T1 !-> T2 & T1 !<- T2): there is no entailment between T1 and T2

Both T1 and T2 are assumed to be TRUE statements; hence in the dataset there are no contradictory pairs.

The CLTE datasets have been created within the EU-funded project Cosyne (Multilingual Content Synchronization with Wikis).

Various CLTE datasets covering different language pairs are available.

CLTE-Semeval Benchmark

The following data was created for the Cross-lingual Textual Entailment (CLTE) for Content Synchronization Task, wich was offered at Semeval-2012 and SemEval 2013.

Four language combinations are available, each containing 1,500 CLTE pairs:

Spanish/English
German/English
Italian/English
French/English

Additionally, a monolingual English dataset is available as a by-product of the data collection methodology (1,500 pairs).

The CLTE-SemEval dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Publications or presentations containing results obtained through the use of CLTE-SemEval should cite the following reference:

Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessandro Marchetti. 2011.
Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011).

Download CLTE

CLTE-Cosyne Benchmark

The CLTE-Cosyne Benchmark consists of:

1518 pairs for the language combinations English/Italian, English/German
800 pairs for the language combination Italian/German

In addition, two monolingual datasets are available, respectively for English (1518 pairs) and Italian (800 pairs).

To get the CLTE-Cosyne benchmark, please contact Matteo Negri (negri[at]fbk.eu)

Other references:

Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization. In Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012).
Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2013. Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-2013).
Matteo Negri and Yashar Mehdad, Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-Day Rush. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk.