Cross-Lingual Textual Entailment (CLTE) is the task of identifying multi-directional entailment relations between two sentences, T1 and T2, written in different languages.
Each T1/T2 pair in the dataset is annotated (XML format) with one of the following entailment relations:
- Bidirectional (T1 ->T2 & T1 <- T2): the two fragments entail each other (semantic equivalence)
- Forward (T1 -> T2 & T1 !<- T2): unidirectional entailment from T1 to T2
- Backward (T1 !-> T2 & T1 <- T2): unidirectional entailment from T2 to T1
- No Entailment (T1 !-> T2 & T1 !<- T2): there is no entailment between T1 and T2
Both T1 and T2 are assumed to be TRUE statements; hence in the dataset there are no contradictory pairs.
The CLTE datasets have been created within the EU-funded project Cosyne (Multilingual Content Synchronization with Wikis).
Various CLTE datasets covering different language pairs are available.
CLTE-Semeval Benchmark
The following data was created for the Cross-lingual Textual Entailment (CLTE) for Content Synchronization Task, wich was offered at Semeval-2012 and SemEval 2013.
Four language combinations are available, each containing 1,500 CLTE pairs:
- Spanish/English
- German/English
- Italian/English
- French/English
Additionally, a monolingual English dataset is available as a by-product of the data collection methodology (1,500 pairs).
The CLTE-SemEval dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Publications or presentations containing results obtained through the use of CLTE-SemEval should cite the following reference:
Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessandro Marchetti. 2011.
Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011).
CLTE-Cosyne Benchmark
The CLTE-Cosyne Benchmark consists of:
- 1518 pairs for the language combinations English/Italian, English/German
- 800 pairs for the language combination Italian/German
In addition, two monolingual datasets are available, respectively for English (1518 pairs) and Italian (800 pairs).
To get the CLTE-Cosyne benchmark, please contact Matteo Negri (negri[at]fbk.eu)
Other references:
- Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization. In Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012).
- Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2013. Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-2013).
- Matteo Negri and Yashar Mehdad, Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-Day Rush. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk.