
MCIF (Multimodal Crosslingual Instruction Following) is a multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities — speech, vision, and text — and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information.
Dataset Resources
- Collection Repository: MCIF GitHub
- Dataset Repository: MCIF HuggingFace
- Paper: MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
License
- CC-BY-4.0
Citation
@inproceedings{papi2026mcif,
title={{MCIF}: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks},
author={Sara Papi and Maike Z{\"u}fle and Marco Gaido and Beatrice Savoldi and Danni Liu and Ioannis Douros and Luisa Bentivogli and Jan Niehues},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=PtPYZYfa0h}
}