MCIF (Multimodal Crosslingual Instruction Following) is a multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities — speech, vision, and text — and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information.

Dataset Resources

License

  • CC-BY-4.0

Citation

@inproceedings{papi2026mcif,
title={{MCIF}: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks},
author={Sara Papi and Maike Z{\"u}fle and Marco Gaido and Beatrice Savoldi and Danni Liu and Ioannis Douros and Luisa Bentivogli and Jan Niehues},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=PtPYZYfa0h}
}