Our researcher, Sara Papi, will give a talk at the Multimodal Cohere Labs on January 23rd at 4:00 PM CET. More information is available at: https://cohere.com/events/cohere-labs-sara-papi-2026
Title: Towards Crosslingual Evaluation of General-Purpose Instruction-Following Models Across Text, Speech, and Vision
Abstract: Users expect modern multimodal LLMs (MLLMs) to operate over spoken and video inputs in multiple languages, flexibly addressing diverse task requests such as transcription, translation, summarization, and question answering. Despite the central role of this instruction-following ability, existing evaluations often remain limited to text-only and monolingual settings, failing to reflect the complexity of real-world scenarios. As a solution, this talk describes the crosslingual evaluation of general-purpose instruction-following models (including SpeechLLMs, VideoLLMs, and MLLMs) through MCIF, a novel benchmark built from scientific talks. Developed entirely through human annotation, MCIF has been designed to assess how the instruction-following abilities vary across languages, modalities, and tasks. Through extensive benchmarking of 23 state-of-the-art models, MCIF reveals critical performance gaps that monolingual single-modality benchmarks cannot capture, such as significant limitations in joint speech-video integration, long-form processing, and summarization, establishing a foundation for advancing truly multimodal, crosslingual instruction-following systems.
