Spoken Language Understanding (SLU) involves interpreting spoken input using Natural Language Processing (NLP). Voice assistants like Alexa and Siri are real-world examples of SLU applications. The core tasks in SLU include intent classification, which determines the goal or command behind an utterance, and slot-filling, which extracts specific details such as dates or music genres from the utterance. However, gathering SLU data presents challenges due to the complexities of recording, validation, and associated costs. Additionally, most existing datasets are primarily English-focused, limiting language diversity and cross-linguistic applications.

To address these gaps, we introduce Speech-MASSIVE, a multilingual SLU dataset that builds on the translations and annotations from the MASSIVE corpus. The dataset was curated through crowd-sourcing with strict quality controls and spans 12 languages, covering 8 language families and 4 different scripts, with a total of over 83,000 utterances. We also established baseline models under various system and resource configurations to facilitate broad comparisons. Speech-MASSIVE aims to advance multilingual SLU capabilities and encourage the development of future models that would exceed our baseline results. The dataset is freely accessible on HuggingFace under the CC BY-NC-SA 4.0 license. Speech-MASSIVE paper is accepted and nominated as a best student paper in INTERSPEECH 2024 (Kos, Greece).

USEFUL LINKS

LIMITATIONS

As Speech-MASSIVE is constructed based on the MASSIVE dataset, it inherently retains certain grammatical errors present in the original MASSIVE text. Correcting these errors was outside the scope of our project. However, by providing the is_transcripted_reported attribute in Speech-MASSIVE, we enable users of the dataset to be aware of these errors.

LICENSE

All datasets are licensed under the CC BY-NC-SA 4.0 license.

CITATION

Please cite the paper when referencing the Speech-MASSIVE corpus as:

@inproceedings{lee2024speechmassivemultilingualspeechdataset,
      title={{Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond}}, 
      author={Beomseok Lee and Ioan Calapodescu and Marco Gaido and Matteo Negri and Laurent Besacier},
      year={2024},
      booktitle={Proc. Interspeech 2024}, 
}