In the DireSTI (Direct Speech Translation for Interpreters) project, approved by CINECA in 2021 (CALL 24B – 10/2021), we focused on applying speech-translation (ST) technology to the interpreting sector. The focus was to enhance ST systems, by making them able to: a) automatically identify semantically-relevant information (i.e. named entities – NEs, terms, and numbers) that is difficult to remember for an interpreter during a session, b) generate accurate translations of the identified information, and c) perform the task in a simultaneous setting. The task was motivated by the urgent need for ST solutions capable of timely showing to the interpreters the key elements pronounced by a speaker with the corresponding translation, so as to enable them to focus only on rendering content into a different language without the additional effort (and high cognitive load) of remembering and mapping across languages rare terms and NEs possibly unknown to them.
The project focused on direct ST (i.e. neural architectures whose parameters are trained on the end-to-end ST task), a rapidly evolving technology in which FBK is at the forefront of worldwide research. The approach was motivated by the intrinsic advantages of direct solutions (lower latency, reduced error propagation, ease of maintenance and better exploitation of audio information), which make it a more promising direction compared to the traditional cascaded approach based on separate automatic speech recognition (ASR) and machine translation (MT) components.
Despite the high computational demand of direct ST model training, the support from CINECA allowed the proponents to carry out cutting-edge research and perform extensive experiments with different architectures and on multiple languages. The results, in line with the initial research roadmap, include:
- The improvement of direct ST models toward a better handling of NEs and terminology. A preliminary analysis of the behaviour of state-of-the-art ST systems (both cascade and direct) in translating NEs and terminology, revealed the higher effectiveness of direct models in handling these key elements when present in the input speech. Furthermore, on a benchmark specifically created for our analysis (NEuRoparl ST), we identified in person names one of the most critical NE categories, thus defining a major direction for subsequent improvements. Starting from this finding, published at EMNLP 2021 [1] (a top-tier conference in the natural language processing field), in a follow-up work we analyzed the causes of errors in the translation of person names, and explored ad-hoc architectural solutions. Along this direction, we proposed and evaluated multilingual models capable of jointly generating transcripts and translations, prioritising the former over the latter in favour of a more accurate translation of NEs in general and person names in particular. The approach resulted in a 47.8% relative accuracy improvement, on average, in the rendering of person names in three language directions (English->French/Italian/Spanish). This work [2] was published at IWSLT 2022, a major venue for ST research, where it received the Best Paper award.
- A step forward, towards “augmented ST”. The above results subsequently inspired the formulation of the more general problem of enriching the ST output with additional information rather than only translating relevant concepts. This requires, as an additional step on top of translation, the capability of properly labelling each NE present in the output. Toward this “augmented ST” scenario (reminiscent of augmented reality), we proposed multitask models (respectively Inline and Parallel, see the attached images) able to jointly perform ST and NER (i.e. to generate NE-annotated translations of the speech) without introducing computational overhead with respect to an ST-only model. Positive results on English->French/Italian/Spanish, which further demonstrate the potential of direct ST models, are reported in a paper under review at ICASSP 2023 [3], a top conference in signal processing.
- The integration of the proposed solutions in the simultaneous scenario. A key requirement posed by the interpreting scenario is speed or, in other words, simultaneous processing. Toward this objective, our research focused on different aspects of the problem. One was the study of model training regimes, targeting a reduction of computational costs. The main outcome was the demonstration that a single model trained offline can effectively serve not only the offline but also the simultaneous task without the need for any additional training or fine-tuning. Through experiments on English->German/Spanish, we demonstrated that, aside from facilitating the adoption of well-established offline techniques and architectures without affecting latency, our offline solution achieves similar or better translation quality compared to the same model trained in simultaneous mode, also being competitive with the state of the art in simultaneous ST. This study [4] was published in the findings of EMNLP 2022. From the evaluation standpoint, in a paper [5] published at the AutoSimTrans 2022 Workshop, we proposed a correction of the widely used Average Lagging (AL) metric, which we proved to underestimate the latency scores for systems that generate longer predictions compared to the corresponding references. Through experiments on English->Spanish, we demonstrated that the alternative latency metric we proposed (LAAL – Length-Adaptive Average Lagging) can effectively handle both under and over-generation at the sentence level, leading to a more reliable evaluation of simultaneous ST systems. The adoption of LAAL among the official evaluation metrics in the forthcoming IWSLT 2023 shared task on simultaneous speech translation further attests to its reliability. At the architectural level, our participation [6] in the IWSLT 2022 simultaneous task focused on reducing the computational costs of both inference and training of ST systems without sacrificing final translation quality. To avoid costly techniques based on ASR pre-training, our solutions exploit input compression based on a Connectionist Temporal Classification (CTC) loss, as well as simple yet effective data filtering techniques. Integrated into a Conformer-based system, our methods allowed us to obtain state-of-the-art performance when using only MuST-C as a training corpus, positioning our systems among the top ones in terms of computational AL in the final ranking. Finally, from the application standpoint, in our latest submission to ICASSP 2023 [3] we also tested our multitask ST+NER models in simultaneous conditions. With coherent results on English->French/Italian/Spanish test data, we demonstrated that our best system is able to outperform a cascade approach while having the same computational cost (hence latency) of a base direct ST model.
- The integration of external information. Another requirement posed by the interpreting scenario concerns the possibility to show the user not only the translation of the NEs present in the speech but also their transcription in the original language. To meet this requirement, we explored the possibility to identify and align NEs starting from a joint generation of the full transcript and the translation of the input audio. For text generation, the developed solution exploits the so-called “triangle” approach. The approach is based on an architecture consisting of one encoder and two decoders respectively in charge of generating i) the transcript using the information coming from the encoder, and ii) the translation by leveraging information both from the encoder and from the decoder used to produce the transcript. For NE alignment across the source and target language (a task known as entity linking), we exploited the integration of external information provided by a knowledge graph (KG). The KG is used to: i) identify all possible NEs present in the translation, ii) assign to them a unique identifier, and iii) associate them with all possible translations in the language of the transcript. At the end of the process, all the collected NE associations are filtered by means of a cross-lingual entity matching algorithm designed to retain the most reliable ones. This approach was validated through on-field tests involving students of translation studies operating on three language directions (English<->Spanish and French->Spanish) with overall positive results to be presented at the 3rd HKBU International Conference on interpreting. The short demonstration video being uploaded together with this report shows the functionalities of a CAI (Computer-Assisted Interpreting) tool co-developed by FBK within the SmarTERP international project, in which the above solutions were successfully integrated.