BitterCorpus

May 30, 2023 | Corpora

The BitterCorpus is a collection of parallel en-ita documents in the IT domain where domain-specific terms have been manually marked and aligned. The documents are extracted from the GNOME and the KDE data collections. They contain 874 domain-specific bilingual terms in total.

GNOME Corpus:

It contains 55 parallel documents extracted from the Gnome manual documentation (IT domain). Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 313 Italian and 282 English terms and 237 bilingual domain-specific terms.

KDE Corpus:

It contains one parallel document extracted from the KDE manual documentation (IT domain), whereby the document is made of 100 lines of text.Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 628 Italian and 628 English terms, and 637 bilingual domain-specific terms.

BitterCorpus is freely available for research purposes, and is distributed under a Creative Commons Attribution- NonCommercial-ShareAlike license.

The data were used for the SMT evaluation presented in:

Mihael Arcan, Marco Turchi, Sara Tonelli and Paul Buitelaar “Enhancing Statistical Machine Translation with Bilingual Terminology in a CAT Environment“. In Proceedings of AMTA 2014.

If you use the corpus, please cite the above paper.

Download BitterCorpus

MT Group at FBK Follow

#MachineTranslation Research Unit @FBK_research. #nlproc #deeplearning #ai

Avatar MT Group at FBK @fbk_mt ·

9 Jun

Late update, but we had two great talks last month!

#MachineTranslation #FBK #NLProc #GenderBias #SpeechSynthesis

Reply on Twitter 2064343003892683056 Retweet on Twitter 2064343003892683056 Like on Twitter 2064343003892683056 Twitter 2064343003892683056

Avatar MT Group at FBK @fbk_mt ·

3 Jun

Our pick of the week by @dhairya_su47605

: "Scaling Laws for Precision" by @tanishqkumar07, Zachary Ankner, @bfspectorShiekh, @blake__bordelon, @Muennighoff, @mansiege, @CPehlevan, Christopher R´e, @AdtRaghunathan

📰

#Quantization #LLM #ScalingLaw

Dhairya Suman @dhairya_su47605

Pick of the week @fbk_mt
Super interesting paper on the limitations of quantization, demonstrating how post-training quantization scales poorly in data.

https://arxiv.org/abs/2411.04330

Reply on Twitter 2062203374712344946 Retweet on Twitter 2062203374712344946 Like on Twitter 2062203374712344946 3 Twitter 2062203374712344946

Avatar MT Group at FBK @fbk_mt ·

27 May

⭐ For our #PickOfTheWeek, this paper explores an important question for modern speech AI:

🎙️ Which Evaluation for Which Speech Model?
👥 Authors: @Maureendss , @EeshanDhekane

Speech foundation models are evolving rapidly, but evaluation practices are still fragmented.

Reply on Twitter 2059703767570780492 Retweet on Twitter 2059703767570780492 Like on Twitter 2059703767570780492 2 Twitter 2059703767570780492

Avatar MT Group at FBK @fbk_mt ·

15 May

🏝️ Yesterday at #LREC2026, Palma de Mallorca!
@lina_conti presented "Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation" at the poster session.
📄Paper:
💻Code: https://github.com/lina-conti/voice-bias-coreference
#SpeechTranslation #NLProc

Reply on Twitter 2055326042957713546 Retweet on Twitter 2055326042957713546 Like on Twitter 2055326042957713546 6 Twitter 2055326042957713546

Load More