PhD Defense: Data Selection for Statistical Machine Translation

1 December 2021

Abstract: Machine Translation (MT) is a current topic in the Computational Linguistics (CL) community. Training an MT model on a domain and using it on another domain does not yield the expected performance due to the syntactic and semantic differences between the two domains. Thus, domain adaptation is necessary. Data selection, which is the topic of this thesis, is a corpus-driven domain adaptation method. Given a general domain corpus and an in-domain, each sentence from the general domain corpus is scored according to its similarity to the in-domain. The most similar sentences to an in-domain are selected as pseudo in-domain and used later on in the training of domain-focused MT systems.

There are two challenges that arise with data selection: which method to use to determine the most similar sentences from the general domain to a given in-domain and how many of the general domain sentences to select as pseudo in-domain. In this thesis, data selection methods that address both challenges are presented. I developed several scoring methods and compared them with a method I developed that automatically determines the ratio of sentences to select.

Data selection is crucial for MT systems that aim to translate domain-specific texts. The data selection SMT models presented in this thesis were trained faster in comparison with training using full general domain data, had a smaller size, and performed on a par or better than the models trained using the full training data.

To participate in the PhD defense, please contact Stefania via email. The Zoom link will then be mailed to you shortly before the meeting.