Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics Subjects: Computer Science >> Natural Language Understanding and Machine Translation submitted time 2024-06-05
Abstract: Objective This paper aims to quantify the sentence alignment scores of low-resource parallel corpora to obtain high-quality parallel corpora, improving machine translation performance. Methods We propose NeuroAlign, a neural network-based unsupervised sentence embedding method for scoring bilingual parallel sentence alignment. Parallel sentence pairs are embedded into the same vector space, and alignment scores for given candidate sentence pairs in the parallel corpus are calculated. Based on these scores, low-scoring sentence pairs are filtered out, resulting in high-quality bilingual parallel corpora for low-resource languages. Results In the BUCC2018 parallel text mining task, the F1 score can be improved by 0.5-0.8. In the CCMT2021 low-resource language neural machine translation task, the BLEU score can be improved by 0.1-10.9. The sentence alignment scores can approach human evaluation. Limitations Due to the scarcity of low-resource bilingual parallel corpora, research has not been conducted on language pairs other than Tibetan-Chinese, Uyghur-Chinese, and Mongolian-Chinese. Conclusions This method can be effectively applied to sentence alignment scoring for low-resource language machine translation parallel corpora, improving the quality of the data source, and thereby enhancing machine translation performance.