ChinaXiv.org 中国科学院科技论文预发布平台

Submitted Date

2024
1

Subjects

Authors

Institution

result total 1.

Hide Summary

Hits

Date

Downloads

Your conditions: 赵小兵

1. ChinaXiv:202406.00020
Download

Parallel Corpus Sentence Alignment Scoring for Low-Resource Language Machine Translation

Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics Subjects: Computer Science >> Natural Language Understanding and Machine Translation submitted time 2024-06-05

Li Linxia Chen Bo Zhou Maoke Zhao Xiaobing

Abstract： Objective This paper aims to quantify the sentence alignment scores of low-resource parallel corpora to obtain high-quality parallel corpora, improving machine translation performance. Methods We propose NeuroAlign, a neural network-based unsupervised sentence embedding method for scoring bilingual parallel sentence alignment. Parallel sentence pairs are embedded into the same vector space, and alignment scores for given candidate sentence pairs in the parallel corpus are calculated. Based on these scores, low-scoring sentence pairs are filtered out, resulting in high-quality bilingual parallel corpora for low-resource languages. Results In the BUCC2018 parallel text mining task, the F1 score can be improved by 0.5-0.8. In the CCMT2021 low-resource language neural machine translation task, the BLEU score can be improved by 0.1-10.9. The sentence alignment scores can approach human evaluation. Limitations Due to the scarcity of low-resource bilingual parallel corpora, research has not been conducted on language pairs other than Tibetan-Chinese, Uyghur-Chinese, and Mongolian-Chinese. Conclusions This method can be effectively applied to sentence alignment scoring for low-resource language machine translation parallel corpora, improving the quality of the data source, and thereby enhancing machine translation performance.

YES

Hits 726 Downloads 201 Comment 0

Parallel Corpus Sentence Alignment Scoring for Low-Resource Language Machine Translation