Publications
Overcoming vocabulary sparsity in mt using lattices
Abstract
Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge:(1) common stems are fragmented into many different forms in training data,(2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate+ 1.3 and+ 1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.
- Date
- January 10, 2026
- Authors
- Steve DeNeefe, Ulf Hermjakob, Kevin Knight
- Conference
- Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
- Pages
- 89-96