Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The Prague Bulletin of Mathematical Linguistics

The Journal of Charles University

2 Issues per year

Open Access
See all formats and pricing
More options …

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

Duygu Ataman / Matteo Negri / Marco Turchi / Marcello Federico
Published Online: 2017-06-06 | DOI: https://doi.org/10.1515/pralin-2017-0031


The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologically rich languages. Conventional methods that aim to overcome this problem by using sub-word or character-level representations solely rely on statistics and disregard the linguistic properties of words, which leads to interruptions in the word structure and causes semantic and syntactic losses. In this paper, we propose a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language. Our method is based on unsupervised morphology learning and can be, in principle, used for pre-processing any language pair. We also present an alternative word segmentation method based on supervised morphological analysis, which aids us in measuring the accuracy of our model. We evaluate our method in Turkish-to-English NMT task where the input language is morphologically rich and agglutinative. We analyze different representation methods in terms of translation accuracy as well as the semantic and syntactic properties of the generated output. Our method obtains a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.


  • Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.Google Scholar

  • Bisazza, Arianna and Marcello Federico. Morphological pre-processing for Turkish to English statistical machine translation. In IWSLT, pages 129–135, 2009.Google Scholar

  • Bradbury, James and Richard Socher. MetaMind neural machine translation system for WMT 2016. In Proceedings of the 1st Conference on Machine Translation. ACL, 2016.Google Scholar

  • Cettolo, Mauro, Christian Girardi, and Marcello Federico. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of EAMT, pages 261–268, 2012.Google Scholar

  • Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, 2014.Google Scholar

  • Clark, Jonathan H., Chris Dyer, Alon Lavie, and Noah A. Smith. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. In Proceedings of the 49th Annual Meeting of ACL, pages 176–181. ACL, 2011.Google Scholar

  • Creutz, Mathias and Krista Lagus. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning, pages 51–59, 2005a.Google Scholar

  • Creutz, Mathias and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005b.Google Scholar

  • Creutz, Mathias and Krista Lagus. Unsupervised models for morpheme segmentation and morphology learning. Transactions on Speech and Language Processing, 4(1):3, 2007.Google Scholar

  • Cuong, Hoang and Khalil Simaan. Latent domain translation models in mix-of-domains haystack. In Proceedings of COLING, pages 1928–1939, 2014.Google Scholar

  • Duchi, John, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.Google Scholar

  • Gage, Philip. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.Google Scholar

  • Grönroos, Stig-Arne, Sami Virpioja, Peter Smit, and Mikko Kurimo. Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology. In COLING, pages 1177–1185, 2014.Google Scholar

  • Lee, Jason, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. CoRR, abs/1610.03017, 2016.Google Scholar

  • Ling, Wang, Isabel Trancoso, Chris Dyer, and Alan W Black. Character-based neural machine translation. CoRR, abs/1511.04586, 2015.Google Scholar

  • Lison, Pierre and Jörg Tiedemann. Opensubtitles 2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of LREC, 2016.Google Scholar

  • Luong, Minh-Thang and Christopher D Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of ACL. ACL, 2016.Google Scholar

  • Oflazer, Kemal. Two-level description of Turkish morphology. Literary and linguistic computing, 9(2):137–148, 1994.Google Scholar

  • Oflazer, Kemal and Ilknur Durgar El-Kahlout. Exploring different representational units in English-to-Turkish statistical machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation, pages 25–32. ACL, 2007.Google Scholar

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of ACL, pages 311–318. ACL, 2002.Google Scholar

  • Paul, Michael, Marcello Federico, and Sebastian Stücker. Overview of the IWSLT 2010 Evaluation Campaign. In Proceedings of IWSLT, pages 3–27, 2010.Google Scholar

  • Popovic, Maja. chrF: character n-gram F-score for automatic MT evaluation. 2015.Google Scholar

  • Sak, Haşim, Tunga Güngör, and Murat Saraçlar. Morphological disambiguation of Turkish text with perceptron algorithm. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 107–118. Springer, 2007.Google Scholar

  • Sánchez-Cartagena, Vıctor M and Antonio Toral. Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences. In Proceedings of the 1st Conference on Machine Translation. ACL, 2016.Google Scholar

  • Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.Google Scholar

  • Sennrich, Rico, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel L”aubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of EACL, 2017.Google Scholar

  • Skadiņš, Raivis, Jörg Tiedemann, Roberts Rozis, and Daiga Deksne. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. In Proceedings of LREC. European Language Resources Association, 2014.Google Scholar

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel. A Study of Translation Error Rate with Targeted Human Annotation. In Proceedings of AMTA, 2006.Google Scholar

  • Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.Google Scholar

  • Tiedemann, Jörg. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, volume 5, pages 237–248, 2009.Google Scholar

  • Tiedemann, Jörg. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of LREC. European Language Resources Association, 2012.Google Scholar

  • Tyers, Francis M and Murat Serdar Alperen. South-east European Eimes: A parallel corpus of balkan languages. In Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pages 49–53, 2010.Google Scholar

  • Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR, abs/1609.08144, 2016.Google Scholar

About the article

Published Online: 2017-06-06

Published in Print: 2017-06-01

Citation Information: The Prague Bulletin of Mathematical Linguistics, Volume 108, Issue 1, Pages 331–342, ISSN (Online) 1804-0462, DOI: https://doi.org/10.1515/pralin-2017-0031.

Export Citation

© 2017 Duygu Ataman et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in