Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter December 1, 2020

Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

  • Carlos Manuel Hidalgo-Ternero EMAIL logo and Gloria Corpas Pastor EMAIL logo
From the journal Yearbook of Phraseology


The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.


This paper has been carried out in the framework of various research projects on language technologies applied to translation and interpretation (ref. FFI2016-75831-P, UMA18-FEDERJA-067, CEI-RIS3 and EUIN2017-87746). It has also been funded by the Spanish Ministry of Education (FPU16/02032).


Alegria, Iñaki, Olatz Ansa, Xabier Artola, Nerea Ezeiza, Koldo Gojenola & Ruben Urizar. 2004. Representation and treatment of multiword expressions in Basque. In Proceedings of the second ACL workshop on multiword expressions: Integrating processing, 48–55. in Google Scholar

Al Saied, Hazem, Mathieu Constant & Marie Candito. 2017. The ATILF-LLF system for Parseme shared task: a transition-based verbal multiword expression tagger. In Proceedings of the 13th workshop on multiword expressions (MWE 2017), 127–132. in Google Scholar

Al Saied, Hazem, Marie Candito & Mathieu Constant. 2019. Comparing linear and neural models for competitive MWE identification. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, 86–96. in Google Scholar

Anastasopoulos, Antonios. 2019. An analysis of source-side grammatical errors in NMT. In Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, 213–223. in Google Scholar

Bejček, Eduard, Pavel Straňák & Daniel Zeman. 2011. Influence of treebank design on representation of multiword expressions. In Alexander F. Gelbukh (ed.), Computational Linguistics and intelligent text processing – 12th international conference, CICLing 2011, vol. 6608 (Lecture notes in Computer Science), 1–14. Berlin & Heidelberg: Springer.10.1007/978-3-642-19400-9_1Search in Google Scholar

Bejček, Eduard, Pavel Straňák & Pavel Pecina. 2013. Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. In Proceedings of the 9th workshop on multiword expressions, 106–115. in Google Scholar

Belinkov, Yonatan & Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. ArXiv. in Google Scholar

Constant, Mathieu, Gülşen Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner & Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics 43(4). 1–92.10.1162/COLI_a_00302Search in Google Scholar

Corpas Pastor, Gloria & Jean-Pierre Colson (eds.). 2020. Computational phraseology (IVITRA Research in Linguistics and Literature, 24). Amsterdam & Philadelphia: John Benjamins.10.1075/ivitra.24Search in Google Scholar

Derczynski, Leon, Alan Ritter, Sam Clark & Kalina Bontcheva. 2013. Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), 198–206. in Google Scholar

Finlayson, Mark & Nidhi Kulkarni. 2011. Detecting multiword expressions improves word sense disambiguation. In Proceedings of the eighth ALC workshop on multiword expressions (MWE 2011), 20–24. in Google Scholar

Foufi, Vasiliki, Luca Nerima & Eric Wehrli. 2019. Multilingual parsing and MWE detection. In Yannick Parmentier & Jakub Waszczuk (eds.), Representation and parsing of multiword expressions: Current trends, 217–237. Berlin: Language Science Press.Search in Google Scholar

Gui, Tao, Qi Zhang, Haoran Huang, Minlong Peng & Xuanjing Huang. 2017. Part-of-speech tagging for twitter with adversarial neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411–2420. in Google Scholar

Hidalgo-Ternero, Carlos Manuel. 2020 (forthcoming). Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. MonTI 6 (Special issue, “Análisis multidisciplinar del fenómeno de la variación en traducción e interpretación / Multidisciplinary Analysis of the Phenomenon of Phraseological Variation in Translation and Interpreting”).10.6035/MonTI.2020.ne6.5Search in Google Scholar

Honnibal, Matthew & Inés Montani. 2017 (to appear). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.Search in Google Scholar

Huang, Po-Sen, Chong Wang, Sitao Huang, Denny Zhou & Li Deng. 2018. Towards neural phrase-based machine translation. Paper presented at the sixth International Conference on Learning Representations (ICLR), Vancouver Convention Center, 30 April–3 May 2018. in Google Scholar

Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovvář, Jan Michelfeit, Pavel Rychlý & Vít Suchomel. 2003. The Sketch Engine. (accessed 4 March 2020)Search in Google Scholar

Klyueva, Natalia, Antoine Doucet & Milan Straka. 2017. Neural networks for multi-word expression detection. In Proceedings of the 13th workshop on multiword expressions (MWE 2017), 60–65. in Google Scholar

Maldonado, Alfredo, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel & Qun Liu. 2017. Detection of verbal multi-word expressions via conditional random fields with syntactic dependency features and semantic re-ranking. In Proceedings of the 13th Workshop on multiword expressions (MWE 2017), 114–120. in Google Scholar

Monti, Johanna, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov. 2018. Multiword units in machine translation and technology. In Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor & Violeta Seretan (eds.), Multiword units in translation and translation technology, 1–37. Amsterdam: John Benjamins.10.1075/cilt.341Search in Google Scholar

Moreau, Erwan, Ashjan Alsulaimani, Alfredo Maldonado & Carl Vogel. 2018. CRF-Seq and CRFDepTree at PARSEME Shared Task 2018: Detecting verbal MWEs using sequential and dependency-based approaches. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 241–247. in Google Scholar

Nagy, István T. & Veronika Vincze. 2014. VPCTagger: Detecting verb-particle constructions with syntax-based methods. In Proceedings of the 10th workshop on multiword expressions (MWE 2014), 17–25. in Google Scholar

Neunerdt, Melanie, Bianka Trevisan, Michael Reyer & Rudolf Mathar. 2013. Part-of-speech tagging for social media texts. In Iryna Gurevych, Chris Biemann & Torsten Zesch (eds.), Language processing and knowledge in the web. Lecture notes in computer science 8105, 139–150. Berlin & Heidelberg: Springer.10.1007/978-3-642-40722-2_15Search in Google Scholar

Niu, Xing, Prashant Mathur, Georgiana Dinu & Yaser Al-Onaizan. 2020. Evaluating robustness to input perturbations for neural machine translation. ArXiv. in Google Scholar

Ramisch, Carlos. 2015. Multiword expressions acquisition: A generic and open framework (Theory and applications of natural language processing series XIV). Cham: Springer.10.1007/978-3-319-09207-2Search in Google Scholar

Ramisch, Carlos, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar, Voula Giouli, Tunga Güngör, Abdelati Hawwari, Uxoa Iñurrieta, Jolanta Kovalevskaitė, Simon Krek, Timm Lichte, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Behrang QasemiZadeh, Renata Ramisch, Nathan Schneider, Ivelina Stoyanova, Ashwini Vaidya & Abigail Walsh. 2018. Edition 1.1 of the PARSEME Shared Task on automatic identification of verbal multiword expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 222–240. in Google Scholar

Ramisch, Carlos & Aline Villavicencio. 2018. Computational treatment of multiword expressions. In Ruslan Mitkov (ed.), Oxford handbook of Computational Linguistics (2nd edn). N. p.: Oxford University Press.10.1093/oxfordhb/9780199573691.013.56Search in Google Scholar

Riedl, Martin & Chris Biemann. 2016. Impact of MWE resources on multiword recognition. In Proceedings of the twelfth workshop on multiword expressions (MWE 2016), 107–111. in Google Scholar

Rikters, Matīss & Ondřej Bojar. 2017. Paying attention to multi-word expressions in neural machine translation. ArXiv. in Google Scholar

Rohanian, Omid, Shiva Taslimipoor, Samaneh Kouchaki, Le An Ha & Ruslan Mitkov. 2019. Bridging the gap: Attending to discontinuity in identification of multiword expressions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. I, 2692–2698. in Google Scholar

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Alexander Gelbukh (ed.), Computational Linguistics and intelligent text processing. CICLing 2002. Lecture Notes in Computer Science, 1–15. Berlin & Heidelberg: Springer.10.1007/3-540-45715-1_1Search in Google Scholar

Schneider, Nathan, Emily Danchik, Chris Dyer & Noah A. Smith. 2014. Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. TACL 2. 193–206.10.1162/tacl_a_00176Search in Google Scholar

Sperber, Matthias, Jan Niehues & Alex Waibel. 2017. Toward robust neural machine translation for noisy input sequences. In International Workshop on Spoken Language Translation (IWSLT). in Google Scholar

Wang, Xing, Zhaopeng Tu, Deyi Xiong & Min Zhang. 2017. Translating phrases in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), 1421–1431. in Google Scholar

Zampieri, Nicolas, Carlos Ramisch & Geraldine Damnati. 2019. The impact of word representations on sequential neural MWE identification. Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 169–175. in Google Scholar

Zaninello, Andrea & Alexandra Birch. 2020. Multiword expression aware neural machine translation. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 3816–3825. in Google Scholar

Published Online: 2020-12-01
Published in Print: 2020-11-25

©2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 9.6.2023 from
Scroll to top button