Abstract
The present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.
Acknowledgements
This paper has been carried out in the framework of various research projects on language technologies applied to translation and interpretation (ref. FFI2016-75831-P, UMA18-FEDERJA-067, CEI-RIS3 and EUIN2017-87746). It has also been funded by the Spanish Ministry of Education (FPU16/02032).
References
Alegria, Iñaki, Olatz Ansa, Xabier Artola, Nerea Ezeiza, Koldo Gojenola & Ruben Urizar. 2004. Representation and treatment of multiword expressions in Basque. In Proceedings of the second ACL workshop on multiword expressions: Integrating processing, 48–55. https://www.aclweb.org/anthology/W04-0407.pdf10.3115/1613186.1613193Search in Google Scholar
Al Saied, Hazem, Mathieu Constant & Marie Candito. 2017. The ATILF-LLF system for Parseme shared task: a transition-based verbal multiword expression tagger. In Proceedings of the 13th workshop on multiword expressions (MWE 2017), 127–132. https://www.aclweb.org/anthology/W17-1717.pdf10.18653/v1/W17-1717Search in Google Scholar
Al Saied, Hazem, Marie Candito & Mathieu Constant. 2019. Comparing linear and neural models for competitive MWE identification. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, 86–96. https://www.aclweb.org/anthology/W19-6109.pdfSearch in Google Scholar
Anastasopoulos, Antonios. 2019. An analysis of source-side grammatical errors in NMT. In Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, 213–223. https://www.aclweb.org/anthology/W19-482210.18653/v1/W19-4822Search in Google Scholar
Bejček, Eduard, Pavel Straňák & Daniel Zeman. 2011. Influence of treebank design on representation of multiword expressions. In Alexander F. Gelbukh (ed.), Computational Linguistics and intelligent text processing – 12th international conference, CICLing 2011, vol. 6608 (Lecture notes in Computer Science), 1–14. Berlin & Heidelberg: Springer.10.1007/978-3-642-19400-9_1Search in Google Scholar
Bejček, Eduard, Pavel Straňák & Pavel Pecina. 2013. Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. In Proceedings of the 9th workshop on multiword expressions, 106–115. https://www.aclweb.org/anthology/W13-1016.pdfSearch in Google Scholar
Belinkov, Yonatan & Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. ArXiv. https://arxiv.org/abs/1711.02173Search in Google Scholar
Constant, Mathieu, Gülşen Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner & Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics 43(4). 1–92.10.1162/COLI_a_00302Search in Google Scholar
Corpas Pastor, Gloria & Jean-Pierre Colson (eds.). 2020. Computational phraseology (IVITRA Research in Linguistics and Literature, 24). Amsterdam & Philadelphia: John Benjamins.10.1075/ivitra.24Search in Google Scholar
Derczynski, Leon, Alan Ritter, Sam Clark & Kalina Bontcheva. 2013. Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), 198–206. http://www.aclweb.org/anthology/R13-1026Search in Google Scholar
Finlayson, Mark & Nidhi Kulkarni. 2011. Detecting multiword expressions improves word sense disambiguation. In Proceedings of the eighth ALC workshop on multiword expressions (MWE 2011), 20–24. https://www.aclweb.org/anthology/W11-0805.pdfSearch in Google Scholar
Foufi, Vasiliki, Luca Nerima & Eric Wehrli. 2019. Multilingual parsing and MWE detection. In Yannick Parmentier & Jakub Waszczuk (eds.), Representation and parsing of multiword expressions: Current trends, 217–237. Berlin: Language Science Press.Search in Google Scholar
Gui, Tao, Qi Zhang, Haoran Huang, Minlong Peng & Xuanjing Huang. 2017. Part-of-speech tagging for twitter with adversarial neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411–2420. https://www.aclweb.org/anthology/D17-1256.pdf10.18653/v1/D17-1256Search in Google Scholar
Hidalgo-Ternero, Carlos Manuel. 2020 (forthcoming). Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. MonTI 6 (Special issue, “Análisis multidisciplinar del fenómeno de la variación en traducción e interpretación / Multidisciplinary Analysis of the Phenomenon of Phraseological Variation in Translation and Interpreting”).10.6035/MonTI.2020.ne6.5Search in Google Scholar
Honnibal, Matthew & Inés Montani. 2017 (to appear). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.Search in Google Scholar
Huang, Po-Sen, Chong Wang, Sitao Huang, Denny Zhou & Li Deng. 2018. Towards neural phrase-based machine translation. Paper presented at the sixth International Conference on Learning Representations (ICLR), Vancouver Convention Center, 30 April–3 May 2018. https://arxiv.org/pdf/1706.05565.pdfSearch in Google Scholar
Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovvář, Jan Michelfeit, Pavel Rychlý & Vít Suchomel. 2003. The Sketch Engine. https://www.sketchengine.eu (accessed 4 March 2020)Search in Google Scholar
Klyueva, Natalia, Antoine Doucet & Milan Straka. 2017. Neural networks for multi-word expression detection. In Proceedings of the 13th workshop on multiword expressions (MWE 2017), 60–65. https://www.aclweb.org/anthology/W17-1707.pdf10.18653/v1/W17-1707Search in Google Scholar
Maldonado, Alfredo, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel & Qun Liu. 2017. Detection of verbal multi-word expressions via conditional random fields with syntactic dependency features and semantic re-ranking. In Proceedings of the 13th Workshop on multiword expressions (MWE 2017), 114–120. https://www.aclweb.org/anthology/W17-1715.pdf10.18653/v1/W17-1715Search in Google Scholar
Monti, Johanna, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov. 2018. Multiword units in machine translation and technology. In Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor & Violeta Seretan (eds.), Multiword units in translation and translation technology, 1–37. Amsterdam: John Benjamins.10.1075/cilt.341Search in Google Scholar
Moreau, Erwan, Ashjan Alsulaimani, Alfredo Maldonado & Carl Vogel. 2018. CRF-Seq and CRFDepTree at PARSEME Shared Task 2018: Detecting verbal MWEs using sequential and dependency-based approaches. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 241–247. https://www.aclweb.org/anthology/W18-4926.pdfSearch in Google Scholar
Nagy, István T. & Veronika Vincze. 2014. VPCTagger: Detecting verb-particle constructions with syntax-based methods. In Proceedings of the 10th workshop on multiword expressions (MWE 2014), 17–25. https://www.aclweb.org/anthology/W14-0803.pdf10.3115/v1/W14-0803Search in Google Scholar
Neunerdt, Melanie, Bianka Trevisan, Michael Reyer & Rudolf Mathar. 2013. Part-of-speech tagging for social media texts. In Iryna Gurevych, Chris Biemann & Torsten Zesch (eds.), Language processing and knowledge in the web. Lecture notes in computer science 8105, 139–150. Berlin & Heidelberg: Springer.10.1007/978-3-642-40722-2_15Search in Google Scholar
Niu, Xing, Prashant Mathur, Georgiana Dinu & Yaser Al-Onaizan. 2020. Evaluating robustness to input perturbations for neural machine translation. ArXiv. https://arxiv.org/pdf/2005.00580.pdf10.18653/v1/2020.acl-main.755Search in Google Scholar
Ramisch, Carlos. 2015. Multiword expressions acquisition: A generic and open framework (Theory and applications of natural language processing series XIV). Cham: Springer.10.1007/978-3-319-09207-2Search in Google Scholar
Ramisch, Carlos, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar, Voula Giouli, Tunga Güngör, Abdelati Hawwari, Uxoa Iñurrieta, Jolanta Kovalevskaitė, Simon Krek, Timm Lichte, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Behrang QasemiZadeh, Renata Ramisch, Nathan Schneider, Ivelina Stoyanova, Ashwini Vaidya & Abigail Walsh. 2018. Edition 1.1 of the PARSEME Shared Task on automatic identification of verbal multiword expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 222–240. https://www.aclweb.org/anthology/W18-4925.pdfSearch in Google Scholar
Ramisch, Carlos & Aline Villavicencio. 2018. Computational treatment of multiword expressions. In Ruslan Mitkov (ed.), Oxford handbook of Computational Linguistics (2nd edn). N. p.: Oxford University Press.10.1093/oxfordhb/9780199573691.013.56Search in Google Scholar
Riedl, Martin & Chris Biemann. 2016. Impact of MWE resources on multiword recognition. In Proceedings of the twelfth workshop on multiword expressions (MWE 2016), 107–111. https://www.aclweb.org/anthology/W16-1816.pdf10.18653/v1/W16-1816Search in Google Scholar
Rikters, Matīss & Ondřej Bojar. 2017. Paying attention to multi-word expressions in neural machine translation. ArXiv. https://arxiv.org/abs/1710.06313Search in Google Scholar
Rohanian, Omid, Shiva Taslimipoor, Samaneh Kouchaki, Le An Ha & Ruslan Mitkov. 2019. Bridging the gap: Attending to discontinuity in identification of multiword expressions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. I, 2692–2698. https://www.aclweb.org/anthology/N19-1275.pdfSearch in Google Scholar
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Alexander Gelbukh (ed.), Computational Linguistics and intelligent text processing. CICLing 2002. Lecture Notes in Computer Science, 1–15. Berlin & Heidelberg: Springer.10.1007/3-540-45715-1_1Search in Google Scholar
Schneider, Nathan, Emily Danchik, Chris Dyer & Noah A. Smith. 2014. Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. TACL 2. 193–206.10.1162/tacl_a_00176Search in Google Scholar
Sperber, Matthias, Jan Niehues & Alex Waibel. 2017. Toward robust neural machine translation for noisy input sequences. In International Workshop on Spoken Language Translation (IWSLT). https://pdfs.semanticscholar.org/88ed/f12127a628bed608cae0bdf3700d00824df4.pdfSearch in Google Scholar
Wang, Xing, Zhaopeng Tu, Deyi Xiong & Min Zhang. 2017. Translating phrases in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), 1421–1431. https://www.aclweb.org/anthology/D17-1149.pdf10.18653/v1/D17-1149Search in Google Scholar
Zampieri, Nicolas, Carlos Ramisch & Geraldine Damnati. 2019. The impact of word representations on sequential neural MWE identification. Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 169–175. https://www.aclweb.org/anthology/W19-5121.pdf10.18653/v1/W19-5121Search in Google Scholar
Zaninello, Andrea & Alexandra Birch. 2020. Multiword expression aware neural machine translation. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 3816–3825. https://www.aclweb.org/anthology/2020.lrec-1.471.pdfSearch in Google Scholar
©2020 Walter de Gruyter GmbH, Berlin/Boston