Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton August 17, 2019

Part of speech tagging for Polish

Katarzyna Krasnowska-Kieraś and Łukasz Kobyliński

Abstract

In this paper we discuss the current state of the art in part-of-speech tagging for Polish. We introduce the problem of POS tagging and point out the key issues in tagging inflected languages, which make this task more difficult in the case of Polish than e.g. English. We also discuss the most important language resources connected with POS tagging, as well as the task of morphological analysis, as it is commonly used as a preliminary step in tagging. We describe the methods that have been applied to the problem of POS tagging for Polish to date and discuss the most current, neural-network based methods in more detail. Finally, we conclude with a general view of this field in the context of Polish and discuss possible future research directions.


Łukasz Kobyliński Institute of Computer Science Polish Academy of Sciences Jana Kazimierza 5 01-248 Warszawa Poland

References

Acedański, S. 2010. “A morphosyntactic Brill tagger for inflectional languages”. In: Loftsson, H., E. Rögnvaldsson and S. Helgadóttir S. (eds.), Advances in Natural Language Processing. NLP 2010. Berlin: Springer. 3–14.10.1007/978-3-642-14770-8_3Search in Google Scholar

Bień, J. Stanisław. 1991. Koncepcja słownikowej informacji morfologicznej i jej komputerowej weryfikacji [A concept for showing morphological information in a dictionary and the computerized verification thereof]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego.Search in Google Scholar

Brants, T. 2000. “TnT: A statistical part-of-speech tagger”. Proceedings of the Sixth Conference on Applied Natural Language Processing Stroudsburg, PA: Association for Computational Linguistics. 224–231. doi:10.3115/974147.974178.10.3115/974147.974178Search in Google Scholar

Brill, E. 1992. “A simple rule-based part of speech tagger”. Proceedings of the Third Conference on Applied Natural Language Processing Association for Computational Linguistics. 152–155.10.3115/974499.974526Search in Google Scholar

Daelemans, W., P. Berck, J. Zavrel and S. Gillis. 1996. “MBT: A memory-based part of speech tagger-generator”. Proceedings of the 4th Workshop on Very Large Corpora Copenhagen. 14–27.Search in Google Scholar

Dębowski, Ł. 2004. “Trigram morphosyntactic tagger for Polish”. Proceedings of the International IIS:IIPWM’04 Conference Berlin: Springer. 409–413.10.1007/978-3-540-39985-8_43Search in Google Scholar

Gers, F.A., J. Schmidhuber and F.A. Cummins. 1999. “Learning to forget: Continual prediction with LSTM”. Neural Computation 12. 2451–2471.10.1049/cp:19991218Search in Google Scholar

Giménez, J. and L. Márquez. 2004. “SVMTool: A general POS tagger generator based on support vector machines”. Proceedings of the 4th International Conference on Language Resources and Evaluation Lisbon. 43–46.Search in Google Scholar

Gruszczyński, W., D. Adamiec and M. Ogrodniczuk. 2013. “Elektroniczny korpus tekstów polskich z XVII i XVIII w. (Do 1772 r.)” [An electronic corpus of 17th- and 18th-century Polish texts (up to 1772)]. Polonica XXXIII. 311–318.Search in Google Scholar

Hochreiter, S. and J. Schmidhuber. 1997. “Long short-term memory”. Neural computation 9. 1735–1780.10.1162/neco.1997.9.8.1735Search in Google Scholar

Jurafsky, D. and J.H. Martin. 2018. Speech and language processing. (3rd edition draft.) <https://web.stanford.edu/~jurafsky/slp3/>Search in Google Scholar

Karlsson, F., A. Voutilainen, J. Heikkilä and A. Anttila. 1995. Constraint grammar: A language-independent system for parsing unrestricted text Berlin: Mouton de Gruyter.10.1515/9783110882629Search in Google Scholar

Kieraś, W., D. Komosińska, E. Modrzejewski and M. Woliński. 2017. “Morphosyntactic annotation of historical texts. The making of the Baroque Corpus of Polish”. In: Ekštein, K. and V. Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference (TSD 2017), Prague, August 27–31. Berlin: Springer. 308–316. doi:10.1007/978-3-319-64206-S_3510.1007/978-3-319-64206-S_35Search in Google Scholar

Kieraś, W. and M. Woliński. 2017. “Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego” [Morfeusz 2 – an inflectional analyzer and generator for Polish]. Język Polski XCVII(1). 75–83.Search in Google Scholar

Kieraś, W. and M. Woliński. 2018. “Manually annotated corpus of Polish texts published between 1830 and 1918”. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 3854–3859. <http://www.lrec-conf.org/proceedings/lrec2018/index.html>Search in Google Scholar

Kobyliński, Ł. 2014. “PoliTa: A multitagger for Polish”. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík: ELRA. 2949–2954. <http://www.lrec-conf.org/proceedings/lrec2014/index.html>Search in Google Scholar

Kobyliński, Ł. and W. Kieraś. 2016. “Part of speech tagging for Polish: State of the art and future perspectives”. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016). Konya.Search in Google Scholar

Kobyliński, Ł. and M. Ogrodniczuk. 2017. “Results of the PolEval 2017 competition: Part-of-speech tagging shared task”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 362–366.Search in Google Scholar

Kobyliński, Ł., M. Wasiluk and G. Wojdyga. 2018. “Improving part-of-speech tagging by meta-learning”. Proceedings of 21st International Conference on Text, Speech and Dialogue (LNAI). Berlin: Springer-Verlag. 1–9.Search in Google Scholar

Krasnowska-Kieraś, K. 2017. “Morphosyntactic disambiguation for Polish with bi-LSTM neural networks”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 367–371. <http://ltc.amu.edu.pl/book/papers/PolEvalU-Sdpdf>Search in Google Scholar

Mikolov, T., K. Chen, G. Corrado and J. Dean. 2013. “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR 2013Search in Google Scholar

Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. Proceedings of NIPS 2013 USA: Curran Associates Inc.Search in Google Scholar

Nivre, J., M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C.D. Manning, R. McDonald, et al. 2016. “Universal dependencies v1: A multilingual treebank collection”. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).Search in Google Scholar

Patejuk, A. and A. Przepiórkowski. 2018. From lexical functional grammar to enhanced universal dependencies: Linguistically informed treebanks of Polish Warsaw: Institute of Computer Science, Polish Academy of Sciences.10.1007/s10579-018-9433-zSearch in Google Scholar

Pęzik, P. and S. Laskowski. 2017. “Evaluating an averaged perceptron morphosyntactic tagger for Polish”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 372–376. <http://ltc.amu.edu.pl/book/papers/PolEvalU-3dpdf>Search in Google Scholar

Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Search in Google Scholar

Piasecki, M. and W. Walentynowicz. 2017. “MorphoDiTa-based tagger adapted to the Polish language technology”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 377–381. <http://ltc.amu.edu.pl/book/papers/PolEvalU-4dpdf>Search in Google Scholar

Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Search in Google Scholar

Radziszewski, A. 2013. “Evaluation of lemmatisation accuracy of four Polish taggers”. Proceedings of the LTC 2013Search in Google Scholar

Radziszewski, A. and S. Acedański. 2012. “Taggers gonna tag: An argument against evaluating disambiguation capacities of morphosyntactic taggers”. Proceedings of TSD 2012 (LNCS). Berlin: Springer-Verlag.10.1007/978-3-642-32790-2_9Search in Google Scholar

Radziszewski, A. and T. Śniatowski. 2011. “A memory-based tagger for Polish”. Proceedings of the LTC 2011.Search in Google Scholar

Rychlikowski, P., M. Zapotoczny and J. Chorowski. 2017. “Character-based neural POS tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 382–385. <http://ltc.amu.edu.pl/book/papers/PolEvalU-5dpdf>Search in Google Scholar

Saloni, Z., W. Gruszczyński, M. Woliński, R. Wołosz and D. Skowrońska. 2015. Słownik gramatyczny języka polskiego [A grammatical dictionary of Polish]. (3rd edn.) <http://sgjp.pl>Search in Google Scholar

Silfverberg, M., T. Ruokolainen, K. Lindén and M. Kurimo. 2014. “Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy”. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers) Baltimore: Association for Computational Linguistics. 259–264. <http://aclweb.org/anthology/P14-204>10.3115/v1/P14-2043Search in Google Scholar

Straková, J. Milan Straka and Jan Hajič. 2014. “Open-source tools for morphology, lemmatization, POS tagging and named entity recognition”. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations Baltimore: Association for Computational Linguistics. 13–18. <http://www.aclweb.org/anthology/P/P14/P14-5003dpdf>10.3115/v1/P14-5003Search in Google Scholar

Szałkiewicz, Ł. and A. Przepiórkowski. 2012. “Anotacja morfoskładniowa” [Morpho-syntactic annotation]. In: Przepiórkowski, A., M. Bańko, R.L. Górski and B. Le-wandowska-Tomaszczyk (eds.), Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN. 59–96.Search in Google Scholar

Toutanova, K. and C.D. Manning. 2000. “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger”. Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics 63–70.10.3115/1117794.1117802Search in Google Scholar

Waszczuk, J. 2012. “Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai. 2789–2804.Search in Google Scholar

Wawer, A. 2015. “Sentiment dictionary refinement using word embeddings”. Proceedings of ISMIS 2015 Cham. 186–193. doi:10.1007/978-3-319-25252-C_20.10.1007/978-3-319-25252-C_20Search in Google Scholar

Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining 2006 Conference Wisła. 511–520.10.1007/3-540-33521-8_55Search in Google Scholar

Woliński, M. 2018. Automatyczna analiza składnikowa języka polskiego [Automatic syntactic analysis of Polish]. Warsaw: IPI PAN.10.31338/uw.9788323536147Search in Google Scholar

Woliński, M. and W. Kieraś. 2016. “The on-line version of Grammatical Dictionary of Polish”. Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016 Portorož: ELRA; European Language Resources Association (ELRA). 2589–2594. <http://www.lrec-conf.org/proceedings/lrec2016/index.html>Search in Google Scholar

Wróbel, K. 2017. “KRNNT: Polish recurrent neural network tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 386–391. <http://ltc.amu.edu.pl/book/papers/PolEvalU-6dpdf>Search in Google Scholar

Wróblewska, A. 2018. “Extended and enhanced Polish dependency bank in Universal Dependencies format”. Proceedings of the Second Workshop on Universal Dependencies (UDW 2018). Brussels: Association for Computational Linguistics. 173–182.10.18653/v1/W18-6020Search in Google Scholar

Published Online: 2019-08-17
Published in Print: 2019-06-26

© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland