Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Poznan Studies in Contemporary Linguistics

Editor-in-Chief: Dziubalska-Kolaczyk, Katarzyna


IMPACT FACTOR 2018: 0.347

CiteScore 2018: 0.56

SCImago Journal Rank (SJR) 2018: 0.252
Source Normalized Impact per Paper (SNIP) 2018: 0.520

Online
ISSN
1897-7499
See all formats and pricing
More options …
Volume 55, Issue 2

Issues

Part of speech tagging for Polish

Katarzyna Krasnowska-Kieraś / Łukasz Kobyliński
Published Online: 2019-08-17 | DOI: https://doi.org/10.1515/psicl-2019-0009

Abstract

In this paper we discuss the current state of the art in part-of-speech tagging for Polish. We introduce the problem of POS tagging and point out the key issues in tagging inflected languages, which make this task more difficult in the case of Polish than e.g. English. We also discuss the most important language resources connected with POS tagging, as well as the task of morphological analysis, as it is commonly used as a preliminary step in tagging. We describe the methods that have been applied to the problem of POS tagging for Polish to date and discuss the most current, neural-network based methods in more detail. Finally, we conclude with a general view of this field in the context of Polish and discuss possible future research directions.

Keywords: Part-of-speech tagging; morphological analysis; taggers; deep learning

References

  • Acedański, S. 2010. “A morphosyntactic Brill tagger for inflectional languages”. In: Loftsson, H., E. Rögnvaldsson and S. Helgadóttir S. (eds.), Advances in Natural Language Processing. NLP 2010. Berlin: Springer. 3–14.Google Scholar

  • Bień, J. Stanisław. 1991. Koncepcja słownikowej informacji morfologicznej i jej komputerowej weryfikacji [A concept for showing morphological information in a dictionary and the computerized verification thereof]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego.Google Scholar

  • Brants, T. 2000. “TnT: A statistical part-of-speech tagger”. Proceedings of the Sixth Conference on Applied Natural Language Processing Stroudsburg, PA: Association for Computational Linguistics. 224–231. doi:10.3115/974147.974178.Google Scholar

  • Brill, E. 1992. “A simple rule-based part of speech tagger”. Proceedings of the Third Conference on Applied Natural Language Processing Association for Computational Linguistics. 152–155.Google Scholar

  • Daelemans, W., P. Berck, J. Zavrel and S. Gillis. 1996. “MBT: A memory-based part of speech tagger-generator”. Proceedings of the 4th Workshop on Very Large Corpora Copenhagen. 14–27.Google Scholar

  • Dębowski, Ł. 2004. “Trigram morphosyntactic tagger for Polish”. Proceedings of the International IIS:IIPWM’04 Conference Berlin: Springer. 409–413.Google Scholar

  • Gers, F.A., J. Schmidhuber and F.A. Cummins. 1999. “Learning to forget: Continual prediction with LSTM”. Neural Computation 12. 2451–2471.Google Scholar

  • Giménez, J. and L. Márquez. 2004. “SVMTool: A general POS tagger generator based on support vector machines”. Proceedings of the 4th International Conference on Language Resources and Evaluation Lisbon. 43–46.Google Scholar

  • Gruszczyński, W., D. Adamiec and M. Ogrodniczuk. 2013. “Elektroniczny korpus tekstów polskich z XVII i XVIII w. (Do 1772 r.)” [An electronic corpus of 17th- and 18th-century Polish texts (up to 1772)]. Polonica XXXIII. 311–318.Google Scholar

  • Hochreiter, S. and J. Schmidhuber. 1997. “Long short-term memory”. Neural computation 9. 1735–1780.CrossrefGoogle Scholar

  • Jurafsky, D. and J.H. Martin. 2018. Speech and language processing. (3rd edition draft.) <https://web.stanford.edu/~jurafsky/slp3/>

  • Karlsson, F., A. Voutilainen, J. Heikkilä and A. Anttila. 1995. Constraint grammar: A language-independent system for parsing unrestricted text Berlin: Mouton de Gruyter.Google Scholar

  • Kieraś, W., D. Komosińska, E. Modrzejewski and M. Woliński. 2017. “Morphosyntactic annotation of historical texts. The making of the Baroque Corpus of Polish”. In: Ekštein, K. and V. Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference (TSD 2017), Prague, August 27–31. Berlin: Springer. 308–316. doi:10.1007/978-3-319-64206-S_35Google Scholar

  • Kieraś, W. and M. Woliński. 2017. “Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego” [Morfeusz 2 – an inflectional analyzer and generator for Polish]. Język Polski XCVII(1). 75–83.Google Scholar

  • Kieraś, W. and M. Woliński. 2018. “Manually annotated corpus of Polish texts published between 1830 and 1918”. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 3854–3859. <http://www.lrec-conf.org/proceedings/lrec2018/index.html>

  • Kobyliński, Ł. 2014. “PoliTa: A multitagger for Polish”. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík: ELRA. 2949–2954. <http://www.lrec-conf.org/proceedings/lrec2014/index.html>

  • Kobyliński, Ł. and W. Kieraś. 2016. “Part of speech tagging for Polish: State of the art and future perspectives”. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016). Konya.Google Scholar

  • Kobyliński, Ł. and M. Ogrodniczuk. 2017. “Results of the PolEval 2017 competition: Part-of-speech tagging shared task”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 362–366.Google Scholar

  • Kobyliński, Ł., M. Wasiluk and G. Wojdyga. 2018. “Improving part-of-speech tagging by meta-learning”. Proceedings of 21st International Conference on Text, Speech and Dialogue (LNAI). Berlin: Springer-Verlag. 1–9.Google Scholar

  • Krasnowska-Kieraś, K. 2017. “Morphosyntactic disambiguation for Polish with bi-LSTM neural networks”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 367–371. <http://ltc.amu.edu.pl/book/papers/PolEvalU-Sdpdf>

  • Mikolov, T., K. Chen, G. Corrado and J. Dean. 2013. “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR 2013Google Scholar

  • Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. Proceedings of NIPS 2013 USA: Curran Associates Inc.Google Scholar

  • Nivre, J., M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C.D. Manning, R. McDonald, et al. 2016. “Universal dependencies v1: A multilingual treebank collection”. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).Google Scholar

  • Patejuk, A. and A. Przepiórkowski. 2018. From lexical functional grammar to enhanced universal dependencies: Linguistically informed treebanks of Polish Warsaw: Institute of Computer Science, Polish Academy of Sciences.Google Scholar

  • Pęzik, P. and S. Laskowski. 2017. “Evaluating an averaged perceptron morphosyntactic tagger for Polish”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 372–376. <http://ltc.amu.edu.pl/book/papers/PolEvalU-3dpdf>

  • Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Google Scholar

  • Piasecki, M. and W. Walentynowicz. 2017. “MorphoDiTa-based tagger adapted to the Polish language technology”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 377–381. <http://ltc.amu.edu.pl/book/papers/PolEvalU-4dpdf>

  • Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Google Scholar

  • Radziszewski, A. 2013. “Evaluation of lemmatisation accuracy of four Polish taggers”. Proceedings of the LTC 2013Google Scholar

  • Radziszewski, A. and S. Acedański. 2012. “Taggers gonna tag: An argument against evaluating disambiguation capacities of morphosyntactic taggers”. Proceedings of TSD 2012 (LNCS). Berlin: Springer-Verlag.Google Scholar

  • Radziszewski, A. and T. Śniatowski. 2011. “A memory-based tagger for Polish”. Proceedings of the LTC 2011.Google Scholar

  • Rychlikowski, P., M. Zapotoczny and J. Chorowski. 2017. “Character-based neural POS tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 382–385. <http://ltc.amu.edu.pl/book/papers/PolEvalU-5dpdf>

  • Saloni, Z., W. Gruszczyński, M. Woliński, R. Wołosz and D. Skowrońska. 2015. Słownik gramatyczny języka polskiego [A grammatical dictionary of Polish]. (3rd edn.) <http://sgjp.pl>

  • Silfverberg, M., T. Ruokolainen, K. Lindén and M. Kurimo. 2014. “Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy”. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers) Baltimore: Association for Computational Linguistics. 259–264. <http://aclweb.org/anthology/P14-204>

  • Straková, J. Milan Straka and Jan Hajič. 2014. “Open-source tools for morphology, lemmatization, POS tagging and named entity recognition”. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations Baltimore: Association for Computational Linguistics. 13–18. <http://www.aclweb.org/anthology/P/P14/P14-5003dpdf>

  • Szałkiewicz, Ł. and A. Przepiórkowski. 2012. “Anotacja morfoskładniowa” [Morpho-syntactic annotation]. In: Przepiórkowski, A., M. Bańko, R.L. Górski and B. Le-wandowska-Tomaszczyk (eds.), Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN. 59–96.Google Scholar

  • Toutanova, K. and C.D. Manning. 2000. “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger”. Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics 63–70.Google Scholar

  • Waszczuk, J. 2012. “Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai. 2789–2804.Google Scholar

  • Wawer, A. 2015. “Sentiment dictionary refinement using word embeddings”. Proceedings of ISMIS 2015 Cham. 186–193. doi:10.1007/978-3-319-25252-C_20.Google Scholar

  • Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining 2006 Conference Wisła. 511–520.Google Scholar

  • Woliński, M. 2018. Automatyczna analiza składnikowa języka polskiego [Automatic syntactic analysis of Polish]. Warsaw: IPI PAN.Google Scholar

  • Woliński, M. and W. Kieraś. 2016. “The on-line version of Grammatical Dictionary of Polish”. Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016 Portorož: ELRA; European Language Resources Association (ELRA). 2589–2594. <http://www.lrec-conf.org/proceedings/lrec2016/index.html>

  • Wróbel, K. 2017. “KRNNT: Polish recurrent neural network tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 386–391. <http://ltc.amu.edu.pl/book/papers/PolEvalU-6dpdf>

  • Wróblewska, A. 2018. “Extended and enhanced Polish dependency bank in Universal Dependencies format”. Proceedings of the Second Workshop on Universal Dependencies (UDW 2018). Brussels: Association for Computational Linguistics. 173–182.Google Scholar

About the article

Łukasz Kobyliński Institute of Computer Science Polish Academy of Sciences Jana Kazimierza 5 01-248 Warszawa Poland


Published Online: 2019-08-17

Published in Print: 2019-06-26


Citation Information: Poznan Studies in Contemporary Linguistics, Volume 55, Issue 2, Pages 211–237, ISSN (Online) 1897-7499, ISSN (Print) 0137-2459, DOI: https://doi.org/10.1515/psicl-2019-0009.

Export Citation

© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland.Get Permission

Comments (0)

Please log in or register to comment.
Log in