Abstract
In this article we discuss the current state-of-the-art for named entity recognition for Polish. We present publicly available resources and open-source tools for named entity recognition. The overview includes various kind of resources, i.e. guidelines, annotated corpora (NKJP, KPWr, CEN, PST) and lexicons (NELexiconS, PNET, Gazetteer). We present the major NER tools for Polish (Sprout, NERF, Liner2, Parallel LSTM-CRFs and PolDeepNer) and discuss their performance on the reference datasets. In the article we cover identification of named entity mentions in the running text, local and global entity categorization, fine- and coarse-grained categorization and lemmatization of proper names.
References
Akbik, A., D. Blythe and R. Vollgraf. 2018. “Contextual string embeddings for sequence labeling”. COLING 2018, 27th International Conference on Computational Linguistics. 1638–1649.Search in Google Scholar
Bojanowski, P., E. Grave, A. Joulin and T. Mikolov. 2017. “Enriching word vectors with subword information”. Transactions of the Association for Computational Linguistics 5. 135–146.10.1162/tacl_a_00051Search in Google Scholar
Borchmann, Ł., A. Gretkowski and F. Graliński. 2018. “Approaching nested named entity recognition with parallel LSTM-CRFS”. In: Ogrodniczuk, M. and Ł. Kobyliński (eds.), Proceedings of the Poleval 2018 Workshop Warsaw: Institute of Computer Science, Polish Academy of Science. 63–73.Search in Google Scholar
Broda, B., M. Marcińczuk, M. Maziarz, A. Radziszewski and A. Wardyński. 2012. “KPWr: Towards a free corpus of Polish”. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012Search in Google Scholar
Chiu, J. P. C. and E. Nichols. 2015. “Named entity recognition with bidirectional LSTM-CNNS”. <http://arxiv.org/abs/1511.08308>Search in Google Scholar
Cho, K., B van Merrienboer, D. Bahdanau and Y. Bengio. 2014. “On the properties of neural machine translation: Encoder-decoder approaches”. Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 201410.3115/v1/W14-4012Search in Google Scholar
Drożdżyński, W. H.-U. Krieger, Jakub Piskorski and Ulrich Schäfer. 2006. “SProUT – a general-purpose NLP framework integrating finite-state and unification-based grammar formalisms”. In: Yli-Jyrä, A., L. Karttunen and J. Karhumäki (eds.), Finite-state methods and natural language processing Berlin: Springer. 302–303.10.1007/11780885_35Search in Google Scholar
Hall, M., F. Eibe, G. Holmes, B. Pfahringer, P. Reutemann and I.H. Witten. 2009. “The Weka data mining software: An update”. SIGKDD Explor. Newsl. 11(1). 10–18. <https://doi.org/10.1145/1656274.1656278>10.1145/1656274.1656278Search in Google Scholar
Jurafsky, D. and J.H. Martin. 2000. Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition (1st ed.) Upper Saddle River, NJ: Prentice Hall PTR.Search in Google Scholar
Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer. 2016. “Neural architectures for named entity recognition”. arXiv Preprint arXiv:1603.01360.10.18653/v1/N16-1030Search in Google Scholar
Marcińczuk, M. 2015. “Automatic construction of complex features in conditional random fields for named entities recognition”. International Conference Recent Advances in Natural Language Processing, RANLPSearch in Google Scholar
Marcińczuk, M. 2017. “Lemmatization of multi-word common noun phrases and named entities in Polish”. In: Mitkov, R. and G. Angelova (eds.), Proceedings of the International Conference Recent Advances in Natural Language Processing, (RANLP 2017), Varna, Bulgaria, September S–R, 2017 483–491. <https://doi.org/10.26615/978-954-452-049-6_064>10.26615/978-954-452-049-6_064Search in Google Scholar
Marcińczuk, M. J. Kocoń and M. Gawor. 2018. “Recognition of named entities for Polish – Comparison of deep learning and conditional random fields approaches”. In: Ogrodniczuk, M. and Ł. Kobyliński (eds.), Proceedings of the PolEval 2018 Workshop Warsaw,: Institute of Computer Science, Polish Academy of Science. 77–92.Search in Google Scholar
Marcińczuk, M. J. Kocoń and M. Janicki. 2013. “Liner2 – A customizable framework for proper names recognition for Polish”. In: Bembenik, R., Ł. Skonieczny, H. Rybiński, M. Kryszkiewicz and M. Niezgódka (eds.), Intelligent tools for building a scientific information platform Berlin: Springer. 231–253. <http://dblp.uni-trier.de/db/series/sci/sci467dhtml#MarcinczukKJ13>10.1007/978-3-642-35647-6_17Search in Google Scholar
Marcińczuk, M. and M. Krautforst. 2016. “Wikipedia Infobox Mapping PL”. <http://hdl.handle.net/11321/293>Search in Google Scholar
Marcińczuk, M. and M. Krautforst. 2017. “Python-G419wikitools-1.0”. <http://hdl.handle.net/11321/336>Search in Google Scholar
Marcińczuk, M., M. Oleksy and A. Dziob. 2016. “KPWr annotation guidelines – Named entities”. <http://hdl.handle.net/11321/294>Search in Google Scholar
Marcińczuk, M. M. Oleksy, M. Maziarz, J. Wieczorek, D. Fikus, A. Turek, M. Wolski, T. Bernaś, J. Kocoń and P. Kędzia. 2016. “Polish Corpus of Wrocław University of Technology 1.2”. <http://hdl.handle.net/11321/270>Search in Google Scholar
Marcińczuk, M., M. Oleksy and J. Wieczorek. 2016. Preliminary study on automatic recognition of spatial expressions in Polish texts Vol. 9924 LNCS. <https://doi.org/10.1007/978-3-319-45510-5_18>Search in Google Scholar
Marcińczuk, M. and M. Piasecki. 2011. “Statistical proper name recognition in Polish economic texts”. Control and Cybernetics 40(2).Search in Google Scholar
Marcińczuk, M., A. Radziszewski, M. Piasecki, D. Piasecki and M. Ptak. 2013. “Evaluation of a baseline information retrieval for a Polish open-domain question answering system”. International Conference Recent Advances in Natural Language Processing, RANLPSearch in Google Scholar
McCallum, A. 2003. “Efficiently inducing features of conditional random fields”. Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence UAI’03. San Francisco, CA: Morgan Kaufmann Publishers Inc. 403–410. <http://dl.acm.org/citation.cfm?id=2100584.2100633>Search in Google Scholar
Mill, J.S. 1858. A system of logic, Rand inductive: Being a connected view of the principles of evidence and the methods of scientific investigation London: Harper & Brothers.Search in Google Scholar
Ogrodniczuk, M. and M. Lenart. 2012. “Web service integration platform for Polish linguistic resources”. Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012 Istanbul: ELRA. 1164–1168.Search in Google Scholar
Piskorski, J. 2004. “Extraction of Polish named entities”. Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR, 2004) Prague: ACL. 313–316.Search in Google Scholar
Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk. 2012. Narodowy Korpus Jȩzyka Polskiego [National Corpus of Polish].Search in Google Scholar
Sahu, S. and A. Anand. 2016. “Recurrent neural network models for disease name recognition using domain invariant features”. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin: Association for Computational Linguistics. 2216–2225. <https://doi.org/10.18653/vU/P16-1209>10.18653/v1/P16-1209Search in Google Scholar
Savary, A. and J. Piskorski. 2011. “Language resources for named entity annotation in the National Corpus of Polish”. Control and Cybernetics 40(2). 361–91.Search in Google Scholar
Sekine, S. Kiyoshi Sudo and Chikashi Nobata. 2002. “Extended Named Entity Hierarchy”. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002) Las Palmas: European Language Resources Association (ELRA). <http://www.lrec-conf.org/proceedings/lrec2002/pdf/120dpdf>Search in Google Scholar
Tjong Kim Sang, E. F. and F. De Meulder. 2003. “Introduction to the CONLL-2003 shared task: Language-independent named entity recognition”. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 – Volume 4 142–47. CONLL ’03. Stroudsburg, PA: Association for Computational Linguistics. <https://doi.org/10.3115/1119176.1119195>Search in Google Scholar
Walkowiak, T. 2018. “Language processing modelling notation – Orchestration of NLP microservices”. In: Zamojski, W., J. Mazurkiewicz, J. Sugier, T. Walkowiak and J. Kacprzyk (eds.), Advances in dependability engineering of complex systems Cham: Springer International Publishing. 464–473.10.1007/978-3-319-59415-6_44Search in Google Scholar
Waszczuk, J., K. Głowińska, A. Savary, A. Przepiórkowski and M. Lenart. 2013. “Annotation tools for syntax and named entities in the National Corpus of Polish”. International Journal of Data Mining, Modelling and Management 5(2). 103–122.10.1504/IJDMMM.2013.053691Search in Google Scholar
© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland