Abstract
This is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.
Funding statement: This work was supported by Social Science Fund of Shaanxi State, (Grant Number: 2015K001), Univerzita Karlova v Praze (10.13039/100007397), Progress 4, Ostravská Univerzita v Ostravě (10.13039/501100006704 Grant Number: SGS02/UVAFM/2017).
About the authors
Miroslav Kubát (born 1984, Ph.D. Palacký University 2015) is an assistant professor in Czech Language at the University of Ostrava (Czech Republic). His research interests focus on quantitative linguistics and stylometry. He specializes in quantitative indices of text analysis, such as vocabulary richness, activity and context specificity.
Jan Hůla (born 1985, MgA. Tomas Bata University, 2011) is a PhD student at the Institute for Research and Applications of Fuzzy Modeling, Faculty of Science, University of University. He specializes in Neural Networks and Natural Language Processing; he is also interested in Applied Category Theory and its applications in Linguistics.
Xinying Chen (born 1984, Ph.D. Communication University of China, 2012) is a post-doctoral research fellow at the University of Ostrava in the Czech Republic and an associate professor in Linguistics at the Xi’an Jiaotong University in China. Her research interests focus on the empirical syntactical analysis of linguistic units in spoken and written communication. She is also interested in applying interdisciplinary methods, such as social network analysis or statistical clustering algorithms, to quantitative analysis of synchronic and diachronic texts.
Radek Čech (born 1974, Ph.D. Palacký University 2005) is an associate professor in Czech Language at the University of Ostrava (Czech Republic). His research interests focus on quantitative text analysis and quantitative syntax (valency, syntactic complex networks). He is also interested in the application of quantitative methods to historical linguistics (word ordering of enclitics, stylometry).
Jiří Milička (born 1986, PhD Charles University, 2016) is a research associate at the Department of Comparative Linguistics and the Institute of the Czech National Corpus, Faculty of Arts, Charles University. He specializes in quantitative and corpus linguistics and Arabic language; he also develops applications for linguistic research.
References
Bublitz, Wolfram & Neal R. Norrick (eds.). 2011. Foundations of pragmatics. Berlin: De Gruyter Mouton.10.1515/9783110214260Search in Google Scholar
Čech, Radek., Jan Hůla, Miroslav Kubát, Xinying Chen & Jiří Milička. 2018. The development of context specificity of lemma. A word embeddings approach. Journal of Quantitative Linguistics https://www.tandfonline.com/doi/abs/10.1080/09296174.2018.1491748 (accessed 28 September 2018).10.1080/09296174.2018.1491748Search in Google Scholar
Cvrček, Václav & Lucie Chlumská. 2015. Simplification in translated Czech: A new approach to type-token ratio. Russian Linguistics 39(3). 309–325.10.1007/s11185-015-9151-8Search in Google Scholar
Golato, Andrea & Peter Golato. 2012. Pragmatics research methods. In Carol A. Chapelle (ed.), The encyclopedia of applied linguistics. Oxford: Wiley-Blackwell. doi:10.1002/9781405198431.wbeal0946.10.1002/9781405198431.wbeal0946Search in Google Scholar
Grieve, Jack. 2005. Quantitative authorship attribution: A history and an evaluation of techniques. Simon Fraser University MA thesis.Search in Google Scholar
Grieve, Jack. 2007. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3). 251–270.10.1093/llc/fqm020Search in Google Scholar
Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1489–1501.Search in Google Scholar
Hnátková, Milena, Michal Křen, Pavel Procházka & Hana Skoumalová. 2014. The SYN-series corpora of written Czech. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavík: ELRA, 160–164Search in Google Scholar
Juola, Patrick. 2006. Authorship attribution. Foundations and Trends in Information Retrieval 1(3). 233–334.10.1561/9781601981196Search in Google Scholar
Křen, Michal, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš Jelínek, Dominika Kováříková, Vladimír Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička & Adrian Zasina 2016. Corpus SYN, version 4 from 16. 9. 2016. Praha: Ústav Českého národního korpusu FF UK. http://www.korpus.cz.Search in Google Scholar
Kroeger, Paul. 2005. Analyzing grammar: An introduction. Cambridge: Cambridge University Press.10.1017/CBO9780511801679Search in Google Scholar
Kubát, Miroslav. 2016. Kvantitativní analýza žánrů, [Quantitative Analysis of Genres]. Ostrava: University of Ostrava.Search in Google Scholar
Kubát, Miroslav & Jiří Milička. 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20(4). 339–349.10.1080/09296174.2013.830552Search in Google Scholar
Levy, Omer, Yoav Goldberg & Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3. 211–225.10.1162/tacl_a_00134Search in Google Scholar
Manning, Christopher, D. Jeffrey Pennington & Richard Socher. 2014. Proceedings of the empirical methods in natural language processing (EMNLP 2014).Search in Google Scholar
Matthews, Robert AJ & Thomas VN Merriam. 1993. Neural computation in stylometry I: An application to the works of Shakespeare and Fletcher. Literary and Linguistic Computing 8(4). 203–209.10.1093/llc/8.4.203Search in Google Scholar
McMenamin, Gerald R. 2002. Forensic linguistics: Advances in forensic stylistics. Boca Raton: CRC Press.10.1201/9781420041170Search in Google Scholar
Mikolov, Tomas, Kai Chen, Greg S. Corrado, Jeff Dean & Ilya Sutskever. 2013a. Distributed representations of words and phrases and their compositionality. Proceedings of Neural Information Processing Systems (NIPS 26), 3111–3119.Search in Google Scholar
Mikolov, Tomas, Kai Chen, Greg S. Corrado, Jeff Dean & Ilya Sutskever. 2013b. Efficient estimation of word representations in vector space. ICLR Workshop Papers.Search in Google Scholar
Mikros, George K. & Kostas Perifanos. 2013. Authorship attribution in Greek tweets using multilevel author’s n-gram profiles. In E. Hovy, V. Markman, C. H. Martell & D. Uthus (eds.), Papers from the 2013 AAAI spring symposium “Analyzing Microtext”, 17–23. Stanford, California. Palo Alto, California: AAAI Press. (25–27March2013).Search in Google Scholar
Popescu, Ioan Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur D. Jayaram, Reinhard Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009. Word frequency studies. Berlin, New York: Mouton de Gruyter.10.1515/9783110218534Search in Google Scholar
Sampson, Geoffrey. 2001. Empirical linguistics. London – New York: Continuum International.Search in Google Scholar
© 2018 Walter de Gruyter GmbH, Berlin/Boston