Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2018: 0.960
5-year IMPACT FACTOR: 1.052

CiteScore 2018: 0.84

SCImago Journal Rank (SJR) 2018: 0.388
Source Normalized Impact per Paper (SNIP) 2018: 1.245

See all formats and pricing
More options …

The lexical context in a style analysis: A word embeddings approach

Miroslav KubátORCID iD: http://orcid.org/0000-0002-3398-3125 / Jan Hůla / Xinying Chen / Radek Čech / Jiří MiličkaORCID iD: http://orcid.org/0000-0001-8605-1199
Published Online: 2018-11-16 | DOI: https://doi.org/10.1515/cllt-2018-0003


This is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.

Keywords: neural networks; word embedding; word2vec; stylometry; style


  • Bublitz, Wolfram & Neal R. Norrick (eds.). 2011. Foundations of pragmatics. Berlin: De Gruyter Mouton.Google Scholar

  • Čech, Radek., Jan Hůla, Miroslav Kubát, Xinying Chen & Jiří Milička. 2018. The development of context specificity of lemma. A word embeddings approach. Journal of Quantitative Linguistics https://www.tandfonline.com/doi/abs/10.1080/09296174.2018.1491748 (accessed 28 September 2018).

  • Cvrček, Václav & Lucie Chlumská. 2015. Simplification in translated Czech: A new approach to type-token ratio. Russian Linguistics 39(3). 309–325.Web of ScienceCrossrefGoogle Scholar

  • Golato, Andrea & Peter Golato. 2012. Pragmatics research methods. In Carol A. Chapelle (ed.), The encyclopedia of applied linguistics. Oxford: Wiley-Blackwell. doi:10.1002/9781405198431.wbeal0946.

  • Grieve, Jack. 2005. Quantitative authorship attribution: A history and an evaluation of techniques. Simon Fraser University MA thesis.Google Scholar

  • Grieve, Jack. 2007. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3). 251–270.CrossrefGoogle Scholar

  • Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1489–1501.Google Scholar

  • Hnátková, Milena, Michal Křen, Pavel Procházka & Hana Skoumalová. 2014. The SYN-series corpora of written Czech. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavík: ELRA, 160–164Google Scholar

  • Juola, Patrick. 2006. Authorship attribution. Foundations and Trends in Information Retrieval 1(3). 233–334.Google Scholar

  • Křen, Michal, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš Jelínek, Dominika Kováříková, Vladimír Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička & Adrian Zasina 2016. Corpus SYN, version 4 from 16. 9. 2016. Praha: Ústav Českého národního korpusu FF UK. http://www.korpus.cz.

  • Kroeger, Paul. 2005. Analyzing grammar: An introduction. Cambridge: Cambridge University Press.Google Scholar

  • Kubát, Miroslav. 2016. Kvantitativní analýza žánrů, [Quantitative Analysis of Genres]. Ostrava: University of Ostrava.Google Scholar

  • Kubát, Miroslav & Jiří Milička. 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20(4). 339–349.Web of ScienceCrossrefGoogle Scholar

  • Levy, Omer, Yoav Goldberg & Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3. 211–225.CrossrefGoogle Scholar

  • Manning, Christopher, D. Jeffrey Pennington & Richard Socher. 2014. Proceedings of the empirical methods in natural language processing (EMNLP 2014).Google Scholar

  • Matthews, Robert AJ & Thomas VN Merriam. 1993. Neural computation in stylometry I: An application to the works of Shakespeare and Fletcher. Literary and Linguistic Computing 8(4). 203–209.CrossrefGoogle Scholar

  • McMenamin, Gerald R. 2002. Forensic linguistics: Advances in forensic stylistics. Boca Raton: CRC Press.Google Scholar

  • Mikolov, Tomas, Kai Chen, Greg S. Corrado, Jeff Dean & Ilya Sutskever. 2013a. Distributed representations of words and phrases and their compositionality. Proceedings of Neural Information Processing Systems (NIPS 26), 3111–3119.Google Scholar

  • Mikolov, Tomas, Kai Chen, Greg S. Corrado, Jeff Dean & Ilya Sutskever. 2013b. Efficient estimation of word representations in vector space. ICLR Workshop Papers.Google Scholar

  • Mikros, George K. & Kostas Perifanos. 2013. Authorship attribution in Greek tweets using multilevel author’s n-gram profiles. In E. Hovy, V. Markman, C. H. Martell & D. Uthus (eds.), Papers from the 2013 AAAI spring symposium “Analyzing Microtext”, 17–23. Stanford, California. Palo Alto, California: AAAI Press. (25–27March2013).Google Scholar

  • Popescu, Ioan Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur D. Jayaram, Reinhard Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009. Word frequency studies. Berlin, New York: Mouton de Gruyter.Google Scholar

  • Sampson, Geoffrey. 2001. Empirical linguistics. London – New York: Continuum International.Google Scholar

About the article

Miroslav Kubát

Miroslav Kubát (born 1984, Ph.D. Palacký University 2015) is an assistant professor in Czech Language at the University of Ostrava (Czech Republic). His research interests focus on quantitative linguistics and stylometry. He specializes in quantitative indices of text analysis, such as vocabulary richness, activity and context specificity.

Jan Hůla

Jan Hůla (born 1985, MgA. Tomas Bata University, 2011) is a PhD student at the Institute for Research and Applications of Fuzzy Modeling, Faculty of Science, University of University. He specializes in Neural Networks and Natural Language Processing; he is also interested in Applied Category Theory and its applications in Linguistics.

Xinying Chen

Xinying Chen (born 1984, Ph.D. Communication University of China, 2012) is a post-doctoral research fellow at the University of Ostrava in the Czech Republic and an associate professor in Linguistics at the Xi’an Jiaotong University in China. Her research interests focus on the empirical syntactical analysis of linguistic units in spoken and written communication. She is also interested in applying interdisciplinary methods, such as social network analysis or statistical clustering algorithms, to quantitative analysis of synchronic and diachronic texts.

Radek Čech

Radek Čech (born 1974, Ph.D. Palacký University 2005) is an associate professor in Czech Language at the University of Ostrava (Czech Republic). His research interests focus on quantitative text analysis and quantitative syntax (valency, syntactic complex networks). He is also interested in the application of quantitative methods to historical linguistics (word ordering of enclitics, stylometry).

Jiří Milička

Jiří Milička (born 1986, PhD Charles University, 2016) is a research associate at the Department of Comparative Linguistics and the Institute of the Czech National Corpus, Faculty of Arts, Charles University. He specializes in quantitative and corpus linguistics and Arabic language; he also develops applications for linguistic research.

Published Online: 2018-11-16

This work was supported by Social Science Fund of Shaanxi State, (Grant Number: 2015K001), Univerzita Karlova v Praze (10.13039/100007397), Progress 4, Ostravská Univerzita v Ostravě (10.13039/501100006704 Grant Number: SGS02/UVAFM/2017).

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2018-0003.

Export Citation

© 2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Comments (0)

Please log in or register to comment.
Log in