Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2018: 0.960
5-year IMPACT FACTOR: 1.052

CiteScore 2018: 0.84

SCImago Journal Rank (SJR) 2018: 0.388
Source Normalized Impact per Paper (SNIP) 2018: 1.245

See all formats and pricing
More options …

Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis

Punjaporn PojanapunyaORCID iD: http://orcid.org/0000-0003-0694-200X / Richard Watson Todd
  • Department of Language Studies, School of Liberal Arts, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2018-04-06 | DOI: https://doi.org/10.1515/cllt-2015-0030


Keyword analysis is used in a range of sub-disciplines of applied linguistics from genre analyses to critically-oriented studies for different purposes ranging from producing a general characterization of a genre to identifying text-specific ideological issues. This study compares the use of log-likelihood (LL), a probability statistic, and odds ratio (OR), an effect size statistic, for keyword identification and argues that the two methods produce different keywords applicable to research focusing on different purposes. Through two case studies, keyword analyses of advance fee scams against the British National Corpus and research articles in applied linguistics against research articles from other academic disciplines, we show that both the LL and OR keywords concern the aboutness of the corpus, but differ in their specificity and pervasiveness through the corpus. LL highlights words which are relatively common in general use serving genre purposes, whereas OR highlights more specialized words serving critically-oriented purposes. Methodological and practical contributions to keyword analysis are discussed.

Keywords: keyness; keyword; log-likelihood; odds ratio; keyword identification


  • Adolphs, Svenja. 2006. Introducing electronic text analysis: A practical guide for language and literacy studies. New York: Routledge.Google Scholar

  • Adolphs, Svenja, Brian Brown, Ronald Carter, Paul Crawford & Opinder Sahota. 2004. Applying corpus linguistics in a health care context. Journal of Applied Linguistics 1(1). 9–28.CrossrefGoogle Scholar

  • Agresti, Alan. 2002 [1990]. Categorical data analysis, 2nd edn. New York: Wiley.Google Scholar

  • Agresti, Alan. 2007 [1996]. An introduction to categorical data analysis, 2nd edn. New York: Wiley.Google Scholar

  • Anthony, Laurence. 2013a. AntWordProfiler (Version [Computer Software]. Tokyo: Waseda University. http://www.laurenceanthony.net/software/antwordprofiler/ (accessed 8 October 2014).Google Scholar

  • Anthony, Laurence. 2013b. A critical look at software tools in corpus linguistics. Linguistic Research 30(2). 141–161.CrossrefGoogle Scholar

  • Anthony, Laurence. 2014. AntConc (Version 3.4.3) [Computer Software]. Tokyo: Waseda University. http://www.laurenceanthony.net/software/antconc/ (accessed 8 October 2014).Google Scholar

  • Baker, Paul. 2004. Querying key words: Questions of difference, frequency, and sense in key words analysis. Journal of English Linguistics 32(4). 346–359.CrossrefGoogle Scholar

  • Baker, Paul. 2006a. The question is, how cruel is it? Keywords, foxhunting and the House of Commons. Paper presented at AHRC ICT [Information and Communications Technology in Arts and Humanities Research] Methods Network Expert Seminar on Linguistics, Lancaster University, 8 September.

  • Baker, Paul. 2006b. Using corpora in discourse analysis. London: Continuum.Google Scholar

  • Bassi, Erica. 2010. A contrastive analysis of keywords in newspaper articles on the “Kyoto Protocol”. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 207–218. Amsterdam: John Benjamins.Google Scholar

  • Bestgen, Yves & Sylviane Granger. 2014. Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing 26(1). 28–41.CrossrefGoogle Scholar

  • Bestgen, Yves. 2014. Inadequacy of the chi-squared test to examine vocabulary differences between corpora. Literary & Linguistic Computing 29(2). 164–170.CrossrefGoogle Scholar

  • Bondi, Marina & Mike Scott (eds.). 2010. Keyness in texts. Amsterdam: John Benjamins.Google Scholar

  • Bowker, Lynne & Jennifer Pearson. 2002. Working with specialized language: A practical guide to using corpora. London: Routledge.Google Scholar

  • Butler, Christopher S. 2001. A matter of give and take: Corpus linguistics and the predicate frame. Revista Canaria de Estudios Ingleses 42. 55–78.Google Scholar

  • Carreon, Jonathan Rante & Richard Watson Todd. 2011. Analysing private hospital websites from a critical perspective: Potential issues of methodology, analysis and interpretation of findings. In Proceedings of the International Conference on Doing Research in Applied Linguistics [DRAL], 26–36. Bangkok: King Mongkut’s University of Technology Thonburi.Google Scholar

  • Chujo, Kiyomi & Masao Utiyama. 2006. Selecting level-specific specialized vocabulary using statistical measures. System 34(2). 255–269.CrossrefGoogle Scholar

  • Crawford, Lynn, Julien Pollack & David England. 2006. Uncovering the trends in project management: Journal emphases over the last 10 years. International Journal of Project Management 24. 175–184.CrossrefGoogle Scholar

  • Cruickshank, Douglas. 2001. I crave your distinguished indulgence (and all your cash). http://www.salon.com/2001/08/07/419scams/ (accessed 14 May 2015).

  • Cukier, Wendy L., Eva J. Nesselroth & Susan Cody. 2007. Genre, narrative and the “Nigerian letter” in electronic mail. Proceedings of the 40th Annual Hawaii International Conference on System Sciences [HICSS’07]. 70a. http://www.computer.org/csdl/proceedings/hicss/2007/2755/00/27550070a.pdf (accessed 25 May 2015).

  • Culpeper, Jonathan. 2002. Computers, language and characterisation: An analysis of six characters in Romeo and Juliet. In Ulla Melander-Marttala, Carin Östman & Merja Kytö (eds.), Conversation in life and in literature, 11–30. Uppsala: Universitetstryckeriet.Google Scholar

  • Culpeper, Jonathan. 2009. Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics 14(1). 29–59.CrossrefGoogle Scholar

  • De Schryver, Gilles-Maurice. 2012. Trends in twenty-five years of academic lexicography. International Journal of Lexicography 25(4). 464–506.CrossrefGoogle Scholar

  • del-Teso-Craviotto, Marisol. 2006. Words that mater: Lexical choice and gender ideologies in women’s magazines. Journal of Pragmatics 38(11). 2003–2021.CrossrefGoogle Scholar

  • Dörnyei, Zoltán. 2007. Research methods in applied linguistics. Oxford: Oxford University Press.Google Scholar

  • Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1). 61–74.Google Scholar

  • Dyrud, Marilyn A. 2005. Letters, “I brought you a good news”: An analysis of Nigerian 419 letters. In Lisa E. Gueldenzoph (ed.), Proceedings of the 2005 Association for Business Communication Annual Convention [ABC], 1–11. Irvine: The Association for Business Communication.Google Scholar

  • Evert, Stefan. 2008. Corpora and collocations. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook 2, 1212–1248. Berlin & New York: Mouton de Gruyter.Google Scholar

  • Feng, Haiying. 2006. A corpus-based study of research grant proposal abstracts. Perspectives: Working Papers in English and Communication 17(1). 1–24.Google Scholar

  • Freddi, Maria. 2005. Arguing linguistics: Corpus investigation of one functional variety of academic discourse. Journal of English for Academic Purposes 4(1). 5–26.CrossrefGoogle Scholar

  • Gabrielatos, Costas & Paul Baker. 2008. Fleeing, sneaking, flooding a corpus analysis of discursive constructions of refugees and asylum seekers in the UK Press, 1996–2005. Journal of English Linguistics 36(1). 5–38.CrossrefGoogle Scholar

  • Gabrielatos, Costas & Anna Marchi. 2012. Keyness: Appropriate metrics and practical issues. Paper presented at Critical Approaches to Discourse Studies, University of Bologna, 13–14 September. http://repository.edgehill.ac.uk/4196/1/Gabrielatos%26Marchi-Keyness-CADS2012.pdf (accessed 20 September 2015).

  • Gabrielatos, Costas. 2007. Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal 31. 5–44.Google Scholar

  • Gerbig, Andrea. 2010. Key words and key phrases in a corpus of travel writing. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 147–168. Amsterdam: John Benjamins.Google Scholar

  • Gleick, James. 2003. You have spam. Australian Magazine March 15. 16. http://web.lexis-nexis.com/universe/document?_m=3550ffbea5787e1788de3f3a33bdabf&_docnum=48&wchp=dGLbVtz-zSkVb&+md5=34b249bcee6db14d8b237c3448899aab.

  • Goldstein, Alan. 2003. Growing junk e-mail traffic has become a ‘Headache.’ Hamilton Spectator [Ontario, Canada] August 12. http://web.lexis-nexis.com/universe/document?_m=35501T6bea5787e1788de3f3a33bdabf&_docnum=48&wchp=dGLbVtz-zSkVb&_md5=34b249bcee6db14d8b237c3448899aab.

  • Gooberman-Hill, Rachael, Melissa French, Paul Dieppe & Gillian Hawker. 2009. Expressing pain and fatigue: A new method of analysis to explore differences in osteoarthritis experience. Arthritis and Rheumatism 61(3). 353–360.PubMedCrossrefGoogle Scholar

  • Graham, Dougal. 2014. KeyBNC [Computer Software]. Bangkok: King Mongkut’s University of Technology Thonburi. http://crs2.kmutt.ac.th/Key-BNC/ (accessed 27 November 2014).Google Scholar

  • Gries, Stefan Th. 2014. Frequency tables, effect sizes, and explorations. In Dylan Glynn & Justyna Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy, 365–389. Amsterdam & Philadelphia: John Benjamins.Google Scholar

  • Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber & Randi Reppen (eds.), The Cambridge handbook of English corpus linguistics, 50–72. Cambridge: Cambridge University Press.Google Scholar

  • Grissom, Robert J. & John J. Kim. 2005. Effect sizes for research: A broad practical approach. New Jersey: Lawrence Erlbaum.Google Scholar

  • Hardie, Andrew. 2014. Log Ratio – an informal introduction. http://cass.lancs.ac.uk/?p=1133 (accessed 27 August 2015).

  • Jimarkon, Pattamawan & Richard Watson Todd. 2013. Red or yellow, peace or war: Agonism and antagonism in online discussion during the 2010 political unrest in Thailand. In Antoon De Rycker & Zuraidah Mohd Don (eds.), Discourse and crisis: Critical perspectives, 301–322. Amsterdam: John Benjamins.Google Scholar

  • Kang, Ning & Qiaofeng Yu. 2011. Corpus-based stylistic analysis of tourism English. Journal of Language Teaching and Research 2(1). 129–136.Google Scholar

  • Kich, Martin. 2005. A rhetorical analysis of fund-transfer-scam solicitations. Cercles 14. 129–142.Google Scholar

  • Kilgarriff, Adam. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1). 97–133.CrossrefGoogle Scholar

  • Kotzé, Ernst Frederick. 2010. Author identification from opposing perspectives in forensic linguistics. Southern Africa Linguistics and Applied Language Studies 28(2). 185–197.CrossrefGoogle Scholar

  • Kwary, Deny Arnos. 2011. A hybrid method for determining technical vocabulary. System 39(2). 175–185.CrossrefGoogle Scholar

  • Lamberger, Igor, Bojan Dobovšek & Boštjan Slak. 2013. Analysis of the fraudulent letters A.K.A. Nigerian letters. In Gorazd Meško, Andrej Sotlar & Jack R. Greene (eds.), Proceedings of the Biennial International Conference: Criminal Justice and Security–Contemporary Criminal Justice Practice and Research, 443–466. Ljubljana: University of Maribor. https://www.ncjrs.gov/pdffiles1/242949.pdf (accessed 25 May 2015).Google Scholar

  • Leone, Paola. 2010. General spoken language and school language: Key words and discourse patterns in history textbooks. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 234–248. Amsterdam: John Benjamins.Google Scholar

  • Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki & Heikki Mannila. 2014. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities 29(4). http://users.ics.aalto.fi/lijffijt/articles/lijffijt2015a.pdf (accessed 20 September 2015).Google Scholar

  • Ljung, Magnus. 2002. What vocabulary tells us about genre differences: A study of lexis in five newspaper genres. Language and Computers 40(1). 181–196.Google Scholar

  • Loudermilk, Brandon Conner. 2007. Occluded academic genres: An analysis of the MBA thought essay. Journal of English for Academic Purposes 6(3). 190–205.CrossrefGoogle Scholar

  • Malavasi, Donatella & Davide Mazzi. 2010. History v. marketing: Keywords as a clue to disciplinary epistemology. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 169–184. Amsterdam: John Benjamins.Google Scholar

  • Martínez, Antonia Sánchez. 2008. Collocation analysis of a sample corpus using some statistical measures: An empirical approach. In Rafael Monroy & Aquilino Sánchez (eds.), Proceedings of the 25th International AESLA [The Spanish Society for Applied Linguistics] Conference: 25 years of Applied Linguistics in Spain: milestones and challenges, 763–768. Murcia: University of Murcia.Google Scholar

  • Moudraia, Olga. 2003. The student engineering corpus: Analysing word frequency. In Dawn Archer, Paul Rayson, Andrew Wilson & Tony McEnery (eds.), Proceedings of the Corpus Linguistics 2003 Conference [CL2003], 552–561. Lancaster: Lancaster University.Google Scholar

  • Nassaji, Hossein. 2012. Statistical significance tests and result generalisability. In Graeme Porte (ed.), Replication research in applied linguistics, 92–115. Cambridge: Cambridge University Press.Google Scholar

  • Nation, Pual & Alex Heatley. 2002. Range: A program for the analysis of vocabulary in texts [Computer Software]. Wellington: Victoria University. http://www.victoria.ac.nz/lals/about/staff/paul-nation (accessed 19 September 2014).Google Scholar

  • O’Halloran, Kieran. 2011. Investigating argumentation in reading groups: Combining manual qualitative coding and automated corpus analysis tools. Applied Linguistics 32(2). 172–196.CrossrefGoogle Scholar

  • Oakes, Michael P. 2008. Measures from information retrieval to find the words which are characteristic of a corpus. In Barbara Lewandowska-Tomaszczyj (ed.), Corpus linguistics, computer tools, and applicationsstate of the art: PALC 2007, 127–138. Frankfurt: Peter Lang.Google Scholar

  • Paquot, Magali & Yves Bestgen. 2009. Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Andreas H. Jucker, Daniel Schreier & Marianne Hundt (eds.), Corpora: Pragmatics and discourse, 247–269. Amsterdam & New York: Rodopi.Google Scholar

  • Rayson, Paul & Roger Garside. 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora [WCC’00], 1–6. Hong Kong: Association for Computational Linguistics.Google Scholar

  • Rayson, Paul. 2008a. From key words to key semantic domains. International Journal of Corpus Linguistics 13(4). 519–149.CrossrefGoogle Scholar

  • Rayson, Paul. 2008b. Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.html (accessed 27 August 2015).

  • Rayson, Paul. 2009. Wmatrix: a web-based corpus processing environment [Computer Software]. Lancaster: Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/Google Scholar

  • Rayson, Paul. 2013. Corpus analysis of key words. In Carol A. Chapelle (ed.), The encyclopaedia of applied linguistics, 1–7. Oxford: Wiley-Blackwell.Google Scholar

  • Rayson, Paul, Damon Berridge & Brian Francis. 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In Gérald Purnelle, Cédrick Fairon & Anne Dister (eds.), Proceedings of the 7th International Conference on Statistical Analysis of Textual Data [JADT], 926–936. Louvain-la-Neuve: UCL Presses universitaires de Louvain.Google Scholar

  • Renström, Caroline. 2011. Framing Obama: A comparative study of keywords and frames in two Washington newspapers. Stockholm: Stockholm University Bachelor Degree Thesis. http://su.diva-portal.org/smash/get/diva2:479520/FULLTEXT01 (accessed 24 September 2013).

  • Römer, Ute & Stefanie Wulff. 2010. Applying corpus methods to written academic texts: Explorations of MICUSP. Journal of Writing Research 2(2). 99–127.CrossrefGoogle Scholar

  • Schaffer, Deborah. 2012. The language of scam spams: linguistic features of “Nigerian fraud” e-mails. et Cetera 69(2). 157–179.Google Scholar

  • Scharl, Arno & Albert Weichselbraun. 2008. An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology and Politics 5(1). 121–132.CrossrefGoogle Scholar

  • Schmitt, Norbert. 2010. Researching vocabulary: A vocabulary research manual. Basingstoke: Palgrave Macmillan.Google Scholar

  • Scott, Mike & Christopher Tribble. 2006. Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.Google Scholar

  • Scott, Mike. 1997. PC analysis of key words – and key key words. System 25(2). 233–245.CrossrefGoogle Scholar

  • Scott, Mike. 2000. Focusing on the text and its key words. In Lou Burnard & Tony McEnery (eds.), Rethinking language pedagogy from a corpus perspective, 103–122. Frankfurt: Peter Lang.Google Scholar

  • Scott, Mike. 2015. WordSmith Tools (Version 6.0) [Computer Software]. Oxford: Oxford University Press.Google Scholar

  • Seale, Clive 2008. Mapping the field of medical sociology: A comparative analysis of journals. Sociology of Health & Illness 30(5). 677–695.CrossrefPubMedGoogle Scholar

  • Seale, Clive, Sue Ziebland & Jonathan Charteris-Black. 2006. Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science and Medicine 62(10). 2577–2590.CrossrefGoogle Scholar

  • Sealey, Alison. 2009. Probabilities and surprises: A realist approach to identifying linguistic and social patterns, with reference to an oral history corpus. Applied Linguistics 31(2). 215–235.Google Scholar

  • Stubbs, Michael. 1995. Collocations and semantic profiles: On the cause of the trouble with quantitative studies. Functions of Language 2(1), 23–55.CrossrefGoogle Scholar

  • Sweeney, Latanya. 2006. Protecting job seekers from identity theft. IEEE Internet Computing 10(2). http://dataprivacylab.org/dataprivacy/projects/idangel/paper3.pdf (accessed 25 May 2015).Google Scholar

  • Thompson, Geoff. 2004 [1996]. Introducing functional grammar, 2nd edn. London: Arnold.Google Scholar

  • Tomokiyo, Takashi & Matthew Hurst. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL [Association for Computational Linguistics] 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment [MWE’03], 33–40. Sapporo: Association for Computational Linguistics.Google Scholar

  • Vechtomova, Olga & Stephen Robertson. 2000. Integration of collocation statistics into the probabilistic retrieval model. In Stephen Robertson & Goker Ayse (eds.), Proceedings of the 22nd Annual Colloquium on Information Retrieval Research [ECIR], 165–177. Cambridge: Sidney Sussex College.Google Scholar

  • Viosca, R. Charles Jr., Blaise J. Bergiel & Phillip Balsmeier. 2004. Effects of the electronic Nigerian money fraud on the brand equity of Nigeria and Africa. Management Research News 27(6). 11–20.CrossrefGoogle Scholar

  • Viswamohan, Aysha Iqbal, Charles Hadfield & Jill Hadfield. 2010. ‘Dearest beloved one, I need your assistance’: the rhetoric of spam mail. ELT Journal 64(1). 85–94.CrossrefGoogle Scholar

  • Walsh, Matthew. 2005. Collocation and the learner of English. Language teaching publications. Hove 2(7). 26–54.Google Scholar

  • Webb, Stuart & Paul Nation. 2008. Evaluating the vocabulary load of written text. TESOLANZ Journal 16. 1–10.Google Scholar

  • Wilson, Andrew. 2013. Embracing Bayes factors for key item analysis in corpus linguistics. In Markus Bieswanger & Amei Koll-Stobbe (eds.), New approaches to the study of linguistic variability (Language competence and language awareness in Europe 4), 3–11. Frankfurt: Peter Lang.Google Scholar

  • 419 Advance Fee Fraud Statistics 2009. 2010. http://www.ultrascan-agi.com/public_html/html/aff_37_countries.htm.

About the article

Published Online: 2018-04-06

Published in Print: 2018-04-25

Citation Information: Corpus Linguistics and Linguistic Theory, Volume 14, Issue 1, Pages 133–167, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2015-0030.

Export Citation

© 2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Joe Geluso and Roz Hirch
Register Studies, 2019, Volume 1, Number 2, Page 209
Valentin Werner
Text & Talk, 2019, Volume 39, Number 5, Page 671
Nathan Thomas, Heath Rose, and Punjaporn Pojanapunya
Applied Linguistics Review, 2019, Volume 0, Number 0
Laia Subirats, Natalia Reguera, Antonio Bañón, Beni Gómez-Zúñiga, Julià Minguillón, and Manuel Armayones
International Journal of Environmental Research and Public Health, 2018, Volume 15, Number 9, Page 1877
Richard Watson Todd
English for Specific Purposes, 2017, Volume 45, Page 31

Comments (0)

Please log in or register to comment.
Log in