Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton April 6, 2018

Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis

  • Punjaporn Pojanapunya ORCID logo EMAIL logo and Richard Watson Todd

Abstract

Keyword analysis is used in a range of sub-disciplines of applied linguistics from genre analyses to critically-oriented studies for different purposes ranging from producing a general characterization of a genre to identifying text-specific ideological issues. This study compares the use of log-likelihood (LL), a probability statistic, and odds ratio (OR), an effect size statistic, for keyword identification and argues that the two methods produce different keywords applicable to research focusing on different purposes. Through two case studies, keyword analyses of advance fee scams against the British National Corpus and research articles in applied linguistics against research articles from other academic disciplines, we show that both the LL and OR keywords concern the aboutness of the corpus, but differ in their specificity and pervasiveness through the corpus. LL highlights words which are relatively common in general use serving genre purposes, whereas OR highlights more specialized words serving critically-oriented purposes. Methodological and practical contributions to keyword analysis are discussed.

References

Adolphs, Svenja. 2006. Introducing electronic text analysis: A practical guide for language and literacy studies. New York: Routledge.10.4324/9780203087701Search in Google Scholar

Adolphs, Svenja, Brian Brown, Ronald Carter, Paul Crawford & Opinder Sahota. 2004. Applying corpus linguistics in a health care context. Journal of Applied Linguistics 1(1). 9–28.10.1558/japl.1.1.9.55871Search in Google Scholar

Agresti, Alan. 2002 [1990]. Categorical data analysis, 2nd edn. New York: Wiley.10.1002/0471249688Search in Google Scholar

Agresti, Alan. 2007 [1996]. An introduction to categorical data analysis, 2nd edn. New York: Wiley.10.1002/0470114754Search in Google Scholar

Anthony, Laurence. 2013a. AntWordProfiler (Version 1.4.0.1) [Computer Software]. Tokyo: Waseda University. http://www.laurenceanthony.net/software/antwordprofiler/ (accessed 8 October 2014).Search in Google Scholar

Anthony, Laurence. 2013b. A critical look at software tools in corpus linguistics. Linguistic Research 30(2). 141–161.10.17250/khisli.30.2.201308.001Search in Google Scholar

Anthony, Laurence. 2014. AntConc (Version 3.4.3) [Computer Software]. Tokyo: Waseda University. http://www.laurenceanthony.net/software/antconc/ (accessed 8 October 2014).Search in Google Scholar

Baker, Paul. 2004. Querying key words: Questions of difference, frequency, and sense in key words analysis. Journal of English Linguistics 32(4). 346–359.10.1177/0075424204269894Search in Google Scholar

Baker, Paul. 2006a. The question is, how cruel is it? Keywords, foxhunting and the House of Commons. Paper presented at AHRC ICT [Information and Communications Technology in Arts and Humanities Research] Methods Network Expert Seminar on Linguistics, Lancaster University, 8 September.Search in Google Scholar

Baker, Paul. 2006b. Using corpora in discourse analysis. London: Continuum.10.5040/9781350933996Search in Google Scholar

Bassi, Erica. 2010. A contrastive analysis of keywords in newspaper articles on the “Kyoto Protocol”. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 207–218. Amsterdam: John Benjamins.10.1075/scl.41.15basSearch in Google Scholar

Bestgen, Yves & Sylviane Granger. 2014. Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing 26(1). 28–41.10.1016/j.jslw.2014.09.004Search in Google Scholar

Bestgen, Yves. 2014. Inadequacy of the chi-squared test to examine vocabulary differences between corpora. Literary & Linguistic Computing 29(2). 164–170.10.1093/llc/fqt020Search in Google Scholar

Bondi, Marina & Mike Scott (eds.). 2010. Keyness in texts. Amsterdam: John Benjamins.10.1075/scl.41Search in Google Scholar

Bowker, Lynne & Jennifer Pearson. 2002. Working with specialized language: A practical guide to using corpora. London: Routledge.10.4324/9780203469255Search in Google Scholar

Butler, Christopher S. 2001. A matter of give and take: Corpus linguistics and the predicate frame. Revista Canaria de Estudios Ingleses 42. 55–78.Search in Google Scholar

Carreon, Jonathan Rante & Richard Watson Todd. 2011. Analysing private hospital websites from a critical perspective: Potential issues of methodology, analysis and interpretation of findings. In Proceedings of the International Conference on Doing Research in Applied Linguistics [DRAL], 26–36. Bangkok: King Mongkut’s University of Technology Thonburi.Search in Google Scholar

Chujo, Kiyomi & Masao Utiyama. 2006. Selecting level-specific specialized vocabulary using statistical measures. System 34(2). 255–269.10.1016/j.system.2005.12.003Search in Google Scholar

Crawford, Lynn, Julien Pollack & David England. 2006. Uncovering the trends in project management: Journal emphases over the last 10 years. International Journal of Project Management 24. 175–184.10.1016/j.ijproman.2005.10.005Search in Google Scholar

Cruickshank, Douglas. 2001. I crave your distinguished indulgence (and all your cash). http://www.salon.com/2001/08/07/419scams/ (accessed 14 May 2015).Search in Google Scholar

Cukier, Wendy L., Eva J. Nesselroth & Susan Cody. 2007. Genre, narrative and the “Nigerian letter” in electronic mail. Proceedings of the 40th Annual Hawaii International Conference on System Sciences [HICSS’07]. 70a. http://www.computer.org/csdl/proceedings/hicss/2007/2755/00/27550070a.pdf (accessed 25 May 2015).10.1109/HICSS.2007.238Search in Google Scholar

Culpeper, Jonathan. 2002. Computers, language and characterisation: An analysis of six characters in Romeo and Juliet. In Ulla Melander-Marttala, Carin Östman & Merja Kytö (eds.), Conversation in life and in literature, 11–30. Uppsala: Universitetstryckeriet.Search in Google Scholar

Culpeper, Jonathan. 2009. Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics 14(1). 29–59.10.1075/ijcl.14.1.03culSearch in Google Scholar

De Schryver, Gilles-Maurice. 2012. Trends in twenty-five years of academic lexicography. International Journal of Lexicography 25(4). 464–506.10.1093/ijl/ecs030Search in Google Scholar

del-Teso-Craviotto, Marisol. 2006. Words that mater: Lexical choice and gender ideologies in women’s magazines. Journal of Pragmatics 38(11). 2003–2021.10.1016/j.pragma.2005.03.012Search in Google Scholar

Dörnyei, Zoltán. 2007. Research methods in applied linguistics. Oxford: Oxford University Press.Search in Google Scholar

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1). 61–74.Search in Google Scholar

Dyrud, Marilyn A. 2005. Letters, “I brought you a good news”: An analysis of Nigerian 419 letters. In Lisa E. Gueldenzoph (ed.), Proceedings of the 2005 Association for Business Communication Annual Convention [ABC], 1–11. Irvine: The Association for Business Communication.Search in Google Scholar

Evert, Stefan. 2008. Corpora and collocations. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook 2, 1212–1248. Berlin & New York: Mouton de Gruyter.Search in Google Scholar

Feng, Haiying. 2006. A corpus-based study of research grant proposal abstracts. Perspectives: Working Papers in English and Communication 17(1). 1–24.Search in Google Scholar

Freddi, Maria. 2005. Arguing linguistics: Corpus investigation of one functional variety of academic discourse. Journal of English for Academic Purposes 4(1). 5–26.10.1016/j.jeap.2003.09.002Search in Google Scholar

Gabrielatos, Costas & Paul Baker. 2008. Fleeing, sneaking, flooding a corpus analysis of discursive constructions of refugees and asylum seekers in the UK Press, 1996–2005. Journal of English Linguistics 36(1). 5–38.10.1177/0075424207311247Search in Google Scholar

Gabrielatos, Costas & Anna Marchi. 2012. Keyness: Appropriate metrics and practical issues. Paper presented at Critical Approaches to Discourse Studies, University of Bologna, 13–14 September. http://repository.edgehill.ac.uk/4196/1/Gabrielatos%26Marchi-Keyness-CADS2012.pdf (accessed 20 September 2015).Search in Google Scholar

Gabrielatos, Costas. 2007. Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal 31. 5–44.Search in Google Scholar

Gerbig, Andrea. 2010. Key words and key phrases in a corpus of travel writing. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 147–168. Amsterdam: John Benjamins.10.1075/scl.41.11gerSearch in Google Scholar

Gleick, James. 2003. You have spam. Australian Magazine March 15. 16. http://web.lexis-nexis.com/universe/document?_m=3550ffbea5787e1788de3f3a33bdabf&_docnum=48&wchp=dGLbVtz-zSkVb&+md5=34b249bcee6db14d8b237c3448899aab.Search in Google Scholar

Goldstein, Alan. 2003. Growing junk e-mail traffic has become a ‘Headache.’ Hamilton Spectator [Ontario, Canada] August 12. http://web.lexis-nexis.com/universe/document?_m=35501T6bea5787e1788de3f3a33bdabf&_docnum=48&wchp=dGLbVtz-zSkVb&_md5=34b249bcee6db14d8b237c3448899aab.Search in Google Scholar

Gooberman-Hill, Rachael, Melissa French, Paul Dieppe & Gillian Hawker. 2009. Expressing pain and fatigue: A new method of analysis to explore differences in osteoarthritis experience. Arthritis and Rheumatism 61(3). 353–360.10.1002/art.24273Search in Google Scholar

Graham, Dougal. 2014. KeyBNC [Computer Software]. Bangkok: King Mongkut’s University of Technology Thonburi. http://crs2.kmutt.ac.th/Key-BNC/ (accessed 27 November 2014).Search in Google Scholar

Gries, Stefan Th. 2014. Frequency tables, effect sizes, and explorations. In Dylan Glynn & Justyna Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy, 365–389. Amsterdam & Philadelphia: John Benjamins.10.1075/hcp.43.14griSearch in Google Scholar

Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber & Randi Reppen (eds.), The Cambridge handbook of English corpus linguistics, 50–72. Cambridge: Cambridge University Press.10.1017/CBO9781139764377.004Search in Google Scholar

Grissom, Robert J. & John J. Kim. 2005. Effect sizes for research: A broad practical approach. New Jersey: Lawrence Erlbaum.Search in Google Scholar

Hardie, Andrew. 2014. Log Ratio – an informal introduction. http://cass.lancs.ac.uk/?p=1133 (accessed 27 August 2015).Search in Google Scholar

Jimarkon, Pattamawan & Richard Watson Todd. 2013. Red or yellow, peace or war: Agonism and antagonism in online discussion during the 2010 political unrest in Thailand. In Antoon De Rycker & Zuraidah Mohd Don (eds.), Discourse and crisis: Critical perspectives, 301–322. Amsterdam: John Benjamins.10.1075/dapsac.52.10jimSearch in Google Scholar

Kang, Ning & Qiaofeng Yu. 2011. Corpus-based stylistic analysis of tourism English. Journal of Language Teaching and Research 2(1). 129–136.10.4304/jltr.2.1.129-136Search in Google Scholar

Kich, Martin. 2005. A rhetorical analysis of fund-transfer-scam solicitations. Cercles 14. 129–142.Search in Google Scholar

Kilgarriff, Adam. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1). 97–133.10.1075/ijcl.6.1.05kilSearch in Google Scholar

Kotzé, Ernst Frederick. 2010. Author identification from opposing perspectives in forensic linguistics. Southern Africa Linguistics and Applied Language Studies 28(2). 185–197.10.2989/16073614.2010.519111Search in Google Scholar

Kwary, Deny Arnos. 2011. A hybrid method for determining technical vocabulary. System 39(2). 175–185.10.1016/j.system.2011.04.003Search in Google Scholar

Lamberger, Igor, Bojan Dobovšek & Boštjan Slak. 2013. Analysis of the fraudulent letters A.K.A. Nigerian letters. In Gorazd Meško, Andrej Sotlar & Jack R. Greene (eds.), Proceedings of the Biennial International Conference: Criminal Justice and Security–Contemporary Criminal Justice Practice and Research, 443–466. Ljubljana: University of Maribor. https://www.ncjrs.gov/pdffiles1/242949.pdf (accessed 25 May 2015).Search in Google Scholar

Leone, Paola. 2010. General spoken language and school language: Key words and discourse patterns in history textbooks. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 234–248. Amsterdam: John Benjamins.10.1075/scl.41.17leoSearch in Google Scholar

Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki & Heikki Mannila. 2014. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities 29(4). http://users.ics.aalto.fi/lijffijt/articles/lijffijt2015a.pdf (accessed 20 September 2015).10.1093/llc/fqu064Search in Google Scholar

Ljung, Magnus. 2002. What vocabulary tells us about genre differences: A study of lexis in five newspaper genres. Language and Computers 40(1). 181–196.10.1163/9789004334267_011Search in Google Scholar

Loudermilk, Brandon Conner. 2007. Occluded academic genres: An analysis of the MBA thought essay. Journal of English for Academic Purposes 6(3). 190–205.10.1016/j.jeap.2007.07.001Search in Google Scholar

Malavasi, Donatella & Davide Mazzi. 2010. History v. marketing: Keywords as a clue to disciplinary epistemology. In Marina Bondi & Mike Scott (eds.), Keyness in texts, 169–184. Amsterdam: John Benjamins.10.1075/scl.41.12malSearch in Google Scholar

Martínez, Antonia Sánchez. 2008. Collocation analysis of a sample corpus using some statistical measures: An empirical approach. In Rafael Monroy & Aquilino Sánchez (eds.), Proceedings of the 25th International AESLA [The Spanish Society for Applied Linguistics] Conference: 25 years of Applied Linguistics in Spain: milestones and challenges, 763–768. Murcia: University of Murcia.Search in Google Scholar

Moudraia, Olga. 2003. The student engineering corpus: Analysing word frequency. In Dawn Archer, Paul Rayson, Andrew Wilson & Tony McEnery (eds.), Proceedings of the Corpus Linguistics 2003 Conference [CL2003], 552–561. Lancaster: Lancaster University.Search in Google Scholar

Nassaji, Hossein. 2012. Statistical significance tests and result generalisability. In Graeme Porte (ed.), Replication research in applied linguistics, 92–115. Cambridge: Cambridge University Press.Search in Google Scholar

Nation, Pual & Alex Heatley. 2002. Range: A program for the analysis of vocabulary in texts [Computer Software]. Wellington: Victoria University. http://www.victoria.ac.nz/lals/about/staff/paul-nation (accessed 19 September 2014).Search in Google Scholar

O’Halloran, Kieran. 2011. Investigating argumentation in reading groups: Combining manual qualitative coding and automated corpus analysis tools. Applied Linguistics 32(2). 172–196.10.1093/applin/amq041Search in Google Scholar

Oakes, Michael P. 2008. Measures from information retrieval to find the words which are characteristic of a corpus. In Barbara Lewandowska-Tomaszczyj (ed.), Corpus linguistics, computer tools, and applicationsstate of the art: PALC 2007, 127–138. Frankfurt: Peter Lang.Search in Google Scholar

Paquot, Magali & Yves Bestgen. 2009. Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Andreas H. Jucker, Daniel Schreier & Marianne Hundt (eds.), Corpora: Pragmatics and discourse, 247–269. Amsterdam & New York: Rodopi.10.1163/9789042029101_014Search in Google Scholar

Rayson, Paul & Roger Garside. 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora [WCC’00], 1–6. Hong Kong: Association for Computational Linguistics.10.3115/1117729.1117730Search in Google Scholar

Rayson, Paul. 2008a. From key words to key semantic domains. International Journal of Corpus Linguistics 13(4). 519–149.10.1075/ijcl.13.4.06raySearch in Google Scholar

Rayson, Paul. 2008b. Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.html (accessed 27 August 2015).Search in Google Scholar

Rayson, Paul. 2009. Wmatrix: a web-based corpus processing environment [Computer Software]. Lancaster: Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/Search in Google Scholar

Rayson, Paul. 2013. Corpus analysis of key words. In Carol A. Chapelle (ed.), The encyclopaedia of applied linguistics, 1–7. Oxford: Wiley-Blackwell.10.1002/9781405198431.wbeal0247Search in Google Scholar

Rayson, Paul, Damon Berridge & Brian Francis. 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In Gérald Purnelle, Cédrick Fairon & Anne Dister (eds.), Proceedings of the 7th International Conference on Statistical Analysis of Textual Data [JADT], 926–936. Louvain-la-Neuve: UCL Presses universitaires de Louvain.Search in Google Scholar

Renström, Caroline. 2011. Framing Obama: A comparative study of keywords and frames in two Washington newspapers. Stockholm: Stockholm University Bachelor Degree Thesis. http://su.diva-portal.org/smash/get/diva2:479520/FULLTEXT01 (accessed 24 September 2013).Search in Google Scholar

Römer, Ute & Stefanie Wulff. 2010. Applying corpus methods to written academic texts: Explorations of MICUSP. Journal of Writing Research 2(2). 99–127.10.17239/jowr-2010.02.02.2Search in Google Scholar

Schaffer, Deborah. 2012. The language of scam spams: linguistic features of “Nigerian fraud” e-mails. et Cetera 69(2). 157–179.Search in Google Scholar

Scharl, Arno & Albert Weichselbraun. 2008. An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology and Politics 5(1). 121–132.10.1080/19331680802149582Search in Google Scholar

Schmitt, Norbert. 2010. Researching vocabulary: A vocabulary research manual. Basingstoke: Palgrave Macmillan.10.1057/9780230293977Search in Google Scholar

Scott, Mike & Christopher Tribble. 2006. Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.10.1075/scl.22Search in Google Scholar

Scott, Mike. 1997. PC analysis of key words – and key key words. System 25(2). 233–245.10.1016/S0346-251X(97)00011-0Search in Google Scholar

Scott, Mike. 2000. Focusing on the text and its key words. In Lou Burnard & Tony McEnery (eds.), Rethinking language pedagogy from a corpus perspective, 103–122. Frankfurt: Peter Lang.Search in Google Scholar

Scott, Mike. 2015. WordSmith Tools (Version 6.0) [Computer Software]. Oxford: Oxford University Press.Search in Google Scholar

Seale, Clive 2008. Mapping the field of medical sociology: A comparative analysis of journals. Sociology of Health & Illness 30(5). 677–695.10.1111/j.1467-9566.2008.01090.xSearch in Google Scholar

Seale, Clive, Sue Ziebland & Jonathan Charteris-Black. 2006. Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science and Medicine 62(10). 2577–2590.10.1016/j.socscimed.2005.11.016Search in Google Scholar

Sealey, Alison. 2009. Probabilities and surprises: A realist approach to identifying linguistic and social patterns, with reference to an oral history corpus. Applied Linguistics 31(2). 215–235.10.1093/applin/amp023Search in Google Scholar

Stubbs, Michael. 1995. Collocations and semantic profiles: On the cause of the trouble with quantitative studies. Functions of Language 2(1), 23–55.10.1075/fol.2.1.03stuSearch in Google Scholar

Sweeney, Latanya. 2006. Protecting job seekers from identity theft. IEEE Internet Computing 10(2). http://dataprivacylab.org/dataprivacy/projects/idangel/paper3.pdf (accessed 25 May 2015).10.1109/MIC.2006.40Search in Google Scholar

Thompson, Geoff. 2004 [1996]. Introducing functional grammar, 2nd edn. London: Arnold.Search in Google Scholar

Tomokiyo, Takashi & Matthew Hurst. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL [Association for Computational Linguistics] 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment [MWE’03], 33–40. Sapporo: Association for Computational Linguistics.10.3115/1119282.1119287Search in Google Scholar

Vechtomova, Olga & Stephen Robertson. 2000. Integration of collocation statistics into the probabilistic retrieval model. In Stephen Robertson & Goker Ayse (eds.), Proceedings of the 22nd Annual Colloquium on Information Retrieval Research [ECIR], 165–177. Cambridge: Sidney Sussex College.Search in Google Scholar

Viosca, R. Charles Jr., Blaise J. Bergiel & Phillip Balsmeier. 2004. Effects of the electronic Nigerian money fraud on the brand equity of Nigeria and Africa. Management Research News 27(6). 11–20.10.1108/01409170410784167Search in Google Scholar

Viswamohan, Aysha Iqbal, Charles Hadfield & Jill Hadfield. 2010. ‘Dearest beloved one, I need your assistance’: the rhetoric of spam mail. ELT Journal 64(1). 85–94.10.1093/elt/ccp086Search in Google Scholar

Walsh, Matthew. 2005. Collocation and the learner of English. Language teaching publications. Hove 2(7). 26–54.Search in Google Scholar

Webb, Stuart & Paul Nation. 2008. Evaluating the vocabulary load of written text. TESOLANZ Journal 16. 1–10.Search in Google Scholar

Wilson, Andrew. 2013. Embracing Bayes factors for key item analysis in corpus linguistics. In Markus Bieswanger & Amei Koll-Stobbe (eds.), New approaches to the study of linguistic variability (Language competence and language awareness in Europe 4), 3–11. Frankfurt: Peter Lang.Search in Google Scholar

419 Advance Fee Fraud Statistics 2009. 2010. http://www.ultrascan-agi.com/public_html/html/aff_37_countries.htm.Search in Google Scholar

Published Online: 2018-4-6
Published in Print: 2018-4-25

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 6.2.2023 from https://www.degruyter.com/document/doi/10.1515/cllt-2015-0030/html
Scroll Up Arrow