Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

See all formats and pricing
More options …

Against statistical significance testing in corpus linguistics

Alexander Koplenig
Published Online: 2017-06-03 | DOI: https://doi.org/10.1515/cllt-2016-0036


In the first volume of Corpus Linguistics and Linguistic Theory, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:10.1515/cllt.2005.1.2.277. http://www.degruyter.com/view/j/cllt.2005.1.issue-2/cllt.2005.1.2.277/cllt.2005.1.2.277.xml: 285) asked whether corpus linguists should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.

Keywords: corpus linguistic methodology; statistical significance; quantitative approaches; representativeness; null-hypothesis testing


  • Angrist, Joshua D. & Jörn-Steffen Pischke. 2008. Mostly harmless econometrics: An empiricist’s companion. Princeton, NJ: Princeton University Press.Google Scholar

  • Arppe, Antti, Gaëtanelle Gilquin, Dylan Glynn, Martin Hilpert & Arne Zeschel. 2010. Cognitive corpus linguistics: Five points of debate on current theory and methodology. Corpora 5(1). 1–27.Google Scholar

  • Arppe, Antti & Järvikivi Juhani 2007a. Take empiricism seriously! – In support of methological diversity in linguistics [Commentary of Geoffrey Sampson 2007. Grammar without Grammaticality.]. Corpus Linguistics and Linguistic Theory 3(1). 99–109.Google Scholar

  • Arppe, Antti & Järvikivi Juhani 2007b. Every method counts – Combining corpus-based and experimental evidence in the study of synonymy. Corpus Linguistics and Linguistic Theory 3(2). 131–159.Google Scholar

  • Baroni, Marco & Stefan Evert. 2009. Statistical methods for corpus exploitation. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook, vol. 2, 777–802. Berlin: de Gruyter Mouton.Google Scholar

  • Berk, Richard A. & David A. Freedman. 2003. Statistical assumptions as empirical commitments. In Sheldon L. Messinger, Thomas G. Blomberg & Stanley Cohen (eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger, 2nd edn. New York: Aldine de Gruyter.http://www.stat.berkeley.edu/~census/berk2.pdf (accessed 15 June, 2015).

  • Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8(4). 243–257. doi: (accessed 30 March 2015).CrossrefGoogle Scholar

  • Brezina, Vaclav & Miriam Meyerhoff. 2014. Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics 19(1). 1–28. doi:.CrossrefGoogle Scholar

  • Burnard, Lou (ed.). 2007. [bnc] British National Corpus. http://www.natcorp.ox.ac.uk/docs/URG/ (accessed 21 October 2014).

  • Chomsky, Noam. 1986. Knowledge of language: Its nature, origin, and use. In (Convergence). New York: Praeger.Google Scholar

  • Cohen, Jacob. 1994. The earth is round (p < 0.05). American Psychologist 49(12). 997–1003. doi:.CrossrefGoogle Scholar

  • Deppermann, Arnulf & Martin Hartung. 2011. Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation des „Forschungs- und Lehrkorpus Gesprochenes Deutsch“ (FOLK) am Institut für Deutsche Sprache (Mannheim). In Ekkehard Felder, Marcus Müller & Friedemann Vogel (eds.), Korpuspragmatik. Berlin & Boston: de Gruyter. http://www.degruyter.com/view/books/9783110269574/9783110269574.415/9783110269574.415.xml (accessed 10 June, 2015).

  • Diekmann, Andreas. 2002. Empirische sozialforschung: Grundlagen, methoden, anwendungen. 8th edn. Reinbek: Rowohlt Taschenbuch Verlag.Google Scholar

  • Durrell, Martin. 2015. “Representativeness”, “Bad Data”, and legitimate expectations. What can an electronic historical corpus tell us that we didn’t actually know already (and how)? In Jost Gippert & Ralf Gehrke (eds.), Historical corpora: Challenges and perspectives (Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache 5), 13–33. Tübingen: Narr.Google Scholar

  • Ellenberg, Jordan. 2014. The Summer’s Most Unread Book Is …. The Wall Street Journal. http://www.wsj.com/articles/the-summers-most-unread-book-is-1404417569 (accessed 11 June 2015).

  • Ellis, Nick C. 2012. What can we count in language, and what counts in language acquisition, cognition, and use? In Frequency effects in language learning and processing. Berlin & Boston: de Gruyter (accessed 19 May 2016).Google Scholar

  • Evert, Stefan. 2006. How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2). 177–190. (accessed 19 March 2014).Google Scholar

  • Fillmore, Charles J. 1992. “Corpus linguistics” or “Computer-aided armchair linguistics.” In Jan Svartvik (ed.), Directions in Corpus Linguistics, 35–60. Berlin: de Gruyter.Google Scholar

  • Gilquin, Gaëtanelle. 2008. What you think ain’t what you get: Highly polysemous verbs in mind and language. In Guillaume Desgulier, Jean-Baptiste Guignard & Jean Rémi Lapaire (eds.), Du fait grammatical au fait cognitif. From Gram to Mind, vol. 2. Pessace: Presses Universitaires de Bordeaux.Google Scholar

  • Gilquin, Gaëtanelle & Th. Gries Stefan. 2009. Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory 5(1). 1–26.Google Scholar

  • Gries, Stefan Th. 2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:. http://www.degruyter.com/view/j/cllt.2005.1.issue-2/cllt.2005.1.2.277/cllt.2005.1.2.277.xml (accessed 28 May 2015).Crossref

  • Gries, Stefan Th. 2015. The most under-used statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora 10(1). 95–125. doi:.CrossrefGoogle Scholar

  • Gries, Stefan Th., Beate Hampe & Schönefeld. Döris 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics 16(4). 635–676.Google Scholar

  • Gries, Stefan Th., Beate Hampe & Schönefeld. Döris 2010. Converging evidence II: More on the association of verbs and constructions. In John Newman & Sally Rice (eds.), Empirical and experimental methods in cognitive/functional research, 59–72. Stanford, CA: CSLI.Google Scholar

  • Harald, Baayen, R. 2010. Demythologizing the word frequency effect: A discriminative learning perspective. The Mental Lexicon 5. 436–461.Google Scholar

  • Hunston, Susan. 2010. Corpora in applied linguistics 7. print. (The Cambridge Applied Linguistics Series). Cambridge: Cambridge University Press.Google Scholar

  • Jann, Ben. 2005. Einführung in die Statistik. München & Wien: Oldenbourg.Google Scholar

  • Kellehear, Allan. 1993. The unobtrusive researcher: A guide to methods. St. Leonards, NSW: Allen & Unwin Pty Ltd.Google Scholar

  • Kertész, András & Csilla Rákosi (eds.). 2008. New approaches to linguistic evidence: Pilot studies = Neue Ansätze zu linguistischer Evidenz: Pilotstudien (MetaLinguistica v. 22). Frankfurt & New York: Peter Lang.Google Scholar

  • Kilgarriff, Adam. 2005. Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1(2). doi:http://www.degruyter.com/view/j/cllt.2005.1.issue-2/cllt.2005.1.2.263/cllt.2005.1.2.263.xml.Crossref

  • Köhler, Reinhard. 2005. Korpuslinguistik – zu wissenschaftstheoretischer Grundlagen und methodologischen Perspektiven. LDV Forum 20(2). 1–16.Google Scholar

  • Kohnen, Thomas. 2007. From Helsinki through the centuries: The design and development of English diachronic corpora.” In: Towards Multimedia in Corpus Studies. In Päivi Phata, Irma Taavitsainen, Terttu Nevalainen & Jukka Tyrkkö (eds.), Helsinki: Research Unit for Variation, Contacts and Change in English (Studies in Language Variation, Contacts and Change in English 2). http://www.helsinki.fi/varieng/journal/volumes/02/kohnen (accessed 5 October 2014).

  • Leech, Geoffrey. 1991. The state of the art in corpus linguistics. In Jan Svartvik, Karin Aijmer & Bengt Altenberg (eds.), English corpus linguistics: Studies in honour of Jan Svartvik, 8–29. London & New York: Longman.Google Scholar

  • Leech, Geoffrey. 2007. New resources, or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf & C. Biewer (eds.), Corpus Linguistics and the Web, 133–149. Amsterdam: Rodopi.Google Scholar

  • Lijffijt, Jefrey, Tertti Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamaki & Heikki Mannila. 2014. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities. doi:. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqu064 (accessed 22 April 2015).Crossref

  • Lykken, David T. 1968. Statistical significance in psychological research. Psychological Bulletin 70(3, Pt.1). 151–159. doi:.CrossrefGoogle Scholar

  • Mandera, Paweł, Emmanuel Keuleers & Marc Brysbaert. 2015. How useful are corpus-based methods for extrapolating psycholinguistic variables? The Quarterly Journal of Experimental Psychology 1–20. doi: (accessed 22 April 2015).CrossrefGoogle Scholar

  • McEnery, Tony & Andrew Wilson. 1996. Corpus linguistics (Edinburgh Textbooks in Empirical Linguistics). Edinburgh: Edinburgh University Press.Google Scholar

  • McEnery, Tony, Richard Xiao & Yukio Tono. 2006. Corpus-based language studies: An advanced resource book. London & New York: Routledge.Google Scholar

  • Meehl, Paul E. 1978. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Counseling and Clinical Psychology 46. 806–834. doi:.CrossrefGoogle Scholar

  • Nuzzo, Regina. 2014. Scientific method: Statistical errors. Nature 506(7487). 150–152. doi:.CrossrefGoogle Scholar

  • Oakes, Michael P. 1998. Statistics for corpus linguistics (Edinburgh Textbooks in Empirical Linguistics). Edinburgh: Edinburgh University Press.Google Scholar

  • Rieger, Burghard. 1979. Repräsentativität: Von der Unangemessenheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung. In Henning Bergenholtz & Burkhard Schaeder (eds.), Empirische Textwissenschaft. Aufbau und Auswertung von Text-Corpora (Monographien Linguistik Und Kommunikationswissenschaft 39), 52–70. Königstein im Taunus: Scriptor. http://www.uni-trier.de/fileadmin/fb2/LDV/Rieger/Publikationen/Aufsaetze/79/rub79.html.

  • Schmid, Hans-Jörg. 2010. Does frequency in text instantiate entrenchment in the cognitive system? In Dylan Glynn & Kerstin Fischer (eds.), Quantitative methods in cognitive semantics: Corpus-driven approaches, 101–133. Berlin & New York: de Gruyter.Google Scholar

  • Schneider, Jesper W. 2013. Caveats for using statistical significance tests in research assessments. CoRR abs/1112.2516.Google Scholar

  • Schönefeld, Doris. 2011. Introduction. On evidence and the convergence of evidence in linguistic research. In Doris Schönefeld (ed.), Converging evidence: Methodological and theoretical issues for linguistic research (Human Cognitive Processing v. 33), 1–31. Amsterdam & Philadelphia: John Benjamins Pub. Co.Google Scholar

  • Schütze, Carson T. 1996. The empirical base of linguistics. Chicago: The University of Chicago Press.Google Scholar

  • Trochim, William. 2006. Design. Research Methods Knowledge Base. http://www.socialresearchmethods.net/kb/design.php (accessed 14 September 2011).

  • Tukey, John W. 1991. The philosophy of multiple comparisons. Statistical Science 6(1). 100–116. doi:.CrossrefGoogle Scholar

  • Váradi, Tamás. 2001. The linguistic relevance of corpus linguistics. In Paul Rayson, Andrew Wilson, Tony McEnery, Andrew Hardie & Shereen Khoja (eds.), Proceedings of the Corpus Linguistics 2001 conference, Lancaster University (UK), 29 March – 2 April 2001, 587–593. Lancaster: Lancaster University.Google Scholar

  • Wasow, Thomas & Jennifer Arnold. 2005. Intuitions in linguistic argumentation. Lingua 114(11). 1481–1496.CrossrefGoogle Scholar

  • Wasserstein, Ronald L. & Nicole A. Lazar. 2016. The ASA’s statement on p-values: Context, process, and purpose. The American Statistician. doi:.CrossrefGoogle Scholar

  • Wiechmann, Daniel. 2008. On the computation of collostruction strength. Corpus Linguistics and Linguistic Theory 4(2). 253–290.Google Scholar

About the article

Published Online: 2017-06-03

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2016-0036.

Export Citation

© 2017 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in