Null-hypothesis significance testing of word frequencies: a follow-up on Kilgarriff

Stefan Th. Gries
1. Introduction

In this issue of Corpus Linguistics and Linguistic Theory , Adam Kilgarriff discusses several issues concerned with the role of probabilistic modelling and statistical hypothesis testing in the domain of corpus linguistics and computational linguistics. Given the overall importance of these issues to the above-mentioned fields, I felt that the topic merits even more discussion and decided to add my own two cents with the hope that this discussion note triggers further commentaries or even some lively discussion and criticism.

The points raised in Kilgarriff’s paper are various and important and considerations of space do not allow me to address all of them in as great detail as they certainly deserve. I will therefore concentrate on only one particular aspect of the paper which I find ‒ given my own research history and subjective interests ‒ particularly important, namely the issue of statistical hypothesis testing. More precisely, I will address one of the central claims of Kilgarriff’s paper. Kilgarriff argues ‒ apparently taking up issues from methodological discussion in many other disciplines (cf. section 2) ‒ that the efficiency of statistical null-hypothesis testing is often doubtful because (i) “[g]iven enough data, H0 is almost always rejected however arbitrary the data” and (ii) “true randomness is not possible at all”. In information-retrieval parlance, null-hypothesis significance testing when applied to large corpora yields too many false hits.

In this short discussion note I would like to do two things. First, I would like to make a few suggestions as to what I think are the most natural methodological consequences of Kilgarriff’s statement and several other points of critique concerning null-hypothesis significance testing raised in other disciplines. Second, I would like to revisit one of the examples Kilgarriff discusses in his paper to exemplify aspects of these proposals and show how the results bear on corpus-linguistic issues.

