Skip to content
Publicly Available Published by De Gruyter Mouton December 18, 2014

The importance of robust corpora in providing more realistic descriptions of variation in English grammar

  • Mark Davies EMAIL logo
From the journal Linguistics Vanguard


This paper provides many concrete examples of how English grammar varies in important ways, as a function of differences between genres, as a function of language change, and as a function of differences between dialects. We also show – in great detail – how several recent corpora – such as COCA (2008), COHA (2011), GloWbE (2013), and the BYU interface to the Google Book n-grams (2012) – allow us to accurately examine this full range of variation, in ways that are not available with smaller corpora. As a result, these new corpora allow us to provide a much more reliable and insightful view into English syntax than was possible even four or five years ago.

1 Introduction

Too many grammars of English make overly-general statements about the grammaticality or acceptability ofcertain syntactic phenomena, without taking into account the fact that those judgments might apply tojust one genre of one dialect at one particular point in time (e.g. journalistic prose from the US in the 1990s–2000s). The use of corpus data might help to remedy this situation, but all too often even corpus linguists base their conclusions on corpora that fail to adequately take into account a full range of variation in the language.

In this paper, I will consider how syntactic phenomena can vary as a function of language change, genre-based differences, and dialectal differences. Equally as important, I will consider how several recent corpora allow us to examine these three types of variation in ways that were quite impossible even four or five years. The overall message, then, is that with the right type of corpora, we can account for variation in a much more reliable way, and thus provide much more insightful investigations into English grammar.

2 Researching genre-based variation with COCA

One of the most important types of syntactic variation is that which results from differences between genres. Biber et al. (1999) was a groundbreaking 1,100+ page book that provided hundreds of examples of significant differences in the frequency of syntactic constructions and features between spoken, fiction, newspaper, and academic texts. As important as Biber’s work has been, there are two limitations. First, the data came from a proprietary corpus, which is not publicly available for use by others. Second, the 40 million word Longman corpus that they used – while quite robust for the mid-1990s – would now be seen as being rather small.

The Corpus of Contemporary American English (COCA; see Davies 2009, 2011), which was released in 2008, solves these two problems. First, it is 450 million words in size (and continues to grow by 20 million words each year), which makes it more than ten times as large as the Longman corpus. Second, it is publicly-accessible, which means that teachers and students can easily replicate the investigations in Biber et al, and also carry out new studies, which can be verified by others. In this section, I will provide examples of a handful of such phenomena.

To begin with, we might look at the relative frequency of all modals in all genres of COCA (spoken, fiction, popular magazines, newspapers, and academic). Researchers have often looked at modals in much smaller 2–4 million word corpora, since they are one of the few syntactic phenomena that can be studied successfully with corpora that size. Expanding this search, we can see frequency of must and have to (asemi-modal) followed by a lexical verb (e.g. must recognize, has to know) in the five main genres of COCA. Notice how must is more common in the more “formal” genres (cf. Leech 2003), whereas have to is more common in the informal genres, such as spoken.

Another example of clear genre-based variation in COCA is related to the “be” and “get” passives (e.g. John was/got fired from his job; see Hundt 2001). Two simple searches in COCA show us that the be passive is much more common in the formal genres, especially in academic, where explicit agents (e.g. in describing a chemistry experiment) would sound strange. The get passive, on the other hand, is most frequent in the informal genres (such as spoken).

While the spoken transcripts in the Longman Corpus are from common, everyday conversation, the spoken transcripts in COCA come from unscripted conversation on national TV and radio broadcasts. As a result, some might think that this conversation in COCA would be too formal and stilted, but this is not the case. For example, the data for the discourse marker you know or the like quotative (and I’m like, “no way”) show that they are much more common in the spoken transcripts in COCA.

The other significant advantage of COCA over the Longman Corpus (in addition to being much more recent) is that it is much larger. For some low-frequency constructions, this is of crucial importance. For example, consider the construction that combines the passive, perfect, and progressive (e.g. he had been being watched by the FBI). We see clear effects of genre with the construction, in that it occurs much more in spoken than in the other genres. But note that there are only fifteen tokens in COCA, which contains 450 million words. Even in a 100 million word corpus like the British National Corpus, there are only two tokens, and probably even less in a much smaller 40 million word corpus like the Longman Corpus.

In summary, then, we can use COCA to quickly and easily search for and document important genre-based variation in English syntax, to confirm the detailed genre-based data in Biber et al. (1999). And for certain low-frequency constructions, COCA is perhaps the only corpus that will show such genre-based differences.

In addition to genre-based variation, with the right kind of corpora we can also map out historical changes in syntax. In the following three sections, we will see how this can be done for very recent and ongoing changes in English with COCA (Section 3), over the past 200 years with the 400 million word Corpus of Historical American English [COHA] (in Section 4), and over the past 200 years with the 155 billion word Google Books (Advanced) n-grams databases (Section 5). Due to limitations of space, just a small handful of examples will be given in each section.

3 Researching recent and ongoing syntactic changes with COCA

Turning first to recent, ongoing changes in English, I have argued elsewhere (see Davies 2011) that COCA is perhaps the only large corpus of English that allows us to look such changes. This is due to the fact that COCA is the only large corpus that (1) is large enough in size to look at a wide range of phenomena, (2) continues to be updated, and (3) that has a genre composition that is essentially the same from year to year. All are crucial to mapping recent syntactic shifts.

As far as recent syntactic shifts, let us consider the rise in two fairly salient grammatical constructions that have increased in frequency during the past two decades. The "quotative like" construction (e.g. “and he’s like, …”; cf. Barbieri 2009) has had a sustained increase since the early 1990s (see the right side of the chart). Consider also the "so not" construction (e.g. I’m so not interested in him). Although the tokens for this construction are relatively sparse, we can still see a clear increase in the construction over the past two decades.

We might also consider three other syntactic shifts in contemporary American English, where the changes are probably much less salient. First, there is a clear increase in the "end up V-ing" construction (e.g. he ended up paying too much) during the past two decades. Second, the get passive (Bill got hired last week) has steadily increased over the past two decades (and there has been a corresponding decrease in the be passive as well). Third, there has been a slow but consistent shift from [help to V] to [help V] (I’ll help Mary to clean the room > I’ll help Mary clean the room) (cf. Mair 2002), as is seen in Table 1.

Table 1

Frequency of [help to V / help V]

Search string 1990–1994 1995–1999 2000–2004 2005–2009 2010–2012
+ to [help] [p*] to [v*] 825 798 726 668 370
– to [help] [p*] [v*] 5,494 6,453 7,144 7,502 4,237
% – to 86.9% 89.0% 90.8% 91.8% 92.0%

Finally, COCA can be used to look at recent changes with prescriptively-charged phenomena as well. For example, consider the split infinitive (to [verb] [Adv] > to [Adv] [Verb], e.g. to go boldy > to boldly go) (cf. Close 1987). This is measured by the percentage of –ly adverbs (e.g. boldly, quickly) either before or after the infinitive, following to. As can be seen in Table 2, there is an increase in each five year block during the past two decades, for an increase of nearly 50% during this time.

Table 2

Frequency of split infinitive (e.g. to go boldly > to boldly go)

Search string 1990–1994 1995–1999 2000–2004 2005–2009 2010–2012
– split to [vv*] *ly.[r*] 12,595 11,353 11,456 10,415 5,048
+ split to *ly.[r*] [v*] 7,681 8,907 9,955 10,819 6,353
% split 37.9% 44.0% 46.5% 51.0% 55.7%

These are just a handful of changes, and many more are discussed in Davies (2011). The important point is that COCA is probably the only corpus of English whose size and composition allow us to look at ongoing syntactic changes such as these.

4 Researching longer range syntactic changes with COHA

The Corpus of Historical American English (COHA) was released in 2010. It contains more than 400million words of text from a wide range of genres, and it maintains roughly the same genre balance from decade to decade. At 400 million words, it is about 100 times as large as any other genre-balanced historical corpus of English (see Davies 2012a, 2012b).

Carrying out research on diachronic syntax with COHA is both quick and easy. For example, we can easily see the increase in the need to V (I need to leave) or end up V-ing (I’ll end up getting there late) constructions. Notice the nice S-curve increase in both constructions in the last 40–50 years.

Even more complicated studies of diachronic syntax can be carried out quite easily with COHA. For example, Table 3 considers adverb placement with modals. [A] represents pre-verbal placement (never|always [vm*] [vv*] : he never would answer his mail) while [B] is post-verbal placement: ([vm*] never|always [vv*] : he would never answer his mail). With the two quick and easy searches, we can clearly see in Table 3 and Figure 1 the sustained shift towards post-verbal placement: he would never answer his mail.

Figure 1 Frequency of post-modal negation
Figure 1

Frequency of post-modal negation

Table 3

Frequency of post-modal negation

1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
A 490 522 437 384 434 423 401 281 280 241 157 147 121 135 82
B 2,298 2,544 2,758 2,587 2,856 3,126 3,151 3,051 2,922 3,140 2,815 2,750 3,129 3,660 3,862
% B 0.82 0.83 0.86 0.87 0.87 0.88 0.89 0.92 0.91 0.93 0.95 0.95 0.96 0.96 0.98

Consider now a syntactic search that would likely be quite complex with other corpora, but which can be done quite easily with COHA. This deals with the increase in null relative pronouns at the expense of overtrelative pronouns. [A] in Table 4 (and Figure 2) represent overt relative pronouns with he as relative clause subject ([nn*] that|which|who|whom he [vv*]: the woman that he married) while [B] is zero relative pronoun: ([nn*] – he [vv*]: the woman that he married).

Figure 2 Zero relative (the man – he saw)
Figure 2

Zero relative (the man – he saw)

Table 4

Zero relative (the man – he saw)

1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
A 1,666 1,680 1,756 2,025 1,887 2,044 1,992 2,032 1,734 1,462 1,513 1,389 1,278 1,122 904 968
B 4,929 5,117 6,126 7,778 8,453 8,921 9,681 10,931 9,937 9,088 9,079 9,069 8,239 8,664 7,696 7,617
% B 0.75 0.75 0.78 0.79 0.82 0.81 0.83 0.84 0.85 0.86 0.86 0.87 0.87 0.89 0.89 0.89

Of course we might want to change the relative clause subject, or experiment with different type of antecedents, or eliminate “false positives”, and so on. But the point is that with COHA, we can do evenrelatively complex searches such as this – resulting in clear and unambiguous data like that shown above – in just a minute or so.

Small 2–4 million word historical corpora (like ARCHER or the Brown family of corpora) are limited primarily to looking at high frequency constructions, such as modals and auxiliaries. But because of its large size (100 times the size of these other corpora), COHA allows us to gain insight into a much wider range of syntactic changes in English.

5 Researching long-range syntactic changes with Google Books-Advanced

While COHA is composed of 400 million words of text, the Google Books n-grams are based on 155 billion words of data from millions of books, and this is just the data from the American English dataset. Unfortunately, the "standard" Google Books interface (see Michel et al. 2011) is extremely limited and simplistic, as far as syntactic searches go. It is difficult or impossible to search by either lemma or part of speech. For example, to search for the construction “end up V-ing” (ended up paying, ends up looking, etc.), one would have to search for – one by one –the individual strings end up paying, ended up paying, ends up paying, and then start with tens of thousands of other embedded verbs – all of which would take weeks or months.

With the BYU/Advanced Google Books interface that we released in 2012, however, researchers can search by lemma and part of speech, and they could do a search like “end up V-ing” (which yields more than 400,000 tokens) in just 1–2 seconds. In addition to seeing the overall frequency, researchers can also see the frequency of each matching string in each decade, and then click on any of these to see the book excerpts at

Another example of a syntactic search that is quite easy and fast in Google Books Advanced (but quite impossible in Google Books Standard) is the increase in the periphrastic future with going to VERB (e.g. going to leave). We can easily search for “going to [v*]”, and we see the overall increase, as well as all of the matching strings. A further example is the “get passive” construction (e.g. got returned, get fired), which is definitely increasing over time (overall / individual forms). Again, with Google Books Standard, we would have to perform thousands of separate searches and perhaps spend days to get this data.

To take a pair of somewhat more complex constructions, consider first the “way construction”, which has been the focus of a great deal of research in construction grammar (see e.g. Goldberg 1997 for an introduction). In Google Books Advanced, we can simply search for “[vvd*] [ap*] way [i*]” to find more than 1,083,000 tokens for 3000 unique strings like find their way into, make his way through, groping their way into, and so on. Or consider the “causative into” construction – talked him into going, coerced them into buying, terrify me into doing, etc. (see Rudanko 2006; Wulff et al. 2007). The one simple search “[vv*] [p*] into [vvg*]” yields 30,200 tokens for 234 different strings.

As we have seen above, we can also carry out searches on two competing constructions to find evidence for syntactic change. For example, Table 5 (and Figure 3) provide data for the use of the subjunctive and indicative in the context “if I/he/she/it was/were” (e.g. if I was/were; cf. Gonzalez-Alvarez 2003), and is based on 6,153,000 tokens from the 1810s–2000s. As can be seen in Table 5 and Figure 3, there is an increase in the use of the indicative since about the 1950s.

Figure 3 Percentage of if clauses with indicative (vs. subjunctive)
Figure 3

Percentage of if clauses with indicative (vs. subjunctive)

Table 5

Subjunctive vs indicative form with if

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Subjunctive 271,173 314,531 228,277 185,032 189,935 239,876 361,927 331,667 345,971 465,334 629,778
Indicative 98,417 116,839 82,125 69,514 68,948 89,690 142,617 147,712 178,489 287,666 510,332
% indicative 0.27 0.27 0.26 0.27 0.27 0.27 0.28 0.31 0.34 0.38 0.45

Another example of syntactic variation over time deals with verbal subcategorization; in this casewhether or not to is used in complements of help (help him to do it vs help himdo it). Two simple queries yield 3,812,000 tokens, which show a clear increase in the omission of the complementizer to (see Table 6 and Figure 4).

Figure 4 Percentage of help complements without to
Figure 4

Percentage of help complements without to

Table 6

help NP (to) VERB

1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
to 1,481 5,058 8,242 17,563 40,179 46,904 42,336 89,654 90,702 209,335
zero 533 2,077 4,541 11,285 25,600 37,196 59,769 157,044 302,561 1,186,332
% zero 0.26 0.29 0.36 0.39 0.39 0.44 0.59 0.64 0.77 0.85

The contrast between Google Books Standard and Google Books Advanced – in terms of how they can be used to look at syntactic change – is quite striking. For example, in the case of the “causative V-ing” construction discussed above (“V1 NP into V2-ing”), we would have to search for [thousands of V1] × [thousands of V2] × [all possible pronouns] (e.g. forced him into accepting, coax us into returning). There would be hundreds of thousands or even millions of unique strings, and it would take months or perhaps years to carry out this research in GB-S. In GB-Adv, on the other hand, we have all of the data in just 2–3 seconds.

Finally, notice the incredibly large number of tokens for these constructions. For example, there are more than 6 million tokens of the “help (to) VERB” construction (Table 5, Figure 3), and the number of tokens for this one minor construction is more than three times as large as the total number of words in other often-used historical corpora such as ARCHER and the Brown family of corpora.

6 Researching dialectal variation in syntax in other World Englishes

Until recently, the largest corpus of English from several different dialects was the International Corpus of English (ICE), which is composed of one million words each from about thirteen English-speaking countries.

The corpus of Global Web-based English (GloWbE), which was released in 2013, is about 150 times as large as ICE, and it contains nearly two billion words of English from twenty different countries. The countries with the largest corpora are the US and the UK (about 385 million words each), but there are also at least 40 million words each from the other countries as well (and in many cases many more than that): Canada, Ireland, Australia, New Zealand, India, Sri Lanka, Pakistan, Bangladesh, Singapore, Malaysia, Philippines, Hong Kong, South Africa, Nigeria, Ghana, Kenya, Tanzania, and Jamaica.

Note that nearly 60% of all of the texts in GloWbE (more than a billion words of text) come fromblogs, which represent informal language quite nicely. Nevertheless, these are probably notasinformal as the conversations in ICE, which model informal, spontaneous English extremelywell.

To see how large corpora such as GloWbE can be used to look at dialectal variation in English, consider again the "like construction", which has been discussed above. The 3,114 tokens in GloWbE show that the construction is the most frequent in the US and Canada, but that it is also relatively common in the other “core” countries of English as well, including Great Britain (GB), Ireland (IE), Australia (AU), and New Zealand (NZ) – in roughly descending order of frequency.

Another example deals with verbal subcategorization, with the “stop NP (from) V-ing” construction (e.g. stop them saying that, stop them from doing that). While the variant with from has roughly the same frequency in the different dialects, the variant without from is much less frequent in the United States and Canada than in the other “core” dialects of English (GB, IE, AU, NZ).

With GloWbE, it is also possible to see all of the matching strings for certain constructions, as well as the frequency of each of these strings in each of the twenty dialects. For example, we can search for the “way construction” (e.g. made his way through the crowd, fought her way through the pitfalls), or the “causative into” construction (e.g. he talked her into staying, they forced me into admitting it).

Finally, with GloWbE we can also easily look at phenomena that straddle syntax and discourse. For example, we find that “having said that…” is relatively less common in American and Canadian English than the other Inner Circle dialects, whereas “that said…” is the most common in American and Canadian English.

In all of the cases discussed in this section, we have robust data from GloWbE – typically hundreds or even thousands of tokens. If we were using the ICE corpus, on the other hand, we would have many fewer tokens. The International Corpus of English is only about 1/150th the size of COCA (~13 million words in ICE, but nearly two billion words in COCA). So where we might have 2000 tokens in GloWbE, we might have just ten or fifteen total (for all countries) in ICE. Virtually none of the examples discussed in this section would be possible with any other corpus besides GloWbE.

7 Conclusion

In this paper, I have provided many different concrete examples of how English grammar varies in important ways, as a function of differences between genres, as a function of language change, and as a function of differences between dialects.[1] All of this data shows that it is far too simplistic to say that “Structure X is acceptable or common in English”, when in fact it may be in one historical period but not in the corpus 50 or 100 years before, or in just American English but not British English, or in just academic English but not spoken English.

As we have seen, several recent corpora – such as COCA (2008), COHA (2011), GloWbE (2013), and the BYU interface to the Google Book n-grams (2012) – allow us to accurately examine this full range of variation. As we have seen, corpus size is often a crucial factor. New 400 million word corpora like COHA can provide 100 times as much data as previous small corpora like the Brown family of corpora or ARCHER, and a corpus like GloWbE often provides 100–200 times as much data as the combined ICE corpora.

These new corpora are also very user friendly, especially in the sense that it is possible – via the corpus interface at – to seamlessly move between these different corpora (with just one click) to compare phenomena over time, between dialects, and between genres. The end result is that these new corpora allow us to provide a much more reliable and insightful view into English syntax than was possible even four or five years ago.


Barbieri, Federica. 2009. Quotative “be like” in American English: Ephemeral or here to stay? English World-Wide 30. 6890.10.1075/eww.30.1.05barSearch in Google Scholar

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finnegan. 1999. Longman grammar of spoken and written English. London: Longman.Search in Google Scholar

Close, R.A. 1987. Notes on the split infinitive. Journal of English Linguistics 20. 217229.10.1177/007542428702000206Search in Google Scholar

Davies, Mark. 2009. The 385+ million word corpus of contemporary American English (1990-2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics 14. 159190.10.1075/ijcl.14.2.02davSearch in Google Scholar

Davies, Mark. 2011. The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing 25. 447465.10.1093/llc/fqq018Search in Google Scholar

Davies, Mark. 2012a. Examining recent changes in English: Some methodological issues. In Terttu Nevalainen & Elizabeth Closs Traugott (eds.), Handbook on the history of English: Rethinking approaches to the history of English, 263287. Oxford: Oxford University Press.Search in Google Scholar

Davies, Mark. 2012b. Expanding horizons in historical linguistics with the 400 million word corpus of historical American English. Corpora 7. 121157.10.3366/cor.2012.0024Search in Google Scholar

Goldberg, Adele. 1997. Making one’s way through the data. In Masayoshi Shibatani and Sandra A. Thompson (eds.), Grammatical constructions: Their form and meaning, 2953. Oxford: Clarendon Press.Search in Google Scholar

Gonzalez-Alvarez, Dolores. 2003. If he come vs. if he comes, if he shall come: Some remarks on the subjunctive in conditional protases in early and late modern English. Neuphilologische Mitteilungen 104(3). 303313.Search in Google Scholar

Hundt, Marianne. 2001. What corpora tell us about the grammaticalisation of voice in get-constructions. Studies in Language 25(1). 4987.10.1075/sl.25.1.03hunSearch in Google Scholar

Leech, Geoffrey. 2003. Modality on the move: The English modal auxiliaries 1961–1992. In Roberta Facchinetti, Manfed Krug, and Frank R. Palmer. (eds.), Modality in contemporary English, 224240. Berlin: Mouton de Gruyter.Search in Google Scholar

Mair, Christian. 2002. Three changing patterns of verb complementation in late modern English: A real-time study based on matching text corpora. English Language and Linguistics 6. 105131.10.1017/S1360674302001065Search in Google Scholar

Michel, Jean-Baptiste, Yuan, Kui Shen, Aviva, Presser Aiden, Adrian, Veres, Matthew K., Gray, The Google Books Team, Joseph P., Pickett, Dale, Hoiberg, Dan, Clancy, Peter, Norvig, Jon, Orwant, Steven, Pinker, Martin A., Nowak, Erez, Lieberman Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science 331. 176182.10.1126/science.1199644Search in Google Scholar

Rudanko, Juhani. 2006. Emergent alternation in complement selection: The spread of the transitive into -ing construction in British and American English. English Linguistics 34(4). 312331.10.1177/0075424206293600Search in Google Scholar

Wulff, Stefanie, Anatol Stefanowitsch & Stefan Gries. 2007. Brutal Brits and persuasive Americans: Variety specific meaning construction in the into-causative. In G. Radden et al. (eds.), Aspects of meaning construction, 265281. Amsterdam: John Benjamins.10.1075/z.136.17wulSearch in Google Scholar

Published Online: 2014-12-18
Published in Print: 2015-12-1

©2015 by De Gruyter Mouton

Downloaded on 1.12.2023 from
Scroll to top button