This corpus-based study of pluralized non-count nouns (informations, advices, etc.) uses collocation-derived measures (determiners vs. bare noun and mass quantifiers) to extract potential candidates of non-count nouns in a bottom-up approach from the British National Corpus (BNC), allowing the detection of grammatical categories from distributional features. We then use this token list to retrieve data on pluralization of non-counts from nine annotated components of the International Corpus of English (ICE). While the distinction between count and non-count nouns is gradient rather than categorical, it is still possible to distinguish between standard and non-standard pluralization of non-counts. Qualitative analyses of our data show that non-standard pluralization of non-count nouns is regularly attested in second-language varieties, including previously unrecorded types; however, it is also occasionally found in first-language varieties. We discuss implications of our corpus results for common explanations of pluralized non-count nouns, such as substrate influence, language learning effects and historical input. By combining a bottom-up corpus-based approach with fine-grained qualitative analyses we can provide a more nuanced view of pluralization of non-counts across ENL and ESL for the investigation of World Englishes.
Kachru’s (1985) model of WE groups countries into three concentric circles: the Inner Circle where English is the first or “native” language (ENL) of the majority of speakers (e.g. GB or Australia), the Outer Circle where English is an institutionalized second language (ESL; e.g. India or Nigeria), and the Expanding Circle where English is widely used as a foreign language (EFL; e.g. much of continental Europe). The taxonomic problems of this and related models have been discussed extensively (Schneider 2015). One of them is that the circles model assumes a categorical distinction between variety types. Our paper contributes to the critical empirical investigation of this assumption. The key question is whether there are truly diagnostic features, i.e. features exclusively found in ESL. One of the most frequently cited candidates is the extended use of –s pluralization with all noun types (see Section 2.2). Corpus-based research into this issue is still limited and, with the notable exception of Schmidtke and Kuperman (2017), studies have been limited to a predefined set of items. Our methodological goal is to collect an extensive list of pluralized non-count nouns in a bottom-up approach (rather than the few prototypical types usually cited in the literature). We use this list to assess their overall text frequency and use across different WE and discuss their relevance for common explanations such as substrate influence, language learning effects and historical input.
We first provide background information on the distinction between count and non-count nouns, previous corpus-based research on pluralized non-counts and extension of plural marking in WE (Section 2). We describe the corpora that we use as our source of evidence and the procedure to semi-automatically retrieve pluralized non-count nouns (Section 3). The results are reported and evaluated in Section 4, and discussed in the context of variety types in Section 5.
Count nouns refer to entities that speakers conceptualize as countable; for those that are not, grammars (e.g. Quirk et al. 1985: 247) distinguish between concrete (butter) and abstract (laziness) non-counts. On closer inspection, the dichotomy turns out to be too simplistic. Nouns refer to entities whose atomicity appears to range from clearly bounded/solid to amorphous/non-solid (e.g. Jackendoff 1991). In line with this underlying cognitive gradient, the grammatical encoding is gradient, too:
There are nouns that arguably can be treated as either mass or count (e.g. bread). Furthermore, nouns that seem to belong to one class may be coerced to the other by specific syntactic constructions. Mass nouns may occur as count nouns; for example three beers ‘three glasses of beer’, three oils ‘three kinds of oil.’ And count nouns may occur as mass nouns; for example apple in Put more apple into the salad! … The meaning of a noun occurrence, consequently, is a function of its lexical meaning and the syntactic context in which it appears. (Krifka 1999: 221) 
Krifka (ibid.) argues that, in terms of countability, non-count nouns include both “stuff nouns” like oil, gold, flour (prototypical mass nouns) and “collective nouns” such as furniture, cattle, staff. Grimm and Levin (2011, 2012) distinguish a sub-group of the latter, which they refer to as “functional collectives” (i.e. furniture, luggage, jewelry) that “straddle” the traditional mass-count distinction. While count nouns may be coerced into a mass-noun use and vice versa, Cruse (1999: 270) notes that “ … one usage is intuitively felt to be more basic than the other: … apple is basically a count noun and beer a mass noun”. Non-prototypical uses of stuff nouns, according to him, fall into two groups: 1) a “kind of” reading (e.g. Chinese green teas), and 2) a parcelling-out of mass into measurable quantities (e.g. three beers, two ice-creams). Moreover, some nouns are hybrid: These oats are not suitable for muesli (count) vs. How much oats have you got in that sack (mass) (Cruse 1999: 269). Furthermore, polysemous nouns like control have both an abstract, non-count reading and a concrete, count-noun sense as in the controls of an airplane.
The distinction between count and non-count nouns is further complicated by their morphology: non-count nouns like measles are only expected in their plural form, while count nouns like fish do not typically inflect for plural. Words like cattle are plural in meaning but singular in form, so overt countability (in standard ENL grammar) is only possible with the help of a classifier, i.e. twelve head of cattle, whereas words like scissors are singular in meaning but plural in form, so a classifier is needed as an additional site for plural marking, i.e. three pairs of scissors. Plural morphology on the noun itself is thus not strictu sensu a reliable indicator of the count-mass distinction. This poses a methodological challenge for our study (see Section 3).
Distributional Information, i.e. statistical analyses of the context, is an important source for deriving word category and syntactic structure, both in theoretical linguistics (Harris 1954), computational linguistics (Klein and Manning 2001) and psycholinguistics (Tomasello 2000; Mintz et al. 2014). In this spirit, we focus on previous research that has used bottom-up approaches to the retrieval of non-counts in English. Baldwin and Bond (2003) use a rich set of 1284 features to predict (non-)countability. The features include frequency, several Bayesian probabilities each of number of head and modifier(s), number disagreement in noun conjunctions, absence of determiner, participation in of constructions and context-based features such as pronouns and verb number in the vicinity. They achieve an F-score (i.e. the harmonic mean of precision and recall) of up to 89% on assigning the class uncountable but do not provide an evaluation of the predictive power of individual features. Moreover, their data come from standard ENL corpus material, only.
Schmidtke and Kuperman (2017) use data from the Global Web-based English (GloWbE) corpus  and a combination of a bottom-up approach for data retrieval with a top-down approach for variety clustering to study pluralization of mass nouns across ENL and ESL varieties (on the basis of an a priori grouping of varieties). They use a frequency-based approach that is “blind to the count-mass distinction in the initial step” (2017: 141). Moreover, they do not provide a qualitative analysis of their data with respect to the semantics of the pluralized non-counts, i.e. they compare overall pluralization rates and do not distinguish between count-noun coercion and “proper” overextension. Thus, their “coarse-grained”, quantitative approach needs supplementing with qualitative analyses, as they (2017: 159) point out themselves.
The assessment for 76 English varieties in the electronic World Atlas of Varieties of English (eWAVE, Kortmann and Lunkenheimer 2011), which is based on linguists’ rating of vernacular features in WE, yields a clear pattern: pluralization of non-count nouns is particularly frequent in ESL varieties. An A (“pervasive or obligatory”) or B rating (“neither pervasive nor extremely rare”) is given for 15 out of 18 L2 Englishes (83.3%). Generalized –s pluralization has been noted to be particularly common in ESL of Africa and Asia such as Kenya, Cameroon, Indi, Sri Lanka or Hong Kong.  According to Mesthrie and Bhatt (2008: 53) “[a]lmost every study of individual WE varieties in Africa and Asia reports frequent examples like furnitures, equipments, staffs, fruits, accommodations, and less common ones like offsprings, underwears, paraphernalias, etc”. The feature is absent in the majority of regional varieties of AmE and BrE (8 out of 10). Table 1 lists the ratings that are relevant in our context for the varieties available in the ICE corpora (see Section 3), including vernacular varieties in contact with the non-ENL varieties which we study. 
|Variety type according to eWAVE||A (feature pervasive or obligatory)||B (feature neither pervasive nor extremely rare)||C (feature exists, but extremely rare)||D – (attested absence or other rating)|
|High-contact L1||Singapore E, Philippine E||American E, Irish E, New Zealand E|
|Indigenized L2 varieties||Hong Kong E, Indian E||Jamaican E|
|English-based Creoles||Jamaican Creole|
Mair (2017: 16) provides evidence from his Corpus of Cyber-Nigerian for pluralization of stuffs, showing that the nativized pattern is more frequent on web-pages in Nigeria than by expatriate Nigerians in the US and Great Britain (see also Alo and Mesthrie 2004: 821), indicating that the feature is susceptible to standardization in dialect contact situations. Mohr’s (2016) top-down study of 22 nouns in ICE and GloWbE shows that individual varieties in East Africa differ significantly with respect to the frequency of overgeneralized plural non-counts, thus qualifying the rater-based eWAVE description but supporting a general ENL-ESL divide. Schmidtke and Kuperman’s (2017) results indicate that pluralized non-count nouns are, indeed, more regularly attested in ESL data; with respect to relative magnitude of pluralization. Moreover, ESL varieties cluster regionally, with e.g. South Asian varieties (Pakistan, Sri Lankan and Indian English) showing a similar propensity for pluralization. An important caveat with respect to Mohr’s (2016) and Schmidtke and Kuperman’s (2017) results is that they provide frequencies but no qualitative analyses of the nouns in context, i.e. no information on the proportion of regular and non-standard pluralization of non-counts.
Mesthrie and Bhatt (2008: 161) attribute the extension of plural marking in L2 Englishes to learning strategies and transfer. Sharma (2012: 524) adds historical source dialects as a third factor. With respect to historical explanations, it is important to note that pluralization was originally possible for many nouns but then lost from the ENL varieties that served as the original input (see also Denison 1998: 96–98). Examples would be per cent, which had a count-noun sense referring to stocks paying a specific interest rate (see OED, s.v. per cent, n.) and advice in the sense of “opinion”, which the OED (s.v. 2.b.) describes as “[n]ow chiefly Caribbean and S. Asian”. A cursory glance at historical data (e.g. from the court proceedings of the Old Bailey and the Corpus of Historical American English) shows that pluralized non-counts are, in fact, regularly attested in earlier stages of BrE ((1)–(4)) and AmE ((5)–(8)):
Regarding the very commonly observed possibility of substrate influence, we have to consider that, while languages universally have ways of expressing the distinction between singular and non-singular (including categories such as dual and plural), not all languages mark number or “numerosity” (Cruse 1999: 267) morphologically in the noun phrase. Wong (2012: 552–553) points out that
[t]he break-down of mass/count noun distinctions in [Hong Kong English] … can also be traced back to the syntax of the substrate. The overall structure of a noun phrase in Cantonese is similar to the English one, with the difference that a classifier (CL) is required in the former but not in the latter.
The use of a classifier means that the default class is unclear and can be overridden fully productively in Cantonese. 
This fact deserves special attention in the analysis of Englishes that are embedded into high-contact scenarios alongside typologically very different languages. Moreover, semantic and pragmatic aspects play a role when there is a lexical element in cross-linguistic variability: as Cruse (1999: 270) points out, even if languages mark number morphologically in the noun phrase and distinguish between count and non-count nouns (including both stuff and collective nouns), there may be variation in the conceptualization of individual nouns: “ … spaghetti is a singular mass noun in English, but plural in Italian and French, … ; fruit is basically a mass noun in English (Have some fruit), but a count noun in French … ”. A list of examples of nouns that are “non-count” in English but “count” in other languages (including typologically related languages like German, for instance) includes accommodation, advice, baggage, equipment, food, homework, information, hair, luggage, machinery, money, news, progress, and trouble.
In second language acquisition, transfer from the substrate language may play a role, but in addition, vernacular features may arise from general mechanisms of language acquisition in contact-induced processes of language shift, such as analogical extension or overgeneralisation, “economy of production” (leading to simplification) and “hyperclarity” (resulting in redundant marking, see Mesthrie 2017: 186–187). However, beyond the outer circle, Hall et al. (2013: 20) show that non-standard pluralization is very infrequent in ELF contexts, concluding that the feature is not helpful in distinguishing ENL from “non-native” varieties, generally. As inner circle and expanding circle are similar for this feature, it can add a new, unexpected pattern to the study of the gradience from ENL to ESL and EFL (Mukherjee and Hundt 2011; Deshors et al. 2016; Schneider and Gilquin 2016; Meriläinen and Paulasto 2017).
The aim of our paper is to test the hypothesis that the extension of pluralization to non-count nouns beyond standard instances of coercion as in two coffees is particularly frequent in, and limited to, ESL varieties. We do this using a two-pronged approach: a corpus-based bottom-up retrieval of a list of candidates of non-count nouns (instead of the widely used top-down approach); this list is then used to retrieve data on potential pluralized non-counts from corpora of WE. In a final step, we analyse our candidates for extended pluralization qualitatively, thus moving beyond Schmidtke and Kuperman (2017) purely quantitative approach.
We use the BNC and the ICE, which were automatically annotated using a syntactic dependency parser (Schneider 2008). ENL  data come from ICE-GB (Great Britain), ICE-IRE (Ireland), ICE-CAN (Canada), ICE-NZ (New Zealand) and ESL data from ICE-SIN (Singapore), ICE-HK (Hong Kong), ICE-IND (India), ICE-PHI (Philippines) and ICE-JAM (Jamaica).  Like Schmidtke and Kuperman (2017), we apply a bottom-up approach to extract potential pluralized non-counts. While their study relies exclusively on morphological marking and does not distinguish between count and non-count for the retrieval, our approach is more theory-informed in that it uses collocation statistics and morphosyntactic criteria typical of non-counts (see 3.1). The initial results are evaluated and fine-tuned in two steps (3.2 and 3.3). The list obtained from the BNC is used in a top-down approach to retrieve potential pluralized non-counts from the ICE components.
According to Krifka (1999: 221), mass nouns (both what he calls “stuff nouns” and “collective nouns”) are characterized by three properties:
They do not co-occur with the indefinite article a(n): *an oil but are typically used as bare NPs (and without overt number marking);
they typically do not combine with “number words” (*one cheese, *three golds) but can be used in “numerative constructions” (e.g. five gallons of gas);
they generally do not co-occur with certain quantificational determiners (*every, many, all oil/butter/chocolate) in their (default) mass interpretation, selecting a different set of quantifiers instead (much, little, some, a lot of, a huge amount of oil/butter/gold).
In the following, we will illustrate how we operationalized these properties to retrieve potential non-counts.
Property 1: The use of bare NPs (e.g. I like Ø milk) is difficult to measure with surface approaches. Our parsed corpora allow us to approximate these by retrieving singular nouns without a determiner and excluding NPs headed by a proper name (e.g. I like Peter). Simply sorting these data by frequency results in a list of generally frequent nouns rather than non-counts. Factoring in the probability of bare vs. non-bare NP is problematic as well because of data sparseness. Ranking by collocational force works considerably better. Typically, the significance-based T-score performs better than the information-theoretic Observed divided by Expected (O/E) or mutual information (MI) metric on this task. For an overview of collocation statistics, see Evert (2009). T-score is defined as (O-E)/√O. O are the corpus counts, E the co-occurrence frequency if words are randomly shuffled.
Table 2, which is sorted by descending T-score of zero-determiner + noun (column 2) lists the findings for the top 15 bare NP candidates; for the top 200 candidates combined, precision is 60%.  Precision describes the fraction of nouns that are true positives. At rank 5, for example, four out of the five nouns seen in the list from the top until here, are true positives, precision is thus 4/5 = 80%.
|Noun||T.zero||f(noun.SING)||Manual Verdict||Rank||Precision T.zero|
A further aspect of property 1 is that non-counts are usually unmarked for number. With a precision of 70% for the top 200 candidates, this morphological property of nouns is the best single feature for the retrieval of potential non-counts. Figure 1 shows the cumulative precision (vertical axis) by rank (horizontal axis). At the rightmost position (250), the precision of the candidates is still almost 70%, with 174 of the 250 top candidates being true positives, and the curve only falls slowly. The fact that the curve falls, i.e. precision is highest to the left, indicates that the feature (the T-score of zero determiner plus noun) has a strong positive correlation to non-countability, which is a further indication that we use a meaningful operationalization and can also be interpreted as a cognitive signal: absence of a determiner prepares listeners for a non-count noun.
Property 2: We approximate Krifka’s “numerative constructions” by exploiting the fact that non-counts like bread often occur inside an of-PP construction modifying an NP headed by a measurement noun, e.g. slice of bread. Manual post-editing of the initial results yields the corpus-derived inventory of quantifier nouns in Table 3.
|number, range, lack, lot, amount, piece, group, bit, edge, example, bottle, glass, quantity, model, unit, row, item, hand, acre, stretch, pint, page, mile, pile, period, degree, copy, share, plenty, quarter, half, charge, round, volume, moment, body, word, glass, drink, amount, supply, jug, drop, cup, bowl, tin, litre, carton, slice, loaf, plate, chunk, hunk, basket, pound, slab, ounce, mug, pot, tray, flask, sip, gulp|
Once the collocation value obtained by noun1 of noun2 is multiplied by a boosting factor (empirically set to 4) if noun1 is in the quantifier noun inventory, our approximation of property 2 performs much better. The most highly ranked 20 candidates are given in Table 4.
|Rank||T.noun1 of noun2||noun2||Manual|
Property 3: In an initial approach, we had also tried to use co-occurrence with some, but this proved significantly less successful than any of the other measures.
We noticed that rare words are hardly ever non-count and therefore introduced raw frequency of the noun as an additional feature. A linear combination of all our five features obtains a precision rate of 80% for the top 100 and 67% for the top 400 candidates extracted from the BNC. This list was then manually post-edited to remove all count nouns, which yielded a final list of 266 validated potential non-count nouns.
In a next step, we linearly combine our four successful T-score features (some+noun (T.some), zero article (T.zero), prequalifier (T.of-PP), and singular (T.sing)), expecting that this should yield more promising results than the use of individual features. The top 20 candidates are listed in Table 5.
|T.Combo 4 T-score||Noun||T.of-PP||T.some||OE.some||T.zero||T.sing||f(noun.SING)||Manual|
The precision for the combined retrieval approach is as follows: 90% for the top 50 candidates, 80% for the top 100, 77% for the top 200, 67% for the top 400, and 56.8% for the top 500. The fact that precision in the lower range of the list is still > 50%, only tailing off slowly, indicates that the list of non-count nouns is open. 
In a next step, we tested the performance of features, both individually and in various combinations, using logistic regression to predict the count/non-count distinction of the first 500 items (see Table 6). Regression uses optimal feature weights instead of equal weight for reach feature (as e.g. Naïve Bayes does). The weights are also learnt from the data.
|> mass_aov < - aov (Bin.Dec ~ T.sing + T.zero + T.some + T.of PP + N count,|
|data = mass_tscore, family = binomial) > summary(mass_aov)|
|Df||Sum Sq||Mean Sq||F value||Pr( > F)|
|Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1|
|> mass_confmat = table (round (predict (mass_aov)),mass_tscore$Bin.Dec);|
|0 132 48|
|1 84 236 > precision = (mass_confmat [1,1] + mass_confmat [2,2]) / sum(mass_confmat); > precision;|
With the exception of some+noun, all collocation measures proved significant, and the ranking order is zero-article > singular > quantifier > noun frequency. Due to the optimized weighting of features in combination (and due to noun frequency as a significant factor), precision goes up from 56.8% (linear model) to 73.6% (regression model). After removing insignificant some+noun, precision increases further to 73.8% and if all interactions are included, precision is 76.0%. 
Compared to Baldwin and Bond (2003), our semi-automatic method is a compromise, using fewer, linguistically motivated features, paired with limited manual intervention to reach high precision, collocation statistics instead of Bayesian statistics. We are thus able to evaluate the importance of individual features.
We scrutinized the list of 266 lexemes derived in our bottom-up approach from the BNC before using it to retrieve pluralized non-counts from the ICE corpora. As pointed out in 2.1, some lexemes are polysemous; credits (short for credit points in university contexts or other institutional contexts), crickets, controls, redundancies, respects and supports were exclusively used in the grammaticalized count sense of the word and thus excluded from the list. Nouns that are always used in their plural form (such as mathematics and species) were also excluded as they are never instances of extended pluralization. We also excluded nouns that regularly pluralize, such as action, disagreement, scrap, space, similarity, time, truth as well as irregular life and leaf. Finally, because of frequent tagging errors (inflected verb form rather than pluralized noun), we removed works and helps from our list of lexemes. The resulting list consisted of 241 potential non-counts, of which 154 types were attested in our ICE data, including most of the nouns used in Mohr’s (2016) top-down approach (see also 4.2 for a more detailed comparison).
Figure 2 gives the normalized frequencies for the 154 potential pluralized non-count types found in ICE.  While the ENL varieties all yield low-frequencies of these nouns, there is no clear categorical distinction into ENL and ESL but rather a gradient. The differences across all varieties are highly significant (chi-square contingency table test using raw counts of Figure 2, i.e. 9 varieties x candidates vs. word count, p = 4.2E-30 at df = 8), which means that this noisy data already delivers a reliable trend without human intervention, but the pair-wise test of the borderline varieties (NZE and SingE) is not (same chi-square contingency test, 2 varieties x candidates vs. word count p = 0.10 at df = 2). Note, however, that Figure 2 still contains instances of coerced count or type-noun readings and regularly pluralized forms of polysemous nouns. Coercion (differentiating between classes) and polysemy (technical terms as plural forms) is very frequent, particularly in scientific genres. Breaking down the results by mode reveals that the candidates are twice as frequent in writing than in speech.  Literary styles (monologue and printed) have higher frequencies than colloquial styles (dialogue and non-printed).
In a next step we broke down the totals by lexemes (sorted by decreasing frequency) in order to single out the quantitatively most salient contributors.Tables 7 and 8 give an overview of the most frequent types in ENL and ESL varieties, respectively. 
The highlighted plurals in Table 8 are lexemes that are missing from the most frequently attested ENL data. Interestingly, a number of these are shared across the ESL varieties. Importantly for our methodology, our retrieval missed very few data that are included in Mohr’s (2016) top-down approach, namely cattles (3 instances in ICE-IND) and offsprings (9 instances, 1 from ICE-IND, 2 from ICE-PHI and 3 each from ICE-JAM and ICE-SING). Of the two potential mass nouns from her list that regularly pluralize and that were not included in our list (stones and strings), there are only four instances of stones in the ICE corpora (one each in ICE-HK and ICE-JAM and two in ICE-IND), which are not count nouns, such as (9):
Note that Table 8 shows the Zipfian distribution typical of lexical data. The top three types cover at least a third of all tokens in all varieties in Table 8. In addition, the top types reveal a certain regional bias likely to be due to local geographic and climatic conditions such as torrential rains and coastal waters. As rains and waters are given type-of readings in these contexts, qualitative analysis is necessary to distinguish between regular and non-standard extension of pluralization of non-counts.
We annotated the resulting concordance (2,312 tokens) for (a) false positives, (b) regular plurals (both polysemous and non-polysemous), (c) coerced type-noun and plural uses and (d) extended (non-standard) plurals. Examples of false positives are genitive today’s without the apostrophe or nouns in fixed phrases which are always pluralized, as in all intents and purposes (see (10)). Also excluded as false positives were tagging errors (e.g. (11)) and two instances of object language use of non-standard pluralized non-counts (see (12)). 
Examples of regular plurals are given in (13)–(16).  Coerced count and type-noun readings are illustrated in (17)–(19), and genuine extended non-standard uses in (20)–(22).
Table 9 reports the distribution of regular plurals, coerced and extended uses in our manually annotated ICE data. In distinguishing these different contexts for pluralisation of non-counts, we go significantly beyond previous research (notably the only previous bottom-up study by Schmidtke and Kuperman 2017). The figures reveal that the majority of the plurals turn out to be regular uses. This is due to the large number of polysemous nouns. The frequency of extended non-counts per 10,000 words is given in Figure 3. Now, also pair-wise comparison of the differences between each ENL and ESL variety are highly significant. The proportion of extended uses (against coerced non-counts) across WE is shown in Figure 4.
|regular plural||coerced non-count||extended non-count||total|
The results confirm that extended uses of plural non-counts are more frequent in ESL/ESD than in ENL varieties. Interestingly, our analysis shows that HKE and SingE – the varieties with similar substrates – yield similar relative frequencies but not comparable proportions of extended pluralized non-counts, indicating that substrate influence is unlikely to be the sole explanatory factor for extension of plural morphology to non-counts. IndE, PhilE and SingE are similar in their use of extended plurals, as are the ENL varieties.
Extended uses come from a total of 58 types (i.e. more than twice as many as the number of types included in Mohr’s exclusively top-down study). Examples of extended plurals that have not previously been reported in research on ESL varieties are for instance attentions, appreciations, bloods, funs, fundings and nonsenses. Crucially, these extended uses are not limited to (spontaneous) spoken contexts (e.g. (25)) but are also regularly attested from formal written material, as in (23) and (27).
Interestingly, non-standard plurals in ESL have even made their way into idioms, here the light verb take care:
Contrary to popular belief, extended uses of non-count pluralization are also attested in ENL corpora :
Classification of individual instances may be problematic: the plural in (31), for instance, could be said to be an instance of a coerced count use licensed by the pronoun each in the immediate context rather than a pluralized non-count of fruit. Similarly, staffs in (32) could be argued to refer to the three bodies of employees from the different schools, i.e. also be an instance of a contextually motivated pluralization rather than a non-standard use. Distinguishing between extended and polysemous uses is also often difficult. Example (33) might, at first sight, appear to be a non-standard, extended use of informations in an ENL variety:
According to OED, the noun has a (“rare”) sense that is a count-noun, namely “a fact or circumstance of which a person is told; a piece of news or intelligence”. Interestingly, the last attested example (1959) in OED, like the ICE-GB occurrence, is also from a scientific context:
Against this evidence, it is difficult to argue that the use in (33) is really non-standard. Similarly, a noun that at first might appear as a good example of a prototypical non-count noun is actually one that has both non-count and count senses recorded in the OED (s.v. research, n. 2a. and 2b.), with the count uses attested regularly in the nineteenth century but probably less frequent in the twentieth (but see (35)). It is therefore difficult to confidently analyse instances of plural researches as extended uses. While (36) and (37) are plausible as a continuation of the older (lexicalized) count noun in that researches refer to individual studies, (38) is more likely to refer to the general need for research and was therefore analysed as an extended use of the non-count noun. The analysis is a matter of interpretation, however, and (39), which we took to refer to the author’s studies, could also have been interpreted as a non-standard plural.
In order to gauge the consistency of the annotation and effects of inter-annotator disagreement, we had a subset of 811 variable instances coded by a second annotator, a native speaker. The instances had been randomized so as to reduce the possible impact of priming, and information on the variety had been removed. It turned out that inter-annotator agreement was quite low, at 61%. The following examples may illustrate why this is the case: in each instance, only one annotator had given a (standard) type-of interpretation to fruits, either because the context was interpreted differently or there was not enough context to decide:
Similarly, a type-of reading and a general (count-noun) reading were given to medicines in the following example, resulting in divergent codings:
The larger context may help to decide whether an individual instance can be given a coerced type-noun interpretation or a non-standard extension reading. Out of context, (44) might appear to be a non-standard extension of a non-count, but a look at the larger context opens the possibility of a type-noun reading as the recipe for Busha Browne’s Hearty Red Pea Soup combines soup meat with bacon or a salted pig’s tail, i.e. different kinds of meat.
Elsewhere, the native speaker drew on orthography to decide whether a certain pluralized non-count was a standard use, claiming that “foodstuffs as one word is standard, food stuffs as two is not,” thus making an extended non-count. For (46), one annotator suggested coercion to types of fast food whereas the other argued coercion of an element in a list was less likely than analogy with the other nouns in the conjoined NP (resulting in a non-standard plural).
Interestingly, the number of plurals from ENL varieties that were coded as nonstandard by the native speaker was considerably larger (40:13) than those we had coded as non-standard in the same data set (including, among others, examples (42) and (51)). Overall, the native speaker annotator rated more plurals as non-standard, as Figure 5 shows. Annotator 1 is one of the authors, Annotator 2 a native speaker.
The fact that inter-annotator agreement is quite low is a result of the gradient nature of countability in English, but also of the role that context plays in the interpretation of examples. Some divergent ratings might well have to be attributed to the fact that we cross-validated all our ratings against OED dictionary entries and erred on the side of caution whenever a coerced reading might have been possible, while the native speaker was less conservative. For individual nouns, moreover, native speaker evaluations vary, as is the case with feedback(s). Importantly for our initial hypotheses, the two ratings confirm a general divide into ENL vs. ESL/ESD varieties as a tendency, with less of a difference between BrE and SingE in the native-speaker ratings than might be expected.
Our bottom-up, theory-informed approach to derive potential non-count nouns from corpora on distributional grounds works well: it detects some 90% in the top 50 items, and up to 76% in the top 500 items. Moreover, by combining a bottom-up corpus-based approach with fine-grained qualitative analyses we can provide a more nuanced view of pluralisation of non-counts across ENL and ESL. On a cognitive and usage-based level, we have learnt that the count/noun-count distinction can largely be learnt from word distributions (Klein and Manning 2001; Mintz et al. 2014). Importantly, our method misses very few potentially non-standard pluralized non-counts (Section 4.2). Instead, it allows us to report instances of less prototypical non-standard pluralization (such as attentions, appreciations, bloods, funs, fundings and nonsenses) that remained under the radar in previous top-down studies (Section 4.3). Despite noisy data, there are clear quantitative differences between ENL and ESL varieties, the latter having a higher tendency for non-count noun pluralization. The overall picture is more complicated, though, and borders between variety types are not clear-cut, as ESL varieties show quantitative differences and do not form a coherent group. It is in this context that the different approaches to normalization have real consequences for the interpretation of the data: it is only by looking at the proportion of extended pluralization (against coerced pluralization) that we see a difference between HKE and SingE emerging which goes beyond a simple explanation in terms of “substrate influence”. Qualitative analyses are thus crucial if one wants to move beyond simple frequency-based explanation.
Our subsequent qualitative analyses reveal that the majority of pluralized non-counts in both ENL and ESL varieties are coerced type-noun instances. In other words, simply retrieving non-counts from corpora is not enough to argue for an extended (non-standard) use of the category. The analyses of our ICE data further show that, contrary to the widespread assumption that non-standard pluralization of nouns like furniture or information are exclusively found in second language varieties, these are also occasionally attested in native speaker varieties. We attribute this to the gradient rather than categorical distinction between count and non-count nouns. This, alongside the availability of contextual information, also proved a challenge for the qualitative analysis and became evident in the inter-annotator disagreement. While studies hinging on this criterion need to be assessed critically, low inter-annotator agreement crucially did not impact the observation of a general divide between ENL and ESL. Moreover, our findings confirm Denison’s (1998: 98) observation (based on a 1993 Linguist List discussion) that sporadic use of non-count nouns as count nouns (e.g. homeworks and surgeries) is possible in BrE and AmE. However, whether this is a recent trend led by AmE (among the ENL varieties) needs further empirical support.
Finally, the fact that we find non-standard extension of pluralization to non-counts in ENL varieties is of relevance for theories of origin as well. Most previous research explains extended pluralization of non-counts as arising from processes of second-language acquisition and contact-induced hypercorrection. Such a view rests on the assumption that the (British English) varieties that served as inputs throughout the currently English-speaking world did not have the feature, for which we provide counter-evidence. Alternatively, our findings suggest that the feature was present in the superstrate. In other words, it is not enough to resort to substrate influence or L2-acquisition processes as an explanation for apparently “nativized” pluralized non-counts such as researches. Data from the Old Bailey court proceedings, in particular, show that these ”extended” uses of plural non-counts are also regularly attested in earlier stages of BrE and are thus likely to have been part of the input varieties that helped form the ESL varieties. Future research should ideally be able to draw on historical ESL corpora to verify the continuity of this feature across time. Including singular instances of non-count nouns in future research will allow us to model predictor variables for the use of regular and extended pluralized non-counts. It would also be useful to compare the ESL to EFL corpus data to further confirm Hall et al.’s (2013) finding that non-institutionalized varieties pattern more closely with ENL in this area of grammar.
Corpora Search in Google Scholar
BNC=British National Corpus. via Depedency Bank, see Lehmann and Schneider 2012. Search in Google Scholar
Secondary sources Search in Google Scholar
Allan, Keith. 1980. Nouns and countability. Language 56(3). 541–567. Search in Google Scholar
Alo, Moses A. & Rajend Mesthrie. 2004. Nigerian English: Morphology and syntax. In Bernd Kortmann, Edgar Schneider, Kate Burridge, Rajend Mesthrie & Clive Upton (eds.), A handbook of varieties of English, vol. 2. 813–827. Berlin: de Gruyter. Search in Google Scholar
Baldwin, Timothy & Francis Bond. 2003. Learning the countability of English nouns from corpus data. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics – Volume 1 (ACL ‘03), vol. 1, 463–470. PA, USA: Association for Computational Linguistics, Stroudsburg. Search in Google Scholar
Cruse, D. Alan. 1999. Number and number systems. In Keith Brown & Jim Miller (eds.), Concise encyclopedia of grammatical categories, 267–271. Oxford: Pergamon. Search in Google Scholar
Davies, Mark & Robert Fuchs. 2014. Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide 36(1). 1–28. Search in Google Scholar
Denison, David. 1998. Syntax. In Susanne Romaine (ed.), The Cambridge history of the English language, Vol IV 1776–1997, 92–329. Cambridge: Cambridge University Press. Search in Google Scholar
Deshors, Sandra, Sandra Götz & Samantha Laporte. 2016. Linguistic innovations in EFL and ESL: Rethinking the linguistic creativity of non-native English speakers. International Journal of Learner Corpus Research 2(2). 131–150. Search in Google Scholar
Evert, Stefan. 2009. Corpora and collocations. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics. An international handbook, 1212–1248. Berlin: de Gruyter. Search in Google Scholar
Grimm, Scott & Beth Levin. 2011. Between count and mass: Furniture and other functional collectives. Stanford University. http://web.stanford.edu/~bclevin/lsa11talk.pdf (accessed 19 January 2015). Search in Google Scholar
Grimm, Scott & Beth Levin. 2012. Who has more furniture? An exploration of the bases for comparison. Universitat Pompeu Fabra and Stanford University. http://web.stanford.edu/~bclevin/paris12mcslides.pdf (accessed 19 January 2015). Search in Google Scholar
Hall, Christopher J., Daniel Schmidtke & Jamie Vickers. 2013. Countability in world Englishes. World Englishes 37(1). 1–22. Search in Google Scholar
Harris, Zellig. 1954. Distributional structure. In J. A. Fodor & J. J. Katz (eds.), The structure of language, 33–49. Englewood Cliffs, N.J.: Prentice-Hall. Search in Google Scholar
Hundt, Marianne. 2015. World Englishes. In Douglas Biber & Randi Reppen (eds.), The Cambridge handbook of English corpus linguistics, 381–400. Cambridge: Cambridge University Press. Search in Google Scholar
Jackendoff, Ray. 1991. Parts and boundaries. Cognition 62(2). 169–200. Search in Google Scholar
Joosten, Frank. 2003. Accounts of the count-mass distinction: A critical survey. Nordlyd 31(1). 216–229. Search in Google Scholar
Kachru, Braj B. 1985. Standards, codification and sociolinguistic realism: The English language in the outer circle. In Randolph Quirk & H. G. Widdowson (eds.), English in the world: Teaching and learning the langauge and literatures, 11–30. Cambridge: Cambridge University Press. Search in Google Scholar
Klein, Dan & Christopher Manning. 2001. Distributional phrase structure induction. Proceedings of the 2001 Workshop on Computational Natural Language Learning 7: 14:1–14: 8. Search in Google Scholar
Kortmann, Bernd & Kerstin Lunkenheimer, eds. 2011. The electronic world Atlas of varieties of English. Leipzig: Max Planck Institute for Evolutionary Anthropology. http://www.ewave-atlas.org/ (accessed 12 January 2018). Search in Google Scholar
Kortmann, Bernd & Kerstin Lunkenheimer (eds.). 2012. The mouton world atlas of variation in English. Berlin: de Gruyter. Search in Google Scholar
Krifka, Manfred. 1999. Mass expressions. In Keith Brown & Jim Miller (eds.), Concise encyclopedia of grammatical categories, 221–223. Oxford: Pergamon. Search in Google Scholar
Lehmann, Hans Martin & Gerold Schneider. 2012. BNC Dependency Bank 1.0. In Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgård (eds.), Studies in variation, contacts and change in English, Volume 12: Aspects of corpus linguistics: compilation, annotation, analysis. Helsinki: Varieng. http://www.helsinki.fi/varieng/journal/volumes/12/ (accessed 01 January 2018). Search in Google Scholar
Mair, Christian. 2017. Crisis of the outer circle? – Globalisation, the weak nation state, and the need for new taxonomies in World Englishes research. In Markku Filppula, Anna Mauranen, Juhani Klemola & Svetlana Vetchinnikova (eds.), Changing English: Global and local perspectives, 5–24. Berlin: Mouton de Gruyter. Search in Google Scholar
Meriläinen, Lea & Heli Paulasto. 2017. Embedded inversion as an angloversal: Evidence from inner, outer and expanding circle Englishes. In Markku Filppula, Juhani Klemola & Devyani Sharma (eds.), The Oxford handbook of world Englishes, 676–696. Oxford and New York: Oxford University Press. Search in Google Scholar
Mesthrie, Rajend. 2012. Black South African English. In Kortmann & Lunkenheimer (eds.), 493–500. Berlin: Mouton de Gruyter. Search in Google Scholar
Mesthrie, Rajend. 2017. World Englishes, second language acquisition, and language contact. In Markku Filppula, Juhani Klemola & Devyani Sharma (eds.), The Oxford handbook of world Englishes, 175–193. Oxford: Oxford University Press. Search in Google Scholar
Mesthrie, Rajend & Rakesh M. Bhatt. 2008. World Englishes: The study of new linguistic varieties. Cambridge: Cambridge University Press. Search in Google Scholar
Mintz, Toben H., Felix Hao Wang & Vivian Jia Li. 2014. Word categorization from distributional information: Frames confer more than the sum of their (Bigram) parts. Cognitive Psychology 75C. 1–27. Search in Google Scholar
Mohr, Susanne. 2016. From Accra to Nairobi – The use of pluralized mass nouns in East and West African postcolonial Englishes. In Daniel Schmidt-Brücken, Susanne Schuster & Marina Wienberg (eds.), Aspects of (Post)Colonial Linguistics, 157–187. Berlin: de Gruyter. Search in Google Scholar
Mukherjee, Joybrato & Marianne Hundt (eds). 2011. Exploring second-language varieties of English and learner Englishes: Bridging the paradigm gap. Amsterdam/Philadelphia: John Benjamins. Search in Google Scholar
Quirk, Randolph, Sidney Greenbaum, Goffrey Leech & Jan Svartvik. 1985. A comprehensive grammar of English. London: Longman. Search in Google Scholar
Sand, Andrea. 2012. Jamaican English. In Kortmann & Lunkenheimer (eds.), 210–221. Berlin: Mouton de Gruyter. Search in Google Scholar
Schmidtke, Daniel & Victor Kuperman. 2017. Mass counts in World Englishes: A corpus-linguistic study of noun countability in non-native varieties of English. Corpus Linguistics and Linguistic Theory 13(1). 135–164. Search in Google Scholar
Schmied, Josef. 2008. East African English (Kenya, Uganda, Tanzania): Morphology and syntax. In Rajend R. Mesthrie (ed.), Varieties of English. Africa, South and Southeast Asia, 451–471. Berlin: de Gruyter. Search in Google Scholar
Schmied, Josef. 2012. Tanzanian English. In Kortmann & Lunkenheimer (eds.), 454–463. Berlin: Mouton de Gruyter. Search in Google Scholar
Schneider, Edgar W. 2015. Models of English in the world. In Markku Filppula, Juhani Klemola & Devyani Sharma (eds.), The Oxford handbook of world Englishes, 35–57. Oxford: Oxford University Press. Search in Google Scholar
Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. University of Zurich PhD Thesis. Search in Google Scholar
Schneider, Gerold & Gaëtanelle Gilquin. 2016. Detecting innovations in a parsed corpus of learner English. International Journal of Learner Corpus Research 2(2). 177–204. Search in Google Scholar
Sharma, Devyani. 2012. Indian English. In Kortmann & Lunkenheimer (eds.), 523–530. Berlin: Mouton de Gruyter. Search in Google Scholar
Taiwo, Rotimi. 2012. Nigerian English. In Kortmann & Lunkenheimer (eds.), 410–416. Berlin: Mouton de Gruyter. Search in Google Scholar
Tomasello, Michael. 2000. The item based nature of children’s early syntactic development. Trends in Cognitive Sciences 4. 156–163. Search in Google Scholar
Vine, Bernadette. 1999. Guide to the New Zealand component of the international corpus of English. School of Linguistics and Applied Language Studies, Victoria University of Wellington. Search in Google Scholar
Wong May, L.-Y. 2012. Hong Kong English. In Kortmann & Lunkenheimer (eds.), 548–561. Berlin: Mouton de Gruyter. Search in Google Scholar
Ziegeler, Debra. 2010. Count-mass coercion, and the perspective of time and variation. Constructions and Frames 2(1). 33–73. Search in Google Scholar
The online version of this article offers supplementary material (https://doi.org/10.1515/cllt-2018-0068).
© 2020 Walter de Gruyter GmbH, Berlin/Boston