Douglas Biber , Jesse Egbert and Daniel Keller

Reconceptualizing register in a continuous situational space

De Gruyter | Published online: February 13, 2020

Abstract

Corpus-based methods for the quantitative linguistic description of registers are well established. In contrast, situational analyses of registers have been based on qualitative descriptions of categorical situational characteristics. In the present study, we address this inconsistency by describing the variation among texts and registers in a continuous (quantitative) situational space. We describe “registers” as categorical constructs – culturally recognized categories of texts – but propose that they should be described in continuous terms. Such descriptions allow quantitative comparisons of registers, as well as analysis of the extent to which a register is well-delimited in terms of its situational characteristics.

Applying this analytical framework, we also explore a deeper issue: the possibility that some texts are not instantiations of any culturally-recognized register category. Both issues are tackled through analysis of a corpus of web documents. We first identify quantitative situational dimensions of variation, employing the methods of multi-dimensional (MD) analysis. We then describe how the situational characteristics of texts and registers can be analyzed in a continuous MD space. And finally, we propose analysis of situational text types – categories that are statistically well-defined in their situational characteristics – as an approach to describing all texts, including texts that do not belong to a culturally recognized register category.

1 Introduction

Numerous studies over the last three decades have employed corpus-based analyses for text-linguistic descriptions of register variation. The text-linguistic approach uses quantitative methods to describe the linguistic characteristics of each text, as the basis for comparing the patterns of register variation across texts (see Biber et al. 2016; cf. Biber 2012, Biber 2019).

Register in the text-linguistic approach is studied from a quantitative, comparative perspective. There are three major characteristics of the text-linguistic register framework:

  1. the research goal of describing registers for both situational characteristics and lexico-grammatical characteristics;

  2. the claim that the situational characteristics of registers have a systematic functional relationship to their lexico-grammatical characteristics; and

  3. the claim that those lexico-grammatical characteristics can be described in a continuous quantitative space of variation

To some extent, these defining characteristics were anticipated by earlier researchers in text linguistic and ethnographic research frameworks. For example, Hymes, Halliday/Hasan, and De Beaugrande/Dressler all note the importance of describing text categories with respect to both situational and linguistic characteristics; for example:

“The linguistic features which are typically associated with a configuration of situational features […] constitute a REGISTER.” (Halliday and Hasan 1976: 22)

[Text types are] “classes of texts expected to have certain traits for certain purposes.” (De Beaugrande and Dressler 1981: 182)

These researchers also recognized the importance of communicative function as the underlying explanation of situational-linguistic correlations. In fact, it could be argued that Hymes was more interested in the study of communicative function than linguistic form; for example:

“analysis of use [is] prior to analysis of code [taking into account the] gamut of stylistic or social functions” (Hymes 1974: 79)

“for sociolinguistic research, then, what is essential is […] to take functional questions, questions of social meaning and role, as starting point” (Hymes 1972: 6)

None of these researchers employed quantitative methods to study patterns of textual variation. However, De Beaugrande and Dressler mention the possibility, in the context of emphasizing the need for functional interpretation:

“We might count the proportions of nouns, verbs, etc. [… but] we need to know how and why these traits evolve. Statistical linguistic analysis of this kind ignores the functions of texts in communication and the pursuit of human goals. Presumably, those factors must be correlated with the linguistic proportions […].” (De Beaugrande and Dressler 1981: 183)

In a sense, the text-linguistic approach to register variation could be regarded as a framework designed to meet this challenge (see Biber and Conrad 2009, especially Chapters 1–3). [1] The central theoretical foundation of this approach concerns the relationship among the three components of situation, function, and linguistic forms, illustrated in Figure 1 (see Biber and Conrad 2009: 6–10; cf. Egbert and Biber 2016).

Figure 1: Visual representation of the three-way relationship among situation, function, and linguistic form in the text-linguistic register framework.

Figure 1:

Visual representation of the three-way relationship among situation, function, and linguistic form in the text-linguistic register framework.

Similar to speech event analysis, a text-linguistic register analysis begins with analysis of the situational characteristics of the register, including consideration of participant identities, relations among participants, channel, production circumstances, setting, and communicative purposes (see Biber and Conrad 2009: 39–46). In practice, the situational description of a register is based on four kinds of information: the analyst’s own personal experiences and observations, interviews with experts who regularly produce texts from the register, previous descriptions of the register, and direct consideration of individual texts from the register (see Biber and Conrad 2009: 37–39).

In both the ethnographic and the text-linguistic frameworks, registers (or speech events) are named, culturally recognized categories of texts. In many cases, there are overt external indicators in the context that signal the register category. For example, a lecture occurs in an auditorium with a podium, with the speaker standing at the front of the room. A sermon occurs in a church, with the speaker standing behind a pulpit, dressed in special clothes. The speaker of a political speech will normally be standing on a stage, with a microphone, against a backdrop of banners with political slogans. In addition, such spoken registers are usually overtly announced as “lectures”, “sermons”, or a “political rally”.

External indicators of register are probably even more prevalent for published written texts. For example, a magazine is printed on relatively large pieces of glossy paper, which are stapled or glued together, contains lots of photos, and usually has a glossy cover printed on thicker paper showing some kind of picture or illustration. Newspapers are printed on much larger newsprint paper, with no separate cover, and the pieces of paper simply folded together. Issues of academic journals are printed on smaller glossy paper, usually thicker than magazines, with a table of contents on the rear cover. In addition, many printed texts explicitly self-identify the register as a “newspaper”, “magazine”, “novel”, “biography”, etc. Distinctions like these have been used to organize libraries, and written corpora, for decades.

In addition to these external indicators, the ethnographic and the text-linguistic frameworks are based on the claim that the texts within a register are usually produced in similar situations, and therefore they will share other situational characteristics relating to interactiveness, personal involvement, production circumstances, and the relations among participants. Communicative purpose has also been treated as part of the situational context of a register. However, we show in our analyses below that there is often considerable variability in the communicative purposes of texts within a register.

The ethnographic and text-linguistic frameworks are also similar in that both regard the situational characteristics of a register to be more basic than its linguistic characteristics; for example:

“analysis of use [is] prior to analysis of code” (Hymes 1974: 79)

“the situational characteristics of registers are more basic than the linguistic features” (Biber and Conrad 2009: 33)

The corollary of this assertion is the claim that the linguistic characteristics of a register are for the most part derivative from the situational characteristics. That is, linguistic features and linguistic variation are treated as directly functional, meaning that their use is directly influenced by the situational context and communicative purposes of texts. For this reason, situational descriptions of a register precede linguistic descriptions, and the patterns of linguistic variation across registers are interpreted relative to the situational differences among the registers.

The methods for the linguistic description of registers are well established in the text-linguistic tradition, employing quantitative corpus-based analyses (see, e.g. the survey of register studies in Barbieri and Wizner 2019). That is, rates of occurrence for linguistic features are analyzed in each text. Registers can be compared for these rates of occurrence (the extent to which they use a linguistic feature), and they can also be analyzed for their internal variation (the extent to which there are linguistic differences among the texts within a register).

In contrast, the situational analysis of a register is usually based on qualitative descriptions, applying an analytical framework that consists of categorical situational characteristics (see, e.g. Biber and Conrad 2009, Table 2.1, p. 40). Some previous treatments have recognized the possibility that a register can combine multiple communicative purposes (e.g. being both narrative and persuasive; see, e.g. Biber and Conrad 2009: 45). However, only two previous studies have attempted to treat situational descriptions as continuous (quantitative) parameters: Biber (1994.47–50) and Sharoff (2018). Those two studies had somewhat different motivations: to “allow an empirical investigation of the relations among linguistic and situational parameters” in the case of Biber (1994 p. 47), and to automatically classify the documents in large Web corpora into register categories in the case of Sharoff (2018).

In the present study, we further explore the possibility of describing variation among texts and registers in a continuous (quantitative) situational space, parallel to the ways in which texts and registers are analyzed in a continuous (quantitative) linguistic space. Specifically, we claim that “registers” are top-down categorical constructs –culturally recognized categories of texts – but that those categories should be described situationally and linguistically in continuous terms. Such quantitative descriptions would allow us to document the extent to which a register is clearly delimited in terms of its situational and linguistic characteristics (i.e. the extent to which there are differences among the texts within a register), as well as enabling quantitative comparisons of the extent to which any two registers are similar or different. Previous research has adopted this analytical perspective for the linguistic description of a register; the present study explores how this perspective can be applied to the situational description of a register.

In addition, the present study explores a deeper issue that has been largely disregarded in previous corpus-based research: the possibility that some texts are not instantiations of any culturally recognized category. Corpus research has been somewhat circular in this regard, identifying the register categories that are culturally recognized on an a priori basis, then building a corpus to represent those categories, and finally describing the patterns of register variation in the language based on analysis of that corpus. However, this approach overlooks the possibility of texts that do not belong to a recognized register category, and therefore disregards the patterns of linguistic and situational variation associated with such texts. This is the second major issue that we attempt to address in the present paper, based on analysis of situational text types: bottom-up text categories that are defined in terms of their shared situational characteristics, regardless of the top-down register category of the texts (see Section 6 below).

We tackle these issues through detailed analysis of a corpus of web documents. We begin by analyzing texts with respect to continuous (quantitative) situational parameters. We then explore additional methodological and theoretical issues that follow from the treatment of situational characteristics as continuous parameters. We first investigate the possibility of identifying underlying situational dimensions of variation, employing the methods of multi-dimensional (MD) analysis. We then describe how those dimensions can be used to describe the situational characteristics of texts and registers in a continuous MD space. Such an analysis allows the researcher to describe the extent to which a register is well-delimited situationally: that is, in some registers the texts are all quite similar in their MD situational characteristics, while the texts in other registers can vary widely in their MD situational characteristics.

Finally, those differences in the situational coherence of registers lead us to explore an alternative approach to dividing up a discourse domain, utilizing the quantitative situational dimensions to statistically identify groupings of texts that are well-defined in their situational characteristics. We refer to these bottom-up groupings of texts as situational text types. In our conclusion, we discuss the utility of continuous analyses of situational variation for descriptions of register variation in any discourse domain. In addition, we triangulate the register versus situational text type approaches to textual categorization, showing how the two complement one another. Thus, taken together, the two approaches permit situational descriptions of discourse that are more complete and nuanced than would be possible with either approach taken on its own.

2 Background: Identifying the register category of web documents

The magnitude of the problems associated with register identification and classification became apparent to the authors during a recent study of register variation on the web (see Egbert et al. 2015; Biber and Egbert 2018). [2] That study was based on a large near-random sample of documents from across the entire spectrum of the searchable web (CORE: the Corpus of Online Registers of English – see Section 3 below). The goal of the study was to compare the linguistic characteristics of web registers. However, it quickly became apparent that many web documents have few or no external indicators of register category, making it difficult to classify the texts into culturally recognized register categories. This was the case for both official/institutional documents as well as personal documents (see the extended discussion of this problem in Biber and Egbert 2018, especially Chapters 1–3 and Chapter 7).

In fact, it turns out that it is problematic to even come up with a taxonomy of possible register categories that exist on the web. Researchers are directly confronted with these problems when dealing with web documents, because web search engines are not organized in terms of pre-determined register categories. However, these same problems exist in the domain of some print-media texts (see fuller discussion in the conclusion).

In the web registers project, we addressed this problem by employing a hierarchical decision tree and recruiting end-users of the web to code the situational characteristics of each document (see Egbert et al. 2015). Raters coded each document for communicative purposes associated with general register categories (e.g. narration, opinion, informational description), leading eventually to identification of a specific sub-register category (e.g. news report or travel blog within the general narrative register; see Biber and Egbert 2018: 28–37).

Overall, coders were able to agree on the general register category of c. 70% of the documents in CORE (i.e. with at least three of the four coders agreeing; see Table 3.3, Biber and Egbert 2018). For more detailed analyses of sub-registers, we included all documents where at least 2 of the 4 coders agreed on a specific register category, which accounted for c. 80% of the documents in CORE. This approach allowed us to proceed with detailed linguistic descriptions of each sub-register (Biber and Egbert 2018, Chapters 5–8).

However, even with our decision-tree approach and 10 rounds of pilot testing, there was considerable variability in the final register categorization of documents in CORE. Thus, at the general register level, coders were not able to achieve majority agreement (i.e. at least 3 of the 4 raters) for c. 30% of the documents in CORE (see Table 3.3, Biber and Egbert 2018). And there were additional uncertainties involved for identifying sub-register categories, even when raters agreed on the general register category. Thus, taken together, there was considerable uncertainty in the (sub-)register categorization of almost half (48%) of the documents in CORE (see Tables 3.3–3.7 in Biber and Egbert 2018). This finding is the primary factor leading to the present investigation.

In hindsight – based on the detailed situational and linguistic descriptions presented in our previous publications – it is clear that web documents are often hybrids, and that web registers are not homogenous constructs. That is, it is clear that many web documents share multiple communicative purposes to differing extents. Thus, we became interested in the possibility of describing texts and registers in a continuous situational space.

The approach that we explore in the present study follows exactly the opposite steps from traditional analyses. That is, in most previous corpus studies, register categories are identified at the outset on an a priori basis. Texts that do not belong to one of those predetermined register categories do not end up in the corpus, and thus are not included in the subsequent analyses. In contrast, the approach developed here begins with a comprehensive near-random sample of texts from across the entire discourse domain of web documents (Section 3). Each of those documents was then coded by hand for a set of continuous/quantitative situational variables (Section 4). Next, we applied the techniques of MD analysis to identify underlying situational dimensions of variation (Section 5). Those dimensions were used to compare the situational characteristics of pre-determined register categories, and to analyze the extent of situational variability among the texts within each register. Those quantitative dimensions can also be used to cluster texts into groups of documents that share the same situational characteristics; those groupings can be interpreted as “situational text types” (Section 6). The text type perspective makes it possible to return to the characterizations of individual texts, based on the cluster that the text is grouped into, and the extent to which the text exhibits the defining situational characteristics of that cluster. Finally, it is possible to compare these descriptions to those from the traditional register perspective, resulting in a much more thorough description of this discourse domain than that achieved through either approach on its own (Sections 7 and 8).

3 Corpora used for the study

The analysis here is based on a sub-corpus of web documents taken from the general component of the Corpus of Global Web-based English (GloWbE; see http://corpus2.byu.edu/glowbe/). GloWbE (containing c. 1.9 billion words in 1.8 million web documents) was designed to represent the full range of English documents encountered in typical web searches. The web documents included in GloWbE were selected from the results of Google searches of highly frequent English 3-grams (i.e. the most common 3-grams occurring in COCA; e.g. is not the, and from the). N-grams were used in order to minimize the bias from the preferences built into Google searches. For each n-gram query, 800–1000 links were saved (i.e. 80–100 Google results pages).

GloWbE is not designed for the study of register variation, because the register category of most documents in the corpus is not identified. Thus, we randomly extracted a sub-corpus of documents from GloWbE that could be studied in much more detail for their register characteristics. That sub-corpus, referred to as CORE (Corpus of Online Registers of English), consists of 48,571 web documents and nearly 54 million words. Each document in CORE was coded for its situational characteristics and register category by four independent raters. That version of CORE, including the coded register characteristics, has been integrated into the suite of Brigham Young University (BYU) corpus tools made available online by Mark Davies (http://corpus.byu.edu/core/).

As described in Section 2 above, the results of that project challenged our preconceptions about the construct of “register”, leading us to consider complementary perspectives. The present investigation is an initial attempt to explore one of those perspectives, investigating the possibility that registers can be defined in terms of continuous situational parameters. However, such an analysis required a much more detailed situational analysis of each document, and so the present study is based on Baby-CORE, a random sample of 902 texts taken from the CORE corpus combined with a sample of 100 Wikipedia articles. (We deliberately included a large number of Wikipedia articles in this sample to allow us to measure the amount of variability across ratings of a register that we deem to be extremely well-defined situationally when compared with the other text top-down registers in the corpus.) Our goal in creating this sub-corpus was to represent the range of register diversity found on the web, while at the same time compiling a corpus that was small enough to permit detailed situational analysis of each individual document.

The composition of Baby-CORE, broken down by the register categories used in our coding of CORE, is shown in Table 1. At least two of the four coders for the web-registers project agreed on the register category of these documents. However, at the same time, there is considerable uncertainty concerning the register category of many documents. That is, as discussed in Section 2 above, coders were unable to achieve majority agreement on the sub-register category for almost 50% of the documents in CORE. The analyses presented in the following sections help explain the underlying reasons for this high level of uncertainty in the classification of particular documents.

Table 1:

Composition of Baby-CORE, broken down by pre-determined register category.

Register Number of texts
News 227
Opinion blog 121
Encyclopedia articles 101
Other information 62
Interactive discussion 57
Sports report 54
Personal blogs 45
Reviews 44
How-to 29
Informational blog 28
Description-with-intent-to-sell 23
Advice 17
Religious blog/sermon 17
Descriptions of a person 12
Song lyrics 11
Research articles 10
Travel blogs 9
Historical articles 8
Interviews 7
FAQs 6
Short stories 6
TV transcripts/other spoken 3
Recipes 2
Unclassified 103
TOTAL 1002

4 Analyzing the situational characteristics of individual texts

Each document in Baby-CORE was coded by two independent raters (Ph.D. students in applied linguistics) for 23 different situational parameters, listed in Table 2. Raters viewed the full web page associated with each document, and they coded register characteristics based on whatever they noticed in the web page (in addition to consideration of the text itself).

Table 2:

Situational parameters coded in the project.

The text is:
a spoken transcript.
lyrical or artistic.
pre-planned and edited.
interactive.
The author/speaker:
is an expert.
focuses on himself/herself.
assumes technical background knowledge.
assumes cultural/social knowledge.
assumes personal knowledge about himself/herself.
The purpose of this text is to:
narrate past events.
explain information.
describe a person, place, thing or idea.
persuade the reader.
entertain the reader.
sell a product or service.
give advice or recommendations.
provide “how-to” instructions.
express opinion
The basis of information is:
common knowledge
direct quotes
factual/scientific evidence
opinion
personal experience

    Note: All parameters were coded on 1–6 scales: “disagree completely” to “agree completely”.

    Survey link: https://drive.google.com/open?id=1Jbe8JsOewnOECq596OAWtxFKW67Q3iehzxr93P7Cr7k.

A coding scheme was developed to represent the range of situational characteristics identified in previous frameworks, such as Biber (1994) and Biber and Conrad (2009, Chapter 2). However, those earlier frameworks were modified to achieve two important methodological innovations:

  1. each value of a situational parameter found in a traditional framework is represented here as a separate parameter. This innovation is most obvious for the coding of communicative purpose. That is, rather than choosing between narrative/explanatory/descriptive/persuasive/opinionated purposes, the coder here is asked to evaluate the extent to which the text accomplishes each (and potentially all) of those different purposes.

  2. each situational parameter is coded as a quantitative (ordinal) variable on a 1–6 scale, to permit register descriptions in a continuous space of situational variation.

In addition, our choice of parameters was informed by the analyses of web registers carried out previously for the web registers project. However, we make no claim that this list includes only situational parameters that are clearly distinguished from one another, or that the framework is organized in the best possible way. Our goal here was to try to capture the full set of situational considerations, and we believed that redundancy was a lesser problem than omitting a potentially important consideration. In addition, we did not develop a coding rubric (with operational definitions of the parameters) and we provided no training or socialization on the coding process. Rather, raters were simply instructed to code each document based on their own perceptions of the attributes described in the parameters. Even without a tightly controlled coding process, raters achieved fair-to-good reliability rates for most parameters, with a mean Cohen’s kappa of .46 and a mean correlation coefficient of .52.

For the subsequent statistical analyses, we computed a single score for each situational parameter coded for each text, based on the average of the ratings from the two coders. For example, if Rater 1 coded the parameter “the purpose of this text is to narrate past events” with a score of 4, and Rater 2 gave that text a score of 5 for the narrative parameter, we used a score of 4.5 for the narrative parameter in subsequent analyses. Thus, following this step, we had a matrix with 1,002 rows (i.e. the texts) and 23 columns (an average score for each of the 23 situational parameters). Data of this type is very similar to the kind of input that has been previously analyzed using MD analysis, except that traditional MD analyses are based on linguistic variables. In the following section, we briefly introduce that approach and discuss its application to the present study.

5 Analysis of underlying situational dimensions of variation

The MD analytical approach was developed in the 1980s to analyze and compare the linguistic characteristics of spoken and written registers, with respect to multiple linguistic parameters: the “dimensions”. The first step in the analysis is to compute rates of occurrence for linguistic features in each text. Then, factor analysis is used to analyze the correlational patterns among linguistic features, identifying the linguistic “dimensions”. Thus, each dimension is defined by a set of co-occurring lexico-grammatical features; these dimensions are identified empirically employing corpus-linguistic and multivariate statistical analytical techniques (see Biber 1988). Importantly for the discussion here, the analysis results in a taxonomic framework defined by a continuous space of linguistic variation. First of all, each dimension is a continuous parameter of variation (rather than a dichotomy), so individual registers are compared for the extent to which they employ the co-occurring linguistic features that define the dimension. Second, registers are compared with respect to multiple linguistic dimensions, so that two registers can be quite similar with respect to one dimension but dramatically different with respect to a second dimension. And finally, any given register can be more-or-less well defined with respect to the dimensions: for some registers, there is little variation among texts within the category, but there can be extensive linguistic variation among texts within some other registers. With these characteristics, MD analyses have provided empirically grounded linguistic taxonomies of register variation for multiple discourse domains in English and for numerous languages (see, e.g. the survey of MD studies in Biber 2014).

In the present study, we modify the methods of MD analysis to identify underlying dimensions of situational variation. While the overall goals are similar to those of previous MD studies, the specific methods are somewhat different, focusing on the co-occurrence patterns among situational variables rather than linguistic features.

The principal axis factoring method was used to statistically analyze the data in R (fa function in the psych package) (Revelle 2018). The 2-factor solution was chosen as the most informative, based on consideration of the scree plot and the interpretability of the factors. Taken together, the two factors account for 36% of the shared variance in the data. The factors were then rotated using a Promax rotation.

Each factor was then interpreted as a situational dimension of variation by considering the communicative functions that are shared by the situational variables that have large loadings on the factor. Table 3 lists the important situational variables associated with each of the two dimensions (with loadings > |.30|). The positive pole of Dimension 1 identifies discourse focused on the author, including description of their personal experiences and opinions. The discourse is often intended to be persuasive, entertaining, and to some extent interactive. The opposing pole is discourse that is carefully planned and edited, presenting technical information supported by evidence. The author is usually not discussing their own experiences or feelings, but they do claim to have expertise in some other area of knowledge. The second dimension identifies an opposition between narrative communicative purposes (which assume shared cultural background knowledge and often entertains the reader) versus a range of other communicative purposes, including explaining information, offering advice, or providing how-to/procedural instructions.

Table 3:

Summary of the important situational characteristics loading on Dimensions 1 and 2.

Dimension 1: “Personal opinionated discourse versus technical information supported with evidence”
Situational characteristics with positive loadings
The text is: interactive (.39)
The author: focuses on self (.66), assumes personal knowledge about self (.46)
The purpose: persuade the reader (.51), entertain the reader (.46), give advice or recommendations (.50), express opinion (.81)
The basis of information: common knowledge (.37), opinion (.84), personal experience (.76)
Situational characteristics with negative loadings
The text is: pre-planned and edited (−.62)
The author: is an expert (−.61), assumes technical background knowledge (−.36)
The purpose: explain information (−.57)
The basis of information: factual scientific evidence (−.75)
Dimension 2: “Narrative, entertaining discourse versus other communicative purposes (explanatory, advice, or procedural discourse)”
Situational characteristics with positive loadings
The text is: a spoken transcript (.36), lyrical or artistic (.35)
The author: assumes cultural social knowledge (.47)
The purpose: narrate past events (.61), entertain the reader (.48)
The basis of information: direct quotes (.45)
Situational characteristics with negative loadings
The purpose: explain information (−.39), give advice or recommendations (−.69), provide how-to instructions (−.66)

Similar to MD analyses based on linguistic variables, the next step in the analysis was to compute situational dimension scores for each text. For example, Text #1091 (a news article – see the excerpt in Text 1 below), had a Dimension 1 score of −.673, indicating that the text was carefully written and edited, that it presented information with supporting evidence, and that it was not written from a personal perspective and did not overtly present the opinions of the author. That same text had a Dimension 2 score of + .924, indicating that the author assumed social background knowledge and had narrative and entertaining communicative purposes supported by direct quotes

Text 1: News Report

MR ZIP is the latest Britain’s Got Talent star to get a record deal from Simon Cowell.

And he’s roped in fellow eccentric reality star Martyn Crofts to help out in the music video for the quirky track, which got the nation singing along.

Martyn was also a hit on the series, making it to the semi-finals with his act – singing like a dalek with a saucepan on his head – which saw him dubbed “dalek-table”. After Mr Zip’s first audition, Amanda Holden said of his catchy ditty: “It’ll go straight to No1.”

Simon admitted: “At first I thought you were annoying, then you got more annoying, then I liked you.”

Since the final in May, memorably won by talented dancing dog Pudsey and his owner Ashleigh Butler with their “mission impaw-sible” antics, several of the singing contestants have been awarded recording contracts.

<http://www.thesun.co.uk/sol/homepage/showbiz/tv/4416717/Mr-Zip-is-the-latest-Britains-Got-Talent-star-to-get-a-record-deal-from-Simon-Cowell.html>

These dimension scores can be used to compare the situational characteristics of the registers identified in our earlier research. For example, Figure 2 plots the dimension scores for the documents in Baby-CORE that had been classified in the original categorical coding study as Encyclopedia articles, Lyrical texts, Personal blogs, Discussion forums, and How-to/Instructional documents. There are two major patterns to notice from this plot: the extent to which a register is distinguished from other registers, and the extent to which there is variation among the texts within a register. At one extreme, Encyclopedia articles are mostly quite similar to one another and sharply distinguished from these other registers, with large negative Dimension 1 scores (“Technical information supported with evidence”) combined with unmarked Dimension 2 scores. The other four registers show considerable internal variation but they are still relatively distinguishable from one another. For example, Lyrical texts have relatively large positive scores on Dimension 1 (“Personal opinionated discourse”) coupled with large positive scores on Dimension 2 (“Narrative, entertaining discourse”). Personal blogs overlap with Lyrical texts to some extent, but generally have somewhat large positive scores on Dimension 1 (i.e. personal and opinionated to an even greater extent), and somewhat less extreme positive scores for Dimension 2 (i.e. less narrative and entertaining in purpose).

Figure 2: Registers that are relatively well-defined with respect to the situational dimensions.

Figure 2:

Registers that are relatively well-defined with respect to the situational dimensions.

Figure 3 plots the MD profiles of texts from four additional registers, illustrating quite different patterns: None of these four registers are sharply distinguished from one another, and all four registers show a large spread of internal situational variation. News reports and Opinion blogs are both extreme in these respects, each overlapping with all three of the other registers, and each having documents in all four quadrants of Figure 3.

Figure 3: Registers that are not well-defined with respect to the situational dimensions.

Figure 3:

Registers that are not well-defined with respect to the situational dimensions.

In summary, the present section has documented ways in which registers can be analyzed for the extent to which they are situationally well-defined, in addition to analysis of the extent to which they are distinguished from other registers in terms of their situational characteristics. In particular, we have shown that pre-determined register categories are often not sharply delimited and distinguishable in terms of their situational characteristics. Such findings may help to account for the difficulty that coders experienced trying to agree on the register categories of web documents (see Section 2 above). That is, Figures 2 and 3 plot the distribution of web documents in a continuous space of situational variation, showing that web documents occupy most of that space. Most registers can be described as having prototypical characteristics in that continuous space. For example, Figure 2 shows that personal blogs prototypically have opinionated purposes (Dimension 1) and narrative orientations (Dimension 2). Figure 3 shows that even a register like News Reports has prototypical characteristics (with an informational focus on Dimension 1 and narrative orientation on Dimension 2). The majority of documents in this register share those situational characteristics. At the same time, though, many other documents in these registers have situational characteristics that are strikingly different from those prototypical characteristics. The MD approach employed here allows us to not only note the existence of such variation within registers, but to also precisely describe the situational characteristics of individual texts in comparison to the register norms. In addition, this continuous perspective leads to the characterization of text categories like opinion blogs or interactive discussions as registers that have no clear prototypical norms at all (see Figure 3).

Such findings highlight the importance of describing the continuous range of situational variation found among the texts within a culturally recognized register category. That is, although registers are text categories recognized by members of a culture, they are not necessarily coherent in their situational characteristics. Opinion blogs and discussion forums are two examples of this type from the domain of the web (see further discussion in Section 8 below).

At the same time, such findings lead to the possibility of alternative analytical approaches. In the following section, we turn to one of these approaches: situational text type analysis. As we show below, this approach complements traditional register analysis in two respects: (1) it identifies text categories that are relatively well-defined situationally, and (2) it provides a methodological approach that permits the identification and situational description of previously disregarded texts that do not belong to any culturally recognized register category.

6 Bottom-up identification of situational categories: Situational text types

Similar to the patterns described in the previous section, all previous MD studies have documented the existence of extensive linguistic variation among the texts within a register. For example, Chapter 8 in Biber (1988) takes up the topic of linguistic variation within register categories, describing how a register can be described for the extent to which it is linguistically well-delimited (or not), in addition to the comparison of typical linguistic characteristics across registers. Such patterns led to the development of an alternative analytical perspective, referred to as text type analysis.

In previous research, a text type was defined as a linguistically well-defined grouping of texts (see Biber 1989/2013). That is, the texts grouped into a text type are all similar linguistically, regardless of any pre-determined register categories. In the present section, we apply this same methodological approach to identify situationally well-defined groupings of texts: situational text types.

Cluster analysis is the statistical technique used to identify these text types. There are numerous different statistical approaches that can be used to inductively group observations into categories. For our purposes here, the choice of specific clustering approach is not important. Rather, our goal is to illustrate the utility of the “text type” perspective for the study of textual variation. Any of the specific clustering approaches would work well for that goal.

The analysis begins by treating each individual document as an observation, with no consideration of the pre-determined register categories of those documents. The cluster analysis uses the two situational dimension scores to group documents into new categories, such that the documents within a grouping are maximally similar in their situational characteristics, while the clusters themselves are maximally distinguished; we refer to these new groupings of documents as “situational text types”.

The specific statistical approach that we used is hierarchical clustering (using the agnes function (method = ward) in the cluster library). Figure 4 presents a hierarchical clustering tree (a “dendrogram”) showing the situational relations among all texts in Baby-CORE. Texts that are listed next to each other in the tree are maximally similar in their MD situational characteristics. The y-axis plots the situational distance between clusters (represented by scores on the two situational dimensions); that is, the y-axis position of the node connecting the two branches measures the similarity between the groupings of texts at that level.

Figure 4: Hierarchical tree structure of the cluster analysis.

Figure 4:

Hierarchical tree structure of the cluster analysis.

Table 4 summarizes the number of texts grouped into each cluster as well as the 2-dimensional characterization of each cluster, for the 2-cluster, 3-cluster, 4-cluster, and 5-cluster solutions. Figure 5 provides a scatter plot for the 3-cluster in the two-dimensional space represented by the situational dimensions. Figure 6 provides the same information for the 5-cluster solution.

Figure 5: MD situational characteristics of text types in the 3-cluster solution.

Figure 5:

MD situational characteristics of text types in the 3-cluster solution.

Figure 6: MD situational characteristics of text types in the 5-cluster solution.

Figure 6:

MD situational characteristics of text types in the 5-cluster solution.

Table 4:

Summary of the hierarchical cluster analysis, showing the number of texts and the situational characteristics for each cluster.

Cluster 1 Cluster 2
N = 640 texts N = 362 texts
D1 +  D1 –
(“Personal opinion”) (“Technical information”)
D2 (unmarked) D2++ (“Narrative”)
Cluster 1.1 Cluster 1.2
N = 344 N = 296
D1 (unmarked) D1++
D2 – (“Advice/procedural”) (“Personal opinion”) D2++ (“Narrative”)
Cluster 1.1.1 Cluster 1.1.2
N = 246 N = 98
D1 (unmarked)D2 – D1++ (“Personal opinion”)
(“Advice/procedural”) D2 – (“Advice/procedural”)
Cluster 2.1 Cluster 2.2
N = 142 texts N = 220 texts
D1 – D1 –
(“Technical information”) (“Technical information”)
D2++ (“Narrative”) D2 +  (“Narrative”)

Although the text type perspective provides a way to categorize texts into groups that are better-defined situationally, it is important to note that these are not truly discrete categories, and there is no single “correct” grouping of texts. That is, Figures 5 and 6 show that web documents occupy the full space of situational variation. The groupings identified by the cluster analysis provide an attempt to impose order on that continuous distribution of texts, but they should not be regarded as the “true” categories. Interpretation of multiple levels of grouping in the hierarchical cluster analysis is one way of recognizing that the analysis is simply a heuristic.

At the top level (the 2-cluster solution), the cluster analysis splits all documents in the corpus into two major categories. (Note: In Figure 5, Cluster 1 in the 2-cluster solution includes all texts included in either Cluster 1.1 or Cluster 1.2.) Cluster 2 in the 2-cluster solution is the better delimited one: it includes 362 texts that generally have large negative Dimension 1 scores (“technical information supported by evidence”) and large positive Dimension 2 scores (“narrative”). Cluster 1 in this solution is larger (with 641 texts) but much more diverse in its situational characterization.

The 3-cluster solution splits the more diverse Cluster 1 into two sub-clusters: Clusters 1.1 and 1.2. Figure 5 plots the situational characteristics of all three clusters. Cluster 2 remains the same as described above (362 texts, with large negative Dimension 1 scores and large positive Dimension 2 scores). Cluster 1.1 is mostly composed of texts with large negative Dimension 2 scores (non-narrative communicative purposes), regardless of their Dimension 1 scores. The texts in Cluster 1.2 are similar to Cluster 2 in that they generally have narrative and entertaining communicative purposes (positive scores on Dimension 2). But Cluster 1.2 is the opposite of Cluster 2 in its Dimension 1 characterization, with most texts having large positive scores (“personal opinion”).

The 4-cluster solution splits Cluster 1.1 from the 3-cluster solution (see Table 4). These clusters are shown as Clusters 1.1.1 and 1.1.2 on Figure 6. Of these new clusters, Cluster 1.1.2 is better delimited, including only 98 texts with very distinctive situational characteristics: marked for the expression of personal opinion (large positive Dimension 1 scores) combined with advice or procedural purposes (large negative Dimension 2 scores). In contrast, Cluster 1.1.1 is more general, including 246 texts and marked only for a moderately negative score on Dimension 2.

Finally, the 5-cluster solution splits Cluster 2 into two relatively tight sub-clusters (see Clusters 2.1 and 2.2 in Figure 6). These sub-clusters differ in the extent to which they are marked for the two situational dimensions: Cluster 2.1 has a large positive Dimension 2 score (“narrative”) but only a moderately negative Dimension 1 score (“technical information”), while Cluster 2.2 has only a moderately positive Dimension 2 score (“narrative”) but a large negative Dimension 1 score (“technical information”).

As can be seen from Figure 6, the clusters in the 5-cluster solution vary in the extent to which they are well-defined with respect to their situational characteristics. Cluster 2.1 is the best delimited, while Clusters 2.2 and 1.1.2 are also relatively well delimited. In contrast, Cluster 1.2 includes texts having a wide range of different situational characteristics. And at the far extreme, Cluster 1.1.1 includes texts in all four quadrants of Figure 6, indicating that it is the least well-defined of the five clusters in terms of its situational characteristics.

A visual inspection of the data points on Figures 5 and 6 shows that some texts are central exemplars of their cluster, while other texts are more peripheral. Thus, one of the advantages of this approach is the ability to describe situational variation in a continuous space: on the one hand, each text is classified into a text type category, but at the same time, individual texts can be described for the extent to which they represent the situational characteristics of that category. For example, compare the situational characteristics of Texts 2, 3, and 4, which were all grouped in Cluster 1.2. Text 2 is a core exemplar of Cluster 1.2 (Dimension 1 score = .17; Dimension 2 score = .24), with moderately positive scores for both Dimensions 1 and 2. In this case, the characterization reflects a hybrid kind of text, with some paragraphs providing a straightforward reportage of past events, and other paragraphs switching to the expression of personal opinion. Text 3 is a peripheral exemplar of Cluster 1.2 (Dimension 1 score = 1.5; Dimension 2 score = −.18), with an extremely large positive score for Dimension 1 (personal opinion) but a small negative score for Dimension 2. Those scores reflect the consistent focus of this text on the expression of personal opinions, with little objective reportage of past events. Finally, Text 4 illustrates a different kind of peripheral exemplar for Cluster 1.2 (Dimension 1 score = .53; Dimension 2 score = 1.56), with the opposite skewing from Text 3: a small positive score for Dimension 1 but an extremely large positive score for Dimension 2, reflecting the primary focus on narration and entertainment, supported with direct quotes.

Text 2. Core exemplar of Cluster 1.2: Moderately positive characterizations for Dimension 1 (personal opinion) and Dimension 2 (narrative)

It’s not a guarantee that action on the bill is delayed, but HR 5781, the NASA authorization bill, does not appear on the House floor schedule for Friday as distributed by the office of the House Majority Leader. Several bills are up for consideration under suspension of the rules, some of which were postponed from yesterday, but the NASA bill is not among them. Schedules, as always, are subject to change.

Meanwhile, in a meeting with the editorial board of Florida Today, SpaceX CEO Elon Musk elaborated on his comments in the call-to-action email the company sent out yesterday morning. “It seems like just a basic rule of thumb – maybe you want to spend as much on the American team as you do on the Russians,” Musk told the paper, noting that the bill authorizes several times as much money for buying seats on Russian Soyuz spacecraft as it does for commercial crew development. “It just seems like a crazy time to be doing that sort of thing.”

There’s nothing ‘crazy’ about buying seats on a proven, reliable, dependable spacecraft with decades of operational experience, Master Musk. If you had one flying now you’d know that. Instead, all you do is talk. And it’s not very ‘commercial’ of you to chat up private enterprise being the ‘future of space exploration’ on television programmes and in the press as a way to get ‘ordinary people’ out into space then go soliciting government subsidies for commercial space ventures at the same time. Quite hypocritical. You went to Wharton, remember? Private sector capital markets are the place to tap for investment in these ventures and assume the high risks, absorb the losses or reap any reward. Nothing is stopping commercial space from soaring-except the very limited market their free enterprised, profit driven projects want to service. Which is why governments do it.

I agree with Elon Musk. Why hand the cash to Vladimir Putin and his fellow gangsters a big check? So they terrorize their neighbors? That said, Elon Musk is way down the list of those who can competently provide a manned capability. Build an American Soyuz – Ares I/Orion. That was the original plan and is still the best.

<http://www.spacepolitics.com/2010/07/30/hr-5781-schedule-and-supporting-the-home-team/>

Text 3. Peripheral exemplar of Cluster 1.2: Extreme positive characterization for Dimension 1 (personal opinion) coupled with low positive characterization for Dimension 2 (narrative)

There are some jazz fans who come over all queasy when jazz singers are mentioned. They seem to feel vocal jazz is somehow too emotional. To have someone singing to you rather than playing a piano or the drums or even a trumpet, is too personal, too... “oooh, I don’t know where to look”.

They might also feel that jazz singing is somehow not quite serious; that it is more a cabaret thing or easy listening or some such genre beneath their jazz seriousness.

Am I making these people up? I don’t think I am. Ask any jazz promoter. There are men – and they are mainly men, and probably of a certain age – who will come along regular as clockwork to hear an instrumental group, and the tougher the music the more they like it. But put on a jazz singer and they would rather have their pint of real ale down the pub.

Which is a pity, because they are missing out on so much great music and so many great musicians.

The thing I like a lot about jazz singers is that they aren’t bullshitters. They can’t be. There really is nowhere for them to hide. And they can’t be all head and no heart, either. When your instrument is your body you have to fully engage with the music. Any weakness of technique and you can’t blame your tools – there are no tools.

<http://thejazzbreakfast.com/2012/10/09/the-jazz-musicians-with-no-place-to-hide/>

Text 4. Peripheral exemplar of Cluster 1.2: Small positive characterization for Dimension 1 (personal opinion) coupled with extreme positive characterization for Dimension 2 (narrative; entertaining; quotes)

SING IT: Fierce clubman Dave Mundy (left) heads the Wagga City Rugby Mens Choir with John Ferguson in a performance at Kooringal High School last year. Mundy has penned an adaptation of a classic Banjo Patterson poem that has fuelled the rivalry between Wagga City and Waratahs before Saturday’s SIRU semi-final.

SUMMED up in a verse, the simmering tension between Waratahs and Wagga City has been stoked to full blaze before the Southern Inland Rugby Union semi-final on Saturday.

The most unlikely of culprits, a passion-fuelled adaptation of Banjo Patterson’s classic The Geebung Polo Club has put Tahs offside with one of their greatest SIRU rivals.

Written by former president Dave Mundy, the poem was trotted out at the traditional post-game presentations as the Boiled Lollies celebrated a second consecutive victory over Tahs in round 13.

“We were all sitting at pressos and Dave got up and said he had something to say, it took us all by surprise, actually,” Wagga City coach Mick Small told The Daily Advertiser yesterday.

“He just came out with this poem and caught us all on the hop, it’s something we hadn’t heard before and the boys loved it.”

Intended as a motivator for the Boiled Lollies, Mundy’s poem was a club secret for several years before its timely reveal at presentations four weeks ago.

“I think it’s always such a passionate butt of heads between the two clubs, it’s heart-on-your-sleeve sort of stuff,” Small said.

<http://www.dailyadvertiser.com.au/story/207197/poetic-justice-for-wagga-city/?src=rss>

The goal of the cluster analysis is to identify new bottom-up text-type categories that are better delimited than pre-determined register categories in terms of their situational characteristics. However, our ultimate goal is not to claim that these bottom-up groupings of texts represent the “true” categories. Rather, we consider the register perspective and the text-type perspective to be complementary approaches, which each provide important information about the textual and situational distinctions found in this discourse domain. In the following section, we directly compare the situational descriptions provided by both approaches.

7 Comparing top-down and bottom-up categorizations of situational varieties

Because registers are top-down culturally recognized constructs that can be defined at different levels of specificity, they are not necessarily well-delimited in terms of all situational characteristics. We show in Section 5 above that texts in registers like encyclopedia articles or song lyrics are sharply delimited in terms of their situational characteristics, while texts in registers like news reports or opinion blogs can vary widely in their situational characteristics. In contrast, situational text types are bottom-up categories, groupings of texts that are similar in their situational characteristics (regardless of their register categories). As a result of these differences between the register and situational text type perspectives, it is probably not surprising that there is an imperfect mapping of the one onto the other.

Figure 7 shows the register composition of each of the clusters in the 5-cluster solution. Although there are only five situational text types in the analysis (contrasted with the 25 registers in Baby-CORE – see Table 1), some of the text types are composed to a large extent of texts from a single register. For example, Cluster 2.1 consists mostly of reports of past events: 56% of the texts grouped into the cluster are news reports, and an additional 22% are sports reports. Cluster 2.2 is also relatively homogeneous, with 45% of the texts in the cluster being encyclopedia articles, and an additional 25% of the texts being news articles. In contrast, clusters 1.1.1 and 1.2 comprise texts from a wide range of different registers.

Figure 7: Composition of each situational text type, by register.

Figure 7:

Composition of each situational text type, by register.

Figures 8 and 9 show the opposite perspective: the extent to which texts from a register are grouped into a single text type. Encyclopedia articles are especially noteworthy here, with 97% of the documents from that register being grouped into Cluster 2.2. Song lyrics and personal blogs are also noteworthy here: 91% of song lyrics, and 68% of personal blogs, are grouped into Cluster 1.2. These are both examples of registers that are relatively well-delimited in their situational characteristics, even though the text type that they belong to includes texts from a wide range of different registers.

Figure 8: Registers that are (mostly) grouped into a single situational text type.

Figure 8:

Registers that are (mostly) grouped into a single situational text type.

Figure 9: Registers that are distributed across multiple situational text types.

Figure 9:

Registers that are distributed across multiple situational text types.

Figure 9 shows that some other registers are not at all sharply delimited in their situational characteristics. For example, news reports includes texts that are grouped into four of the five situational text types, while opinion blogs includes texts grouped into all five situational text types. News reports illustrate the opposite pattern from a register like song lyrics: while there is one cluster composed mostly of news reports and sports reports (Cluster 2.1 – see Figure 7 above), the register of news reports overall includes texts that realize a wide range of different situational characteristics, and thus are grouped into many different situational text types (see Figure 9).

In summary, it is clear that these two analytical approaches to the categorization of texts provide quite different situational descriptions. However, both of those perspectives need to be interpreted relative to the quantitative characterizations of individual texts, and the fact that web documents occupy almost all of the space defined by the two situational dimensions of variation. Thus, in the following section, we attempt to integrate the three perspectives – describing the situational characteristics of register categories, describing the situational characteristics of situational text types, and describing texts in a continuous space of situational variation – and apply the triangulated results to two applied issues that have been largely disregarded in previous studies of register variation: (1) an overt recognition of the ways in which texts vary within a register category, and (2) the description of texts that do not fit tidily into any culturally recognized register category.

8 Conclusion: application to the analysis of texts and textual variation

There are two major innovations emerging from the analyses presented in the present paper. The first is the treatment of situational variation as a continuous MD space. And the second is the exploration of bottom-up versus top-down methods for categorizing texts into situational varieties. It turns out that both innovations provide alternative perspectives on analytical problems found in earlier studies.

The first innovation helps us account for important differences among texts within a register. From a linguistic perspective, such differences have long been noted. For example, Biber (1988) devotes an entire chapter (Chapter 8) to discussion of the linguistic variation found among texts within register categories. The present analysis shows similar patterns for situational variation among texts within register categories. By describing the characteristics of each text in a continuous MD space, we are able to isolate the particular texts within a register that require more detailed investigation.

Qualitative analysis of such texts in the full CORE corpus of web documents indicates that there are additional contextual factors that strongly influence register categorization, beyond the text-internal situational characteristics that were analyzed for the current project. In particular, external indicators associated with the contextual setting of a document seem to have had a major influence on the codings of documents in the web registers project: If external indicators associated with a document strongly suggest a particular register category, coders in the project generally agreed on that category, regardless of the other situational characteristics of the text itself.

For example, all four coders in the original web registers project agreed that the document associated with the web page shown in Screen shot 1 was a news report. However, the MD analysis presented in Section 5 above shows that this text is highly peripheral to the typical situational characteristics of news reports, with a Dimension 1 score of 1.14 (extremely opinionated) and a Dimension 2 score of .26 (moderately narrative). In fact, as Figure 10 shows, this text is more typical of opinion blogs than news reports in its MD situational characterization.

Screen shot 1: Webpage from the newspaper The Mirror.

Screen shot 1:

Webpage from the newspaper The Mirror.

Text 5: News report with situational characteristics that are atypical

Dimension 1 score = 1.14 (extremely opinionated)

Dimension 2 score = 0.26 (moderately narrative)

Should the pillar of the local community top the poll in Northamptonshire, the Home Secretary would order an immediate re-run.

Because Theresa May’s flawed law disqualifies Labour’s Barron, a respected local magistrate and former councillor, over a teenage œ20 fine for a minor offence in 1990.

The trade unionist’s presence on the ballot paper, in a contest he’d lose if he wins, captures the ridiculousness of a Conservative plan likely to end in a -self-inflicted bloody nose for David Cameron.

The Prime Plod excelled himself even by his own -incompetent standards in accepting Nick Clegg’s timetable and holding the elections in November.

Polling booth staff should take a newspaper to read because fewer than nine hours of daylight isn’t going to have them queuing out of the door to vote.

The turnout, in a race nobody wanted except misguided Tory policy wonkers, threatens to be the lowest ever for national -elections, below the 23% of the 1999 European elections.

Liberal -Democrats can’t be bothered to raise the party’s standard in all 41 areas outside London.

Cameron’s dream of law and order zealots sweeping the board may be a nightmare if Labour successfully turns the elections into local referendums on police cuts, Miliband’s party gaining a toe-hold to kick Tory MPs out of Parliament.

And we can all take direct action by voting with our hard earned cash.

I’m boycotting Amazon and Starbucks until they tip up tax.

Google’s harder to avoid, but I’ll still be giving it grief.

<http://www.mirror.co.uk/news/uk-news/kevin-maguire-absurdity-of-the-police-and-crime-1433106>

Text excerpt 5 confirms the MD situational characterization of this text, showing that this text is primarily focused on presenting the personal opinions of the author, rather than attempting to provide a more neutral account of information or events. Based on the situational characteristics of the text alone (divorced from its contextual setting), raters would likely have assigned this document to the category of personal opinion blog. However, we find instead that all four coders agreed on the categorization of news report, based on the clear signals in the contextual setting provided by the web page. For better or worse, we would argue that this categorization has cultural validity, reflecting the ways in which readers make decisions about register. However, our main point here is that the analytical approach developed in the present study – describing the situational characteristics of individual texts in a continuous MD space – permit much more insightful descriptions of such documents, allowing us to describe their peculiar nature in comparison to the register norms.

Figure 10: Text 4 characteristics (a news report) relative to other news reports and opinion blogs in the 2-dimensional situational space.

Figure 10:

Text 4 characteristics (a news report) relative to other news reports and opinion blogs in the 2-dimensional situational space.

The second innovation that we explored in the present study was the utility of the situational text type perspective. The case study here applied hierarchical cluster analysis, using the quantitative situational dimensions as variables, to inductively identify textual categories that are well-delimited in their situational characteristics. This innovation is especially valuable for the description of texts from the discourse domain of the web, because the register category of many web documents is unclear (see Section 2 above). In most previous research, such texts would simply be disregarded (or not even included in the corpus). However, the text type perspective provides a framework for the identification and characterization of such texts.

Screen shot 2: Webpage from the National Hockey League.

Screen shot 2:

Webpage from the National Hockey League.

For example, coders were unable to agree on the register category of the document associated with the webpage from the National Hockey League shown in screen shot #2.

This type of web document will be familiar to any reader who performs web searches. But it does not fit tidily into any culturally recognized named register category, and the page itself fails to provide any explicit external indicators of register category. Two of the four coders in the web registers project classified this document as a “sports report”, one coder classified it as an “informational description of a thing”, and the fourth simply classified it as “other information”. It turns out that such documents are extremely common on the web; in fact, coders were unable to agree on a specific register category for 54% of all documents classified as generally “informational” in the analysis of the CORE corpus (see Biber and Egbert 2018, Chapters 3 and 7).

This particular document is additionally problematic because it serves hybrid purposes. In fact, most of the document is devoted to a “comments” section (as shown in Screen Shot 3), where interested readers could submit questions, and the “coach” responds with answers.

Screen shot 3: Comment section of the webpage from the National Hockey League.

Screen shot 3:

Comment section of the webpage from the National Hockey League.

The situational text type perspective provides a framework for the characterization of such texts. Thus, the document associated with Screen Shots 2 and 3 was analyzed as moderately opinionated and persuasive (Dimension 1 score=0.84) and focused on explanatory/advice/procedural purposes (Dimension 2 score=–1.4). The cluster analysis grouped this text into Cluster 1.1.2, together with other documents that coders had classified as interactive discussions (in addition to how-to documents, opinion blogs, and personal blogs). Consideration of the text in these screenshots shows that this document does actually share many of the situational characteristics of discussion forums and other types of interactive discussions on the web: much of the document is interactive, with readers posing informational questions and the site hosts writing back with responses. However, the document also incorporates monologic, informational/advice communicative purposes, making it hybrid to a certain extent. The webpage itself fails to overtly identify a register category in the contextual setting, and thus coders had difficulty classifying the document as an instance of a particular register such as an interactive discussion forum. But the situational text type analysis accurately groups this document together with other texts that have similar combinations of situational features.

As this document illustrates, a text can have hybrid situational characteristics because it is composed of distinct discourse units that serve very different communicative purposes. In other cases, though, an individual text can truly integrate multiple communicative purposes, such as the news report in Text 2 above, which integrates, narrative, personal opinion, and informational communicative purposes. The key generalization for our purposes here is that texts can be hybrids. Registers, in contrast, are culturally defined and recognized categories, and thus they cannot be “hybrids”, even though there is often an extensive range of situational variation among the texts within a register.

A topic for future research is the extension of these methods to other discourse domains, including print-media texts and spoken discourse. We anticipate that the two innovations of the current analysis will be equally useful in other discourse domains. For example, recognizing register categories like “newspaper article”, “magazine article”, or “conversation” is for the most part uncontroversial for members of a speech community. However, there is clearly an extensive range of variation in the communicative purposes and other situational characteristics of particular texts within these categories. By analyzing the characteristics of texts in a continuous MD space, relative to the population of other texts from the same register, researchers will be able to provide much more comprehensive descriptions of these complex everyday registers than previously possible. This approach is currently being applied to the domain of everyday face-to-face conversations, where each conversation is manually segmented into Conversation Units, and each unit is coded for the degree to which it is characterized by a set of nine communicative purposes (see Egbert et al., forthcoming).

Applications of the second innovation – the situational text type approach – are also likely to be useful in other discourse domains. There are many texts that do not fit tidily into culturally recognized register categories. For example, what do we call the political campaign documents that we receive in the mail? Or the informational health pamphlets that we get from our physician? Or the informational document about a study-abroad program that we see on the wall in our office building? These are types of texts that are usually disregarded by corpus building projects, probably in large part because they do not fit tidily into culturally recognized register categories. However, they are not rare. And we expect that the analytical framework developed here will provide tools for the description of such texts, similar to the description of unclassified texts from the web.

Thus, while the present study has focused on variation among web documents, we anticipate that the analytical approaches illustrated here will prove to be informative to the study of register variation in any discourse domain. Although the methods require manual coding, and are therefore expensive and time-consuming, we would argue that the benefits outweigh the costs. As a result, this type of analysis should at least be considered as a possibility for any new corpus-construction project.

Finally, because linguistic variation has a functional basis, we anticipate that a direct comparison of the quantitative-situational characteristics of texts with the quantitative-linguistic characteristics of texts should prove to be highly informative. Preliminary research is confirming that possibility (e.g. showing high correlations between the situational Dimension 1: “Personal opinionated discourse versus technical information supported with evidence” and the linguistic Dimension 1 from the 1988 study: “Involved versus Informational Production”). In research currently underway, we are exploring these continuous quantitative relationships between situational variation and linguistic variation in detail.

References

Barbieri, Federica & Stacey Wizner. 2019. Appendix A: Annotation of major register and genre studies. In Douglas Biber & Susan Conrad (eds.), Register, genre, and style, 318–364. Cambridge: Cambridge University Press. Search in Google Scholar

Biber, D. 1989. A typology of English texts. Linguistics 27. 3–43. (Reprinted in 2013, Linguistics 51; 50-year Jubilee Issue.). Search in Google Scholar

Biber, D. 2013. A typology of English texts. Linguistics 51 (50-year Jubilee Issue). (Reprint of Biber, D. 1989. Linguistics 27. 3–43). Search in Google Scholar

Biber, D. 2019. Text-linguistic approaches to register variation. Register Studies 1. 42–75. Search in Google Scholar

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press. Search in Google Scholar

Biber, Douglas. 1994. An analytical framework for register studies. In D. Biber & E. Finegan (eds.), Sociolinguistic perspectives on register, 31–56. New York: Oxford University Press. Search in Google Scholar

Biber, Douglas. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8. 9–37. Search in Google Scholar

Biber, Douglas. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1). 7–34. Search in Google Scholar

Biber, Douglas & Susan Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press. Search in Google Scholar

Biber, Douglas & Jesse Egbert. 2018. Register variation online. Cambridge: Cambridge University Press. Search in Google Scholar

Biber, Douglas, Jesse Egbert, Bethany Gray, Rahel Oppliger & Benedikt Szmrecsanyi. 2016. Variationist versus text-linguistic approaches to grammatical change in English: Nominal modifiers of head nouns. In Merja Kytö & Päivi Pahta (eds.), Cambridge Handbook of English Historical Linguistics, 351–375. Cambridge: Cambridge University Press. Search in Google Scholar

De Beaugrande, Robert & Wolfgang Dressler. 1981. Introduction to text linguistics. London: Longman. Search in Google Scholar

Egbert, Jesse & Douglas Biber. 2016. Do all roads lead to Rome?: Modeling register variation with factor analysis and discriminant analysis. Corpus Linguistics and Linguistic Theory10.1515/cllt-2016-0016. Search in Google Scholar

Egbert, Jesse, Douglas Biber & Mark Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9). 1817–1831. Search in Google Scholar

Egbert, J., S. Wizner, D. Keller, D. Biber, P. Baker, and A. McEnery. Forthcoming. Identifying and describing functional conversation units in the BNC Spoken 2014. Text and Talk. Search in Google Scholar

Halliday, M.A.K. & Ruqaiya Hasan. 1976. Cohesion in English. London: Longman. Search in Google Scholar

Hymes, Dell. 1972. Editorial introduction to “Language in Society”. Language in Society 1. 1–14. Search in Google Scholar

Hymes, Dell. 1974. Foundations in Sociolinguistics: An Ethnographic Approach. Philadelphia: University of Pennsylvania Press. Search in Google Scholar

Matthiessen, Christian M.I.M. 2019. Register in systemic functional linguistics. Register Studies 1. 10–41. Search in Google Scholar

Neumann, Stella. 2013. Contrastive register variation. Berlin: de Gruyter. Search in Google Scholar

Revelle, W. 2018. psych: Procedures for personality and psychological research. Evanston, Illinois, USA: Northwestern University. https://CRAN.R-project.org/package=psych Version=1. 8.10 Search in Google Scholar

Sharoff, Serge. 2018. Functional text dimensions for the annotation of web corpora. Corpora 13(1). 65–95. Search in Google Scholar

Published Online: 2020-02-13

© 2020 Walter de Gruyter GmbH, Berlin/Boston