Why we need a gradient approach to word order

This article argues for a gradient approach to word order, which treats word order preferences, both within and across languages, as a continuous variable. Word order variability should be regarded as a basic assumption, rather than as something exceptional. Although this approach follows naturally from the emergentist usage-based view of language, we argue that it can be beneficial for all frameworks and linguistic domains, including language acquisition, processing, typology, language contact, language evolution and change, and formal approaches. Gradient approaches have been very fruitful in some domains, such as language processing, but their potential is not fully realized yet. This may be due to practical reasons. We discuss the most pressing methodological challenges in corpus-based and experimental research of word order and propose some practical solutions.

1 Aims of this article 1.1 What do we mean by a gradient approach?
In this article we argue for a gradient approach to word order. By advocating for a gradient approach, we put forth two main theoretical stances. First, we argue for the presumption of variability in word order research, or for treating variability as the null hypothesis. Second, from a crosslinguistic perspective, we argue for a presumption of gradience; by default, we expect that languages should vary in degree but not kind when it comes to word order variability. From the perspective of description, a gradient approach means that word order patterns should be treated as a continuous variable. For example, in addition toor instead oflabeling a language as SO (with the dominant Subject-Object order) or OS (with the dominant Object-Subject order), we can compute and report the proportion of SO and OS based on behavioral data (from sources such as corpora or experiments). Similarly, instead of or in addition to labeling languages as having fixed, flexible, or free word order, we can measure the degrees of this variability by using quantitative measures, such as entropy, in a crosslinguistically comparable manner. Along with increasing descriptive adequacy, this allows us to move beyond categorical claims which stipulate that rigid word order provides a cue for assigning grammatical functions, such as Subject and Object, to noun phrases; instead, we can measure the reliability and strength of word order as a cue for argument structure based on behavioral data. 1 A gradient approach to the description of a particular language can be illustrated by a simple analogy: Instead of using categorical color terms like "white", "blue" or "orange", we can encode them numerically by using different codes. For example, we can talk about a color for which there is a word in English, "orange", in RBG terms as being 100% red, 64.7% green, and 0% blue. In CMYK color space, it is 0% cyan, 35.3% magenta, 100% yellow and 0% black, and it is assigned a hex code #ffa500. However, an additional advantage of this continuous measure is that we can also describe a color for which there is no standard label, such as #BA55D3, which is 73% red, 33% green, and 83% blue, and is given the label "medium orchid" in the X11 color names system. Thus, using various continuous measures, we can not only account for the underlyingly gradient properties of color, but we can also describe more and less prototypical colors on equal footing. Likewise, we can use cross-linguistically comparable gradient measures to describe a particular language, to describe how languages vary, and as a basis for understanding how linguistic variables may interact to produce typological patterns and motivate change.
There are both theoretical and practical reasons for this situation. From a theoretical perspective, a widespread view has been that grammar in general and word order in particular should be described by discrete features, categories, rules, or parameters (e.g., Guardiano and Longobardi 2005;Jackendoff 1977;Lightfoot 1982;Vennemann and Harlow 1977), although probabilistic approaches to grammar are becoming increasingly influential (Bod et al. 2003;Bresnan et al. 2007;Grafmiller et al. 2018; also see Section 1.2.2). A significant practical barrier to adopting a gradient perspective is that it is still difficult to obtain the types of behavioral or corpus data in many languages which would be necessary for descriptions using gradient measures. However, these barriers are rapidly falling away; we have access to new data sources in the form of large corpora, as well as software for processing and analyzing them statistically. To take one example, the Universal Dependencies project (UD; Zeman et al. 2020) had 10 treebanks for 10 languages in 2015. Version 2.9, released in November 2021, contained 217 treebanks for 122 languages, with more in the works. Computational algorithms, such as multilingual word alignment applied to massively parallel corpora, can be used to scale up gradient approaches, providing word order information for almost a thousand languages (Östling 2015). The proliferation of quantitative approaches to different areas of linguistics (Janda 2013;Kortmann 2020), which has been possible thanks to the development of user-friendly statistical software and robust experimental methods, has made us better equipped than ever for investigating gradient phenomena, though many practical issues are far from being fully solved, as we show in Section 4.

Gradience across generative approaches
Even in approaches where, historically, gradience has not been a central concern, various strategies have emerged to account for variability in word and constituent order. With the recent diversification of methodologies in the generative paradigm, researchers are recognizing the need to account for gradience and variability in theory building. 3 This section first discusses so-called 'mainstream' generative approaches (cf. Culicover and Jackendoff 2006) and accounts therein of gradience and variability in order. Then, we look to other major generative approaches, namely Lexical-Functional Grammar and Head-Driven Phrase Structure Grammar, in which accounting for variation in word/constituent order has played a central role in theory construction. We review these here as part of demonstrating why gradience might not have been as central to previous accounts of word order flexibility; the theoretical moves we advocate for in this article are certainly compatible with a range of approaches to syntax, though the details of implementation and, perhaps most crucially, how interested practitioners might be in the associated questions, will of course vary.
Historically, gradience in production or speaker judgments was explained as a difference among grammars: in short, crosslinguistic variation related to proposed parametric differences (e.g., Chomsky 1986). Although the Minimalist Program's feature-checking model generates canonical word orders, subject position, as well as derived word orders in a step-wise, phase-based derivation (though cf. Fanselow 2009), optionality and gradience are problematic and often not assumed to exist. 4 There is work that aims to account for insights related to frequency and innovation in intergenerational language transmission and acquisition (e.g., Biberauer 2019;Gravely 2021;Yang 2002) or from variationist sociolinguistics (e.g., Smith 2005, 2010) by combining the assumptions and theoretical machinery of the Minimalist Program with probabilistic analyses of variable word order. In this type of work, variability can be accounted for by gradient or stochastic probabilistic constraints on lexical/feature selection, motivated by a variety of factors, including functional ones. However, because the organizing principle of this framework is to reduce the specificity of syntactic operations, the types of processes which are relevant to account for different orders within languages or varieties of languages, for example, are often explicitly constructed as being extra-or post-syntactic operationsa theoretical conundrum, given that such processes are reflected in the syntax itself. As such, analyses which include variation or gradience are necessarily about interfaces (as this is a modular framework) with other levels of linguistic analysis.
Syntactic variation related to discourse-information structure is one particularly relevant example, and it presents one of the biggest challenges for the Y-model of language (e.g., Irurtzun 2009). Although no crosslinguistic model accounting for variability due to information structure currently exists in this framework, numerous attempts have centered on particular phenomena or languages. 5 Examples are Rizzi's (1997) Cartographic Program, Frascarelli and Hinterhölzl's (2007) isomorphism of syntax, information structure, and intonation, and Zubizarreta (1998), who elegantly accounts for syntax, prosody, and information structure in Germanic and Romance without the split-CP architecture. However, they are not unproblematic. In particular, recent experimental tests of Spanish varieties find variation not predicted by Zubizarreta's (1998) account, especially for subject information focus (i.e., rheme). 6 In fact, Feldhausen and Vanrell (2015), which builds upon Zubizarreta's account, does so using a non-categorical approach, namely Stochastic Optimality Theory (Boersma and Hayes 2001).
Indeed, Optimality Theoretic approaches have been a productive space for generative syntacticians interested in making variation (be it gradient or stochastic) more central to syntactic analyses. Ortega-Santos (2016) is a compelling application of an Optimality Theoretic approach to focus-related word order variation in Spanish. Müller (2019) uses a Gradient Harmonic Grammar (Smolensky and Goldrick 2016) approach to shed light on extraposed infinitive constructions in German, showing how this approach can explain apparent variable strength in the CP realm. Since Syntactic Structures (Chomsky 1957), the vast majority of generative approaches have sought descriptive as well as explanatory adequacy; therefore, not assuming or accounting for gradience means not capturing an important portion of speaker competence when it comes to word order. The result is a theoretical model that is, at best, incomplete, and, at worst, incorrect.
The question of how to deal with languages with flexible word order has played an important role in how non-hegemonic generative approaches to syntax diverged from other generative approaches. For example, Lexical-Functional Grammar (LFG) is characterized by a division between constituent structure and functional structure, and this has been used specifically to deal with languages with highly flexible constituent order (Austin and Bresnan 1996;Dalrymple et al. 2019). In LFG, languages are not a priori required to be specified for particular orders, and languages such as Plains Cree or Malayalam are argued to be non-configurational (Dahlstrom 1986;Mohanan 1983; see also Nordlinger 1998 andSimpson 2012 for discussion of nonconfigurational Australian languages). Accounting for flexible constituent order languages was not as central to the development of Head-Driven Phrase Structure Grammar (HPSG), in which linearization of constituents is similarly separate from constituent structure (Wechsler and Asudeh 2021), but, as HPSG has been implemented in a variety of languages with flexible constituent order, the framework is flexible enough to account for the gradience seen in natural language (e.g., Fujinami 1996 for Japanese; Mahmud and Khan 2007 for Bangla; Müller 1999 for German;Simov et al. 2004 for Bulgarian). While gradience is not central to these approaches, the existence of flexibility in constituent order, and, relatedly, whether and where grammatical relations are specified in formalisms and/or speakers' knowledge, 7 has not only been addressed in these frameworks, but dealing with word order flexibility has been the source of direct cross-framework comparison. What remains underdeveloped in these accounts is the ability to explain and measure degree of flexibility.

Gradience in usage-based linguistics
Though all theoretical approaches must deal with it in some way, word order gradience follows naturally from approaches that assume a dynamic usage-based view of grammar (e.g., Bybee 2010; Diessel 2019). This can be illustrated by grammaticalization processes. For example, English is mostly a prepositional language. Since many prepositions represent a result of the grammaticalization of verbs, and English has predominantly VO order, the normal outcome of this process is the development of prepositions. However, English also has a few postpositions, such as ago and notwithstanding. Nouns that are now used with those postpositions were originally the subjects of the source verbs (Dryer 2019). The bottom-up emergence of grammar from language use thus is very likely to result in variability/construction-specific idiosyncrasies, such that a categorical approach of applying labels to languages as a whole is less accurate than a gradient approach, which can quantify degrees of divergence from some canonical extreme.
The usage-based approach assumes that the user's knowledge of grammar is probabilistic (Bod et al. 2003;Grafmiller et al. 2018) because it is derived and updated based on exemplars of usage events stored in the memory (Bybee 2010), from which more abstract generalizations can be formed (Goldberg 2006). Individual language users implicitly learn statistical variability, or "soft constraints" (Bresnan et al. 2001), from the input (MacDonald 2013). This view is supported by the fact that language users are able to predict the likelihood of a variant in a specific context, closely matching predictions based on corpora (Bresnan and Ford 2010;Klavan and Divjak 2016). The probabilistic variation is captured by complex statistical models of language users' behavior (e.g., Bresnan et al. 2007;Gries 2003;Szmrecsanyi et al. 2016, to name just a few). Gradience thus forms a part of mental grammar, and categories are emergent or epiphenomenal.
Both word order variability and rigidity in such a framework result from competition between factors involved in language processing and learning, which can have different weights in one language and across languages (see Section 2.1). In this sense, the probabilistic view has much in common with Optimality Theory (see Lee 2003;Keller 2006, et alia for work which deals directly with word order). 8 The difference is that probabilistic grammars do not assume a fixed set of innate constraints (Grafmiller et al. 2018). This variability is instead represented in the individual speaker's grammar in the form of sequential associations of different strength, depending on the degree of entrenchment of a sequence in an individual mind (Diessel 2019), which both depends on and determines the degree of conventionalization of this sequence in the community (Schmid 2020).

Moving from categories to gradience in language description
In this article, we argue that a gradient approach is more descriptively adequate than the approaches which rely on categorical labels like "VO" and "OV", or "rigid/fixed", "flexible" and "dominant". In particular, rigid order means that some orders are "either ungrammatical or used relatively infrequently and only in special pragmatic contexts" (Dryer 2013b). If different possible word orders are grammatical, languages have flexible order. Many flexible-order languages are claimed to have a dominant word order, i.e., the more frequent one (e.g., the order that is used at least twice as frequently as any other order is considered dominant); in the absence of frequency data, or in languages where the dominant order might differ based on contexts such as register or genre (see Payne 1992), this could also be the order labeled as "basic", "pragmatically neutral" or "canonical" in language descriptions.
However, as has been noted (e.g., Hale 1983;Salzmann 2004), a clear-cut distinction between rigid-order languages and flexible-order languages with a dominant order is problematic. First of all, pragmatic neutrality is a slippery notion that depends on the specific situation. In spontaneous informal conversations in Russian, for example, there is nothing pragmatically marked about putting a newsworthy object first. For example, (1a), which comes from a transcribed spoken text, has OSV. Compare it with (1b) taken from an online news report, which has SVO, or givennew order, which is pragmatically neutral here. See more on register and modality effects in Section 4.1.3.
(1) Russian a. Spontaneous conversation Čajnik ja postavi-l-a. kettle.ACC 1SG.NOM set.PFV-PST-SG.F 'I set the kettle.' (Zemskaja and Kapanadze 1978: "A day in a family") b. Online news Ona postavi-l-a čajnik na plit-u… 3SG.F.ACC set.PFV-PST-SG.F kettle.ACC on stove-ACC 'She set the kettle on the stove… (and forgot about it)' (mir24.tv) The second criterion, text frequency, is also problematic. Converting a continuous measure to a categorical one means loss of data, and there is not always a clear cut-off point. Consider an illustration. Using corpora of online news in the Leipzig Corpora Collection (Goldhahn et al. 2012) in 31 languages, annotated with the Universal Dependencies (Zeman et al. 2020), we computed the proportions of Subject -Object order in the total number of clauses in which both arguments were expressed by common nouns. The sample sizes were very large, with a median number of about 138,000 clauses. The data are available online as Dataset1.txt in an OSF directory. 9 The result is shown in Figure 1. In all 31 languages, Subject usually comes before Object. The distribution of the scores represents a continuum from flexible languages (Lithuanian, Hungarian, Tamil, Latvian, and Czech) to well-known rigid languages (e.g., Indonesian, French, English, Norwegian, Danish, and Swedish). This plot demonstrates that there is no clear boundary between the two types, so it is difficult to choose the most appropriate cut-off point. A gradient approach to word order Approaches such as Dryer (2013a), which includes languages without a dominant order, resulting in a simple three-way scale (e.g., Adjective -Noun, no dominant order, Noun -Adjective), and Siewierska (1998), which determines relative flexibility based on the number of grammatical orders in a language, can be seen as precursors of gradient approaches. Here, we advocate for taking the next logical step and moving to more fully continuous variables, as these are not only descriptively more adequate, they also allow us to ask and answer more questions about word order, as discussed in more detail in Section 3. A gradient approach may not always lead to dramatically different results from a categorical one. For example, word order correlations, e.g., the correlation between the order of verb and object, and the order of adpositions and nouns, would still be correct under the gradient approach because both orders display low variability in many languages (Levshina 2019; see also Section 3.5). Yet, we can investigate more linguistic phenomena and languages if we characterize word order with the help of continuous measures.
Note that we do not argue for completely banishing descriptive labels like "canonically SVO language." Such labels can still be useful as a shortcut, especially in the absence of more precise measures. However, under a gradient approach, we make explicit that these labels reflect convenient simplifications rather than representing inherent properties of particular languages. Reiterating our two main arguments, presuming within-language variability and cross-linguistic gradience when it comes to typological categorization, we posit that gradience itself might be a good candidate for an inherent property of language.
In some cases, conventional categories can be misleading. For example, the plot in Figure 2 displays the distribution of proportions of head-final phrases in 123 corpora annotated with Surface-syntactic Universal Dependencies (Osborne and Gerdes 2019). 10 The dataset is available as Dataset2.txt in the OSF directory, which also contains the Python script that was used for the data extraction. The plot shows that the corpora do not follow a distribution with two peaks at each end, confirming the results obtained by Liu (2010) based on a small sample of languages. In other words, the corpora are not strongly head-initial and only rarely strongly head-final (cf. Polinsky 2012). In fact, the main bulk of the corpora contains between 25% and 50% of head-final phrases. The conventional labels mask the asymmetric and gradient shape of the distribution, which is best represented by continuous measures.
Studying gradience requires interdisciplinarity. In order to understand why word order exhibits more or less variability, we need to consider diverse factors related to language acquisition, processing, language contact, language change, prosody, and many others. These are discussed in Section 2. In Section 3, we discuss the research questions in different linguistic domains that cannot be asked and answered without taking a gradient approach. In order to study word order as a gradient phenomenon, we need quantitative measures of variability, such as probabilities or entropy, and empirical data from experiments or corpora. We suggest some methodological solutions and discuss the main challenges in Section 4. Finally, Section 5 provides conclusions and poses pertinent questions for future research across different linguistic disciplines.
2 What conditions word order gradience?
The aim of this section is to identify the main factors that contribute to the emergence of gradient word order patterns, and which make it possible for those patterns to remain in a language community. We begin with individual cognitive processes leading to gradient patterns, reviewing a large body of experimental and corpusbased work on language processing and acquisition (Sections 2.1 and 2.2). After that, we move on to word order gradience at the level of communities, focusing on language change (Section 2.3) and the processes of language contact, which interact with language variation in vernaculars (Section 2.4).

Language processing
There is a large body of psycholinguistic work that has investigated what conditions word order variability at the level of an individual language user and learner. An important role is played by the accessibility of information expressed by constituents, depending on their semantics, information status, and formal weight. It is advantageous for the speaker/signer to place more accessible information first because it helps to save time for planning the less accessible parts. Specifically, there is a general preference to place human (and, more broadly, animate) referents in early appearing and/or prominent sentence positions (i.e., to realize them as grammatical subjects) (Branigan et al. 2008; see also Meir et al. 2017, who found a "me first" principle across three groups of sign languages and in elicited pantomime). The explanation for these effects on word order concern conceptual accessibility: humans find it easier to access concepts denoting animate entities along with their linguistic labels from memory (Bock and Warren 1985). Accordingly, Tanaka et al. (2011) reported that Japanese-speaking adults were more likely to recall OSV sentences as having SOV word order when this resulted in an animate entity appearing before an inanimate entity. Thus, sentence (2a) was often recalled as (2b).
(2) Japanese a. minato de, booto-o ryoshi-ga hakonda harbor in boat-ACC fisherman-NOM carried 'In the harbor, the fisherman carried the boat.' (Tanaka et al. 2011: 322) b. minato de, ryoshi-ga booto-o hakonda harbor in fisherman-NOM boat-ACC carried 'In the harbor, the fisherman carried the boat.' (Tanaka et al. 2011: 322) This effect may ultimately have its origin in preferred modes of event construal. Studies of scene perception and sentence planning using eye-tracking show that speakers can rapidly extract information about participant roles in events, with speakers fixating on characters and entities that are informative about the scene as a whole (Konopka 2019).
Constituent order is also influenced by the discourse status of referents. The earliest proposal for this principle was perhaps by Behaghel (1932: 4): "es stehen die alten Begriffe vor den neuen" [old concepts come before new ones]. This generalization was then later captured as "given before new" by Gundel (1988). The effects of discourse information status have been empirically attested in various contexts. For example, the production experiments of Bock and Irwin (1980) showed that, for English speakers, given information tends to be produced earlier. Similar results have been replicated in Ferreira and Yoshita (2003) with Japanese. Arnold et al. (2000) demonstrated that in both heavy NP shift and the dative alternation in English, speakers prefer to produce the relatively newer constituent later. The influence of information structure would in turn affect ordering flexibility, showing more fixed preference for given information to appear first (for example, Israeli Sign Language has been said to have a Topic-Comment order [Rosenstein 2001], see also McIntire [1982] on American Sign Language). Although most languages discussed in that regard have predominantly given-before-new order, some languages prefer to put new and/or newsworthy information first, e.g., Biblical Hebrew, Cayuga (Iroquoian, the USA), Ngandi (Gunwinyguan, Australia), and Uto (Uto-Atzecan, the USA) (Mithun 1992). This can be explained by the competing principle: "More important or urgent information tends to be placed first in the string" (Givón 1991: 972). But manifestations of this principle can also be found in languages with given-first order, e.g., the clause-initial placement of contrastive topic and focus, but also of full nominal phrases, as in (1a). Crucially, these principles interact, and they are often strong tendencies as opposed to absolute principles; competition between factors such as "given first" and "important first" can lead to gradience effects within and across languages.
Intonation plays an important role in these processes. Across languages, we see that less frequent orders are associated with particular intonational patterns (e.g., Downing et al. 2004;Patil et al. 2008;Vainio and Järvikivi 2006; see also Büring 2013), but also there seems to be a relationship between word order flexibility and the degree to which information structure is encoded prosodically, via word order, or both. Swerts et al. (2002) compared prominence and information status in Italian and Dutch. Italian has relatively flexible word order within noun phrases as compared to Dutch, and Swerts et al. found that, within noun phrases, Dutch speakers encoded information status prosodically in production, and took advantage of this information in perception. However, the connection between information status and prosodic prominence was much less straightforward in Italian production, and it was unclear whether Italian listeners were attending to prosodic prominence as a cue for information status. This work suggests a trade-off between word order flexibility and prosodic encoding of information structure.
A related factor influencing word order is "heaviness" (i.e., the length, usually in words, of a constituent, especially relative to another in the same utterance). In the already mentioned study, Arnold et al. (2000) also showed that heaviness is an independent factor that determines constituent order in English, such that speakers prefer to produce short and given phrases before long and new ones (e.g., I gave the Prime Minister a signed copy of my latest magnum opus is preferred to I gave a signed copy of my latest magnum opus to the Prime Minister). In contrast, Yamashita and Chang (2001) showed that Japanese speakers prefer to order long phrases before short ones, suggesting that the typological differences between the languages lead to speakers weighting formal and conceptual cues in production differently (for a connectionist model, see Chang 2009).
These heaviness effects and their cross-linguistic variation have been explained by the pressure to minimize dependency lengths and similar processing principles, including Early Immediate Constituents (EIC) (Hawkins 1994), Minimize Domains (MiD) (Hawkins 2004), Dependency Locality Theory (DLT) (Gibson 1998) and Dependency Length Minimization (DLM) (Ferrer-i-Cancho 2004; Temperley and Gildea 2018). 11 These principles share a similar general prediction, which posits that words or phrases that bear syntactic and/or semantic dependency/relations with each other tend to occur closer to each other. For example, dependents of the verb move to a position close to the verb. Empirical support and motivation for these principles mainly come from language comprehension studies (Gibson 2000), to the effect that shorter dependencies are preferred in order to ease processing and efficient communication (Gibson et al. 2019;Hawkins 1994Hawkins , 2004Jaeger and Tily 2011). These principles explain the above-mentioned preference "light before heavy" in VO languages like English, and "heavy before light" in OV languages like Japanese.
As revealed by corpus-based investigations, the effect of these principles varies cross-linguistically. Previous work looking at syntactic dependencies has provided evidence that languages minimize the overall or average dependency length to a different extent (Futrell et al. 2015a(Futrell et al. , 2020Gildea and Temperley 2010;Liu 2008;Liu 2020). The effects also depend on specific constructions. A study by Futrell et al. (2020) showed further that the average dependency length given fixed sentence length is longer in head-final contexts (see also Rajkumar et al. 2016). Other experiments have investigated DLM in syntactic constructions with flexible constituent orders (Gulordava et al. 2015;Liu 2019;Wasow and Arnold 2003;Yamashita 2002); with an examination of the double PP construction across 34 languages, Liu (2020) demonstrated a typological tendency for constituents of shorter length to appear closer to their syntactic heads. The results also indicated that the extent of DLM is weaker in preverbal (head-final) than postverbal (head-initial) domains, a contrast that is less constrained by traditionally defined language types (Hawkins 1994); this contrast also seems to hold when the effects of additional factors such as lexical frequency and contextual predictability are controlled for (Liu 2022). Note that the patterns in the preverbal context are in opposition to previous findings with transitive constructions in Japanese (Yamashita and Chang 2001) and Korean (Choi 2007).
Word order flexibility can be advantageous for language users by helping to maximize fluency, permitting users to articulate easily retrievable elements early, and leaving time to plan the more cognitively demanding elements. At the same time, language users are inclined to re-use more global phrasal and sentence plans. This is observed in recycling of highly accessible abstract schemas known as syntactic priming (Ferreira and Bock 2006). Rigid word order, which allows language users to recycle the same routinized articulation plan, can also facilitate language production (MacDonald 2013). The re-use of highly entrenched production plans in conceptually similar constituents can also account for analogy effects in word order (for instance, the order of genitive and noun matches that of adjective and noun, cf. Diessel 2019). This kind of recycling is also important for frequently co-occurring lexemes (e.g., Sasano and Okumura 2016). Since the sequential associations between linguistic units can be of different strength (Diessel 2019), we observe gradience in the individual speaker/signer's grammar knowledge and in their interaction with other users. Highly frequent sequences have strong sequential associations due to automatization and chunking (Bybee 2010;Diessel 2019). Frequency-driven grammaticalization is usually accompanied by a loss of syntagmatic freedom (Lehmann 1995(Lehmann [1982). We can speak therefore of a trade-off between the ease of production of individual words and their sequences. Different languages and varieties have different points of balance, leading to cross-linguistic gradient patterns. The differences between languages can be explained by particular features of different grammars: cross-linguistic studies reveal that grammatical features of individual languages (e.g., verb agreement or aspectual marking) constrain language production (e.g., Kimmelman 2012;Norcliffe et al. 2015;Nordlinger et al. 2022;Proske 2022;Sauppe et al. 2013), even at the earliest stages of sentence planning. Taking a gradient approach to how these factors interact is crucial for capturing both the distribution of cross-linguistic variation and the dynamics of within-language variation.
Although research on comprehension and production regularly uses different methods and tests mode-specific hypotheses, it is likely that each makes use of the same underlying representational vocabulary and processes (Momma and Phillips 2018). Thus, all things being equal, we should expect many of the constraints on word order gradience in production to determine expectations about word order variability in comprehension (see MacDonald 2013, and commentaries, for discussion of one proposal along these lines), although exactly what those constraints are, how they are implemented on parsing and production architectures, is still an open question.

Language acquisition
The outcomes of learning vary at different points across the lifespan. By and large, children faithfully replicate the stable distributional properties of their input , thus maintaining the variability observed cross-linguistically in word order patterns. 12 That is, an English-acquiring child will acquire the variability exhibited in English, and a Latvian-acquiring child will acquire the variability exhibited in Latvian. Word order flexibility is in principle not problematic for acquisition since children acquire so-called free-word order languages such as Warlpiri without problem (e.g., see Bavin and Shopen 1989). All this means that the processes of language development allow for the learning of gradient patterns and support the maintenance of word order variability if it exists in the ambient language.
As for adult L2 learners, experimental evidence shows that they have a strong bias towards regularity during language diffusion (Smith and Wonnacott 2010). In these experiments, input languages exhibiting free variation become increasingly regular (Wonnacott and Newport 2005). 13 Also, it seems that adults are good at learning abstract patterns that can be captured by a few simple rules, while children are better at memorizing strings without any underlying rules (Nowak and Baggio 2017). Based on this, we can hypothesize that a high number of L2 adult learners may result in lower variability in word order in the target language, although this needs to be further investigated. Taking a gradient approach to word order flexibility would allow for a more nuanced picture which could potentially incorporate more complex contexts of language contact (see Section 2.4). 14 12 Although in the case of variable or incomplete input, word order preferences may emerge spontaneously, such as in children learning homesign (Goldin-Meadow and Mylander 1983). 13 At the same time, adults tend to probability-match free variation in an input language under certain conditions more than children do (Hudson Kam and Newport 2005), but this effect is restricted by different factors. In particular, it is observed when adults reproduce already familiar input and when the free variation is between only two alternatives. 14 Cf. related work by Bentz and Winter (2013) which shows a statistical tendency for reduced case marking in languages with more L2 users.

Language change
While word order change may be caused by language contact (see Section 2.4, and also Bradshaw 1982;Doğruöz and Backus 2007;Heine 2008, etc.), including substrate influence, this section focuses on internal change, which is influenced by the cognitive factors discussed above. 15 These changes are transferred and/or amplified via language acquisition, which leads to language-or lineage-specific pathways in word order change (cf. Dunn et al. 2011). Hawkins (1983: 213) states that word order change takes place through "doubling": a new order comes in, co-exists with the old one, and replaces the old one through increase in frequency of occurrence and grammaticalization. We can represent this scenario informally as follows: WO A > WO A & WO B > WO B . This may not be true for all word order change, but this process is a basic benchmark. Hence, change in word order is inherently tied up with variation. For example, Bauer (2009) describes word order change in Latin as part of a general tendency or "drift" (Sapir 1921) in Indo-European, with change away from rigid left-branching (OV) order in Proto-Indo-European (Bauer 2009: Sect. 2.1), towards flexible right-branching (VO) order in Latin, to more rigid right-branching order in Romance.
Importantly, this change happens first at the level of specific constructions, not sweepingly across all constituents at once. 16 The doubling and word order variability are a result of these local changes. Bauer (2009: Section 3) describes in detail the constructions that changed from left-branching to right-branching order in their consecutive order, with verb-complement order changing after right-branching relative clauses, prepositions, and noun-adjective order emerged. Variation arose through long-term development of moving away from left-branching structure, where postpositions were archaic and prepositions arose, both were in non-free variation (i.e., some lemmas were postpositional, others prepositional) for an extended period of time. In addition, stylistic and pragmatic word-order variation arose, such as fronting of the subject and the verb. Several of these pragmatic word orders have been grammaticalized in Romance languages, most importantly using the cleft construction for emphasis. All of these changes are interdependent and dependent on the previous state of the language, i.e., Bauer (2009: 306) emphasizes that variation is not arbitrary, but "in fact [is] rooted in the basic characteristics of the language and [is] connected to many other linguistic features".
15 Substrate influence versus internal change have been suggested for change to VSO in Celtic; Pokorny (1964) argues for a substrate influence account, Watkins (1963) and Eska (1994) for a language-internal account. 16 Though cf. Kroch (1989) and Taylor and Pintzuk (2012) who argue that English OV to VO change happened in a categorical manner.
However, variation in word order does not always imply (ongoing) language change. Word order variability can be stable for centuries. Examples are adjectivenoun and noun-adjective order in Romance and many other languages, such as in Greater Philippine (Austronesian), where four of the six languages in Dryer (2013a) show variable order and at least seven more languages have both orders (Santo Domingo Bikol in Fincke [2002: 93-94]; Mandaya in Estrera [2020: 26-27]; Mansaka in Svelmoe and Svelmoe [1974: 51]; Kalagan in Collins [1970: 66-67]; Cebuano in Tanangkingsing [2009: 134]; Romblomanon in Law [1997: 19-20]; Inonhan in van den Heuvel [1997: 39]). Other examples are verb-object order in main/subordinate clauses in Germanic and many other languages, such as Western Nilotic (for Dinka and Nuer see Salaberri [2017], other relevant Western Nilotic languages are Jumjum in Fadul et al. [2016: 39] and Reel in Cien et al. [2016: 88]), and co-existence of prepositions and postpositions in Gbe and Kwa languages (Aboh 2010;Ameka 2003;see in general Hagège 2010: 110-114). There are usually different diachronic sources for each part of the doublet (Hagège 2010;Hawkins 1983), but this does not necessarily imply that one word order will outcompete the other over the course of generations. While word order change implies variability, variability does not necessarily imply change.
On the other hand, language change can involve rigidification and loss of flexibility of frequently used constructions. Croft (2003: 257-258), citing Lehmann (1995Lehmann ( [1982) and Heine and Reh (1984), calls the grammaticalization of word order rigidification, "the fixing of the position of an element which formerly was free" (see also Lehmann 1992). See Hawkins (2019) for the relation between rigidification and Sapir's (1921) notion of drift, and Harris and Campbell (1995: Ch. 8) for a summary of word order change.

Language contact and vernaculars
While borrowing of orders wholesale from one language to another is one potential outcome, language contact can also have a non-obvious effect on word order variability. For example, Heine (2008) discusses change to flexibility as one potential outcome of language contact. Contact may lead to increased or decreased variability. This has been documented in a number of Australian Aboriginal languages, which have famously flexible word order, after they came under pressure from the relatively more rigid English or Kriol (e.g., Lee 1987;Meakins 2014;Meakins and O'Shannessy 2010;Richards 2001;Schmidt 1985aSchmidt , 1985b. In this situation, the frequency of SVO orders may increase, and the degree of flexibility (entropy) may also decrease. This is the main conclusion of Namboodiripad et al. (2018), who found that Englishdominant Korean-speakers showed a greater relative preference for the canonical SOV order than Korean-dominant speakers, that is, they rated non-canonical orders relatively low as compared to the canonical SOV order. Thus, we see a significantly lower constituent order flexibility correlating with more English dominance, but not overt borrowing of the dominant English order into Korean. A greater preference for canonical order corresponding to increased language contact was also found within a community of Malayalam speakers by Namboodiripad (2017).
Depending on the social context, language contact within a community can often correspond to intergenerational variation. An example can be found in Pitjantjatjara, a Pama-Nyungan language of Central Australia, which has traditionally been described as having a dominant SOV order, with a great degree of flexibility. Bowe (1990) finds SOV order to be approximately twice as frequent as SVO order (although clauses with at least one argument omitted are much more frequent in her corpus). Langlois (2004), in her study of teenage girls' Pitjantjatjara, finds that this is reversed, and SVO order is 50% more frequent than SOV in her sample. An experimental study of word order in Pitjantjatjara (Nordlinger et al. 2020;Wilmoth et al. Forthcoming) also substantiated this general trend; the younger generations had a higher proportion of SVO order, presumably as a result of more intense language contact and more English-medium education. However, while the distribution of different orders appears to be changing, the overall degree of flexibility as measured by entropy (see Section 4.1.1) remained approximately stable among all generations of speakers. There was some effect of older female participants, many of whom had worked as teachers or translators, being more rigidly SOV. This may be a result of their prescriptive attitudes interacting with the experimental setting.
The presence or absence of pressure of prescriptive norms in a particular language variety is an important and relatively under-investigated factor in the dynamics of word order. Specifically, vernacular varieties are relatively free from the pressure of prescriptive norms and are more inclined to word order variation compared to "standard" varieties. An example is reported in the study by Naccarato et al. (2020) on genitive noun phrases in several spoken varieties of Russian. In Standard Russian, the neutral and most frequent word order in genitive noun phrases is Noun followed by Genitive modifier, while two types of vernacular Russiandialects and contact-influenced varietiesoften show the alternative order Genitive modifier followed by Noun. Interestingly, kinship semantics, which was the strongest factor affecting the choice of a specific word order, turns out to be the same for both dialectal and contact-influenced Russian (irrespective of the area and the indigenous languages spoken there). This situation resembles instances of so-called "vernacular universals", that is, features that are common to different spoken vernaculars (Filppula et al. 2009; see also Röthlisberger and Szmrecsanyi 2020), which are sometimes interpreted as language "natural tendencies" in the sense of Chambers (2004). If this is the case, we could argue that word order rigidity is imposed through standardization, and variation might be the result of the natural development of language systems that, due to geographical and historical reasons, are less tightly connected to a particular standard, or which have resisted or rejected standardization.

What we (would) miss out without gradient approaches
The previous section demonstrated that word order variability is supported by diverse cognitive and social factors. How all these factors interact with each other is not yet fully understood. The case studies presented above have also shown that investigating gradient patterns and their causes, is a fruitful endeavor. In this section, we strengthen this argument, presenting a series of research questions that could not be asked or answered without applying a gradient approach systematically.

Language processing
A focus on word order gradience is very much the bread and butter of psycho-and neurolinguistic studies of grammar, where substantial effort has been invested in explaining ordering constraints on production and the implication of word order variability (and its correlation with grammatical structure) for comprehension. As an illustration, consider corpus-based research on dependency length minimization (DLM), which was discussed in Section 2.1. Some studies suggest that the crosslinguistic variation in the effect of dependency length is due to the fact that certain languages have more word order freedom (Futrell et al. 2015a;Futrell et al. 2020;Gildea and Temperley 2010;Yamashita and Chang 2001). The argument goes that with more ordering variability, the ordering preferences might be less subject to DLM and possibly abide more by other constraints such as information structure (Christianson and Ferreira 2005). Nevertheless, recent findings from Liu (2021) showed that is not necessarily the case. Using syntactic constructions in which the head verb has one direct object noun phrase dependent and exactly one adpositional phrase dependent adjacent to each other on the same side (e.g., Kobe praised his oldest daughter from the stands), the results indicated that there is no consistent relationship between overall flexibility and DLM. On the other hand, when looking at specific ordering domains (e.g., preverbal vs. postverbal), on average there is a very weak correlation between DLM and word order variability at a constructional level in the preverbal contexts (e.g., preverbal orders in Czech); while no correlation, either positive or negative, seems to exist between the two in postverbal constructions (e.g., postverbal orders in Hebrew). Another illustration is the connection between prosody and syntax, which is an active area of investigation (Eckstein and Friedericil 2006;Franz et al. 2020;Kreiner and Eviatar 2014;Luchkina and Cole 2021;Nicodemus 2009;Vaissière and Michaud 2006). The work which has investigated the connection between prosody and word order from a phonetic perspective does take phonetic gradience into account and adding a gradient approach to word order further enriches this area of research. For example, Šimík and Wierzba (2017) combine gradient constraints in prosody and syntax to model flexible word order across three Slavic languages. They combine this with a gradient measure of flexibility in the form of acceptability judgments and find complex interactions between prosody and information structure (see also Gupton 2021). Asking and answering these questions about language production and comprehension would be impossible without taking a gradient approach.

Language acquisition
Conceptualizing word order in terms of gradience naturally aligns with the large focus on language acquisition as a cue-based process, which is most notable in probabilistic functionalist approaches such as the Competition Model (Bates and MacWhinney 1982) or constraint-satisfaction (Seidenberg and MacDonald 1999) but is also a feature of formal approaches such as Optimality Theory (Prince and Smolensky 2004). The challenge for acquisition researchers is to determine cue-based hierarchies across languages and how they interact; in addition, they must determine whether children bring pre-existing biases to this problem, such as the oft-cited preference to produce agents before lower-ranking thematic roles (Jackendoff and Wittenberg 2014), and to interpret early appearing nouns as agents (e.g., Bornkessel-Schlesewsky and Schlesewsky 2009).
For instance, while both German and Turkish allow for flexibility in the order of the main constituents with thematic roles marked via case, the transparency of the Turkish case system relative to German means that Turkish-acquiring children find variability in word order less problematic than their German-acquiring counterparts, who tend to prefer to attend to word order over case until they are much older (Dittmar et al. 2008;Slobin and Bever 1982;Özge et al. 2019). In the case of German, it appears that children settle on word order as a "good enough" cue to interpretation while they acquire the nuances of the case system (for a similar effect in Tagalog, see Garcia et al. 2020), although they are also sensitive to deviations in word order marked by cues such as animacy (Kidd et al. 2007).
A focus on cue-weightings (or constraints) has the potential to both inform the creation of more dynamic models of acquisition (thereby linking them to theories of adult language processing and production), and also to force questions common to acquisition concerning representation. Namely, are the cues coded in the structure, as might be argued in construction/usage-based approaches (e.g., Ambridge and Lieven 2015), or do they guide, but are independent of the sequencing choices of the processing system? For instance, do children acquiring European languages like English or German learn non-canonical structures such as object relative clauses as constructions by encoding common properties such as [-animate Head Noun] as part of a prototype structure (e.g., Diessel 2009), or do these distributional properties provide cues to a more abstract structure building mechanism? Understanding what causes word order gradience in the target languages and how children identify these variables is key to answering these questions.

Language change
Word order has typically been measured in a categorical fashion in typology, making gradient phenomena somewhat marginal (see Section 3.5). We can observe the same in historical linguistics: Harris and Campbell (1995: 198) point out that word order reconstructions have ignored frequently attested alternative word orders. Barðdal et al. (2020: 6) hold that under a constructional view, this should not be allowed, while "variation in word order represents different constructions with different information-structural properties, as such constituting form-meaning pairings of their own, which are by definition the comparanda of the Comparative Method". Modeling sentential word order change in terms of categorical bins such as SVO, VSO, etc. does not allow for the inclusion of doubling, the intermediary step of an alternative word order arising; the process of rigidification is glossed over completely (see Section 2.3). Excluding variation from reconstruction makes reconstructions incomplete at best, and wrong at worst. See Section 4.4 for a discussion of how to incorporate gradient measures in a phylogenetic analysis, and which caveats can arise when doing so.

Language contact
A gradient approach to word order appears to be useful in studies in language contact. Although syntactic changes usually start at an intense stage of contact, word order is one of the linguistic features that are more easily borrowed (cf. Thomason 2001: 70-71; Thomason and Kaufman 1988: 74-76). Investigations on contact-induced word order variation in the world's languages include studies focusing on word order at the sentence level (cf. numerous examples listed in Heine 2008: 34-35 and Thomason and Kaufman 1988: 55), and studies devoted to variation within the noun phrase, e.g., focusing on the order of the head noun and its genitive modifier. See, for instance, Leisiö (2000) on Finnish Russian, and some observations in Sussex (1993) on émigré Polish, as well as Friedman (2003) on Macedonian Turkish. Most of these studies take a more categorical approach, focusing on the addition or loss of particular orders.
However, word order variability has been (indirectly) implicated as a potential source of the facility of contact-induced change in this domain, via convergence (e.g., Aikhenvald 2003): More flexible languages are more likely to have more frequent alternative orders, and thus are also more likely to have convergent structures with their contact languages. As such, contact could boost the frequency or change the status of a previously less frequent or non-canonical but still attested order (e.g., Manhardt et al. 2023). So "borrowing" of an order in this case could simply be a change in status of a previously derived or less frequent order to becoming a canonical or more frequent order, as in the case of Pitjantjatjara discussed in Section 2.4. Measuring the degree of flexibility, and accounting for the relative status of all grammatical orders in a language, allows us to see these types of contact effects, which might not be otherwise detectable.
An illustration of the usefulness of gradient approaches is a study of Englishdominant and Korean-dominant Korean speakers by Namboodiripad et al. (2018), which was mentioned in Section 2.4. They found that English-dominant Korean speakers rated canonical SOV order the highest, followed by OSV, the verb-medial orders the next highest, and verb-initial orders the lowest of all. This is the same pattern shown by age-matched Korean speakers who grew up and live in Korea. However, there was a significant quantitative difference based on the degree of contact with English, as discussed above: more precisely, the English-dominant Korean speakers showed a greater relative preference for the canonical SOV order than Korean-dominant speakers. In taking a gradient approach, we are able to see contact effects that might not otherwise be visible. That is, even when we do not see an outright change to the basic constituent order in a language, or even if we do not see the addition of a constituent order due to contact, we might see a decrease in the number of possible orders, or a decrease in the relative frequency or acceptability of some orders (see Section 4.2 for more details on this method).
Moving to another example, what at first glance looks like the result of L1 calquing is actually the by-product of a more complex interaction of different factors. In previous research it has been frequently pointed out that contact in fact reinforces some existing language-internal tendencies (cf. Poplack 2020; Poplack and Levey 2010; Thomason and Kaufman 1988: 58-59). A similar conclusion is drawn in a recent study by Naccarato et al. (2021) devoted to word order variation in noun phrases with a genitive modifier in the variety of Russian spoken in Daghestan. In this variety of L2 Russian the Genitive -Noun order in noun phrases is unexpectedly frequent, whereas in Standard Russian the Noun -Genitive order is the neutral and by far most frequent option (see also Section 2.4). A quantitative analysis of the Daghestanian data and the comparison with monolinguals' varieties of spoken Russian suggest that contact is not the only factor involved in word order variation in Daghestanian Russian. Rather, L1 influence in bilinguals' speech (here, Nakh-Daghestanian or Turkic languages) seems to interact with some language-internal tendencies in Russian that, to a certain extent, are observed for non-bilinguals' varieties too. Without a gradient approach, it would be impossible to test this interaction.

Linguistic typology
Due to theoretical and practical reasons (e.g., the over-emphasis on basic word order in syntax, and the level of granularity available in grammar descriptions), typologists have mostly coded word order using categorical variables. The famous Greenbergian "correlations" (Greenberg 1963), for instance, represent associations between categorical variables, e.g., VO/OV and pre-or postpositions. This practice is responsible for two types of data reduction in typology (cf. Wälchli 2009): first, languages with orders with low word order entropy are more likely to be described accurately and systematically than languages with high entropy; second, word order variables studied by typologists are biased towards those that represent bimodal distributions. This can have consequences for characterizing, documenting or revitalizing the language.
Using a gradient approach can help us deal with these limitations, exploring the points between the extremes. Moreover, it allows us to add more different constructions. For example, oblique nominal phrases and adverbials are highly variable positionally with regard to their heads, which is why they are less frequently considered in word order correlations than the "usual suspects", such as the order of Verb and Object (Levshina 2019). A gradient approach also allows us to formulate more precise predictions for typological correlations and implications, minimizing the space of possible languages, and making the universals more complete (see Gerdes et al. 2021;Levshina 2022;Naranjo and Becker 2018). For example, instead of including only predominantly VO or OV languages, we can also make predictions for more flexible languages which may not fit neatly into one category or another.
Consider an illustration. It is well known that there is a negative correlation between word order rigidity and case marking (Sapir 1921;Sinnemäki 2008). This categorical information can be represented by continuous variables, as shown in Figure 3, which is based on data from 30 online news corpora. Word order variability is measured as entropy of Subject and Object order (see Section 4.1.1), while case marking is represented as Mutual Information between case markers and the syntactic roles of Subject and Object. It shows how much information about the syntactic role of a noun we have if we know the case form, or, in other words, how systematically cases can actually help one guess if the noun is Subject or Object in the corpus (see Levshina 2021b for more details). The dataset is available in the OSF directory as Dataset3.txt. The plot shows clearly that the correlation is observed in the entire range of values of both variables. In particular, languages with differential marking tend to occupy the mid-range of word order variability, while languages with very little overlap between Subject and Object case forms like Lithuanian and Hungarian, allow for maximal freedom (Levshina 2021a). This supports the functional-adaptive view of language, showing that case and word order are not inherent built-in properties, but efficiently used communicative cues (cf. Koplenig

Summary
As demonstrated in this section, gradient approaches have already allowed researchers to discover many important facts about word order across different linguistic domainsfrom language contact to prosody. Without a gradient perspective, these phenomena could not be investigated. Using a gradient approach also allows us to include more constructions and languages when testing theoretical hypotheses, avoiding data reduction. At the same time, the potential of gradient approaches has not been fully tapped. An important reason is the numerous methodological challenges and caveats. In the next section, we discuss them and propose some solutions.

How to investigate gradience? Challenges and some solutions
In this section, we delve into the how of implementing our gradient approach. As we have discussed, methodological limitations have been one barrier to more widespread adoption of gradient approaches to word order. Here, we discuss these challenges along with current solutions, covering corpora (Section 4.1), experimental methods (Section 4.2), fieldwork practices (Section 4.3), and phylogenetic comparative methods (Section 4.4). Along the way, we present some novel analyses of corpora and experimental results using these approaches.

Gradient measures and sample size
Recent work on word order variability tends to use a gradient approach to provide a more accurate picture of word order patterns in language (Ferrer-i-Cancho 2017; Futrell et al. 2015b;Hammarström 2016;Levshina 2019;Östling 2015). Most of these studies are based on large-scale annotated corpora such as the Universal Dependencies (Zeman et al. 2020; also see Croft et al. [2017] for a discussion of using UD corpora to address questions in linguistic typology). Two common gradient measures are the proportion of a particular word order and the Shannon entropy of a particular word order. As an example, let us imagine that we are interested in the order of the object and the verb in a language. One straightforward way to measure the stability of such order in a language is to consider the proportions of OV and VO orders. Another measure frequently used is Shannon's entropy (Shannon 1948), which represents the uncertainty of the data. The higher the entropy (which ranges between 0 and 1 in the case of two possible outcomes), the more variable the word order. For instance, a distribution of 500 sentences with OV and 500 sentences with VO would result in an entropy of 1. 17 If the word order is always OV or always VO, the entropy will be 0. If OV is observed 95% of the time, the entropy will be equal to −[(0.05*log 2 (0.05)) + (0.95*log 2 (0.95))], which is 0.2864. 18 Generally speaking, the choice between entropy or simple proportions depends on what is more important for the researcher: reliability of word order as a cue for interpreting the speaker's message, or the preference for a specific word order. That said, entropy is particularly useful when considering word order variability in case of more than two possibilities, e.g., the order of S, V, and O. While these measures are relatively established, the effect of corpus size is frequently mentioned as a potential source of error. As an example, "the corpora available vary hugely in their sample size, from 1,017 sentences of Irish to 82,451 sentences of Czech. An entropy difference between one language and another might be the result of sample size differences, rather than a real linguistic difference" (Futrell et al. 2015b: 93-94). The question of how large a corpus should be to provide robust measures of word order is very difficult because the answer depends on the frequency of the construction we investigate, which can also vary substantially across languages and genres. Here, we will focus on how many instances of a construction are necessary to measure entropy for the relative ordering of Subject and Verb (S&V) and Verb and Object (V&O). We aim to show what the appropriate sample size might be.
Let us consider the order of S&V and V&O within a sample of 30 languages extracted from the Universal Dependencies: Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Danish, Dutch, English, Estonian, Finnish, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, and Ukrainian. This sample is obviously biased toward Indo-European languages, but it serves here solely for illustrating the methodology. To visualize the effect of corpus size, we randomly sample clauses containing a verb with a dependent nominal subject/object from the existing corpora, with an increasing range of 20 clauses. That is to say, we first sample 20 clauses, then 40, 60, and so on until we reach a size of 2,000 clauses. For each sample, we measure the proportion of the S&V and V&O orders. The data are 17 It is worth pointing out that different studies may use variants to calculate the entropy, and these variants may be affected in different ways by corpus size. For instance, Futrell et al. (2015b) have a more complex definition of word order entropy (as opposed to the coarse-grained entropy we use in our examples), which is conditional on several factors and may be more unstable in small corpora. 18 See www.shannonentropy.netmark.pl for a live calculator of the entropy.  available as Dataset4.txt in the OSF directory. Next, we calculate the entropy as described above and also the difference as compared to the entropy measured with the corpus size set as 2,000 clauses. This difference is called here "gap of entropy". The results averaged across all languages are plotted in Figure 4.
The results indicate that extremely small sample sizes result in unstable results, as the entropy is likely to be biased by the sampled sentences, but sample sizes of more than 300 clauses produce entropy gaps less than 0.1. If we zoom in on a few individual languages, such as Basque, English, French, and German (see Figure 5) and plot the entropy measures, we observe that the entropy varies a lot within approximately the first 500 clauses but stabilizes after that threshold. This means that using a sample size of 500 or even larger is likely to give similar results in terms of S&V and O&V order.
Additional computational measures can be used to quantify where exactly the entropy converges for each language (see Janssen 2018: 79 andJosserand et al. 2021: 10 for an example of how to calculate where a curve stabilizes). Nevertheless, this example provides a quick-and-dirty demonstration of how sample size could be considered in studies related to word order. We recommend that such analyses should be performed on different word order categories and language families to investigate to what extent the entropy measure depends on the sample size crosslinguistically and/or across different constructions.

Potential biases in corpora: annotation methods
Parallel and comparable corpora of reasonable size usually rely on automatic annotation, thus introducing a potential bias depending on the quality of annotation. The quality is strongly influenced by the size and text types of the treebanks available for training the software performing the automatic annotation (the "parser").
If we take a look at the current version (at the time of writing: 2.7) of the Universal Dependency Treebanks, we see that just 21 languages out of 111 have more than 3 treebanks. These languages are mostly national languages of Asian and European countries, and among them, only Arabic, Czech, French, German, Japanese, Russian and Spanish have treebanks with a total number of more than 1M tokens. In order to quantify the annotation quality bias both within and across languages, we run a comparison between the automatic annotation of the parallel corpus of Indo-European Prose and More (CIEP+: Talamo and Verkerk 2022) and the corresponding UD treebanks used for training the parser. 19 Our comparison focuses on the Shannon entropy of the relative order of nominal heads and modifiers in eleven languages from five different genera of the Indo-19 UD v.2.5, except for Welsh: v.2.6. European family. Furthermore, we limit our analyses to the following syntactic relations/UD relations: adpositions (case), determiners (det), 20 adjectival modification (amod) and relative clauses (acl:relcl). 21 Results are plotted in Figure 6. The dataset is available as Dataset5.txt in the OSF directory, which also contains the code for the Figure 6: A comparison between the entropy of four nominal dependencies in eleven languages from two data sources: an automatically-parsed corpus (CIEP+) and the UD treebanks that have been employed for training the parser.
20 Italian, Polish and Portuguese have specific UD relations for possessive pronouns and/or quantifiers; for data consistency, we have added the entropy of these relations to the entropy of the det relation. 21 The UD German parser (model: GSD) has a very limited support of the acl:relcl; accordingly, we do not have data for CIEP+ relative clauses in German. statistical analysis. Assuming that UD treebanks are manually annotated and/or corrected data sources, thus serving as gold standards, the differentand, in many cases, higherrates of entropy in the CIEP+ corpus are likely due to wrong annotations, or "noise". Yet, despite some outliers such as amod and det in Polish and case in Dutch, the amount of noise in automatic annotations is overall low. A regression model with absolute differences between entropies estimated based on CIEP+ and UD, and languages and dependencies as random intercepts shows that the expected absolute difference (represented by the intercept) is 0.05. It is statistically significant (t = 3.429). 22 Although the acceptability of this amount of noise depends on the level of precision required for a specific research question, this value can be seen as relatively low.
Note that source languages may influence target languages in translation (Johansson and Hofland 1994;Levshina 2017), such that the distributions of words or grammatical constructions may differ in translated versus spontaneously-generated texts. However, our results suggest that this is not the case in most languages and dependencies. Needless to say, both sources of bias (i.e., annotation and translationese) require further investigation, especially for low-resource languages, but we hope that our case study gives us some reasons for cautious optimism with regard to the practical feasibility of the gradient approach.
From a more theoretical perspective, we should add that the UD do not reflect categorial universalism in the sense that they do not presuppose any innate set of categories (cf. Haspelmath 2010). The UD were developed based on several desiderata, which are often in conflict: they should be suitable for both language description and language comparison, and for both rapid, consistent annotation by a human annotator and high-quality computational annotation. They should also be accessible to non-linguists, which explains why traditional Eurocentric terms are used. 23 According to Croft et al. (2017), most of these desiderata match the goals of a typologist. Importantly, the UD represent an inventory of universal constructions that serve some communicative functions. For example, the dependency 'case' used for adpositions reflects the fact that they are analogous to morphological marking. This allows (together with the annotation of morphological case in a separate annotation layer of the UD) to extract all cases of the universal construction whose function is to relate an argument dependent to its head (Croft et al. 2017: 65-66).
If annotators believe that it is important to reflect some important idiosyncrasies in a particular language, they can add extensions of the annotation tag to make it more fine-grained. For example, for some Slavic languages, the UD include 'det:nummod', which marks a special type of quantifiers that do not agree with the quantified noun in case and require the counted noun to be in its genitive form. Another option is to introduce a special tag, e.g., classifiers for Chinese. Importantly, these differences should be well documented. A linguist planning to use the UD corpora and tools should first check how the categories of interest are encoded in the corpora. But such a check is necessary, in principle, for all kinds of cross-linguistic data.
We would like to acknowledge that in some cases, the UD framework is not yet directly applicable. For instance, the UD approach is based on annotation of words, which will not be meaningful for analysis of polysynthetic languages. At the same time, the UD annotation guidelines allow flexible modifications and extensions. For example, we could treat each individual morpheme within a polysynthetic word as the alternative of 'one word' in a sentence in English, then design additional de-

The influence of register, text type, and modality
In addition to the problems of corpus size, annotation quality and translationese, another potential source of bias is different linguistic registers and text types, which can have different word order distributions (Batova 2020;Brunato and Dell'Orletta 2017;Panhuis 1981). The extent of this variability is an open question: some evidence suggests, in particular, that register and text type does not have a significant impact on overall patterns of dependency direction (Liu 2008(Liu , 2010. For many languages and constructions, however, we lack sufficient evidence. The most obvious problem is the bias towards written texts. In particular, the UD corpora are almost universally written, rather than spoken or signed. In fact, out of the 111 varieties represented in UD 2.7, only 22 include spoken or signed data, and this data is not always clearly separated from written data within a given corpus (Zeman et al. 2020). For many languages, available corpora are restricted to news and web-crawled texts (e.g., the Leipzig Corpora Collection, Goldhahn et al. 2012). Collection of texts containing the same content in different languages, or parallel corpora (Cysouw and Wälchli 2007), may help to mitigate this problem, providing new text types (e.g., fiction or film subtitles).
The importance of modality for estimation of word order is obvious from the following case study. We took spontaneous conversations, fiction and online news in Russian (from the 1980s to our days), focusing on the order of finite verbs and nominal objects in declarative sentences. Each text type was represented by 100 observations. The data are available as Dataset6.txt in the OSF directory. There was an obvious difference between the conversations and the written texts: the con- In order to control for potential accessibility effects, associated with the length and animacy of the nominal objects, a multiple logistic regression was fitted, with the word order as the response variable (OV or VO), and text type, animacy, and logtransformed length of the object as the predictors. As was shown in Section 2.1, it is advantageous for language users to produce more accessible (short and animate) constituents first. This is why we could not exclude that the preferences for OV in the conversations could be explained by a preference for short and animate objects. The modeling results indicate, however, that modality has a significant effect on the use of word order. Both of the written text types had higher chances of VO in comparison with the conversations (Fiction: b = 1.86, p < 0.001; News: b = 1.42, p < 0.001). The predicted probabilities of VO are shown in Figure 7. In addition, we found that longer objects had higher chances to be placed after the verb, as we would expect (b = 0.71, p = 0.004). Animacy did not have a significant effect (p = 0.443), probably, due to the low frequency of animate objects in the data (only 13%). A gradient approach to word order In addition to differences in word order variability, modalities and text types can also differ with regard to other features, such as argument drop (e.g., Ueno and Polinsky 2009;Zdorenko 2010). These differences can have implications for other popular continuous measures related to word order, such as dependency lengths (see Section 2.1). A study by Kramer (2021) using dependency-parsed YouTube captions to generate naturalistic spoken corpora in seven languages (English, Russian, Japanese, Korean, French, Italian, and Turkish) found that while dependency length minimization in speech patterned similarly, at a broad level, to dependency length minimization in written UD corpora, the difference in minimization across speech and writing varied in a consistent way. In particular, "flexible", argument-dropping languages that were more head-final (Japanese, Korean, Turkish) showed more minimization in speech than in writing, whereas languages with similar properties that were less head-final (Russian, Italian, French) showed less minimization. Finally, English did not differ significantly across modalities (Kramer 2021).
Based on these case studies, we can make some preliminary conclusions. First, we do not need very large samples to obtain relatively stable estimates of word order entropy. Second, we have not found strong indications against using automatically annotated corpora and translated texts. Third, we need to be careful about using comparable genres and modalities (cf. Levshina 2022). Although more research is needed to test these preliminary conclusions and formulate best practices, it seems that computing gradient measures of word order from corpora is a realistic task if sufficient care is taken (see also Schnell and Schiborr [2022] for a thorough review of practical and conceptual concerns associated with using corpora for studying typology).
Note that the same limitations apply in principle to the traditional sources of typological information, such as reference grammars. The 'dominant word order' of a given language could be determined after seeing a couple of sentences from a specific modality, a specific topic and/or a small hand-collected corpus. One of the advantages of using corpus data is that we are more aware of these limitations.

Capturing gradience in experiments: methods and challenges
Psycholinguistic studies of syntactic comprehension typically manipulate properties of word order in order to test models of structure building, measuring those processes by indexing processing difficulty via reaction times, reading and looking times, pupil dulations, neurophysiological data (e.g., event-related potentials), or offline accuracy. These measures are well suited for capturing gradient effects. Production studies reveal those word orders that are preferred given those properties of the message that need to be sequenced into an utterance (for examples from a language with several word order options see Christianson and Ferreira 2005;Norcliffe et al. 2015), which allows us to investigate the weights of relevant factors (see Section 2.1). This section focuses on acceptability judgment experiments, which can be used to capture gradient patterns of variability. While the measure itself is reflective of potentially many factors (familiarity, conventionality, language ideology, processing effort), there is a robust literature on analyzing and interpreting acceptability ratings, as well as comparing these ratings to other psycholinguistic measures (see, among others, Dillon and Wagers 2021;Goodall 2021;Langsford et al. 2019;and, notably, Francis 2021 on how theoretical assumptions affect the interpretation of gradient results in acceptability judgment research). Because this method potentially yields gradient results (given that a gradient response scale is used, such as a 1-7 Likert scale; see Langsford et al. 2019 for discussion of various scales), it is suited to approaches which take gradience as a default assumption. Here, we discuss one particular way that acceptability judgment experiments have been used as a measure of constituent order flexibility (as seen in Namboodiripad 2017 andNamboodiripad et al. 2018; see also another example in Section 3.4).
In this method, participants hear all six logical orderings of subject, object, and verb, pseudo-randomly distributed amongst many other sentences of varying structure. Audio stimuli are used, in order to ensure that the proper intonation associated with each order is held constant, and to ensure that participants do not posit their own silent prosody which might affect how the sentences are parsed or interpreted (see Sedarous and Namboodiripad 2020). While the particular experiments referenced here do not include discourse context, instead holding "neutral context" constant across items, participants can be asked to rate sentences given particular discourse contexts.
For languages with canonical orders, flexibility is operationalized as the difference in acceptability between canonical and non-canonical orders; a bigger difference in acceptability between canonical and non-canonical order is interpreted as an increased preference for canonical order, and therefore decreased flexibility. As such, this measure both captures intuitions about constituent order flexibility (constituent orders feel more interchangeable or similar in flexible languages), and it is analogous to the entropy-based measures discussed elsewhere in this article (e.g., Levshina 2019). These "flexibility scores" can be calculated for individuals, or, when averaged across individuals, for particular varieties or languages. This allows for cross-variety or cross-linguistic comparison, as well as individual differences analyses (e.g., Dąbrowska 2013). Figure 8 shows flexibility scores for several languages using this method. The data are available as Dataset7.txt in the OSF directory. Of these, Avar, Malayalam, Hindi, and Korean would all be labeled as "SOV flexible" languages, were we to take a categorical approach. However, taking a gradient approach allows us to see meaningful heterogeneity within that group, as well as similarity across languages which might be categorized differentlywe see continuity between the "SOV flexible" languages and English, for example. Of course, more languages are required to make stronger cross-linguistic claims, but this nonetheless shows that asking gradient questions can yield patterns which we might not otherwise have seen.
These experiments are quite portable, robust to environmental noise (especially compared to other psycholinguistic measures), and relatively straightforward to explain. Therefore, this method is suited to a wide range of contexts (various fieldwork contexts, online, and in the lab), which is ideal for cross-linguistic comparison. However, careful experimental design is required; we refer the reader to other sources for more information on experimental design (e.g., Gibson et al. 2011;Sprouse and Almeida 2017), but considerations include counterbalancing, creation of Greater scores represent lower flexibility. Each light gray dot represents the difference between the mean rating for canonical order in each language minus the mean ratings for non-canonical orders for each participant; 1-7 ratings have been transformed into z-scores to account for individual differences in how the scale is used and to increase cross-experiment comparability. The languages are in order of mean flexibility, which is the mean flexibility score across participants, represented by the black dots. The bars represent two standard deviations from the mean. lexicalization sets, and controlling items for plausibility, familiarity, cultural appropriateness, length, and other relevant factors for the language. Non-canonical orders often have special intonational contours; sentences should be recorded to have as natural-as-possible intonation (see Sedarous and Namboodiripad 2020 for tips). As such, a crucial aspect of conducting these experiments is familiarity with the language; either working on languages one uses, in close collaboration with the language community, or both, is key.
Acceptability judgment experiments on their own provide partial information; they can measure individuals' relative preference for various orders in a highly controlled context. This method shares a common problem with more traditional psycholinguistic methods: it tests hypotheses about language in a restricted and somewhat artificial manner (Liu et al. 2017). While experiments are indispensable because they allow causal inference, the criticism is not without merit. It will be important to both test more naturalistic materials in ecologically valid contexts (see Kandylaki and Bornkessel-Schlesewsky 2019) and move beyond primarily studying psycholinguistic processes via human-computer interaction to study conversational interaction, since it is these contexts for which both language evolved and is acquired (Levinson 2016). It is also fruitful to combine corpus-based and experimental approaches, as it was done, for example, in Arnold et al. (2000).

Gradience and fieldwork practices
A gradient approach to word order faces some natural restrictions when we deal with field data, especially with hardly accessible data of minor, under-investigated and endangered languages. Word order variation is hard to investigate by means of the most simplistic elicitation tasks alone (translating sentences, grammaticality judgments), which are often used by field linguists. At the same time, experiments with a complex design involving a large number of speakers are not always possible in the field (though see Christianson and Ferreira 2005;Norcliffe et al. 2015;Nordlinger et al. 2022). In this case, textual collections remain the main source of data. They are usually much more restricted than for better studied languages. However, unlike many corpora of major world languages, small field corpora consist mostly of oral spontaneous texts, which represent a better source of data for studies on word order variation with respect to normalized written ones. Data from spontaneous texts such as these can then be used to inform targeted elicitation to test the grammaticality of particular structures and combinations.
The current state of affairs can be illustrated by the data of Turkic, Tungusic and Uralic indigenous languages of Russia. Outside the gradient approach, they are usually described as OV-languages. However, many of them feature VO word order as a more marginal strategy or an ongoing change from OV to VO, which needs a separate explanation. VO word order might be motivated both by language-internal factors such as information structure (see Section 2.1) and by contact with Russian, which has predominantly verb-medial order (approximately 90% in online news corpora, according to Levshina 2021b). This puzzle is being discussed actively for different languages and based on different types of field data. The few experimental studies available concern major languages of the area: for instance, Grenoble et al. (2019) investigated word order variation in Sakha (Turkic) based on data from several picture description experiments. Most recent studies focus on textual data, especially data from small field corpora, cf., e.g., Rudnitskaya (2018) on Evenki (Tungusic), Däbritz (2020) on Enets, Nganasan (Uralic, Samoyedic), and Dolgan (Turkic). A hybrid technique is the use of semi-spontaneous texts, such as Pearstories or Frog-stories, which provide more comparable data than spontaneous ones. Stapert (2013: 241-265) shows for spontaneous narratives and Pear-stories in Dolgan and Sakha (Turkic) that the frequency distribution of competing word orders is pretty much the same in both types of texts. Finally, some studies combine the analysis of textual data with data from questionnaires, e.g., Asztalos (2018) on word order variation in Udmurt (Uralic, Permic). Such a hybrid, pluralistic approach represents a viable strategy for research on word order variability in lesser-studied languages and non-standard varieties.

Gradience and phylogenetic comparative methods
Using a gradient measure of word order allows us to model cross-linguistic variation and word order change more accurately because of the continuous nature of the resulting variables. From a statistical perspective, this is not problematic at all; rather than using the chi-squared test, logistic regression, or discrete phylogenetic comparative methods, we can use (generalized) linear mixed-effect models and continuous phylogenetic comparative methods.
The following presents a case study on the phylogenetic signal of entropy of 24 different word orders taken from Levshina (2019) in Indo-European. Levshina (2019) calculated entropy using Universal Dependency treebanks (Zeman et al. 2020). The dataset is available as Dataset8.txt in the associated OSF directory. The code can be found there, as well. Phylogenetic signal is the notion that related languages are more similar to each other than languages that are less closely related. Here we use the measure lambda (λ, see Pagel 1999), which ranges from 0 (no evidence for historical signal) to 1 or slightly above (tree-like evidence for historical signal, the maximum value of lambda depends on the phylogenetic tree). Figure 9 shows a high phylogenetic signal for some word order entropies (top) but not for all (bottom), showing that variation or non-variation in word order can be related to genealogical inheritance of patterns for many but not all word orders.
At the top, we find oblNOUN_Pred, the variable dealing with the position of oblique noun phrases with regard to the predicate, as in Jane looks PRED [at the stars] OBL (Levshina 2019: 538), with very high phylogenetic signal. The reason for this is that 11/14 Balto-Slavic languages have an entropy above 0.96 (high variation) for oblique noun phrase-predicate word order; in addition, three out of four Indo-Aryan languages have an entropy less than 0.15 (low variation). Hence, attested variation in the order of oblique noun phrase and predicate is for a large part determined by genealogy: closely related languages have similar entropies, hence resulting in the high phylogenetic signal. Most word orders tested by Levshina's (2019) have a high phylogenetic signal in Indo-European, but for others the effect is less strong. We can consider the difference between auxiliary-verb (around 0.2) and copula-verb order (around 1.0), and observe that for auxiliary-verb order, about half of the entropies are low, less than 0.14, and while Romance languages tend to have low entropies and Slavic languages high entropies, the pattern is not clear-cut at all. For copula-verb order, the same pattern exists but is much stronger: six out of eight Romance languages have entropies between 0.27 and 0.39; all Balto-Slavic languages (with the exception of Old Church Slavonic) have entropies between 0.47 and 0.84; all Indo-Aryan languages have entropies below 0.27, etc. Lastly, adjective-noun order, which has a bimodal distribution for phylogenetic signal, has high entropies for Romance, and low for Balto-Slavic and Germanic, the pattern is not very pronounced but is picked up in part of the phylogenetic tree sample (one part of the tree sample returns lambda approximately zero; another part returns lambda around 0.8). Phylogenetic signal measures of word order entropy will vary between families and between the word orders that are being studied.
This case study demonstrates that investigating gradience from an explicit historical perspective is not a moot point. 24 We can detect some meaningful patterns in the data. All noun-predicate dependencies (order of nominal subject, object, and oblique with the predicate) have higher phylogenetic signal than their pronominal counterparts, which may be explained by larger cross-linguistic differences between pronoun placement as opposed to noun placement; or to differences in word order in transitive and intransitive clauses that may exist even in closely related languages. Clausal word orders (ccomp_Main, advcl_Main, acl_Noun) tend to have a lower phylogenetic signal. Even closely related languages may have quite different entropies for clausal word order. A relevant factor for ccomp_Main could also be noisy annotation of direct speech across the language corpora, however. The lack of signal for the order of determiner and noun (det_noun) can be explained by the fact that determiners are a problematic category, which is lacking in some grammars (e.g., in Czech and Polish, see Talamo and Verkerk 2022). This is why there are many inconsistencies in the UD annotation. For example, possessive pronouns are annotated as determiner dependencies in some languages and as nominal modifiers in others. 25 Another issue of concern is the different frequencies of individual constructions in 24 We already know, of course, that categorical analysis of word order carries phylogenetic signal. The stability of word order in four families investigated by Dunn et al. (2011) is clear; so is the analysis by Hammarström (2015), who states that the major "cause" of any language A to have any sentential word order is that the immediate ancestor of language A had that order. 25 See https://universaldependencies.org/u/dep/det.html. individual corpora. A safer approach could be using parallel corpora with more similarity between the texts in different languages.
Despite these caveats, the current results clearly show that closely related languages tend to have similar amounts of word order variation for at least half of the word orders investigated here. This case study demonstrates for the first time that continuous word order data, which reflect gradience, can be used for capturing phylogenetic signal.

Summary
In this section we discussed some methods and data for applying a gradient approach to word order in the lab and in the field. We have addressed some pressing issues, such as ecological validity, sample size, stylistic variation, and reliability of annotation, and suggested some recommendations. With the development of new corpora and cross-linguistic annotation tools, such as the Universal Dependencies pipeline (see Levshina 2022, for an overview), we can hope that gradient approaches will become increasingly accessible to researchers.

Conclusions and perspectives
In this article, we advocated for a gradient approach to word order which (a) assumes word order variation within languages, and (b) assumes that all languages vary in degree, not kind, when it comes to word order flexibility. Section 2 summarized the main factors that are responsible for emergence and maintenance of gradience. In Section 3, we highlighted how gradient approaches have been and could be relevant for a variety of fields of inquiry within linguistics. Finally, Section 4 dealt with some practical methodological issues when implementing gradient approaches. While we advocate for starting from the assumption of within-language variability and crosslinguistic continuity, we want to emphasize that categorical approaches could be fruitful for some languages, empirical domains, or more coarse-grained research questions. However, by taking non-categorical assumptions as the null hypothesis, we allow for the possibility for categoricity to emergethe reverse cannot be true. Gradient approaches follow naturally from the emergentist usage-based view of languages and allow us to obtain a deeper understanding of the underlying cognitive and social processes that determine language structure and use. At the same time, we have argued that many theoretical frameworks and empirical domains can benefit from taking into account word order variability more systematically. One useful "side-effect" of gradient approaches is the greater transparency about the source of data (register, modality) and annotation decisions, which is crucial for crosslinguistic comparison.
At this point we can formulate some pertinent questions, challenges and perspectives for investigating word order gradience across linguistic disciplines: -Language processing: Psycholinguistic research has been conducted on a very small set of languagesless than 1% of the world's languages at last count (Anand et al. 2011;Blasi et al. 2022;Jaeger and Norcliffe 2009; with a similar problem occurring in language acquisition, Kidd and Garcia 2022). Perhaps more worrying is that this sample is biased towards spoken European languages, and as such the influence that typologically diverse languages have on word order variability within individual speakers, be that as learners or mature language users, constitutes a notable gap in the literature (although see Garrido Rodriguez et al. [2023], Norcliffe et al. [2015], Nordlinger et al. [2022]; Sauppe [2017] and Sauppe et al. [2013] for recent work which addresses this gap). Also, explaining how the human language processing system produces word order variability and negotiates variability in comprehension requires explicit models of how these variables (and others, see Section 2.1) are weighted and implemented. Along with developing more robust experimental paradigms, an important method that will help achieve this goal is computational modeling, since it requires the postulation of explicit assumptions about how gradience is instantiated representationally. -Language acquisition: We know that gradience in word order is not in principle a problem for the language learner, but how children deal with gradience as they develop towards the adult target requires further attention. Concerted efforts to study this across a wide range of languages representing a continuum of word order variability, from relatively low entropy languages like English to highly flexible languages like Warlpiri, would be a welcome addition to the literature. -Language evolution and change: The prevailing norm in this area seems to be to treat word order variability as an unstable and transient state. This seems to be motivated by how constituent order change processes are dealt with in historical linguistics 26 (Croft 2003;Kroch et al. 2000; see also Section 2.3), but this potentially presumes a less flexible end-state, which is not the case for the many highly flexible languages. Further, Newmeyer (2000) and Givón (1979) have posited SOV as the potential constituent order of protolanguage, and subsequent empirical work on gestural communication has supported a bias for SOV, at least in some cases (Goldin-Meadow et al. 2008;Hall et al. 2013). However, these approaches do not engage directly with variability or variation in order, although they sometimes compare the communicative functions of word order variants (Gibson et al. 2013;Schouwstra and de Swart 2014). In short, taking the assumptions of gradient approaches presented in this article could lead to stronger models of language evolution; in particular, the idea that rigidity is associated with stability should be taken critically. -Language contact and emergence: Studies of emergent languages (pidgins and creoles) can benefit from a gradient approach, as well. Though the Atlas of Pidgin and Creole Language Structures (Michaelis et al. 2013) does an excellent job in highlighting non-canonical orders and flexibility in word order, it is still a common misconception that creoles have rigid word order. 27 This is often part of the host of misguided and problematic claims about "simpler" grammars in creoles (DeGraff 2005;Mufwene 2000). In discussions of different definitions of simplicity and complexity as they relate to creoles (e.g., Good 2012, discussing morphological complexity), the role of gradience in word order must also be taken into account. -Linguistic typology: The main challenge and task for the future is finding more comparable cross-linguistic data. The main bottleneck remains coverage of lessstudied languages, including vernacular varieties and signed languages (Coons 2022). However, there are many ongoing projects whose goal is to provide annotated corpora for diverse samples of languages, including less well-described ones. 28 Also, new tools for online experiments should help us gather more data from language users in different places of the world. We expect a gradient approach to be particularly fruitful and pertinent in the study of languages which have highly flexible constituent order, and multiple canonical orders (e.g., Sign Language of the Netherlands [Klomp 2021]). This is where new crosslinguistic correlations and implications are likely to be found. -Language policy: We can expect an important role of language policy in increasing or decreasing gradience. For example, a prescriptivist grammar may characterize a language as having a rigid or dominant word order, trying to fit the language in the Procrustean bed of popular word order types, while the actual behavior of speakers can be different. More research on a wider variety of languages as they are used is necessary.
The co-authors of this article are researchers working across disciplines and across the world, all of whom have found taking a gradient approach to be crucial for the particular questions they are asking about word order. By bringing together diverse examples of gradient approaches, we have demonstrated a growing consensus for this type of theoretical move, and we have shown that it is both possible and necessary to be enacted. We hope that the theoretical ideas and practical advice in this article will encourage the reader to embark on their own investigations using gradient approaches, increasing the number and types of languages which are included in such research.