This study investigated how predictability and prosodic phrasing interact in accounting for the variability of syllable duration in Taiwan Southern Min. Speech data were extracted from 8 hours of spontaneous speech. Three predictability measurements were examined: bigram surprisal, bigram informativity, and lexical frequency. Results showed that higher informativity and surprisal led to longer syllables. As for the interaction with prosodic positions, there was a general weakening of predictability effects for syllables closer to the boundary, especially in the pre-boundary position, where pre-boundary lengthening was the strongest. However, the effect of word informativity appeared to be least modulated by this effect of boundary marking. These findings are consistent with a hypothesis that prosodic structure modulates the predictability effects on phonetic variability. The robustness of informativity in predicting syllable duration also suggests a possibility of stored phonetic variants associated with a word's usual contextual predictability.
This study examined how predictability measurements, including informativity and surprisal, interact with prosodic boundary effects in accounting for the variability of syllable duration in Taiwan Southern Min. The main question is whether the presence of pre-boundary lengthening weakens or completely neutralizes the direct relationship between predictability and phonetic variability. The findings have implications for an information-theoretic view of the balance of signal redundancy in human speech, especially regarding the predictions made by the Smooth Signal Redundancy Hypothesis (Aylett and Turk 2004, 2006; Turk 2010) on how prosodic boundary effects modulate information redundancy in speech production. The findings also have implications for the processing and representation of phonetic variants.
1.1 Predictability and phonetic variation
Decades of research in linguistics and speech sciences have observed shortening or segmental reduction of words and syllables with higher lexical frequencies (Aylett and Turk 2004; Gahl 2008; Hashimoto 2021; Jurafsky et al. 2001; Losiewicz 1992; Pluymaekers et al. 2005b; Schweitzer and Möbius 2004; Seyfarth 2014; Tang and Bennett 2018; Tang and Shaw 2021; Van Son and Van Santen 2005; Whiteside and Varley 1998). The same observations are also found for words and syllables in contexts where they are more likely to occur (Andreeva et al. 2020; Aylett and Turk 2004; Bell et al. 2009; Bolinger 1963; Hashimoto 2021; Jurafsky et al. 2001; Lieberman 1963; Malisz et al. 2018; Pluymaekers et al. 2005a; Seyfarth 2014; Sharp 1960; Shaw and Kawahara 2019; Tang and Shaw 2021; Van Son and Pols 2003; Van Son and Van Santen 2005). A consistent finding is that higher lexical frequency and contextual likelihood correlate with shortening or segmental reduction in the acoustic signal. Diachronic change in lexical frequency and contextual likelihood has also been found to similarly correlate with diachronic variability in word duration (Sóskuthy and Hay 2017). Similar correlations are found in other aspects of phonetic realizations, where higher frequencies and contextual likelihood correlate with smaller formant transitions (Benner et al. 2007; Brandt et al. 2018, 2021; Croot and Rastle 2004; Malisz et al. 2018; Whiteside and Varley 1998), lower consonantal center of gravity (Malisz et al. 2018; Van Son and Van Santen 2005), and more segmental deletion (Bybee and Scheibman 1999; Jurafsky et al. 2001; Cohen Priva 2015; Seyfarth 2014; Whang 2019).
We can frame these findings as a “predictability effect” on phonetic realizations: Whether a unit is more frequent or more probable in a particular context affects how easy it is to predict its occurrence. From an Information-Theoretic (Pierce 1972; Shannon 1948) point of view, the correlation between higher predictability and acoustic weakening shows that a more predictable unit carries less information and thus does not require strong acoustic cues. On the other hand, if a linguistic unit is infrequent itself or occurs in a surprising context, its acoustic signal needs to be strengthened to increase its chance of being correctly recognized.
This information-theoretic account has parallels with proposals on speech production and processing such as the Hyper- & Hypoarticulation (H&H) theory (Lindblom 1990), which posits that balance between ease of articulation and effectiveness in perception together shape speech production. Another relevant account is the Smooth Signal Redundancy Hypothesis (Aylett and Turk 2004, 2006), which states that the amount of information is distributed evenly across each element of an utterance so that “linguistic redundancy” (i.e., lexical, syntactic, semantic, and pragmatic information) and “acoustic redundancy” (i.e., acoustic cues) have an inverse relationship. The smooth distribution of information is also an idea founded in Information Theory, where it is viewed as essential in ensuring robust communication in noisy environments. A schematic representation of the relationship is shown in Figure 1 (Figure 5 in Turk and Shattuck-Hufnagel 2014).
1.2 Predictability and prosodic structure
Predictability is not the only factor that affects phonetic realizations. Especially for durational variability in acoustic cues, prosodic factors such as phrasing and prominence also play an important role. For example, it has been commonly found that lengthening occurs at the right edge of prosodic units (Fougeron and Keating 1997; Oller 1973, a.o.). The interaction between this boundary-lengthening effect and predictability has been discussed in the literature. Aylett and Turk (2004) hypothesize that prosodic prominence encodes and thus mediates the relationship between information redundancy and acoustic cues. Under this view, the relationship between predictability and phonetic cues is not direct. Instead, the placement of prominence and the organization of phrasing reflect the predictability profile of an utterance (e.g., unpredictable units are more likely to be focused or have a particular phrasing pattern). The seemingly direct correlation between phonetics and predictability follows from the phonetic correlates of focus and phrasing (see Ladd 2008 for a review of this view). Turk (2010) further extends the hypothesis to cover the role that prosodic constituency and boundary signaling play in modulating signal redundancy.
There are empirical works that support this hypothesis. Cutler and Carter (1987) found that lexical stress falls on 70% of the initial syllable of content words in English spontaneous speech. Since content words are less predictable than function words, especially in the initial syllable, this finding suggests that the metrical structure that dictates stress placement already plays a big role in accounting for the relationship between phonetic realization and predictability. This relationship between the prosodic organization and predictability also has parallels at the sentential level: Less predictable units tend to receive sentential accents, and more predictable units may be unaccented. Ladd (2008, Chapter 6) summarizes relevant findings and discusses several types of units that are likely to be unaccented or deaccented unless in narrow or contrastive focus, and these units can all be interpreted as being more predictable (having less “semantic weight” per the original discussion), such as indefinite pronouns (e.g., someone, something) and semantically empty content words (e.g., man, guy, place, as opposed to policeman, friend, barn; citing Bolinger 1972).
Aylett and Turk directly tested their hypothesis by comparing the R 2 of regression models with different combinations of language redundancy (syllabic trigram probability, log word frequency, and givenness) and prosodic variables (prominence, phrasing). They found that both redundancy and prosodic structure account for syllable duration, i.e., higher levels of information redundancy lead to shorter duration, and stronger prominence and higher-level boundaries lead to longer duration. They also found that most of the contributions by language redundancy are shared by prosodic variables. In addition, prosodic variables made a larger proportion of unique contributions to predicting syllable duration. On the other hand, the unique contribution of language redundancy is highlighted only when controlling for prosodic phrasing. These findings were viewed as supporting the Smooth Signal Redundancy Hypothesis, as they show a relationship between language redundancy and durational variability. More importantly, they also show that prosodic structure predicts a large portion of durational variabilities that language redundancy could have directly explained.
More recent studies have explored the interaction between predictability and prosodic structure in affecting phonetic realizations. Malisz et al. (2018) address the interaction between surprisal and prosodic structure in accounting for phonetic variability in American English, Czech, Finnish, French, German, and Polish. After examining multiple phonetics cues (syllable duration, vowel dispersion, vocalic spectral emphasis, consonant center of gravity), they conclude that while prosodic structure accounts for phonetic variability to a certain extent, informativity also directly affects phonetic realizations. They view these findings as consistent with a weaker version of the Smooth Signal Redundancy Hypothesis. Andreeva et al. (2020) directly compare the effect of predictability on syllables at strong and weak boundaries with or without a phrasal accent. They find that predictability, estimated with surprisal from trigram language models, has a stronger effect on duration at stronger boundaries, while accent placement does not interact with surprisal.
1.3 Implications on the nature of lexical representation
Another aspect of durational variability’s relationship with predictability is whether such a relationship reflects online processing or the storage of variants in lexical representations. The online processing explanation aligns better with the Hyper- & Hypoarticulation (H&H) theory (Lindblom 1990) and the Smooth Signal Redundancy Hypothesis (Aylett and Turk 2004, 2006). The implication would be that the speaker is making constant adjustments to produce acoustic signals during speech production given the upcoming linguistic information. This consistent adjustment aims to reach a balance between the need for ease of articulation and the salience of perception (H&H) or between the amount of information redundancy and acoustic salience.
By explaining reduction and shortening based on lexical and contextual probabilities as an online process, the implication is that these variants are likely coming from canonical representations, which would be in line with studies that show the representational advantage of canonical forms over variants. The phonetic variants may only be associated with contexts that they often occur in instead of being part of the lexical representation. For example, Ernestus et al. (2002) show that less-reduced forms are more successfully recognized in isolation and a limited context (i.e., only flanked by a single vowel plus a consonant), even though in full contexts (i.e., embedded in a full sentence) the difference is drastically reduced. Similar results are reported in Kemps et al. (2004) with the phoneme-monitoring task where a listener has to identify a phoneme from reduced and non-reduced forms. In addition, Ranbom and Connine (2007) show that word forms with the canonical but less frequently produced [nt] sequence nonetheless triggered a shorter reaction time and were rated to be more well-formed than variants with the frequently-produced nasal-flap sequence.
On the other hand, there are studies showing that reduced forms of a word are stored as part of its lexical representation. Lavoie (2002) reports that homophones such as “for” and “four” have different sets of phonetic variants, suggesting that phonetic forms may be word-specific. Similar results are reported in Johnson (2007) and Gahl (2008), the latter of which shows that homophones with a higher lemma frequency have shorter phonetic realizations. If homophones have different sets of variants, it is possible that these variants are not produced via a unified online process. Instead, these variants may be stored as part of the representation.
A larger set of studies show the processing advantage of phonetic variants that are more frequent, some of which are non-canonical (Bürki et al. 2010; Connine 2004; Connine et al. 2008; Deelman and Connine 2001; Ernestus 2014; Pitt 2009; Pitt et al. 2011; Racine and Grosjean 2005; Ranbom and Connine 2007). For example, Connine (2004) shows that word-initial consonants in words having a word-medial flapping context (e.g., pretty) were identified more successfully when flap occurs than when the canonical [t] occurs. Bürki et al. (2010) show that the speakers’ report of relative frequencies of variants (schwa or non-schwa, as in [fənɛtr] vs. [fənɛtr] for the French word “fênetre”) for each word correlates with processing time in naming tasks and symbol-word association tasks, which was taken as evidence that at least in the production lexicon, both variants exist. The Ranbom and Connine (2007) study that found a processing advantage of canonical [nt] forms also found that words with a higher proportion of the nasal-flap variant show a processing advantage of this particular variant. As suggested in the literature (e.g., Ernestus 2014; Pitt 2009), this type of finding shows that if a reduced variant is a frequent and typical realized form of a word, the variant may be stored in representation to the extent that it has processing advantages.
Seyfarth (2014) discusses some possible implementations of the lexical storage of variants. The variants may still be phonologically abstract but include both reduced and unreduced forms (e.g., a /fənɛtr/ representation and a /fənɛtr/ representation). It is also possible that all episodes of production, including reduced and unreduced forms, are stored and collectively affect processing (i.e., Exemplar-based view of phonological representation, Johnson 2007; Pierrehumbert 2001). Reductions may also be stored as lexically specified changes to articulatory timing relationships (Lavoie 2002) in the framework of Articulatory Phonology (Aylett and Turk 2004; Browman and Goldstein 1990; Byrd 1996).
Another proposal that specifically aims to account for the representation of surface duration is to separate phonetic planning from phonological planning and the implementation of motor control (Turk and Shattuck-Hufnagel 2020a, b). In the proposed three-component speech production model, the specification of surface duration is handled by the phonetic planning component so that the observed surface durational variability is not simply a result of emergent timing mechanisms.
Overall, the implication concerning the relationship between phonetic variation, predictability, and lexical representation is if there are lexically-specific phonetic variants, we should be able to find word-specific reduction effects either on top of or instead of local context-specific effects, which has been shown in Mandarin (Tang and Shaw 2021). In other words, a word that is often reduced may be pronounced with a reduced form regardless of whether it occurs in a context where it often occurs since the reduced form is represented in the lexicon.
Informativity (Cohen Priva 2008, 2012, 2015; Seyfarth 2014) is a measurement of predictability that has implications on whether phonetic variants follow word-specific or context-specific statistics. It is essentially a unit’s average contextual predictability across all the contexts where it occurs. The informativity of a unit u is given in (1), where c refers to a context, P(u|c) refers to a unit’s probability in a particular context, and P(c|u) is the weighting variable that refers to how often u occurs in that context. The probability is log-transformed with its sign flipped, as it is a measurement of surprisal.
In other words, each unit type has one informativity measurement. This is different from contextual predictability measured in n-gram probabilities or surprisal, which is variable between tokens of a word in different contexts. Previous studies on informativity have often calculated this measure based on both previous and following contexts. In terms of word informativity, “current” is an example low-informativity word mentioned in (Seyfarth 2014), since it is usually predictable (i.e., occurring with words such as “events”) and its occurrence does not add much information. In contrast, “nowadays” is an unpredictable and thus highly informative word as there are no particular contexts where it often occurs. Their contextual probabilities are plotted in Figure 2 (Figure 1, p. 142 in Seyfarth 2014).
Informativity allows for testing the hypothesis that units with a certain level of predictability will have a particular type of phonetic variant even in a context where they are not as predictable. For example, for a high-informativity word such as “current,” the hypothesis is that it occurs in a reduced form even in contexts where it does not often occur, which would be an argument that the reduced form, being more frequent overall (as the word is generally not very informative), is stored as part of its lexical representation. These studies have found that high-informativity units exhibit reduction, and the effect is either comparable or stronger than the effect of local measurements of surprisal, at both the word level (Seyfarth 2014) and the segmental level (Cohen Priva 2008, 2012, 2015, 2017). Overall, these findings lend some support for the hypothesis that there is some consistency between a word’s general informativity and its phonetic realization that goes beyond specific local contexts.
1.5 Current study
Building on these previous works, this study aims to directly compare the effects of surprisal and informativity in different prosodic conditions. The research target is syllable duration in the spontaneous speech of Taiwan Southern Min. There are three main research questions.
First, is the effect of predictability on syllable duration consistent across different positions before a prosodic boundary? Given the known effect of pre-boundary lengthening, this question essentially asks whether the effect of predictability is neutralized by boundary effects, as predicted by the Smooth Signal Redundancy Hypothesis (SSRH) as extended by Turk (2010) to cover the relationship between boundary signaling and information redundancy.
Following similar formulations in the recent literature (Malisz et al. 2018; Andreeva et al. 2020), we take the strong form of SSRH to be predicting that once the prosodic factors are controlled for, no effects of predictability would be found. This is because the relationship between information redundancy and phonetic variability is moderated through the phonologized and conventionalized relationship between information redundancy and the prosodic structure. In the context of this study, the hypothesis posits that boundary marking already accounts for the need to strengthen acoustic signals for units toward a boundary that are often high in information content. In targeting the effect of pre-boundary lengthening as a prosodic factor, the control is enforced by examining the predictability effects at each specific prosodic position before a boundary, which is a method hinted by Turk (2010) in testing the modulating effect of boundary marking and signal redundancy. A strong form of the hypothesis is consistent with an outcome where predictability effects at a syllable position are completely erased.
A weaker version of the hypothesis, on the other hand, would be supported by a finding showing that the presence of pre-boundary lengthening in a certain prosodic position weakens predictability effects on durational variability in that very position. The third scenario is that predictability effects remain unaffected regardless of whether a particular prosodic position exhibits prosodic lengthening.
Potential results and the corresponding implications regarding this hypothesis are summarized in (2).
Second, this study examines whether the effects of informativity (if successfully replicated) and surprisal would behave differently in their interaction with prosodic phrasing. If the effect of informativity reflects the storage of variants for specific words regardless of information contexts, we would expect its effect to be less modulated by the prosodic structure as well. In other words, a more consistent effect of informativity would lend stronger support for this lexical hypothesis on phonetic variants. On the other hand, since surprisal reflects local information contexts, it is more likely to be sensitive to prosodic context as well, unless a strong view of the Smooth Signal Redundancy Hypothesis is adopted. These hypotheses are summarized in (3).
Finally, this study asks whether the effects of various predictability measurements, especially informativity, a global summary of a word’s local contextual predictability, can be replicated in Taiwan Southern Min, as research on probabilistic reduction and predictability effects has mainly been conducted in English and related Indo-European languages, with a few recent exceptions (Japanese in Hashimoto 2021; Shaw and Kawahara 2019; Mandarin in Tang and Shaw 2021, and Kaqchikel Mayan in Tang and Bennett 2018).
Taiwan Southern Min has similarities and differences with the aforementioned languages, making it an interesting target for extending the width and depth of the coverage of types of languages for research on predictability and probabilistic phonetic reduction. It is a language with lexical tones and a more syllable-timed rhythm, which marks a contrast with the pitch-accent system and mora-timed rhythm in Japanese and the lexical stress system and stress-timed rhythm in English. While Taiwan Southern Min is similar to Mandarin in these prosodic and rhythmic aspects, which extends the recent research on tone languages and predictability, it also has some differences that may contribute to differences in durational patterning. For example, it has a more complicated tonal system (seven tones, as shown in Table 1) than Mandarin. It also has a syllable structure that features plosive coda, which potentially introduces more durational variabilities as tones with plosive coda have been described as ‘short tones’.
|Label||Description||Surface form with Tone Sandhi|
|T1||High-level (55)||Low-level (22)|
|T2||High-falling (53)||High-level (55)|
|T3||Low-falling (21)||High-falling (53)|
|T4||Low, short (2)||High, short (4)|
|T5||Low-rising (24)||Low-level (22)|
|T7||Low-level (22)||Low-falling (21)|
|T8||High, short (4)||Low, short (2)|
In terms of the phonetics of prosodic boundaries, previous research has shown that Taiwanese has a consistent disyllabic domain of pre-boundary lengthening (Wang and Fon 2012, 2015). The difference between prosodic realizations at intonational phrase boundaries and intermediate phrase boundaries is mainly at the strength of lengthening at the position right before the boundary (Wang 2013). A disyllabic domain of pre-boundary lengthening makes Taiwan Southern Min similar to Taiwan Mandarin but different from Japanese and English (Fon 2002; Fon et al. 2011). Fon et al. (2011), in discussing the Mandarin results, raise a possibility that a syllable-timed rhythm and a lack of stress contrasts may attribute to Mandarin’s reliance on a larger domain to signal the presence of a boundary. By examining predictability effects on phonetic realizations in Taiwan Southern Min, this study may identify another dimension of differences between languages with different prosodic and rhythmic structures.
Tone sandhi is another element in Taiwan Southern Min’s prosodic structure that may play a role in durational variability, which does not have an apparent counterpart in Mandarin. In connected speech, all syllables except for the one near the right edge of a “tone sandhi group” undergo tonal alternation, as shown in (4). The inventory of lexical tones and tonal alternations are shown in Table 1. The tone sandhi group has been described as a level in the prosodic hierarchy of Taiwan Southern Min (Peng and Beckman 2003) and has been assumed to be heavily governed by morphosyntactic organization (Lin 1994), even though there may be more factors contributing to its variations (Pan et al. 2019; Pan and Huang 2020). It has been shown that syllables at the right edge of tone sandhi groups have a longer duration, a wider range of F0, and creakier voice quality (Kuo 2013). While Mandarin also has tone sandhi, it is limited to the alternation of a single tone (i.e., a contour tone changes it tonal value when followed by another contour tone in domain-medial positions) so that the phrasing behavior is only observable in specific contexts (see Shih (2017) for example).
In short, Taiwan Southern Min is an interesting case of study for the interaction between predictability and prosodic lengthening because it is another language that differs in prosodic and rhythmic structure from the heavily studied Indo-European languages such as English and the recently studied Japanese. Also, mirroring differences between these types of languages have been found in the phonetic realizations of duration at prosodic boundaries. Furthermore, Taiwan Southern Min has systematic differences with the prosodically and rhythmically similar Mandarin, adding depth and variability in the inquiry on predictability and phonetic realization in languages with lexical tones and a syllable-timed rhythm.
Results show a consistent effect of informativity on top of the effects of surprisal and lexical frequency. Higher informativity and surprisal are generally associated with longer syllables. As for the interaction with prosodic positions, there is a general weakening of predictability effects as a syllable is closer to the boundary, especially for the pre-boundary syllable, which has the most substantial final lengthening. However, the effect of word informativity appears to be least modulated by this effect of boundary marking. These findings suggest that word-specific duration variants may exist for Taiwan Southern Min, with its durational profile predicted from informativity. Overall, the findings are consistent with a weak version of the Smooth Signal Redundancy Hypothesis: prosodic marking affects but does not completely eradicate predictability effects. Concerning lexical representation, the findings are more in line with the hypothesis that phonetic variants are lexically stored, as informativity is shown to be a competitive predictor of syllable duration when compared with surprisal and lexical frequencies.
2.1 Speech corpus
Durational data were extracted from a corpus of 8 hours of Taiwan Southern Min spontaneous speech (Wang and Fon 2013). Sixteen speakers each contributed around 30 min of recordings. The speakers were evenly split in gender and between two age groups. The “older” and “younger” groups are divided by whether the speakers were born before or after 1975. All speakers were from the same dialectal region (Taichung). Transcription follows the Peh-Oe-Ji Romanization system. The alignment between the transcriptions and the speech recording was done at the syllabic level. The corpus also contains annotations of discourse and prosodic boundaries. Discourse annotation was conducted independently of the sound files. The transcriptions were segmented into clausal units, and the relationship between neighboring units was labeled according to the degree of separation of their topics. Prosodic annotations identify two levels of prosodic boundaries (intonational and intermediate phrases).
For this study, additional transcriptions were added following the writing system released by Taiwan’s Ministry of Education (MOE). We also added word segmentation by using forward maximal matching based on the Taiwan Southern Min Dictionary published by the MOE. To investigate the effect of tone sandhi grouping, an additional tier of annotation was done on whether a syllable was pronounced with the base tone or the sandhi tone.
To have a more stringent control of boundary type, this study focuses on the right edges of discourse units that are also intonational phrases. Among the 9981 clausal boundaries marked in this corpus, 6567 (65.79%) correspond to an intonational phrase boundary, and 1421 (14.23%) correspond to boundaries of smaller prosodic phrases. The total proportion of discourse boundaries (80.06%) that match with prosodic breaks is similar to observations in French and Mandarin reported in Prévot et al. (2015).
Since the targets were final, penultimate, ante-penultimate, and initial/medial positions before a boundary, this study only analyzed discourse units that were longer than three syllables, which amounted to 5,699 discourse units (57.1% of all boundaries in the data set). The selected data were further filtered to exclude disfluencies and code-switching, and additional prosodic breaks within the discourse unit. The final data set going into the analyses contained 40,443 syllables (31,600 words).
The focus on syllable duration is motivated by the goal of testing how predictability effects interact with pre-boundary lengthening. Research on pre-boundary lengthening in Taiwan Southern Min and comparisons with other languages have always been made at the syllabic level instead of the word level, and typological differences have been observed in terms of the lengthening domain counted in syllable (Fon 2002; Fon et al. 2011; Wang and Fon 2012, 2015). Therefore, targeting syllable duration allows for directly comparing predictability effects in different syllable positions.
2.2 Language modeling corpus
The corpus for language modeling was a collection of Taiwan Southern Min written texts curated by Iunn (2005) originally for a study on lexical frequencies. The corpus contains texts from various genres, including literary works (fictions, proses, poems, biographies, play scripts, etc.), journalist reports, conversation transcripts, commentaries, and academic writings. Data were available in two different writing conventions: The “Hanlo” version, which contains a mixture of Chinese characters and romanizations, was used in this study. The other version was the fully romanized Peh-oe-ji system.
Additional preprocessing was performed on the “Hanlo” texts to transform the writing to be consistent with the writing convention advocated by the MOE in Taiwan. In addition, word segmentation was done following the MOE dictionary with the maximal length matching method, which is similar to the original segmentation method reported by Iunn. Given these preprocessing measures, the final word count for this corpus is 4.68M (from 5.96M syllables/characters).
Trigram language models were trained with modified Kesner-Ney smoothing (Chen and Goodman 1999) at the word level using the SRILM toolkit (Stolcke 2002; Stolcke et al. 2011). In addition to training models to obtain probabilities given the previous context, models that took sentences from the backward direction were also trained to obtain surprisal and informativity given the following context. This was motivated by Seyfarth’s (2014) similar method and the finding that informativity and surprisal given the following context account for more variances in word duration than informativity and surprisal given the previous context.
By using estimates of word-level predictability in modeling syllable duration, we are making the assumption that word-level predictability may percolate to syllable/phoneme levels and affect syllable duration, which is a divergence from other studies that directly model word duration with word-level predictability (e.g., Seyfarth 2014; Sóskuthy and Hay 2017; Tang and Shaw 2021). To entertain the alternative view that syllable/phoneme-level predictability is more relevant in modeling word duration, syllable/phoneme-level language models were also trained to obtain corresponding predictability measurements, and the results with syllable/phoneme-level predictability measurements will be presented in an additional analysis (Section 3.3.2).
It should be noted that even though trigram surprisals are also used in the literature (e.g., Malisz et al. 2018; Andreeva et al. 2020), bigram surprisals were used in this study for three reasons. First, it has been shown that using trigrams instead of bigrams only produces negligible improvement in predicting reductions in speech (Jurafsky et al. 2002). Second, using bigram surprisal allows direct comparisons between surprisal and informativity, the latter of which is based on bigrams according to the most relevant literature. Finally, by using bigram probabilities (as estimates of lexical frequency, to be discussed later) from trigram models, we obtain bigram probabilities that are adjusted given the distribution of bigrams in trigrams and avoid over-counting of bigrams that simply occur frequently in a few trigrams. As discussed in the next section, the same reason motivates our use of unigram surprisal from the language models to represent lexical frequency.
2.3 Variables in the regression analyses
Each syllable in the corpus was annotated with predictability variables (bigram informativity, bigram surprisal, unigram surprisal, and neighborhood density) and control variables on phrasing (position in a prosodic phrase, length of the prosodic phrase, sandhi boundary), word-level information (word length), and syllable-level information (baseline duration, surface tone).
2.3.1 Informativity given previous and following word
Bigram informativity of each word was calculated based on the formula in (1). The smoothed bigram probabilities were taken from the language models reported in Section 2.2. Two informativity measures, where the context refers to the previous and the following word, respectively, were calculated for each word. Informativity based on contexts in different directions was used as different predicting variables. In cases where a multisyllabic word did not occur in the training corpus, the word was broken up into monosyllabic words, the informativity of which was used instead. These predictors were log-transformed (base 10) and converted to z scores for normalization.
2.3.2 Surprisal given previous and following word
Bigram surprisals were calculated from language models trained on the corpus described in Section 2.2 with the SRILM toolkits with modified Kesner-Ney smoothing. As with informativity, for each word token, surprisal given the previous and following word token was calculated and will be used as separate variables. These predictors were log-transformed (base 10) and converted to z scores for normalization.
2.3.3 Unigram surprisal
Unigram surprisals from the language models were also used as a predictor. They can be conceptualized as a representation of lexical frequency, as they are negative log-transformed unigram probabilities, which in turn are lexical frequencies divided by corpus size. Since we were taking unigram surprisals directly from the language models, these numbers were smoothed by adjusting for how often the lexical unit occurred in a unique context, as part of the modified Kesner-Ney smoothing method (Chen and Goodman 1999). As with bigram surprisals and informativity, this predictor was log-transformed (base 10) and z-normalized.
2.3.4 Neighborhood density
Neighborhood density refers to the number of phonological neighbors that a syllable has in the MOE dictionary. A phonological neighbor is defined as a syllable different from the target syllable by the deletion, addition, or substitution of a segment or tone. For example, /pak4/’s neighbors would include /bak4/, /pat4/, /pak2/, and /ak4/.
Previous studies have shown that words with more phonological neighbors tend to exhibit greater phonetic salience (Baese-Berk and Goldrick 2009; Munson and Solomon 2004; Munson 2007; Scarborough 2010, 2012, 2013; Vitevitch 2002). For example, Munson and Solomon (2004) show that vowels in real words with higher neighborhood density are pronounced with a more dispersed vowel space. Scarborough (2013) also shows that high neighborhood density correlates with increased nasal coarticulation and hyper-articulation in elicited American English speech. However, findings in the opposite direction have also been reported: Gahl et al. (2012) and a later study (Gahl 2015) show that words with higher neighborhood density are more often reduced. Their finding is viewed as evidence supporting a production-based view of phonetic variation: words with higher neighborhood density are easier to access and thus easier to produce. By including neighborhood density in the analysis, we aimed to control for such potential effects and contribute to relevant discussions.
2.3.5 Syllable position
This predictor refers to a syllable’s relative position to the nearest discourse and prosodic boundary that follows it. It has four categories: “Final” (i.e., the pre-boundary position), “Penultimate”, “Ante-penultimate”, and the fourth category, “Initial/Medial”, which includes all syllables more than three syllables away from the right edge of the phrase.
2.3.6 Sandhi boundary
This binary predictor refers to whether the syllable carries a base tone, i.e., at the right edge of a tone sandhi group. This predictor was included to model the possible effect of this level of prosodic phrasing and morphosyntactic organization.
2.3.7 Speech rate
This predictor refers to the number of syllables per second in each prosodic unit. A larger number indicates a faster speech rate.
2.3.8 Word length
Word length describes the length of the word where a syllable occurred. It is a four-level predictor since the longest words in the corpus were four-syllable long. The predictor aims to model a potential inverse relationship between the length of a word and the length of its components (e.g., Edwards and Beckman 1988; Lindblom 1968).
2.3.9 Baseline duration
Similar to Tang and Bennett (2018), a linear regression model was trained on the corpus to predict the duration of each syllable from its onset, onglide, vowel, offglide, coda, and the length of the syllable measured in the number of segments. Predictions made by this model were used as a predictor of a syllable’s expected duration in the analyses. This predictor is log-transformed with base ten and transformed to z scores.
2.3.10 Surface tone
This predictor refers to a syllable’s surface tonal category instead of the “base” tones before the tone sandhi process. In addition to the seven lexical tonal categories, there are also the ‘de-stressed’ and the “particle” categories. The “de-stressed” category refers to cases where a syllable carries neither the base nor surface tone and is often accompanied by phonetic reduction, e.g., the last two syllables in khoaŋ21-khi-lai (the surface form following tone sandhi would have been khoaŋ53-khi55-lai24). The “particle” category was used to label the surface tone of particles such as /a/ and /la/, which do not have inherent tonal targets.
2.4 Modeling procedure
We used mixed-effects models using the lmer() function in the lme4 package (Bates et al. 2015) with the fixed effects mentioned in the previous section. We also added random intercepts for Morpheme (monosyllabic), Word, and Speaker. After a full model was fit, each predictor’s contribution was tested by using the loglikelihood test to compare the full model to a reduced model with the predictor removed. If the model with an absent predictor did not significantly change the fit (alpha = 0.15 following Seyfarth 2014), the predictor was removed. Step-wise removal started from the least contributing predictor until further removal resulted in a significant difference between the full and final models.
We performed additional analyses where a model contained only one of the five predictability variables along with other control variables and random intercepts. These analyses were motivated by the discussion and recommendation of Wurm and Fisicaro (2014): in cases where collinearity is a concern, estimating the effect size of each variable in such simultaneous regression analyses is preferable to estimating effect sizes directly from multivariate regressions, regardless of whether residualized variables are used.
These procedures were performed with the entire dataset and in each of the four categories of boundary position (i.e., initial/medial, ante-penultimate, penultimate, final) to examine the effectiveness of these predictors in each position.
3.1 Overall results
All variables listed in Section 2.3 significantly contributed to the fit of the entire set of data points except for informativity given the previous word (β = −0.0046, p = 0.72) and unigram surprisal (β = 0.0133, p = 0.33). Significant fixed effects are summarized in Table 2. Summary of random effects in Table 3 shows that the random intercept of Monosyllabic morpheme accounts for more residual variances than the intercepts of Word and Speaker.
|Baseline duration||0.2652||0.0102||26.05||p < 0.0001|
|Speech rate||−0.1485||0.0041||−36.38||p < 0.0001|
|Neighborhood density||−0.0384||0.0117||−3.29||p < 0.01|
|Informativity given next||0.0643||0.0113||5.70||p < 0.0001|
|Surprisal given previous||0.0373||0.0042||8.86||p < 0.0001|
|Surprisal given next||0.0487||0.0039||12.53||p < 0.0001|
|Position: ante-penultimate||0.1400||0.0100||14.06||p < 0.0001||(p < 0.0001)|
|Position: penultimate||0.3848||0.0105||36.74||p < 0.0001|
|Position: final||1.0220||0.0125||82.04||p < 0.0001|
|Base tone||0.1740||0.0117||14.94||p < 0.0001|
|Word length||−0.1662||0.0145||−11.49||p < 0.0001|
|Surface tone: de-stressed||0.0347||0.0482||0.72||p = 0.47||(p < 0.0001)|
|Surface tone: T2||0.0135||0.0177||0.76||p = 0.45|
|Surface tone: T3||−0.0868||0.0207||−4.19||p < 0.0001|
|Surface tone: T4||−0.0745||0.0321||−2.33||p < 0.05|
|Surface tone: T5||0.0392||0.0304||1.29||p = 0.20|
|Surface tone: T7||0.0222||0.0179||1.24||p = 0.21|
|Surface tone: T8||−0.0622||0.0308||−2.02||p < 0.05|
|Surface tone: particle||−0.6616||0.0524||−12.63||p < 0.0001|
Among the predictability variables, higher informativity given the next word and surprisal given previous and next words are positively correlated with syllable duration. Informativity and surprisal are also similar in that they are better predictors of syllable duration when conditioned on the next word, as opposed to conditioned on the previous word.
Other variables in Table 2 show the expected effects. There is a positive relationship between baseline duration from a syllable’s segmental content and its actual duration. When the speech rate is faster, syllable duration is shorter. Syllable duration is longer when closer to a prosodic boundary, as shown in Figure 3. Post-hoc analysis (with the lsmeans package; Lenth 2016) of all pairs of comparison between neighboring positions show that syllables closer to the boundary are significantly longer (p < 0.0001 for all comparisons). In other words, there is a trisyllabic domain of pre-boundary lengthening. Syllables in the base tone, i.e., next to a tone sandhi group boundary, are also longer. Syllables in longer words tend to be shorter (main effect on word length). Neighborhood density shows a negative correlation with syllable duration, i.e., syllables with more phonological neighbors tend to be shorter. This is similar to the findings in Gahl et al. (2012) and Gahl and Strand (2016). Finally, the effects of surface tone show that particles and syllables with de-stressed tones are shorter.
Next, we present the effect sizes of all five predictability variables in five simultaneous regression models. Each model only has one of the predictability variables and the control variables. As discussed by Wurm and Fisicaro (2014), this method allows for a more direct interpretation of the effect sizes. The results are shown in Table 4. Informativity conditioned in both direct directions has larger effect sizes than other variables, with informativity conditioned on the following word having a bigger effect. Unigram surprisal, which did not significantly improve model fit when the other five predictability measurements were present, becomes a significant effect in this setting, with an effect size that is bigger than surprisals. The discrepancy may be attributed to unigram surprisal’s high correlation with informativity and bigram surprisal, especially the former (as shown in Figure 4). As a result, informativity and surprisal explain away a large chunk of variances that unigram surprisal could have accounted for. This is a scenario that has been pointed out by Cohen Priva and Jaeger (2018) in a computational study on the relationship between frequency, informativity, and contextual predictability: They show that when informativity is not controlled for, there is an increased likelihood of observing spurious effects of frequency and predictability in the statistical analysis.
|Informativity given previous||0.0808||0.0098||8.27||p < 0.0001|
|Informativity given following||0.1128||0.0109||10.31||p < 0.0001|
|Surprisal given previous||0.0570||0.0040||14.30||p < 0.0001|
|Surprisal given following||0.0620||0.0037||16.64||p < 0.0001|
|Unigram surprisal||0.0892||0.0087||10.26||p < 0.0001|
To summarize, informativity conditioned on the next word contributes to explaining durational variability on top of surprisal. According to multivariate mixed-effect modeling, it is a better predictor of syllable duration than lexical frequency (unigram surprisal), even though in simpler settings with only one predictability measurement, unigram surprisal has a comparable effect size.
3.2 Results in each prosodic position
This section reports the role of predictability in accounting for syllable durations in different prosodic positions. The analyses are again reported in two steps. We first present whether these measurements are significant fixed effects in the final model for each prosodic position following the step-wise elimination method. Then we report the effect sizes of the predictability variables in simultaneous regression analyses for a more straightforward interpretation.
Table 5 summarizes whether informativity and surprisal are significant predictors of syllable duration in the final model for data in each category of syllable position in an utterance. As the summary shows, there are indeed variations on whether a variable shows up as a significant predictor in each syllable position. There is no single predictor that is consistent across different positions.
|Informativity given previous||β||0.0732||0.0441||0.0708|
|p < 0.001||p < 0.05||p < 0.001|
|Informativity given next||β||0.0786|
|p < 0.0001|
|Surprisal given previous||β||0.0455|
|p < 0.0001|
|Surprisal given next||β||0.0538||0.0460||−0.0619|
|p < 0.0001||p < 0.0001||p < 0.01|
|p < 0.0001||p < 0.0001|
Informativity conditioned on different directions exhibits a complementary relationship in this set of analyses: Informativity given the previous word is a predictor of syllable duration across ante-penultimate, penultimate, and final positions, while informativity given the next word is only a significant predictor in the initial/medial positions. This suggests that the significant effect of informativity given the following word in the main analysis may be carried by the initial and medial positions.
Surprisals conditioned in both directions are significant predictors in the initial/medial positions. In addition, surprisal given the following word is also a significant predictor in the penultimate and final positions but has a negative estimate in the final position. In other words, the positive effect of surprisal on syllable duration is mostly limited to syllables away from a discourse boundary where pre-boundary lengthening is not in effect.
Unigram surprisal, which was not a significant fixed effect in the overall analysis, is a significant predictor in penultimate and ante-penultimate positions in this analysis. Finally, neighborhood density is a consistent predictor except for the boundary position.
Next, we present the effect sizes of these predictability variables in simultaneous models in Figure 5. The major trend is that all predictors converge towards a small effect size in the final position, suggesting a nearly absolute neutralizing effect of a prosodic boundary. We also see that informativity is the most resistant to this neutralizing effect, as its effect sizes are larger than other measurements in the positive direction. These observations are confirmed by the statistical tests, summarized in Table 6.
|Informativity given previous||β||0.0873||0.1353||0.1092||0.0466|
|p < 0.0001||p < 0.0001||p < 0.0001||p < 0.05|
|Informativity given following||β||0.1323||0.1543||0.1376||0.0401|
|p < 0.0001||p < 0.0001||p < 0.0001||p = 0.07|
|Surprisal given previous||β||0.0713||0.0472||0.0345||0.0071|
|p < 0.0001||p < 0.001||p < 0.001||p = 0.49|
|Surprisal given following||β||0.0686||0.0425||0.0738||−0.0325|
|p < 0.0001||p < 0.0001||p < 0.0001||p = 0.09|
|p < 0.0001||p < 0.0001||p < 0.0001||p = 0.33|
To sum up, there is evidence that predictability effects are neutralized towards a discourse and prosodic boundary, especially in the final position before the boundary, where the pre-boundary lengthening effect is the strongest. Informativity, especially when estimated given the previous word, is the most resistant to such a neutralizing effect.
3.3 Additional and exploratory analyses
3.3.1 Relationship between informativity and surprisal
This section discusses the relationship between informativity and surprisal. Since informativity is a weighted average of bigram surprisal for each word, it is important to understand how informativity and surprisal’s predictions of syllable duration diverge from each other. For this purpose, regression analyses were run by including informativity and surprisal conditioned in the same direction and the interaction between them.
Informativity and surprisal given the previous word have a nearly significant interaction in the penultimate position (β = −0.0137, SE = 0.0078, p = 0.08). The negative estimate suggests that when both informativity and surprisal were higher, the individual effect of each variable became smaller. This interaction can be observed by the less steep slope for lighter regression lines in the “penultimate” panels in Figures 6 and 7.
These figures also show that this neutralizing trend can be observed in other prosodic positions. It is also worth noting that in Figure 7, the relatively consistent pattern of lighter regression lines above darker ones, especially towards the left side of the graphs, suggests that syllables in even in low-surprisal contexts, high-informativity words were still longer than syllables in low-informativity words. The same separation is less obvious for surprisal given different levels of informativity in Figure 6.
On the other hand, informativity and surprisal conditioned on the following word show more interactions with each other in the penultimate position (β = −0.0247, SE = 0.0091, p < 0.01) and the ante-penultimate position (β = −0.023, SE = 0.0091, p < 0.01). Similar to informativity and surprisal given the previous word, the interaction was in the opposite direction of the main effects: When informativity and surprisal given the next word are both high, the lengthening effect is capped. Figures 8 and 9 visualize the interactions. In Figure 9, we again observe the trend that separation between words with different levels of surprisal was consistent across different levels of surprisal (except for the final position), which is not true for the separation between different surprisal levels across different degrees of informativity, as shown in Figure 8.
To sum up, informativity and surprisal mostly interact in the direction where highly informative words are not as lengthened in high-surprisal contexts and vice versa, which may be interpreted as a ceiling effect for predictability.
3.3.2 Syllable/morpheme-level modeling
We explored an alternative analysis using predictability measurements at the level of monosyllabic morphemes. Trigram language models were trained without word segmentation, i.e., monosyllabic morphemes were treated as separate units. Similar to the main analysis, language models were also trained in both directions. Bigram informativity and surprisal conditioned on previous and following contexts were obtained from the trained models.
The same set of fixed effects from the main analysis was used, except for word length, since word-level units were not taken into account in this analysis. Similarly, Word was not included as a random intercept, leaving the model with random intercepts of Speaker and Phoneme.
Fixed effects from regression analysis with step-wise elimination are shown in Table 7, with the random effects shown in Table 8. All variables except for unigram surprisal contributed significantly to model fit. This is slightly different from the main analysis reported in Section 3.1, where unigram surprisal and informativity given the previous word were not significant contributors to model fit.
|Baseline duration||0.2470||0.0095||25.98||p < 0.0001|
|Speech rate||−0.1584||0.0041||−38.39||p < 0.0001|
|Neighborhood density||−0.0583||0.0111||−5.26||p < 0.0001|
|Informativity given previous||0.0304||0.0132||2.30||p < 0.05|
|Informativity given following||0.0290||0.0130||2.23||p < 0.05|
|Surprisal given previous||0.0875||0.0042||20.91||p < 0.0001|
|Surprisal given following||0.0593||0.0044||13.57||p < 0.0001|
|Position: ante-penultimate||0.1591||0.0100||15.90||p < 0.0001||(p < 0.0001)|
|Position: penultimate||0.3931||0.0105||37.54||p < 0.0001|
|Position: final||1.0290||0.0125||82.15||p < 0.0001|
|Base tone||0.2011||0.0117||17.20||p < 0.0001|
|Surface tone: de-stressed||−0.0848||0.0430||−1.97||p = 0.05||(p < 0.0001)|
|Surface tone: T2||0.0333||0.0171||1.95||p < 0.05|
|Surface tone: T3||−0.0808||0.0198||−4.07||p < 0.001|
|Surface tone: T4||−0.0617||0.0302||−2.04||p < 0.05|
|Surface tone: T5||0.0823||0.0293||2.81||p < 0.01|
|Surface tone: T7||0.0379||0.0171||2.21||p < 0.05|
|Surface tone: T8||−0.0657||0.0293||−2.24||p < 0.05|
|Surface tone: particle||−0.6445||0.0462||−13.96||p < 0.0001|
Compared with word-level informativity and surprisal, there is a smaller degree of correlation between morpheme-level informativity and surprisal. On the other hand, the correlation between unigram surprisal and informativity is still strong, and so is the correlation between informativity conditioned in two directions, as shown in Figure 10. For this reason, we again report each predictability measurement’s effect sizes in simultaneous regression models that only included one of the predictability variables along with other control variables.
Results are shown in Table 9. Surprisal given the previous syllable/morpheme has the largest effect size, followed by informativity conditioned in both directions, which are in turn followed by surprisal given the following syllable/morpheme and unigram surprisal. These results differ from what we saw in the analysis with word-level informativity and surprisal, where unigram surprisal and informativity had larger effect sizes than surprisal. In other words, while a word’s overall frequency and weighted average of contextual predictability could predict syllable duration better than the predictability of specific word tokens, the predictability of specific syllable/morpheme tokens were better predictors of syllable duration than a syllable/morpheme’s overall frequency or average predictability.
|Informativity given previous||0.0781||0.0111||7.02||p < 0.0001|
|Informativity given following||0.0791||0.0109||7.26||p < 0.0001|
|Surprisal given previous||0.0869||0.0042||20.91||p < 0.0001|
|Surprisal given following||0.0579||0.0044||13.27||p < 0.0001|
|Unigram surprisal||0.0556||0.0097||5.74||p < 0.0001|
The positioning of word boundaries, which is not a variable in this set of morpheme-based analyses, is a potential factor that explains bigram surprisal’s success in this setting. Given syllables x and y, the bigram surprisal p(x|y) is likely to be low if x and y are within the same word, and the duration of x is likely to be shortened as well. This covariation between bigram surprisal and syllable duration based on the location of word boundaries may contribute to the success of surprisal in predicting syllable duration.
To test whether this is the case, we ran another set of analyses that included two additional variables: “L/R-side #” (# = word boundary), i.e., for each syllable, whether there was a word boundary on the left or right when word segmentation was considered. We first ran two simple mixed-effect regression models that had L/R-side # as the independent variable and surprisal given the previous/next word as the dependent variable, with “monosyllabic morpheme” added as a random intercept. The results showed that surprisal was significantly lower when the bigram did not cross a word boundary (L-side # predicting surprisal given previous morpheme: β = −1.04, SE = 0.01, p < 0.0001; R-side # predicting surprisal given next morpheme: β = −1.11, SE = 0.01, p < 0.0001).
Then we ran the main regression analysis while adding these two new variables. In the main model, L-side # had a significant effect on syllable duration: when the left side of a syllable was within the same word (β = −0.1545, SE = 0.01, p < 0.0001), the syllable was shorter. The same effect, however, is not found for R-side #.
Finally, we ran simultaneous regression analyses in each prosodic position for each predictability variable. The results are shown in Figure 11. There are two main differences from word-level analysis in Figure 5. First, informativity is no longer significant in the final position, i.e., unlike word informativity, the effect of morpheme informativity was neutralized in the final position. Second, the effect of surprisal given the previous morpheme was not neutralized in the final position. Again, this is likely a result of how the presence of word boundary before a syllable strongly predicts both syllable duration and bigram surprisal, leading to a robust effect of bigram surprisal in predicting duration.
To sum up, similar to the analysis with word-level information, higher morpheme-level informativity and surprisal were correlated with longer syllable duration. These predictability effects were also neutralized towards a discourse and prosodic boundary. In the pre-boundary position, the effect of morpheme-level informativity was not as strong as word-level informativity, while surprisal given the previous morpheme remained a significant predictor in that position.
The results show that informativity as a global measurement of a word’s predictability can account for durational variabilities on top of the effects of surprisal, a measurement of local contextual predictability. Additional analyses show that syllable/morpheme-level informativity also has the same effects. These results are consistent with what previous studies found on the correlation between informativity and phonetic variability (Cohen Priva 2008, 2012, 2015; Seyfarth 2014). It suggests that whether a word is predictable in general is a reliable predictor of its phonetic variants. In this specific case of durational variability, it shows that if a word is often predictable in the contexts that it usually occurs, it is likely to be shortened regardless of the actual context of occurrence. This is consistent with the view that a word has its own set of specific variants (in this case, governed by its informativity), which is in turn the foundation of the hypothesis that phonetic variants are stored in the lexicon.
To address the interaction between informativity/surprisal and prosodic marking of boundaries, the effects of main predictability variables are also examined as a function of syllable position relative to the discourse and prosodic boundary. There is a trend that syllables closer to the boundary have weaker predictability effects. While all predictability measurements show significant effects on syllable duration in phrase initial/medial, ante-penultimate, and penultimate positions, in the final syllables, the only positive correlation between syllable duration occurred for informativity. It shows that predictability plays a lesser role in affecting durational variations for syllables closer to a boundary. The inverse relationship between neighborhood density and syllable duration was also neutralized in the final syllable. These results suggest that pre-boundary lengthening might have already encoded much of the relationship between language redundancy and acoustic salience, i.e., syllables towards the boundary generally carry more information content and need to be strengthened. As a result, there is only a small portion of phonetic variability that predictability directly explained. This finding is in line with the interpretation offered by Malisz et al. (2018), i.e., a weaker version of the Smooth Signal Hypothesis. We leave it to future research to explore whether this modulating effect can be found for other types and levels of boundaries, e.g., syntactic boundaries that correspond to different levels of prosodic boundaries.
Given its significant effect on the final syllable, informativity may be considered the most consistent predictor of syllable duration among all the predictability measurements. This again suggests that word-specific variants (or, more conservatively, informativity-specific variants) likely exist so that the overall prosodic strengthening does not normalize its effect.
It should be noted that lower lexical frequency, represented by a higher smoothed unigram surprisal, is also found to be consistently correlated with syllable duration in this study. However, in multivariate regression analyses with informativity and bigram surprisal also included as predictors, unigram surprisal failed to be a predictor that significantly improved overall model fit, despite the apparent effect and large effect size when it is the only predictability variable in the model. It is likely due to lexical frequency’s interaction with other variables. In other words, the effects we show for informativity and surprisal may be partially attributed to lexical frequency, even though the regression analysis suggests that informativity and surprisal are better predictors in the statistical model. As mentioned earlier, this is a scenario described in Cohen Priva and Jaeger (2018) that suggests a likelihood of spurious frequency and predictability effects when informativity is not controlled for.
We also show a potential complementary relationship between surprisal and informativity. The positive correlation between bigram surprisal and duration is less strong for high-informativity words. The same ceiling effect is true in the other direction: When a word occurred in a high surprisal context, its informativity had a weaker effect on durational variability. A possible interpretation is that local and global predictability act together as a processing factor. It may also suggest that the effect of predictability is capped by the overall prosodic profile of the utterance so that it is not possible for all the parametric factors to have additive effects.
To conclude, this study presents evidence that informativiy is a useful predictor of durational variability in Taiwan Southern Min, on top of local predictability measurements that bigram surprisals represent. By controlling for prosodic boundary type and examining the effects in different utterance positions, we find that local surprisal measurements are more modulated by boundary marking, while the effect of word-specific informativity was less affected by the prosodic structure. It supports a weaker version of the hypothesis that prosodic structure mediates the relationship between language redundancy and acoustic salience. The strong effect of informativity also lends potential support to the view that word-specific phonetic variants may be part of the lexical representation, as they are less affected by local contexts.
I thank the editor-in-chief Catherine T. Best, the associate editor Sonia Frota, and two anonymous reviewers for their constructive feedback. I also would like to thank Shu-Chuan Tseng and the phonetics research lab at the Institute of Linguistics, Academia Sinica, for their support and the insightful discussions on earlier versions of this work.
Andreeva, Bistra, Bernd Möbius & James Whang. 2020. Effects of surprisal and boundary strength on phrase-final lengthening. In Proceedings of the 10th International Conference on Speech Prosody, 146–150. Tokyo, Japan: ISCA.10.21437/SpeechProsody.2020-30Search in Google Scholar
Aylett, Matthew & Alice Turk. 2004. The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech 47(1). 31–56. https://doi.org/10.1177/00238309040470010201.Search in Google Scholar
Aylett, Matthew & Alice Turk. 2006. Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei. Journal of the Acoustical Society of America 119(5). 3048–3058. https://doi.org/10.1121/1.2188331.Search in Google Scholar
Baese-Berk, Melissa & Matthew Goldrick. 2009. Mechanisms of interaction in speech production. Language & Cognitive Processes 24(4). 527–554. https://doi.org/10.1080/01690960802299378.Search in Google Scholar
Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1). 1–48. https://doi.org/10.18637/jss.v067.i01.Search in Google Scholar
Bell, Alan, Jason M. Brenier, Michelle Gregory, Cynthia Girand & Dan Jurafsky. 2009. Predictability effects on durations of content and function words in conversational English. Journal of Memory and Language 60(1). 92–111. https://doi.org/10.1016/j.jml.2008.06.003.Search in Google Scholar
Benner, Uta, Ines Flechsig, Grzegorz Dogil & Bernd Möbius. 2007. Coarticulatory resistance in a mental syllabary. In Proceedings of the 16th International Congress of Phonetic Sciences, 485–488. Dudweiler: Pirrot.Search in Google Scholar
Brandt, Erika, Bernd Möbius & Bistra Andreeva. 2021. Dynamic formant trajectories in German read speech: Impact of predictability and prominence. Frontiers in Communication 6. 643528. https://doi.org/10.3389/fcomm.2021.643528.Search in Google Scholar
Brandt, Erika, Frank Zimmerer, Bistra Andreeva & Bernd Möbius. 2018. Impact of prosodic structure and information density on dynamic formant trajectories in German. In Proceedings of the 9th International Conference on speech prosody, 119–123. Poznań, Poland: ISCA.10.21437/SpeechProsody.2018-24Search in Google Scholar
Browman, Catherine P. & Louis Goldstein. 1990. Tiers in Articulatory Phonology, with Some Implications for Casual Speech. In John Kingston & Mary E. Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, 341–376. Cambridge University Press.10.1017/CBO9780511627736.019Search in Google Scholar
Bürki, Audrey, Mirjam Ernestus & Ulrich H. Frauenfelder. 2010. Is there only one “fenêtre” in the production lexicon? On-line evidence on the nature of phonological representations of pronunciation variants for French schwa words. Journal of Memory and Language 62(4). 421–437. https://doi.org/10.1016/j.jml.2010.01.002.Search in Google Scholar
Bybee, Joan & Joanne Scheibman. 1999. The effect of usage on degrees of constituency: The reduction of don’t in English. Linguistics 37(4). 575–596. https://doi.org/10.1515/ling.37.4.575.Search in Google Scholar
Chen, Stanley F. & Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4). 359–394. https://doi.org/10.1006/csla.1999.0128.Search in Google Scholar
Cieri, Christopher, David Graff, Owen Kimball, Dave Miller & Kevin Walker. 2005. Fisher English training part 2, transcripts LDC2005T19. Web Download, Philadelphia: Linguistic Data Consortium.Search in Google Scholar
Cohen Priva, Uriel. 2008. Using information content to predict phone deletion. In Proceedings of the 27th west coast conference on formal linguistics, 90–98. Somerville, MA: Cascadilla Proceedings Project.Search in Google Scholar
Cohen Priva, Uriel. 2012. Sign and signal: Deriving linguistic generalizations from information utility. Stanford University PhD thesis.Search in Google Scholar
Cohen Priva, Uriel & T. Florian Jaeger. 2018. The interdependence of frequency, predictability, and informativity in the segmental domain. Linguistics Vanguard 4(s2). 20170028. https://doi.org/10.1515/lingvan-2017-0028.Search in Google Scholar
Connine, Cynthia M. 2004. It’s not what you hear but how often you hear it: On the neglected role of phonological variant frequency in auditory word recognition. Psychonomic Bulletin & Review 11(6). 1084–1089. https://doi.org/10.3758/bf03196741.Search in Google Scholar
Connine, Cynthia M., Larissa J. Ranbom & David J. Patterson. 2008. Processing variant forms in spoken word recognition: The role of variant frequency. Perception & Psychophysics 70(3). 403–411. https://doi.org/10.3758/pp.70.3.403.Search in Google Scholar
Croot, Karen & Kathleen Rastle. 2004. Is there a syllabary containing stored articulatory plans for speech production in English? In Steve Cassidy, Felicity Cox, Robert Mannell & Sallyanne Palethorpe (eds.), Proceedings of the 10th Australian International conference on speech science and technology, 376–381. Canberra, Australia: Australian Speech Science and Technology Association.Search in Google Scholar
Cutler, Anne & David M. Carter. 1987. The predominance of strong initial syllables in the English vocabulary. Computer Speech & Language 2(3–4). 133–142. https://doi.org/10.1016/0885-2308(87)90004-0.Search in Google Scholar
Deelman, Thomas & Cynthia M. Connine. 2001. Missing information in spoken word recognition: Nonreleased stop consonants. Journal of Experimental Psychology: Human Perception and Performance 27(3). 656. https://doi.org/10.1037/0096-15126.96.36.1996.Search in Google Scholar
Fon, Janice, Keith Johnson & Sally Chen. 2011. Durational patterning at syntactic and discourse boundaries in Mandarin spontaneous speech. Language and Speech 54(1). 5–32. https://doi.org/10.1177/0023830910372492.Search in Google Scholar
Fon, Yee-Jean Janice 2002. A cross-linguistic study on syntactic and discourse boundary cues in spontaneous speech. The Ohio State University PhD thesis.Search in Google Scholar
Fougeron, Cécile & Patricia A. Keating. 1997. Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America 101(6). 3728–3740. https://doi.org/10.1121/1.418332.Search in Google Scholar
Gahl, Susanne. 2008. Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech. Language 84(3). 474–496. https://doi.org/10.1353/lan.0.0035.Search in Google Scholar
Gahl, Susanne. 2015. Lexical competition in vowel articulation revisited: Vowel dispersion in the easy/hard database. Journal of Phonetics 49. 96–116. https://doi.org/10.1016/j.wocn.2014.12.002.Search in Google Scholar
Gahl, Susanne & Julia F. Strand. 2016. Many neighborhoods: Phonological and perceptual neighborhood density in lexical production and perception. Journal of Memory and Language 89. 162–178. https://doi.org/10.1016/j.jml.2015.12.006.Search in Google Scholar
Gahl, Susanne, Yao Yao & Keith Johnson. 2012. Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech. Journal of Memory and Language 66(4). 789–806. https://doi.org/10.1016/j.jml.2011.11.006.Search in Google Scholar
Hashimoto, Daiki. 2021. Probabilistic reduction and mental accumulation in Japanese: Frequency, contextual predictability, and average predictability. Journal of Phonetics 87. 101061. https://doi.org/10.1016/j.wocn.2021.101061.Search in Google Scholar
Iunn, Un-Gian. 2005. Taiwanese corpus collection and corpus based syllable/word frequency counts for written taiwanese [tâi-gún-bûn gú-liāu-khò so-chip kap gú-liāu-khò ûi pun tâi-gú su-bīn-gú im-chiat sû-pîn tóng-kè]. Final report for research project funded by the National Science Council (NSC 93-2213-E-122-001).Search in Google Scholar
Johnson, Keith. 2007. Decisions and mechanisms in exemplar-based phonology. In Maria-Josep Sole, Patrice Speeter Beddor & Manjari Ohala (eds.), Experimental approaches to phonology, 25–40. Oxford, UK: Oxford University Press.Search in Google Scholar
Jurafsky, Daniel, Alan Bell & Cynthia Girand. 2002. The role of the lemma in form variation. In Carlos Gussenhoven & Natasha Warner (eds.), Laboratory phonology 7, 3–34. De Gruyter, Inc.: Berlin/Boston.Search in Google Scholar
Jurafsky, Daniel, Alan Bell, Michelle Gregory & William D. Raymond. 2001. Probabilistic relations between words: Evidence from reduction in lexical production. Typological Studies in Language 45. 229–254.10.1075/tsl.45.13jurSearch in Google Scholar
Kemps, Rachèl, Mirjam Ernestus, Robert Schreuder & Harald Baayen. 2004. Processing reduced word forms: The suffix restoration effect. Brain and Language 90(1–3). 117–127. https://doi.org/10.1016/s0093-934x(03)00425-5.Search in Google Scholar
Kuo, Chen-Hsiu. 2013. Perception and acoustic correlates of the Taiwanese tone sandhi group. UCLA PhD thesis.Search in Google Scholar
Lavoie, Lisa. 2002. Some influences on the realization of for and four in american English. Journal of the International Phonetic Association 32(2). 175–202. https://doi.org/10.1017/s0025100302001032.Search in Google Scholar
Lieberman, Philip. 1963. Some effects of semantic and grammatical context on the production and perception of speech. Language and Speech 6(3). 172–187. https://doi.org/10.1177/002383096300600306.Search in Google Scholar
Lindblom, Björn. 1968. Temporal organization of syllable production. Quarterly Progress and Status Report 9(2–3). 1–5.Search in Google Scholar
Lindblom, Björn. 1990. Explaining phonetic variation: A sketch of the H&H theory. In William J. Hardcastle & Alan Marchal (eds.), Speech production and speech modelling. Dordrecht: Springer.10.1007/978-94-009-2037-8_16Search in Google Scholar
Losiewicz, Beth L. 1992. The effect of frequency on linguistic morphology. The University of Texas at Austin PhD thesis.Search in Google Scholar
Malisz, Zofia, Erika Brandt, Bernd Möbius, Yoon Mi Oh & Bistra Andreeva. 2018. Dimensions of segmental variability: Interaction of prosody and surprisal in six languages. Frontiers in Communication 3. 25. https://doi.org/10.3389/fcomm.2018.00025.Search in Google Scholar
Munson, Benjamin. 2007. Lexical access, lexical representation, and vowel production. Laboratory Phonology 9. 201–228.Search in Google Scholar
Munson, Benjamin & Nancy Pearl Solomon. 2004. The effect of phonological neighborhood density on vowel articulation. Journal of Speech, Language, and Hearing Research 47(5). 1048–1058. https://doi.org/10.1044/1092-4388(2004/078).Search in Google Scholar
Oller, D. Kimbrough. 1973. The effect of position in utterance on speech segment duration in English. Journal of the Acoustical Society of America 54(5). 1235–1247. https://doi.org/10.1121/1.1914393.Search in Google Scholar
Pan, Ho-hsien & Hsiao-tung Huang. 2020. Lexical propensity and Taiwanese Min tone sandhi rules. In Proceedings of the 10th International Conference on Speech Prosody, 518–522. ISCA.10.21437/SpeechProsody.2020-106Search in Google Scholar
Pan, Ho-hsien, Hsiao-tung Huang & Lyu Shao-ren. 2019. The occurrence of taiwanese min juncture tones before prosodic boundaries and modification marker. In Proceedings of the 19th International Congress of Phonetic Sciences, 3423–3427. Canberra, Australia: Australasian Speech Science and Technology Association Inc.Search in Google Scholar
Peng, Shu-hui & Mary E. Beckman. 2003. Annotation conventions and corpus design in the investigation of spontaneous speech prosody in Taiwanese. ISCA & IEEE workshop on spontaneous speech processing and recognition. ISCA.Search in Google Scholar
Pierce, John Robinson. 1972. Symbols, signals, and noise: The nature and process of communication. Foundations of Language 9(1). 150–151.Search in Google Scholar
Pierrehumbert, Janet. 2001. Exemplar dynamics: Word frequency, lenition and contrast. In Joan Bybee & Paul Hopper (eds.), Frequency and the emergence of linguistic structure, 137–157. Amsterdam: John Benjamins.10.1075/tsl.45.08pieSearch in Google Scholar
Pitt, Mark A. 2009. How are pronunciation variants of spoken words recognized? A test of generalization to newly learned words. Journal of Memory and Language 61(1). 19–36. https://doi.org/10.1016/j.jml.2009.02.005.Search in Google Scholar
Pitt, Mark A., Laura Dilley & Michael Tat. 2011. Exploring the role of exposure frequency in recognizing pronunciation variants. Journal of Phonetics 39(3). 304–311. https://doi.org/10.1016/j.wocn.2010.07.004.Search in Google Scholar
Pluymaekers, Mark, Mirjam Ernestus & R. Harald Baayen. 2005a. Articulatory planning is continuous and sensitive to informational redundancy. Phonetica 62(2–4). 146–159. https://doi.org/10.1159/000090095.Search in Google Scholar
Pluymaekers, Mark, Mirjam Ernestus & R. Harald Baayen. 2005b. Lexical frequency and acoustic reduction in spoken Dutch. Journal of the Acoustical Society of America 118(4). 2561–2569. https://doi.org/10.1121/1.2011150.Search in Google Scholar
Prévot, Laurent, Shu-Chuan Tseng, Klim Peshkov & Alvin Cheng-Hsien Chen. 2015. Processing units in conversation: A comparative study of French and Mandarin data. Language and Linguistics 16(1). 69–92. https://doi.org/10.1177/1606822x14556605.Search in Google Scholar
Racine, Isabelle & François Grosjean. 2005. Le coût de l’effacement du schwa lors de la reconnaissance des mots en français. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 59(4). 240. https://doi.org/10.1037/h0088052.Search in Google Scholar
Ranbom, Larissa J. & Cynthia M. Connine. 2007. Lexical representation of phonological variation in spoken word recognition. Journal of Memory and Language 57(2). 273–298. https://doi.org/10.1016/j.jml.2007.04.001.Search in Google Scholar
Scarborough, Rebecca. 2010. Lexical and contextual predictability: Confluent effects on the production of vowels. Laboratory Phonology 10. 557–586.Search in Google Scholar
Scarborough, Rebecca. 2013. Neighborhood-conditioned patterns in phonetic detail: Relating coarticulation and hyperarticulation. Journal of Phonetics 41(6). 491–508. https://doi.org/10.1016/j.wocn.2013.09.004.Search in Google Scholar
Schweitzer, Antje & Bernd Möbius. 2004. Exemplar-based production of prosody: Evidence from segment and syllable durations. In Proceedings of the 2nd International Conference on Speech Prosody. Nara, Japan: ISCA.Search in Google Scholar
Seyfarth, Scott. 2014. Word informativity influences acoustic duration: Effects of contextual predictability on lexical representation. Cognition 133(1). 140–155. https://doi.org/10.1016/j.cognition.2014.06.013.Search in Google Scholar
Shannon, Claude Elwood. 1948. A mathematical theory of communication. The Bell System Technical Journal 27(3). 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Search in Google Scholar
Sharp, Alan E. 1960. The analysis of stress and juncture in English. Transactions of the Philological Society 59(1). 104–135. https://doi.org/10.1111/j.1467-968x.1960.tb00312.x.Search in Google Scholar
Shaw, Jason A. & Shigeto Kawahara. 2019. Effects of surprisal and entropy on vowel duration in Japanese. Language and Speech 62(1). 80–114. https://doi.org/10.1177/0023830917737331.Search in Google Scholar
Shih, Shu-hao. 2017. Binarity and focus in prosodic phrasing: New evidence from Taiwan Mandarin. In Proceedings of the Annual Meetings on Phonology, 4. LSA.10.3765/amp.v4i0.3988Search in Google Scholar
Sóskuthy, Márton & Jennifer Hay. 2017. Changing word usage predicts changing word durations in New Zealand English. Cognition 166. 298–313. https://doi.org/10.1016/j.cognition.2017.05.032.Search in Google Scholar
Stolcke, Andreas. 2002. SRILM-an extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002). Denver, USA: ISCA.10.21437/ICSLP.2002-303Search in Google Scholar
Stolcke, Andreas, Jing Zheng, Wen Wang & Victor Abrash. 2011. SRILM at sixteen: Update and outlook. In Proceedings of IEEE automatic speech recognition and understanding workshop, 5. Waikoloa: ASRU.Search in Google Scholar
Tang, Kevin & Ryan Bennett. 2018. Contextual predictability influences word and morpheme duration in a morphologically complex language (Kaqchikel Mayan). Journal of the Acoustical Society of America 144(2). 997–1017. https://doi.org/10.1121/1.5046095.Search in Google Scholar
Turk, Alice. 2010. Does prosodic constituency signal relative predictability? A smooth signal redundancy hypothesis. Laboratory Phonology 1(2). 227–262. https://doi.org/10.1515/labphon.2010.012.Search in Google Scholar
Turk, Alice & Stefanie Shattuck-Hufnagel. 2014. Timing in talking: What is it used for, and how is it controlled? Philosophical Transactions of the Royal Society B: Biological Sciences 369(1658). 20130395. https://doi.org/10.1098/rstb.2013.0395.Search in Google Scholar
Turk, Alice & Stefanie Shattuck-Hufnagel. 2020a. Speech timing: implications for theories of phonology, speech production, and speech motor control, vol. 5. USA: Oxford University Press.10.1093/oso/9780198795421.001.0001Search in Google Scholar
Turk, Alice & Stefanie Shattuck-Hufnagel. 2020b. Timing evidence for symbolic phonological representations and phonology-extrinsic timing in speech production. Frontiers in Psychology 10. 2952. https://doi.org/10.3389/fpsyg.2019.02952.Search in Google Scholar
Van Son, Rob J. J. H. & Louis C. W. Pols. 2003. How efficient is speech. In Proceedings of the Institute of Phonetic Sciences, 25, 171–184. University of Amsterdam.Search in Google Scholar
Van Son, Rob J. J. H. & Jan P. H. Van Santen. 2005. Duration and spectral balance of intervocalic consonants: A case for efficient communication. Speech Communication 47(1–2). 100–123. https://doi.org/10.1016/j.specom.2005.06.005.Search in Google Scholar
Vitevitch, Michael S. 2002. The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(4). 735. https://doi.org/10.1037/0278-73188.8.131.525.Search in Google Scholar
Wang, Sheng-Fu. 2013. Durational Cues at discourse Boundaries in Taiwan Southern Min spontaneous speech. National Taiwan University Master’s thesis.Search in Google Scholar
Wang, Sheng-Fu & Janice Fon. 2012. Durational cues at discourse boundaries in Taiwan Southern Min. In Proceedings of 6th International Conference on Speech Prosody. ISCA.Search in Google Scholar
Wang, Sheng-Fu & Janice Fon. 2013. A Taiwan Southern Min spontaneous speech corpus for discourse prosody. In The Proceedings of Tools and Resources for the Analysis of Speech Prosody, 20–23. Aix-en-Provence, France: Labratoire Parole et Langage.Search in Google Scholar
Wang, Sheng-Fu & Janice Fon. 2015. Syllable duration and discourse organization at intonational phrase boundaries in Taiwan Southern Min. In Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow, UK.Search in Google Scholar
Whang, James. 2019. Effects of phonotactic predictability on sensitivity to phonetic detail. Laboratory Phonology: Journal of the Association for Laboratory Phonology 10(1). 1–28. https://doi.org/10.5334/labphon.125.Search in Google Scholar
Whiteside, Sandra P. & Rosemary A. Varley. 1998. Dual-route phonetic encoding: Some acoustic evidence. In Proceedings of the 5th International Conference on Spoken Language Processing. Sydney, Australia: Australian Speech Science and Technology Association.10.21437/ICSLP.1998-812Search in Google Scholar
Wurm, Lee H. & Sebastiano A. Fisicaro. 2014. What residualizing predictors in regression analyses does (and what it does not do). Journal of Memory and Language 72. 37–48. https://doi.org/10.1016/j.jml.2013.12.003.Search in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston