Natural speech is dynamic. It occurs in live communication among speakers under a variety of conditions and contexts, and para- or extra-linguistic factors such as the emotional intent of speakers and the cognitive constraints of speakers or listeners can impact the realization of phonology. In many languages, for example, fundamental frequency (F0) is raised when the speaker is angry and lowered when s/he is sad (e.g., William and Stevens 1972). Utterances spoken by individuals with autism reportedly have monotonous or atypical F0 contours (cf. McCann and Peppé 2003). Dynamic changes are observed in other aspects of prosody, such as voice quality and speech rate, as well as acoustic and distributional characteristics of segments in various speech contexts.
The theoretical frameworks for phonology, in contrast, have been largely based on the analysis of speech produced in idealized conditions, such as controlled laboratory speech. Although careful analysis of controlled speech is crucial for capturing static aspects of phonology, it works to eliminate the variability caused by the dynamic aspects of phonology. By “dynamic,” we refer to aspects of phonology that allow systematic flexibilities in the phonetic realization of various phonological phenomena to accommodate specific communicative needs of speakers and hearers under various contexts. In this paper, we will argue that the dynamic aspects are an integral part of a language’s phonology, not random variation, and as such, they are constrained by the system of phonology. To fully capture the nature of the phonology of human language, it will be important to expand the scope of research beyond idealized speech and include more speech data produced under various real-life conditions.
To this end, analyses of how various aspects of phonology are modified in specialized registers of speech, such as infant-directed speech (IDS), or read speech (RS), as opposed to adult-directed speech (ADS), could shed new light on the dynamic aspects of phonology. In the present paper, we will present results from two studies of Japanese infant-directed speech that explore dynamic aspects of intonation and segmental properties of human speech. In both studies, we found that IDS is not simply more variable or noisy than ADS, but instead differs in systematic, interpretable ways, indicating that the phonological system itself is dynamic; that is, it can shift the realization of rules to accommodate the specific paralinguistic factors of a given register. Such dynamic properties, which must be one of the ways the phonological structure of a language is implemented phonetically, can be observed only by comparing different registers.
1.1 Infant-directed speech (IDS)
When adults talk to infants, they use a specific speech register known as IDS. An extensive body of research has examined in what ways IDS differs from adult-directed speech (ADS) among the world’s languages (see Soderstrom 2007 for a review). For example, IDS is characterized by overall higher pitch (Ferguson 1964; Fernald and Simon 1984), a greater pitch range (Fernald and Simon 1984), larger pitch peaks on focused words (Fernald and Mazzie 1991), slower speech rate (Bryant and Barrett 2007), and more distinct phonetic categories (Malsheen 1980; Masataka 1992; Andruski and Kuhl 1996; Kuhl et al. 1997; Burnham et al. 2002; Liu et al. 2007; Cristià 2010) when compared to ADS. In the literature, these characteristics of IDS have been argued to play important roles in communication between infants and caregivers, such as capturing the infants’ attention, communicating affect, and facilitating the infants’ language development (e.g., Fernald 1989; Kitamura et al. 2002). The prosodic and segmental characteristics of IDS can, therefore, provide us with an ideal window through which the dynamical aspects of phonology can be investigated.
At the same time, research on IDS to date has largely been based on physical measurements of the speech without reference to the phonological structure of the language. For example, ‘exaggeration of intonation’ in IDS has been measured simply by subtracting the minimum F0 value from the maximum of each utterance. In the following sections, we will demonstrate that phonologically informed analyses of IDS can give rise to important new insights on the nature of IDS that had been overlooked, or not fully examined in previous studies. We will use two published studies of Japanese IDS as examples, one involving pitch exaggeration and the other vowel devoicing.
1.2 Data: RIKEN Japanese Mother-Infant Conversation Corpus
Both studies discussed in this paper use a corpus of Japanese IDS (RIKEN Japanese Mother-Infant Conversation Corpus, R-JMICC; Mazuka et al. 2006). The corpus consists of recordings from 22 mothers (age 25–43, average age 33.8), all native speakers from the Tokyo area, and their infants (18–24 months, average 20.4, 9 girls). Each mother was recorded in two situations: playing with her infant using a variety of toys and picture books, and speaking with the experimenter. Speech recorded during the first situation, in which the mother was alone with her infant, will be considered infant-directed, while speech recorded while speaking with the experimenter will be considered adult-directed. Approximately 40 minutes of speech in all were recorded for each mother, resulting in a total corpus length of 14 hours.
The R-JMICC contains various linguistic annotations including segmental, morphological, and intonational labels. Segmental labels, representing types and duration of vowels and consonants, were time-locked to the speech signals. Intonational labeling was based on the X-JToBI scheme (Maekawa et al. 2002), which provides, among other things, information on two levels of prosodic phrasing (Accentual Phrase, AP, and Intonational Phrase, IP), lexical pitch accents, and Boundary Pitch Movements (BPMs).
In addition, recordings of careful, read speech (RS) were collected from 20 of the original 22 mothers, who returned for additional recordings a few years later. In the RS recording, all participants read the same text, consisting of a list of 115 sentences (whose order was randomized by speaker), which were constructed so as to contain phonemes in approximately the same frequencies that they occur in typical adult-directed speech (Sagisaka et al. 1990). Speakers were alone in a sound-attenuated booth during the recording, with a phonetician monitoring the recording from a separate control room.
2 Exaggerated intonation in IDS (Igarashi et al. 2013)
The first study concerns the language-specific ways in which intonation is exaggerated in IDS. Among various properties of IDS, modification of intonation is arguably the best-known. Specifically, caregivers are reported to use ‘exaggerated’ or ‘sing-songy’ intonation in IDS across many languages, and this intonation is often argued to be one of the universal properties of IDS (Ferguson 1977; Grieser and Kuhl 1988; Fernald et al. 1989; Kitamura et al. 2002).
Interestingly, however, significant cross-linguistic differences are also known to exist in the characteristics of IDS (Bernstein Ratner and Pye 1984; Grieser and Kuhl 1988; Fernald et al. 1989; Papoušek et al. 1991; Kitamura et al. 2002). Fernald et al. (1989) compared intonational modifications in six languages/varieties (French, Italian, German, British English, American English, and Japanese), and found that all of them except Japanese showed pitch-range expansion in IDS. Fernald (1993) also reported that although English-learning infants responded appropriately to positive and negative emotional prosody in IDS in English, German, and Italian, they failed to do so with Japanese IDS.
In the previous studies, the cross-linguistic differences in IDS pitch expansion have been interpreted as reflecting mothers in different linguistic communities having stronger or weaker tendencies to exaggerate intonation. This interpretation is based on the commonly held assumption that paralinguistic pitch-range modifications (which should include intonational exaggeration in IDS) can occur globally irrespective of what tones are present in the utterance (cf. Ladd 1996: Ch. 7), using the F0 range of the overall utterance as the measurement (e.g., Ferguson 1977; Grieser and Kuhl 1988; Fernald et al. 1989; Papoušek et al. 1991; Kitamura et al. 2002). If, however, the mechanism of register-induced modification of intonation is constrained by language-specificity in intonational phonology (Ladd 1996: 334; Gussenhoven 2004: 355; Jun 2005: 462), the conventional measurements may not have been sufficient to capture the nature of intonational exaggeration for each language. If paralinguistic modifications of pitch range can occur only at certain tones, cross-linguistic differences in pitch-range expansion in IDS may be observed at different locations within an utterance, not as presence or absence (or larger or smaller) pitch-range expansion globally. Japanese is an ideal language to test this prediction.
To test this prediction, Igarashi et al. (2013) analyzed speech from 21 of 22 mothers in R-JMICC. One mother’s speech was not included due to the creakiness of her voice, which made it difficult to extract F0 values. They began by analyzing the R-JMICC data in the same way as Fernald et al. (1989), measuring the F0 maximum, mean, minimum, and range of each utterance as a whole. Results revealed that although IDS showed a significantly higher mean F0, maximum F0, and minimum F0 for overall utterances, the F0 range did not differ significantly between the two registers. These results replicate the findings in Fernald et al. (1989); that is, mothers used a higher pitched voice when talking to infants than to adults, but did not alter their overall pitch range. Moreover, all averages were comparable to those reported in Fernald et al. (1989), demonstrating that the lack of pitch range expansion in Fernald et al. (1989) was not due to some idiosyncratic characteristics of their Japanese sample. It also showed that the R-JMCC contains F0 characteristics comparable to those of Fernald’s sample.
2.1.1 Pitch range for Boundary Pitch Movement (BPM)
When the same data were analyzed with reference to the intonational structure of Japanese, a very different picture emerged. Igarashi et al. (2013) focused on two aspects of Japanese intonation that are relevant: lexical pitch accent and Boundary Pitch Movement (BPM). Words in Japanese can be either ‘accented’ or ‘unaccented’. The former exhibit pitch contours with a steep F0 fall, while the latter show contours with no such fall. Prosodic phrasing occurs at the Accentual Phrase (AP) level and the Intonation Phrase (IP) level. An AP is defined as (1) having a delimitative rise to high around the second mora and a subsequent gradual fall to low at the end of the phrase, and (2) having at most one lexical pitch accent. IP is defined as the prosodic domain within which pitch range is specified. Pitch-range specification of IPs is closely connected with a phonological process called ‘downstep’ by which the pitch peak of each accentual phrase (AP) is lowered when that AP follows an accented AP. Finally, BPMs are tones that can occur at the end of an AP and contribute to the pragmatic interpretation of the phrase, indicating features such as questioning, emphasis, and continuation (Venditti et al. 2008).
Figure 1(a) shows Pierrehumbert’s (1980) finite-state grammar for English, which can generate all the intonational contours in that language. In this language, all of the tonal categories (pitch accent, phrase accent, and boundary tone) involve a choice of tones. For each type of structure, speakers choose one of these alternatives in order to convey different types of pragmatic information (Ward and Hirschberg 1985; Pierrehumbert and Hirschberg 1990), and thus these tones are pragmatically chosen ones.
The structure of Japanese intonation (Pierrehumbert and Beckman 1988: 282; Maekawa et al. 2002; Venditti 2005) is summarized in the finite-state grammar in Figure 1(b), which shows that except at the end of the phrase, the F0 contour does not allow much variability. The only tonal options available in this part of the utterance involve the presence or absence of H* + L (a lexical pitch accent). This is, however, an intrinsic part of the lexical representation of a word and does not vary according to the speaker’s pragmatic intent. The contour at the end of the phrase, in contrast, can be much more variable. This is the location allocated to various types of BPM. Unlike lexical pitch accent, the choice of BPMs depends on pragmatic factors. When speakers wish to express some pragmatic information, they have a choice as to whether they assign a BPM to a given phrase and as to what type of BPM they use.
Igarashi et al. hypothesized that intonational exaggeration in IDS should emerge at the pragmatically chosen tones. If true, this exaggeration should appear anywhere in the contour in English; it should be observed not only in stressed syllables where pragmatic pitch accents occur but also at the edges of phrases where phrasal accents and boundary tones appear. In Japanese, by contrast, exaggeration should be confined to a more restricted part of the contour, namely in the BPM at the end of the AP, which is the only location where pragmatically chosen tones are realized in this language.
To test this prediction, they separated mothers’ utterances into the section that bears BPM and the rest of the utterance (called BODY), and analyzed these parts separately. For each BPM type (H%, LH%, & HL%), separate ANOVAs were carried out for F0 mean (Hz), minimum (Hz), maximum (Hz), and range (st). The results revealed that these measurements were significantly higher in IDS than ADS (except for minimum F0 for HL%). Importantly, they found that IDS pitch range was significantly larger in all BPM types than that of ADS. Figure 2 shows an example of pitch range expansion in a BPM in IDS. 1
2.1.2 Pitch range for the BODY of an utterance
Given that pitch ranges for IDS and ADS are not significantly different from each other when they are analyzed for an utterance as a whole, a large pitch range for BPM in IDS means that the pitch range for BODY in ADS should be larger than that in IDS, which Igarashi et al. (2013) confirmed in their analyses. They demonstrated that this can be accounted for by the longer utterance length in ADS. As described above, the F0 contour of BODY in Japanese is largely determined by the lexical specifications of the words in the phrase. Because downstep occurs every time an accented word is encountered in an utterance, both the F0 peak and F0 valley are lowered. In association with downstep, anticipatory raising occurs in which the F0 peak of the initial AP becomes higher. These two processes predict that pitch ranges for longer utterances should be larger than shorter ones regardless of the speech register. In fact, this is what they found. When they analyzed F0 mean, maximum, minimum, and pitch ranges of the BODY of IDS and ADS utterances, matched for utterance length (measured by the number of APs), they found that although IDS was generally higher than ADS in F0 mean, maximum, and minimum for utterances, the pitch ranges for IDS and ADS were not different from each other when they were compared separately at each length.
Taken together, Igarashi et al.’s (2013) results demonstrate that Japanese IDS does show register-induced pitch-range expansion, counter to previous studies that reported otherwise. First, they found robust pitch-range expansion at the locations of BPMs, which occur more frequently in IDS than ADS. BPMs are tones that are associated with pragmatic interpretations (Venditti et al. 2008), and IDS utterances containing BPMs often involve mothers’ attempt to engage infants by addressing questions to them or seeking agreement by using the sentence-final particle ne (cf. Fernald and Morikawa 1993). Crucially, when BPMs occur in IDS, each of the tones is produced with an expanded pitch range. This type of pitch-range expansion is likely to be heard as ‘exaggerated’ by listeners and may account for the results of previous studies showing that the intonation of Japanese IDS is perceived to be exaggerated by Japanese adults and infants.
Second, the length-induced pitch-range expansion in the ADS BODY is not an ‘exaggeration’ of the intonation. The pitch range of an Intonation Phrase (IP) is determined primarily by the number of accents the IP contains – the longer the IP (and thus the more accented words it contains), the larger the pitch range. This is independent of register-induced pitch-range modification and occurs regardless of the difference between ADS and IDS. When a speaker produces a long utterance with a pitch range that is normal for its length, she/he is under no pressure to ‘exaggerate’ the intonation, nor is it likely to be heard as ‘exaggerated’ by a listener. In fact, when ADS and IDS utterances of equal length were compared, there was no difference in pitch ranges between the two registers.
Igarashi et al.’s findings are novel in that they successfully demonstrate that not all phonological tones are subject to the paralinguistic modification characteristic of a specialized speech register. Specifically, their analysis suggests that pitch-range expansions in IDS are not realized in the same way in every language, but are instead implemented within a language-specific system of intonation. When there is a desire or pressure to exaggerate the intonation, speakers seem to do so by expanding the pitch range at the location where flexibility in varying contours is most tolerated. In phonological terms, this is the location where pragmatically chosen tones are realized. In the case of Japanese, these are BPMs at the boundaries of prosodic phrases, while in the case of English, they are not only phrase accents and boundary tones at the phrasal boundaries, but also pitch accents at the locations of stressed syllables. It has been commonly assumed that paralinguistic pitch-range modifications (which should include intonational exaggeration in IDS) can occur globally irrespective of what tones are present in the utterance (cf. Ladd 1996: Ch. 7). The results of Igarashi et al. (2013), however, show that only certain tones can undergo paralinguistic modifications. Their study, therefore, promises to shed light on the phonetics of pitch range variation.
3 Vowel devoicing (Martin et al. 2014)
We now turn to the second study, which examined the phonological process of vowel devoicing in Japanese. When high vowels /i/ and /u/ (phonetically [ɯ]) occur between two voiceless consonants in Japanese, they are frequently pronounced without the vocal fold vibrations typically employed in vowel production (Han 1962; Vance 1987). The spectrograms in Figure 3 illustrate the phonetic differences between voiced and devoiced versions of the same vowel phoneme. Both spectrograms represent the pronunciation of the word kita ‘came’ by different mothers in the R-JMICC. In (a), the vowel /i/ in the first syllable is voiced, as evidenced by the regular voicing pulses and formant bands seen in the spectrogram. In contrast, the /i/ in (b) is devoiced, characterized by a lack of voicing pulses, and irregular high-frequency white noise.
Although the high vowels /i/ and /u/ devoice with the greatest frequency, the other vowels of Japanese are also occasionally devoiced in this environment (Vance 1987). Crucially, devoicing of high or non-high vowels does not occur 100% of the time, and the probability of a given vowel being devoiced or not changes dynamically according to linguistic and para-linguistic factors. Comparing how various factors impact the likelihood of devoicing in different speech registers could shed light on the phonological processes involved in vowel devoicing.
3.1 Devoicing in IDS
Research on IDS has shown that phonetic categories in IDS tend to be more distinct (Malsheen 1980; Masataka 1992; Andruski and Kuhl 1996; Kuhl et al. 1997; Burnham et al. 2002; Liu et al. 2007; Cristià 2010) when compared to ADS. It has been argued that the distinct phonetic categories in IDS serve to provide the infant with input that is optimized for learning certain aspects of the linguistic system (Kuhl et al. 1997; de Boer and Kuhl 2003; Kirchhoff and Schimmel 2005). These hypotheses have in common a vision of IDS as an essentially listener-oriented speech style, in which the speaker expends additional effort in order to maximize the information content in the signal, making the infant listener’s task easier. Fernald (2000), exemplifying this view, argues that IDS represents a form of hyperspeech, a reference to Lindblom’s (1990) “H and H theory,” which posits a continuum from effortful but easy to perceive “hyperspeech” on one end to less effortful but harder to perceive “hypospeech” on the other.
Infant-directed speech is only one variety of hyperspeech. Foreigner-directed speech (Scarborough et al. 2007; Uther et al. 2007), speech produced in noisy conditions (Payton et al. 1994), and read speech (Nakamura et al. 2008) have all been described as hyperarticulated speech styles, in which the information content in the signal is increased for the benefit of the listener. If the H and H theory is correct, then these various speech styles should pattern similarly – those acoustic parameters associated with clarity and intelligibility should be altered in the same direction when these varieties of hyperspeech are compared to a non-hyperspeech baseline. Martin et al. (2014) tested this hypothesis by comparing three different speech styles, all produced by the same set of speakers: spontaneous infant-directed speech, spontaneous adult-directed speech, and careful, read speech (RS).
Devoiced vowels are more difficult to perceive than their voiced counterparts (Gordon 1998), meaning that devoicing results in a loss of information regarding the vowel’s identity. Beckman and Shoji (1984) show that native Japanese listeners’ ability to identify whether a devoiced vowel is /i/ or /u/ drops substantially when the vowel is devoiced. Vowel devoicing also impedes the recognition of nonwords in running speech (Cutler et al. 2009). Devoiced vowels thus make the infant’s task of learning how to identify vowels and words that much more difficult. If IDS is a form of hyperspeech, we expect Japanese speakers to devoice less often when speaking to infants, in order to maximize the infants’ chances of correctly perceiving vowels and the words that contain them.
The first part of Martin et al.’s (2014) analyses was based on IDS and ADS samples from the R-JMICC. A total of 7,838 vowels which occurred between two voiceless consonants were identified and coded for voicing by a trained phonetician who is a native speaker of Japanese (second author). In order to check the accuracy of these codings, 10% of the vowel tokens were selected at random and coded separately by the third and the fourth authors of the present paper (both trained phoneticians, one of whom is a native speaker of Japanese, and neither of whom was involved in the original coding of voicing). Inter-coder reliability was 93.5%. In the second part of the analyses, careful, read speech (RS) from 20 of the original 22 mothers was compared to ADS. As with the IDS and ADS samples, all vowels which occur between voiceless consonants were coded for voicing.
3.2.1 High vowels: IDS vs. ADS
Overall, out of all high vowels in potential devoicing environments, the mothers in the corpus devoiced an average of 76.8% of the vowels in infant-directed speech, compared to 90.2% in adult-directed speech, a significant difference according to a paired-samples t-test (t(21) = 5.57, p < 0.0001). However, from these numbers alone we cannot conclude that reduced devoicing is a part of a strategy, whether implicit or explicit, used by mothers when speaking to infants. It could be that some other property differs between the two speech styles, such as the use of different vocabulary in IDS and ADS, speech rate, and breathiness of the voice, each of which in turn could impact devoicing rate.
In order to test these possibilities, Martin et al. (2014) constructed two nested generalized linear mixed effects models (McCullagh and Nelder 1989). Both models contain the above-mentioned factors as predictors, and differ only in whether speech style is included among the predictors. If the difference in devoicing rates between IDS and ADS is entirely the result of differences in factors such as context and speech rate, adding speech style as a predictor should not result in a significant increase in predictive power.
Both models use Voicing as the dependent variable (i.e., voiced or voiceless). The first model contains the fixed factors Vowel (/i/ or /u/), Preceding Context (the phoneme occurring before the vowel), Following Context (the phoneme occurring after the vowel), an interaction term between preceding and following context, Accent (accented or unaccented), Speech Rate (the rate in moras per second of the utterance in which the vowel occurs, where an utterance is defined as a section of continuous speech surrounded by pauses of 2.0 seconds or longer), and Breathiness (H1–H2 averaged over all voiced vowels in a utterance). Speaker and Word were also included as random factors to ensure that the model generalizes to other speakers and vocabulary items. 2
The first model, which did not include Speech Style among the predictors, confirmed that, indeed, the identity of the vowel, the voice quality, the speech rate, the consonantal context, and the syllable’s accent status all influence whether a vowel is devoiced. Replicating previous research, a faster speech rate results in greater devoicing. The vowel /u/ is devoiced less often than /i/, and vowels in accented syllables are devoiced less often than those in unaccented syllables. Also consistent with earlier research are the effects of context – being preceded by a fricative increases a vowel’s chance of being devoiced, while being followed by a fricative decreases the probability of devoicing (Fujimoto 2004; Maekawa and Kikuchi 2005). Although the model seems to indicate that greater breathiness corresponds to less devoicing, the Breathiness factor is likely acting as a proxy for the Speech Style factor, which is missing from this model.
The second linear model was identical to the first, with the addition of Speech Style (IDS or ADS) as a fixed factor, and was designed to determine whether the speech style affects vowel devoicing above and beyond the contributions made by the factors included in the first model. A likelihood ratio test confirms that this second model, including the Speech Style factor, provides a significantly improved fit to the data over the first model (χ2(1) = 7.50, p < 0.01). This indicates that, while devoicing is affected by vowel identity, speech rate, pitch accent, and context, there is also a reliable additional effect of speech style – ceteris paribus, high vowels will be devoiced less often in infant-directed speech. Additionally, in the model including the Speech Style factor, the effects of voice quality are no longer significant. This suggests that in the smaller model, the only reason the Breathiness factor was a significant predictor was its correlation with speech style. When speech style is added as a predictor, there is no longer any evidence that breathier speech corresponds to less devoicing.
3.2.2 Non-high vowels: ADS vs IDS
Although vowel devoicing in Japanese is typically described as being limited to the high vowels /i/ and /u/, the other vowels are also sometimes devoiced, albeit at much lower frequencies (Maekawa 1988; Maekawa and Kikuchi 2005). The mothers in the R-JMICC devoiced /e/, /o/, and /a/ an average of 2.3% of the time in adult-directed speech. In infant-directed speech, however, the devoicing rate rises to 11.4%. In other words, they found the opposite pattern in non-high vowels of that found in high vowels – mothers tend to devoice non-high vowels more when speaking to infants.
As with the high vowels, Martin et al. constructed a pair of nested linear models in the same way, using the same factors, as described above. For the non-high vowels, the consonantal context factors did not contribute significantly to the model; in addition, standard errors for these factors were extremely high, indicating that the algorithm was having difficulty estimating parameters for consonantal context. For these reasons, the preceding and following context predictors were removed from the final model.
A likelihood ratio test shows that for non-high vowels, just as for high vowels, including Speech Style as a predictor results in a significant improvement in model fit (χ2(1) = 20.76, p < 0.001). However, the direction of the change in the probability of devoicing is opposite from that seen in high vowels: devoicing is more likely in IDS than ADS.
It is thus clear that speakers alter their speech when speaking to infants, but in opposite directions, decreasing devoicing in high vowels and increasing it in non-high vowels.
3.2.3 Read speech
Careful, read speech (RS) has been described as a style in which, like infant-directed speech, speakers take pains to emphasize phonetic distinctions, presumably in order to make the listener’s task easier. This view would predict that read speech should differ from spontaneous ADS in similar ways as IDS. To test this prediction, speech from 20 mothers who participated in RS recordings was compared to their ADS. Using the same procedure as the above comparisons of ADS and IDS, for each of the vowel types, high and non-high, Martin et al. constructed two nested linear models. Both models contained the same factors as those in the ADS-IDS comparisons discussed above – they differed only in that one included Speech Style (ADS or RS) as an additional fixed factor.
For high vowels, the coefficient for Speech Style (ADS) was significant, but it was negative, indicating that compared to RS, high vowels in ADS are less likely to be devoiced. Furthermore, a likelihood ratio test shows that the addition of Speech Style as a predictor results in a significantly improved fit to the data (χ2(1) = 22.16, p < 0.001). For non-high vowels, the Speech Style (ADS) factor has a positive coefficient, meaning that non-high vowels in spontaneous adult-directed speech are more likely to be devoiced than those in careful speech. In this case as well, the larger model with Speech Style had significantly more predictive power (χ2(1) = 11.09, p < 0.001) than the smaller model.
There are thus two opposite trends apparent in the data: as one moves from infant-directed to adult-directed to read speech, devoicing in the high vowels /i/ and /u/ becomes more probable, while devoicing in the non-high vowels /a/, /e/, and /o/ becomes less probable. These results are summarized by Figure 4, which plots the probabilities of devoicing predicted for each speech style by a linear model containing data from all three styles, for high vowels (gray bars) and non-high vowels (white bars). The plotted values were derived by setting all of the model parameters except for the intercept and the Speech Style coefficient to zero, thus producing an overall predicted value in logit units for each speech style and each vowel type. These logit values were then back-transformed into probabilities. Error bars represent the standard errors of the estimates of the Speech Style coefficient. Note that the y-axis is scaled differently for each vowel type. These findings demonstrate that what is acoustically a single process, vowel devoicing, behaves differently depending on the type of vowel involved.
Martin et al. (2014) were able to capture the dynamic nature of an articulatory process, vowel devoicing, by comparing factors that influence devoicing in infant-directed, adult-directed, and careful, read speech. They found that compared to adult-directed speech, the process was implemented less often in one class of sounds (high vowels) and more often in another (non-high vowels). They also found that in careful, read speech, the high vowel devoicing rule is implemented more consistently than in spontaneous adult-directed speech, while non-high vowels are more frequently produced with voicing.
The way other factors interacted with speech register also offers useful insights. Among many of the factors that have been previously reported to impact vowel devoicing, linguistic/phonological factors such as identity of the vowel, the consonantal context, and the syllable’s accent status, were found to impact only the devoicing of high-vowels but not non-high vowels. Factors that can be considered para-linguistic, such as speech rate and breathiness of the voice, on the other hand, were found to impact non-high vowel devoicing as well. These differences are also consistent with the two mechanisms view. The fact that the two types of devoicing (high vowels vs. non-high vowels) showed an opposite pattern with regard to the three registers, when considered alongside the way other factors interacted with speech register, seems to suggest a picture in which two distinct mechanisms underlie these processes: phonological devoicing, in which speakers intentionally produce the target vowel without voicing, and phonetic devoicing, which occurs as an unintentional consequence of gestural overlap.
Martin et al. (2014) attempted to offer an interpretation for why devoicing rates change across speech styles in the directions they do. For high vowels, the major puzzle is the fact that adult-directed speech has devoicing rates intermediate between those of infant-directed and read speech. If one takes a view that IDS and RS are both varieties of hyperspeech – speakers are trying to speak clearly to increase the intelligibility of their speech – one would expect them to diverge from ADS in the same direction. Martin et al. suggest that the answer to this question lies in the differing audiences for each speech style, and in the effects those audiences’ lexical knowledge, or lack thereof, has on the perceptibility of devoiced vowels. More specifically, the fact that infants do not yet know many words forces them to process speech bottom-up, relying primarily on the properties of individual segments – this means that they would benefit most from vowels which are pronounced as clearly as possible. Adults, on the other hand, are equipped with a vast store of top-down knowledge, meaning that their recognition is optimized when words are pronounced in their expected, canonical forms.
There is experimental evidence supporting both of these claims. As mentioned earlier, Beckman and Shoji (1984) found that when adult listeners were presented with words in isolation containing devoiced vowels, the rates at which they correctly identified the vowels (i.e., whether they were /i/ or /u/) were lower than when the vowels were voiced. Conversely, Ogasawara and Warner (2009) found that in a lexical decision task, adult listeners responded faster to the presence of a target high vowel in a devoicing environment when it was devoiced rather than voiced, presumably because their knowledge of the word led them to expect a devoiced vowel. These divergent findings are likely the result of the different tasks involved. In the Beckman and Shoji study, subjects did not need to access the lexicon, since the options they were asked to choose between in each trial were both actual words (e.g., /sikaN/ ‘military officer’ versus /syukaN/ ‘subjectivity’). To complete the task, subjects had to focus on the crucial vowel. The Ogasawara and Warner lexical decision task, however, involved searching one’s lexicon for a match to the entire stimulus (e.g., /yakisoba/). In other words, devoicing appears to impair perception of individual vowels but aid lexical access, presumably because lexical entries for words with frequently devoiced vowels are encoded with devoiced versions of those vowels.
Taken together, these experimental results suggest that for a listener with little or no lexical knowledge, such as an infant, whose primary task involves identifying individual segments, devoicing would pose an obstacle to be overcome. For an adult listener who is matching speech strings to an already developed lexicon, on the other hand, hearing words in their most typical form (i.e., with high vowels devoiced in devoicing environments) would actually help understanding.
The opposing trends in devoicing rates that we see in high vowels can thus be interpreted as speakers fine-tuning their speech to the processing needs of the intended listener. Infant-directed speech is characterized by the least amount of devoicing, accommodating infants’ need for maximally informative vowels. Adult-directed speech contains more devoicing, conforming to the adult listener’s expectations. Read speech, in which speakers are able to expend more time and energy on pronunciation, represents an even more extreme version of this adult-directed style. Speech rate, in contrast, fails to show these listener-specific effects, because slower speech is advantageous both for identifying individual vowels and for lexical lookup.
The above argument should not apply to non-high vowels, if we are correct in our speculation that devoicing in these vowels is due to articulatory factors which are beyond the control of the speaker. What, then, is the cause of the different devoicing rates across speech styles in non-high vowels? The most plausible candidate is voice quality – speech becomes breathier as we move from RS to ADS to IDS. As mentioned above, Martin et al. (2014) found a significant effect of breathiness in the non-high vowels, with greater breathiness being associated with more devoicing. But precisely because this devoicing is inadvertent, speakers are unable to reduce it even though doing so would benefit the infant listener. The increased breathiness in IDS itself serves a function – breathier speech is perceived as softer and more intimate (Laver 1980; Ishi et al. 2008), and is potentially used to communicate positive affect and maintain an emotional bond with the infant. But these advantages also exact a price, in the form of increased devoicing in the non-high vowels.
In sum, Martin et al. found that mothers alter their vowel devoicing behavior according to speech style, and furthermore do so in different ways depending on the height of the vowel. They have argued that, although infant-directed speech and careful, read speech modify devoicing in opposite directions, both can be considered forms of hyperspeech, in that they maximize intelligibility for their intended audiences. In addition, they have shown that infant-directed speech can contain properties – namely, increased devoicing in non-high vowels – which are in fact disadvantageous to infant listeners, but can be seen to be the result of other factors – in this case, increased breathiness – which themselves are beneficial to infant communication.
The picture of speech styles that emerges from this study is thus more complex than a unidimensional continuum from hypospeech at one end to hyperspeech at the other. A single acoustic parameter – the presence or absence of voicing pulses during a vowel – represents a battleground on which an array of forces compete to be expressed. Exploring the complex interplay of factors impinging on speech behavior promises to give us a richer, more nuanced picture of the role of infant-directed speech.
In the present paper, we showed how phonologically informed analyses of specialized speech registers, such as infant-directed speech and read speech, can provide new insights into the dynamic aspects of phonology using two of our own studies dealing with exaggerated intonation and vowel devoicing in Japanese IDS as examples. The first study demonstrated that the same para-linguistic pressure of mothers’ desire to exaggerate intonation manifests itself differently in Japanese and English. It showed that in each language, where the exaggeration is allowed to emerge is constrained by the intonational structure of the language. Their findings challenge the commonly held assumption that paralinguistic pitch-range modifications (which should include intonational exaggeration in IDS) can occur globally, irrespective of what tones are present in the utterance (cf. Ladd 1996: Ch. 7). They showed that only certain tones can undergo paralinguistic modifications.
The second study showed that a single acoustic parameter of vowel voicing may be produced by two distinct mechanisms depending on the vowel height and that the likelihood of occurrence changes dynamically according to the pressures imposed by specific speech registers. Furthermore, how a single speech register impacts devoicing differs distinctly depending on whether high or non-high vowels are the target of devoicing. This discovery was made possible only because the authors compared spontaneous adult speech to two registers (read speech and IDS) that on the surface would appear to be subject to the same para-linguistic pressures, but which in fact impact the process of devoicing in opposite directions.
This study also demonstrates that phonologically informed analyses of IDS is beneficial for IDS research. It can provide a deeper understanding of how and why mothers modify their speech when talking to young infants, which is not possible on the basis of the small number of acoustic parameters typically used in cross-linguistic studies of speech register.
One caveat that should be kept in mind when considering this type of research is that a superficial application of tools which have proven successful in previous research can sometimes be misleading. In experimental phonological research, it is not uncommon to compare spontaneous adult-adult speech to careful, read speech. Many insights have been derived from such comparisons, the hyperarticulation hypothesis being one of them. The problem occurs when the observations derived from such comparison are unquestioningly applied to other speech registers, such as infant-directed speech, which may involve certain characteristics that violate assumptions of adult phonology that are not always explicit.
In careful speech, adults are usually under a pressure to produce clear speech. Under such circumstances, they tend to produce hyperarticulated speech. When mothers speak to infants, they are also under pressure to speak clearly. Since careful speech and IDS are under the same pressure – speak clearly – mothers should also produce hyperarticulated speech. In Martin et al.’s example, the unspoken assumption for adult listeners, i.e., that they have extensive lexical knowledge of the language, was violated in the case of an infant audience, and the failure to take this into account led to inaccurate predictions.
Nonetheless, the results of the two studies amply demonstrate the usefulness of expanding research beyond conventional boundaries, both in phonology and infant language development. We believe that further research along these lines could bring novel discoveries in both fields.
Andruski, Jean E. & Patricia K. Kuhl. 1996. The acoustic structure of vowels in mothers’ speech to infants and adults. Proceedings of the Fourth International Conference on Spoken Language 3. 1545–1548. Google Scholar
Bates, Douglas M. 2005. Fitting linear mixed models in R. R News 5. 27–30. Google Scholar
Beckman, Mary E. & Atsuko Shoji. 1984. Spectral and perceptual evidence for CV coarticulation in devoiced/si/and/syu/in Japanese. Phonetica 41(2). 61–71. Google Scholar
Bernstein Ratner, Nan & Clifton Pye. 1984. Higher pitch in BT is not universal: Acoustic evidence from Quiche Mayan. Journal of Child Language 11(3). 515–522. Google Scholar
Bryant, Gregory A. & H. Clark Barrett. 2007. Recognizing intentions in infant-directed speech. Psychological Science 18(8). 746–751. Google Scholar
Burnham, Denis, Chiristine Kitamura & Ute Vollmer-Conna. 2002. What’s new, pussycat? On talking to babies and animals. Science 296(5572). 1435. Google Scholar
Cristià, Alejandrina. 2010. Phonetic enhancement of sibilants in infant-directed speech. The Journal of the Acoustical Society of America 128. 424–434. Google Scholar
Cutler, Anne, Takeshi Otake & James M. McQueen. 2009. Vowel devoicing and the perception of spoken Japanese words. The Journal of the Acoustical Society of America 125. 1693. Google Scholar
de Boer, Bart & Patricia K. Kuhl. 2003. Investigating the role of infant-directed speech with a computer model. Acoustics Research Letters Online 4(4). 129–134. Google Scholar
Ferguson, Charles. A. 1964. Baby talk in six languages. American Anthropologist 66(6). 103–114. Google Scholar
Ferguson, Charles. A. 1977. Baby talk as a simplified register. In Catherine E. Snow & Charles. A. Ferguson (eds.), Talking to children: Language input and acquisition, 209–235. London: Cambridge University Press. Google Scholar
Fernald, Anne. 1989. Intonation and communicative intent in mothers’ speech to infants: Is the melody the message? Child Development 60. 1497–1510. Google Scholar
Fernald, Anne. 1993. Approval and disapproval: Infant responsiveness to vocal affect in familiar and unfamiliar languages. Child Development 64(3). 657–674. Google Scholar
Fernald, Anne. 2000. Speech to infants as hyperspeech: Knowledge-driven processes in early word recognition. Phonetica 57. 242–254. Google Scholar
Fernald, Anne & Claudia Mazzie. 1991. Prosody and focus in speech to infants and adults. Developmental Psychology 27(2). 209–221. Google Scholar
Fernald, Anne & Hiromi Morikawa. 1993. Common themes and cultural variations in Japanese and American mothers’ speech to infants. Child Development 64. 637–656. Google Scholar
Fernald, Anne & Thomas Simon. 1984. Expanded intonation contours in mothers’ speech to newborns. Developmental Psychology 20(1). 104–113. Google Scholar
Fernald, Anne, Traute Taeschner, Judy Dunn, Mechtild Papoušek, Bénédicte de Boysson-Bardies & Ikuko Fukui. 1989. A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants. Journal of Child Language 16(03). 477–501. Google Scholar
Fujimoto, Masako. 2004. Effects of consonant type and syllable position within a word on vowel devoicing in Japanese. Speech Prosody 2004. 625–628. Google Scholar
Gordon, Matthew. 1998. The phonetics and phonology of non-modal vowels: a cross-linguistic perspective. Berkley Linguistics Society 24. 93–105. Google Scholar
Grieser, DiAnne L. & Patricia K. Kuhl 1988. Maternal speech to infants in a tonal language: Support for universal prosodic features in motherese. Developmental Psychology 24(1). 14–20. Google Scholar
Gussenhoven, Carlos. 2004. The phonology of tone and intonation. Cambridge: Cambridge University Press. Google Scholar
Han, Mieko. S. 1962. Unvoicing of vowels in Japanese. Onsei no Kenkyū 10. 81–100. Google Scholar
Igarashi, Yosuke, Ken’ya Nishikawa, Kuhiyoshi Tanaka & Reiko Mazuka. 2013. Phonological theory informs the analysis of intonational exaggeration in Japanese infant-directed speech. The Journal of Acoustical Society of America 134(2). 1283–1294. Google Scholar
Ishi, Carlos Toshinori, Hiroshi Ishiguro & Norihiro Hagita. 2008. The acoustics and the functions of breathy/whispery voice qualities in speech communication. IEICE Technical Report SP2008-42. 127–132.
Jun, Sun-Ah. 2005. Prosodic typology: The phonology of intonation and phrasing. New York: Oxford University Press. Google Scholar
Kirchhoff, Katrin & Steven Schimmel. 2005. Statistical properties of infant-directed versus adult-directed speech: Insights from speech recognition. The Journal of the Acoustical Society of America 117. 2238. Google Scholar
Kitamura, Christine, Thanavishuth Chayada, Denis Burnham & Sudaporn Luksaneeyanawin. 2002. Universality and specificity in infant-directed speech: Pitch modifications as a function of infant age and sex in a tonal and non-tonal language. Infant Behavior and Development 24. 372–392. Google Scholar
Kuhl, Patricia K., Jean Andruski, Inna Chistovich & Ludmilla Chistovich. 1997. Cross-language analysis of phonetic units in language addressed to infants. Science 277. 684–686. Google Scholar
Ladd, Robert D. 1996. Intonational phonology. Cambridge: Cambridge University Press. Google Scholar
Laver, John. 1980. The phonetic description of voice quality. Cambridge: Cambridge University Press. Google Scholar
Lindblom, Björn. 1990. Explaining phonetic variation: a sketch of the H and H theory. In William J. Hardcastle & Alain Marchal (eds.), Speech production and speech modelling, 403–439. Dordrecht: Kluwer. Google Scholar
Liu, Huei-Mei, Feng-Ming Tsao & Patricia K. Kuhl. 2007. Acoustic analysis of lexical tone in Mandarin infant-directed speech. Developmental Psychology 43(4). 912–917. Google Scholar
Maekawa, Kikuo. 1988. Boin no museika [Vowel devoicing]. Kōza Nihongo to Nihongo Kyōiku 2. 135–153. Google Scholar
Maekawa, Kikuo & Hideaki Kikuchi. 2005. Corpus-based analysis of vowel devoicing in spontaneous Japanese: an interim report. In Jeroen van de Weijer, Kensuke Nanjo & Tetsuo Nishihara (eds.), Voicing in Japanese, 205–228. Berlin: Mouton de Gruyter. Google Scholar
Maekawa, Kikuo, Hideaki Kikuchi, Yosuke Igarashi & Jennifer Venditti. 2002. X-JToBI: An extended J_ToBI for spontaneous speech. In Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, 1545–1548.
Malsheen, B. J. 1980. Two hypotheses for phonetic clarification in the speech of mothers to children. In Grace H. Yeni-Komshianm, James F. Kavanaugh & Charles A. Ferguson (eds.), Child phonology: Volume 2. Perception, 173–184. New York: Academic Press. Google Scholar
Martin, Andrew, Akira Utsugi & Reiko Mazuka. 2014. The multidimensional nature of hyperspeech: Evidence from Japanese vowel devoicing. Cognition 132(2). 216–228. Google Scholar
Masataka, Nobuo. 1992. Motherese in a signed language. Infant Behavior and Development 15(4). 453–460. Google Scholar
Mazuka, Reiko, Yosuke Igarashi & Ken’ya Nishikawa. 2006. Input for learning Japanese: RIKEN mother-infant conversation corpus. IEIC Technical Report 106(165). 11–15. Google Scholar
McCanne, Joanne & Sue Peppé. 2003. Prosody in autism spectrum disorders: a critical review. International Journal of Language and Communication Disorders 38. 325–350. Google Scholar
McCullagh, Paul & John A. Nelder. 1989. Generalized linear models. London: Chapman & Hall. Google Scholar
Nakamura, Masanobu, Koji Iwano & Sadaoki Furui. 2008. Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance. Computer Speech & Language 22(2). 171–184. Google Scholar
Ogasawara, Naomi & Natasha Warner. 2009. Processing missing vowels: Allophonic processing in Japanese. Language and Cognitive Processes 24(3). 376–411. Google Scholar
Papoušek, Mechtild, Hanuš Papoušek & Hanuš Symmes. 1991. The meanings of melodies in motherese in tone and stress languages. Infant Behavior and Development 14(4). 415–440. Google Scholar
Payton, Karen L., Rosalie M. Uchanski & Louis D. Braida. 1994. Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. The Journal of the Acoustical Society of America 95. 1581. Google Scholar
Pierrehumbert, Janet B. 1980. The phonology and phonetics of English intonation. Cambridge, MA: Massachusetts Institute of Technology Ph.D. dissertation. Google Scholar
Pierrehumbert, Janet B. & Mary E. Beckman. 1988. Japanese tone structure. Cambridge, MA: MIT Press. Google Scholar
Pierrehumbert, Janet B. & Julia Hirschberg. 1990. The meaning of intonational contours in the interpretation of discourse. In Philip R. Cohen, Jerry Morgan & Martha E. Pollack (eds.), Intentions in communication, 271–311. Cambridge, MA: MIT Press. Google Scholar
R Development Core Team. 2010. R: A language and environment for statistical computing. Computer programme. http://www.R-project.org/.
Sagisaka, Yoshinori, Kazuya Takeda, Masanobu Abe, Shigeru Katagiri, Tetsuo Umeda & Hisao Kuwabara. 1990. A large-scale Japanese speech database. Proceedings from First International Conference on Spoken Language Processing, 1089–1092. Google Scholar
Scarborough, Rebecca, Jason Brenier, Yuan Zhao, Lauren Hall-Lew & Olga Dmitrieva. 2007. An acoustic study of real and imagined foreigner-directed speech. ICPHS Proceedings 16. 2165–2168. Google Scholar
Soderstrom, Melanie. 2007. Beyond babytalk: Re-evaluating the nature and content of speech input to preverbal infants. Developmental Review 27. 501–532. Google Scholar
Uther, Maria, Monja A. Knoll & Denis Burnham. 2007. Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech. Speech Communication 49(1). 2–7. Google Scholar
Vance, Timothy J. 1987. An introduction to Japanese phonology. Albany, NY: State University of New York Press. Google Scholar
Venditti, Jennifer. 2005. The J_ToBI model of Japanese intonation. In Sun-Ah Jun (ed.), Prosodic typology: The phonology of intonation and phrasing, 172–200. New York: Oxford University Press. Google Scholar
Venditti, Jennifer, Kikuo Maekawa & Mary E. Beckman. 2008. Prominence marking in the Japanese intonation system. In Shigeru Miyagawa & Mamoru Saito (eds.), Handbook of Japanese Linguistics, 456–512. New York: Oxford University Press. Google Scholar
Ward, Gregory & Julia Hirschberg. 1985. Implicating uncertainty: the pragmatics of fall-rise. Language 61. 747–776. Google Scholar
Williams, Carl E. & Kenneth N. Stevens. 1972. Emotions and speech: some acoustical correlates. The Journal of the Acoustical Society of America 52. 1238–1250. Google Scholar
See Igarashi et al. (2013) for the details of the statistical analyses.
In order to reduce collinearity among the predictors, the scalar factors Speech Rate and Breathiness were centered by subtracting the mean from each value, and the categorical factors were each sum coded (for example, vowels in an accented syllable were coded as 1, and those in an unaccented syllable as –1). Parameters for the model were estimated using Laplace approximation, implemented in the lmer function included in the lme4 package (Bates 2005; Bates and Sarkar 2005) for the R statistical programming environment (version 2.12.0; R Development Core Team 2010). See Martin et al. (2014) for the detailed results of the statistical analyses.