Using social media as a source of analysable material in phonetics and phonology – lenition in Spanish

: The COVID-19 pandemic has shown that alternative methods of data collection are necessary to continue working in certain ﬁ elds of linguistics. This is a challenge for (socio)phoneticians and phonologists who have to rely on good quality sound but cannot do ﬁ eldwork or gather recordings in a traditional manner. In this paper, I show that audio recordings made via social media can help alleviate this problem. To this end, I compared samples from ﬁ ve speakers of dialectal Spanish recorded in a laboratory setting and via a social media application (WhatsApp). The analysis of temporal and spectral characteristics of consonants in postvocalic position shows that recordings made via social media can be successfully used for at least some types of sociophonetic analysis. They also provide some additional advantages for researchers: ease of data collection, potentially large speech corpora, and access to authentic, naturalistic speech which is uninhibited by laboratory conditions or the presence of a researcher and a professional recording device.


Introduction
The outbreak of COVID-19 had a negative influence not only on the everyday life of societies across the globe, but also on the way academic output is generated by thousands of researchers. This is of utmost importance for experimental linguists, and phoneticians in particular, as the type of data they are dealing with has to be carefully collected, in accordance with international standards of data quality. Typically, linguists dealing with spoken language rely on hands-on recordings made either in the laboratory or in naturalistic settings, that is, in the place of residence of the analysed speakers. The pandemic has made this task practically impossible over the past three years. On many occasions it was impossible to travel or meet with study participants, and new methods had to be employed to gather data and allow research to go forward. In this context, the use of modern technologies comes to mind as a possible solution. This includes not only online or remote studies, for instance, but also the use of social media. In this paper, I present an instance of the latter.

Social media as a source of data for phonetic analysis
Social media provides an enormous source of language data produced in real-life situations and without the oversight of a researcher. On the one hand, this leads to uncontrolled samples where the produced sentences are not designed to contain the objects of interest, which makes it more difficult to get enough data to investigate the desired phenomena; larger corpora may be needed to elicit what the researcher is looking for. On the other hand, we get the language as it is used on an everyday basis, in real-life situations, and with real interlocutors, without hypercorrection or inhibition produced by a laboratory setting or the presence of a recording device.
Unconstrained language productions have to date been the object of interest mostly of text linguists (see e.g. Sun et al. 2021 for a review; also Baldwin et al. 2013;Rüdiger and Dayter 2020). However, it is also possible to use speech recordings produced via social media either for sociolinguistic or, more specifically, for (socio)phonetic analysis. A major concern in the latter case is recording quality, as only high quality recordings are suitable for a more in-depth acoustic analysis. This topic has been examined fairly recently in the context of remote data collection (see especially Freeman and De Decker 2021;Zhang et al. 2020Zhang et al. , 2021. Zhang et al. (2021), for instance, compare different devices and applications used either on a computer or on a smartphone, including video conferencing, for remote data collection, pointing to the advantages and disadvantages of this method. While such data collection proved useful, several data quality issues, such as formant and F0 distortions, need to be taken into account. However, not all types of phonetic analysis require the absolute reliability of these metrics, which is of consequence for the present study. Analysing relative duration and intensity is (one hopes) a lot simpler and less sensitive to data quality issues. Of relevance here is that these are the two most important variables in studying lenition. At the same time, there is an important distinction between remote but controlled data collection and naturalistic recordings made over social media. While remote data collection is a conscious process which involves a task (usually reading or repeating sounds or sentences), social media recordings give access to more authentic productions because they are not made for the purpose of phonetic analysis. Rather, the data are later repurposed for speech analysis (with the speaker's consent). This is a major advantage in studying lenition, where we are dealing with optionality and variation, and we often want to get insight into the rates of sound changes not only inside, but also across, words.
In this context, I would like to show that sufficient quality can be obtained from social media recordings at least for some types of phonetic/phonological analysis. Thus, in the subsequent sections of the paper, I present the results of a comparative analysis of recordings obtained via a popular social media application, WhatsApp (https://www.whatsapp.com), and recordings obtained in an experimental setting. As will be argued in Section 5, naturalistic recordings give us insight into the actual productions of speakers in the course of peer-to-peer communication and into the extent of lenition in this context, which is much greater than when measured in a controlled study.

The process under analysis
The process taken into account in the comparative analysis is postvocalic stop voicing, a lenition phenomenon typical of certain dialects of Spanish Quilis 1993;Machuca Ayuso 1997;Martínez Celdrán 2009;Torreblanca 1976;Torreira and Ernestus 2011), especially in the Canary Islands (Herrera Santana 1989Marrero 1986;Oftedal 1985;Trujillo 1980). Some examples of the process are given in (1) According to the literature, the inhabitants of the Canary Islands produce voiced variants of /p t k/ in spontaneous speech regardless of age, sex, or social status, although the process is more common among uneducated speakers of the islands and residents of rural areas (Trujillo 1980). However, contrary to expectations, a recent study by Broś and Lipowska (2019) showed that full stop voicing occurs only 29.5% of the time, and partial voicing only 15.4% on average. The question is to what extent these results are due to the experimental setting of that study, rather than sociolinguistic or diachronic factors. Interestingly, the 2019 study also showed that full or partial voicing is not necessarily the only result of /p t k/ lenition in the dialect. Voiceless stops can also be produced as voiced approximants (e.g. fonética [fo.ˈne.di.ɣ ̞ a]). The typical phonetic manifestation of this change is higher intensity with respect to the flanking vowels, and the resulting approximants have a visible formant structure. At the same time, /p t k/ can undergo voicing or remain unvoiced but be produced without a burst. Thus, a complex picture of lenition ensues in which sound changes should be treated as a continuum, from voiceless stops to approximants, with various features changing. All of these factors were explored by Broś and Lipowska and a comparison of their results with spontaneous data from this study can provide new insight on the matter. Most importantly, it can inform us about the true nature and degree of lenition in the dialect given that controlled studies can underrepresent the actual rate of phonetic processes.

Data and methodology
The data used for the analysis were taken from two sources: a controlled production study performed in 2016, and recordings taken from WhatsApp conversations from 2020 made by the same speakers as those who participated in the production study. 2 The focus of the paper is on the lenition of /p t k/ in dialectal Spanish. Thus, I present a comparison of the way in which Spanish stops are lenited in both types of recordingthat is, both types of setting (laboratory vs. social media)by the same speakers, which gives us direct insight into the "measurement bias" or inaccuracy produced either by the presence of a researcher or recording device or by the "lab setting" typical of a controlled study in which a speaker has to repeat or read out a fixed set of sentences. Since the comparison was made to provide information on the reliability of social media-based data, the analysis should be treated as a pilot study. As will be shown in subsequent sections, however, the method can be extended to other speakers and more robust data in the future provided that an appropriate sampling and data selection procedure is applied.

Data under analysis
The first corpus of data consists of 39 sentences each containing one instance of postvocalic /p t k/ and produced twice by five native speakers of Gran Canarian Spanish (one female, four males) aged 23-24. Each sentence consisted of the phrase He comprado cinco 'I have bought five' followed by a noun phrase consisting of a noun denoting a container or an object (the target word) and a prepositional phrase, such as cubos de basura 'trash bins'. The preceding sound was [o], whereas the target segment was [p], [t], or [k], in which postvocalic lenition was expected. There were 13 p-initial, 13 t-initial, and 13 k-initial words (see Broś and Lipowska 2019 for details). 3 The recordings were made using a Zoom H4N digital recorder and a Shure SM10a headworn microphone. The second corpus consists of spontaneous recordings made by the speakers themselves in the course of their interactions with each other via WhatsApp. The preliminary corpus consisted of 25 short voice messages. While some of these recordings can be useful for sociolinguistic or impressionistic phonological analysis, they were deemed unsuitable for acoustic analysis due to insufficient sound quality. Only a subset of messages were recorded in a silent setting. Those recorded outdoors are too noisy to allow for a reliable analysis of voice distinctions or the intensity profile of the sounds of interest. Nevertheless, the remaining 16 recordings were of sufficient quality to allow for acoustic measurements (see Table 1).

Acoustic measurements
To determine the amount of voicing, I followed the procedure used by Broś and Lipowska (2019), that is, I deemed a given sound partially voiced if more than 50% of the sound duration showed voicing on the spectrogram and 2 The results of the controlled study were published by Broś and Lipowska in 2019 and only a part of that data is analysed here. The WhatsApp recordings were made by the speakers themselves in a context unrelated to the study in which they had earlier participated. The recordings were then provided by the speakers, and consent was given for them to be analysed as a part of this paper. 3 The experiment also included words starting with a palatal /tʃ/. They were excluded from the present analysis given that /tʃ/ is not used so often in spontaneous speech and the rate of voicing in this case differs substantially from the voicing of /p t k/.
Using social media in phonetics and phonology fully voiced if pulses were present throughout the sound following a visual inspection taking into consideration the voicing bar and the voice report from Praat (Boersma and Weenink 2021).
To assess the degree of aperture of the sound in question, that is, relative degree of lenition, I measured the minimum intensity of the consonant and the maximum intensity of the following vowel to calculate the intensity difference in decibels. It is a general rule in lenition literature that the more lenited the consonant, the higher its minimum intensity and hence the smaller the intensity difference between the consonant and the following vowel (Hualde and Nadeu 2011;Parrell 2010). This principle has been taken further to assume that lenition in intervocalic position leads to the greater continuity of the speech signal by making the consonant more similar to the flanking vowels (Katz 2016;Kingston 2008).
According to the literature, the duration of the sound is correlated with the degree of weakening (see e.g. Cohen Priva and Gleason 2020), thus I also measured relative sound duration, following Dalcher (2008) and Broś and Lipowska (2019), as a ratio of total consonant duration to total VCV sequence duration.
Other variables taken into account in the analysis were: the presence of burst and the presence of formants. Given that bursts are produced only in maximally constricted consonants in which complete closure must be maintained for at least 20-30 ms (Shadle 1997), lenition can lead to burst-less stops, either voiced or voiceless (e.g. Dalcher 2008). Also, weakened consonants are less constricted in terms of muscle tenseness and proximity of the articulators (Vennemann 1988), which results in changes in manner (i.e. approximantization). As mentioned above, voiceless stops in the dialect have a tendency to be weakened in precisely this way, which results in the presence of formants on the spectrogram.

Statistical methods
The analyses presented below were all conducted in R 4.1.2 (R Core Team 2020) with the package lmer4 (Bates et al. 2018). Two linear mixed-effects regression models were estimated using the function lmer(), each run for one of the two numerical variablesintensity difference and relative consonant durationas the dependent variable. Additionally, mixed-effects binary logistic regression models were built for the following binary variables: burst, voicing, and formants. This was done using the glmer() function with binomial family and the logit link function. In each model, setting (lab or social media) was used as a fixed factor and there were random intercepts for speaker, word used, and, additionally, place of articulation (labial, coronal, dorsal). A final set of models was then built with intensity difference as a dependent variable and all the other variables as fixed factors (see Section 4.2), from which an optimal fit was chosen. Full random structure was not employed due to problems with model convergence, thus no random slopes were included. Note that the total number of sounds in the laboratory setting was  per speaker ( sentences × ). However, instances in which a pause was made between the left-hand context (cinco 'five') and the target word were excluded, resulting in slightly different numbers of sounds per speaker.
Before getting to the statistical analysis of the data, it is worth looking at the phonetic outputs of underlying /p t k/ from the two corpora. Figure 1 presents a comparison of voicing in the sound /p/, which also allows for the evaluation of data quality in the two samples. It is worth noting that the intensity profiles of the two sound samples are similar, although intensity ranges differ, which may be due to the different recording conditions, proximity of the microphone, or differences in sound file type (and the need for sound conversion in the case of the social media; see Section 6). As for the other characteristics of the spectra, a voicing bar is perfectly visible in both cases and the vowels' formant structures can be easily appreciated. In addition, the sound waves corresponding to these spectrograms do not show any distortions or unexpected patterns. Vowels have correct periodic cycles and the voiced [b] shows signs of vocal cord vibration during closure. Thus, the two types of recording seem to be well suited for a comparative analysis.

Descriptive statistics
A total of 670 observations from five speakers were analysed. The mean number of observations per speaker was 134 (SD = 27.62). At the same time, the number of target consonants produced in each setting was similar (377 in the lab setting vs. 293 in the social media app), which makes the samples comparable, with all speakers producing 68-78 target sounds in the lab setting and slightly fewer sounds, 59 on average, in the social media files, although it must be noted that the latter mean is as high as it is because of one speaker (see Table 1).
Turning to the sounds that were classified as voiced in the data, they were a minority of productions in the lab setting (43.8%) but a majority in the social media data (76%). This change is dramatic and the general reversal of voicing versus no voicing holds for each speaker (see Figure 2).
Quite interestingly, we can see in Figure 2 that while there are substantial inter-individual differences between speakers in the lab setting (for instance, Speaker 3 has very little voicing in general), all speakers seem to be quite uniform in the percentage of voicing in a naturalistic setting with only Speaker 1, a heavy voicer in the lab setting as well, seeming to differ somewhat. This suggests that speakers in the same age range speak in a similar fashion, with similar rates of lenition, but their strategies pertaining to supervised speech differ. Some of them perhaps became more nervous or have a greater tendency for hypercorrection, while others may not have been as intimidated by the way in which they were being recorded. This issue should be pursued further in a separate study, however, due to the small number of speakers considered here. Using social media in phonetics and phonology As for the intensity difference of the produced sounds, this differed depending on the setting (see Figure 3). In the lab, the speakers tended to produce stops with a greater intensity difference (M = 19.8, SD = 7.99), that is, with less lenition compared to the social media setting (M = 14.9, SD = 6.83). The same tendency was observed in the relative sound duration. As shown in Figure 4, sounds were generally longer (M = 0.36, SD = 0.08) in the lab setting compared to the social media setting (M = 0.32, SD = 0.09), which also indicates less lenition.
As for other markers of lenition, the number of sounds produced without a burst, regardless of voicing, was greater in the social media setting (60.4% vs. 28.6% in the lab); the number of sounds produced with formants visible on the spectrogram was also greater (31.4% vs. 24% in the lab); see Figure 5.
It is worth observing that the overall intensity differences between sounds classified as approximants, voiced stops, and voiceless stops are very similar in the two settings and in line with previous literature (e.g. Broś et al. 2021), according to which underlying /p t k/ produced as approximants have the smallest intensity difference (around 5-12 dB), those produced as voiced stops are in the mid-range (around 15-20 dB), and those which surface  as voiceless stops have values closer to 22-24 dB. Although the ranges of values are slightly different in the case of social media compared to the lab setting, the proportional changes between them are comparable. Since lenition and gestural overlap are greater in the social media context, this may translate into overall lower values of intensity for all surface pronunciations of underlying /p t k/ and hence lower than expected values in voiced stops, for instance (see Figure 6). Taking all of the above into consideration, however, there is reason to believe that the data coming from social media are reliable in terms of capturing the intensity of the produced sounds.

Statistical models
The effect of setting was significant in all the models. Intensity difference was much smaller in the social media recordings compared to the lab setting (t(128.05) = −8.3, p < 0.001). Sound duration was shorter in the social media context (t(158.81) = −5.55, p < 0.001). Also, there was an increased probability of voicing (z = 8.04, p < 0.001), a lower probability of producing underlying /p t k/ with a burst (z = −7.48, p < 0.001), and a higher probability of producing a sound with visible formants (z = 2.34, p = 0.019) in the social media recordings compared to the lab setting.

Using social media in phonetics and phonology
The additional model built for the data using intensity difference as a dependent variable (see Section 3.3) included all the remaining variables as fixed factors and first degree interactions between setting and each of burst, formants, and relative consonant duration, and between voicing and burst. The reason for running this model was that intensity changes are perhaps the most common marker of lenition according to previous studies (Cohen Priva and Gleason 2020;Katz 2016). Thus, I wanted to see how the degree of lenition is moderated by the remaining phonetic factors, and by the setting in which a given sound is produced. I used the lmer() function with a maximal model and then compared it with simpler models using anova(). As a result, a final best fit was selected with fixed factors as above and the following interactions: burst*setting, voicing*setting, relative duration*setting. Setting (F = 9.87, p < 0.01), relative duration (F = 31.7, p < 0.001), and formants (F = 39.76, p < 0.001) resulted  significant, while burst was at the statistical tendency level (F = 3.69, p = 0.55). As for the interactions, all of them resulted significant (see Table 2 for the summary of the results). The effect plots (Figure 7) show that sounds produced with a burst have a larger intensity difference with respect to the following vowel in the social media setting compared to the sounds produced without a burst. As for voicing, when it was produced, it was correlated with a smaller intensity difference compared to voiceless in both settings in a quite similar way. Finally, greater relative sound duration was correlated with a smaller intensity difference in the social media setting compared to the lab and a positive correlation between the two variables only ensues in the lab setting (see Table 2 for the estimates).

Discussion
The analysis presented in this paper compared lab recordings with voice messages recorded in a naturalistic setting by the speakers themselves and without any reservations related to the fact of being recorded for scientific or professional purposes. The latter are a good sample of natural speech which is not hypercorrect, inhibited, or otherwise altered due to the presence of a third person. These recordings were also made for personal purposes, in voice message conversations with close friends, hence they represent speech produced similarly to face-to-face communication, which is a great advantage for phoneticians seeking reliable samples of natural, spontaneous speech and wanting to capture the real dimension of sound change. Here, the error bars and the shaded areas correspond to the 95% confidence intervals for the fitted values. Note that when these overlap (as on the first plot), it signifies greater data variability and suggests that the plotted difference may not be significant.

Using social media in phonetics and phonology
The comparison presented in Section 4 showed that there is a significant difference between the sounds produced in the two settings, and this difference obtains in the expected direction and in all the tested lenition markers. Thus, the data give us insight into the degree of inhibition of lenition in an experimental setting as opposed to natural, everyday speech, which constitutes a substantial methodological contribution to the fields of phonetics and laboratory phonology. In addition, the data clearly show that inter-speaker variability is substantially greater in the lab setting. By contrast, in the social media recordings all speakers tend to speak in a similar way, at least in terms of weakening /p t k/, which can be appreciated in Figures 2, 3, and 4. 4 As suggested by a reviewer, this is especially relevant in the context of Labov's (1966Labov's ( , 1969Labov's ( , 1970 and Tarone's (1979Tarone's ( , 1983 insights that the style used when minimal attention is paid to speech is the most systematic and that as we pay more attention to our speech (e.g. when reading sentences out loud or being recorded by third persons) we introduce inconsistencies in our productions. 5 Thus, social media recordings provide a way to get to the most stabilized speech styles and the actual grammatical competence of native speakers.
Furthermore, the analysis provides information on the viability of social media recordings in sound change research. While the data quality was not ideal, and some of the recordings had to be rejected, social media has a substantial advantage: we are able to obtain great amounts of data to choose from, provided that speakers are willing to share their personal content. As a consequence, we can potentially get a greater amount of naturalistic data and hence more accurate statistical information on sound change than by using other methods (such as remote data collection, elaborate methods of elicitation, and less "controlled" study protocols). Needless to say, however, such data should always be treated with caution given lossy file formats, poor signal-to-noise ratios, and possible sound distortions. In every case, both pros and cons should be carefully considered in line with the particular research question and sensitivity of a given acoustic measure to data conversion and storage in formats other than lossless WAV or similar files.

Limitations of the study
The study presented in this paper has certain limitations, which are mostly due to the type of data analysed. Apart from the small sample and limited number of target sounds in the second corpus, which can be explained by the fact that the purpose of the paper was to provide a methodological alternative for traditional recording sessions in linguistics, data quality is a major concern. Most importantly, the data were recorded in the social media app's native format, OPUS, which is a lossy but high quality compression data storage type. In order to analyse such files in Praat, the researcher has to convert them into one of the supported file types, preferably WAV. Thus, when using speakers' own recordings, it would be better to ask them to record themselves using an application supporting WAV (see e.g. Zhang et al. 2021 or De Decker andNycz 2011 for recommendations). This would result in better quality recordings that can be passed directly for analysis, but it would cancel the authenticity of everyday communication.
Depending on the type of analysis, the file type issue should be carefully considered by the researcher. For instance, in the case of stop lenition, the acoustic measurements used were mostly relative values, that is, they were measured relative to the adjacent parts of the recording. Absolute intensity, which can be distorted or can differ depending on the speaker or distance from the smartphone, was not used. Instead, I looked at the intensity parameter relative to the intensity of the flanking sounds. The distortions produced in the course of the recording or data conversion should be the same across the sounds being considered. As for duration, OPUS files are known to produce a lag of 10-20 ms relative to original time, but again I used relative duration so this should not be a problem. In the case of voicing, measurements will be highly dependent on the quality of the recording and the amount of noise. For this reason I chose the cleanest files possible. It may be, however, that voicing parameters are not that reliable following conversion, so the results should be interpreted with caution. The upside of the analysis is perhaps the fact that I annotated sounds both based on the automatically generated voice report in Praat and on a careful auditory and visual inspection of the sounds. The results of the two types of sound categorization for voicing were very similar.
Another important issue, related to the previous one, consists in the conditions in which recordings are made. Recording quality will differ substantially depending on background noise, other people speaking, sound reverberation, and so on. Thus, a lot more data has to be gathered than would initially appear to be necessary, due to data loss. To obtain an optimal sample for statistical analysis, one has to get numerous recordings and sieve through them, possibly dismissing as much as half of the data before starting acoustic analysis. Also, with this type of recording, one cannot control the sounds or words produced, nor the contexts in which they appear, which can be a disadvantage for some studies. Conversely, it can also be an advantage, for instance when studying lenition or sandhi phenomena. In such cases, truly spontaneous data are desirable because we are looking for processes that happen in connected speech and may be inhibited during reading. Needless to say, similar advantages and disadvantages apply in the case of fieldwork recordings, which often suffer from bad quality due to multiple speakers, wind, or background noise despite the best of recording devices, as well as problems with controlling phonetic environments.