Individual differences in attention control and the processing of phonological contrasts in a second language

This study investigated attention control in L2 phonological processing from a cognitive individual differences perspective, to determine its role in predicting phonological acquisition in adult L2 learning. Participants were 21 L1-Spanish learners of English, and 19 L1-English learners of Spanish. Attention control was measured through a novel speech-based attention-switching task. Phonological processingwas assessed through a speeded ABX categorization task (perception) and a delayed sentence repetition task (production). Correlational analyses indicated that learners with more efficient attention switching skill and faster speed in correctly identifying the target phonetic features in the speech dimension under focus could perceptually discriminate L2 vowels at higher processing speed, but not at higher accuracy rates. Thus, attentional flexibility provided a processing advantage for difficult L2 contrasts but did not predict the extent to which precise representations for the target L2 vowels had been established. However, attention control was related to L2 learners’ ability to distinguish the contrasting L2 vowels in production. In addition, L2 learners’ accuracy in perceptually distinguishing between two contrasting vowels was significantly related to howmuch of a quality distinction between them they could make in production.


Introduction
Speaking and understanding a second language (L2) is a complex, cognitively demanding task. L2 users at all levels of competence normally need to put more effort when using their L2 than when using the language they grew up speaking at home (first language or L1) in some or all of the linguistic domains (morphology, syntax, vocabulary, pronunciation) needed to function fluently in everyday communication. This is because in the L1, many of the processes involved in using language (e.g. grammatical and phonological encoding and decoding; lexical activation, selection and retrieval; articulation) are characterized by automaticity and processing efficiency and occur fluently and effortlessly. By contrast, in the L2 such processes are less automatic, require effortful processing and usually result in dysfluent language use (Segalowitz 2010). This is particularly notorious in the case of instructed adult L2 learners in classroom environments with very limited exposure to authentic L2 input and few opportunities for meaningful language use beyond a few hours of instruction per week (Muñoz 2014). The limitations of this kind of language learning experiences are especially striking in the domain of L2 phonology for most learners, because sustained L2 input is necessary for learners to improve L2 phonological processing as well as to establish precise phonetic representations for L2 sounds, which would allow them to acquire the segmental contrasts of the L2 and develop an L2 phonological system (Tyler 2019).
Although exceptional outcomes in phonology have been reported for some learners, arguably due to a combination of learning styles and cognitive (aptitude and talent), psychological (motivation and strong sense of L2 self) and experiential factors (e.g., length of residence) (Moyer 1999(Moyer , 2014, most adult L2 learners struggle with L2 pronunciation and, without pronunciation instruction, tend to see only modest improvements in comprehensibility or accentedness over time. In addition, previous studies assessing L2 phonological acquisition in naturalistic, classroom and lab training contexts have shown large inter-learner variation in performance (Bradlow et al. 1999;Derwing and Munro 2013;Golestani and Zatorre 2009;MacKay et al. 2001). The sources of this variability have been widely investigated and attributed to a myriad of linguistic, contextual and learner variables: the extent to which the L1 and the L2 differ phonetically, age-related factors, amount and quality of L2 input, frequency and amount of L1 and L2 use, motivation, or language learning aptitude (see Munro and Bohn 2007;Piske et al. 2001 for a review). Among these factors, age of onset of L2 learning, input quality and quantity and amount of L2 use have been shown to explain a substantial amount of variance in L2 phonological acquisition in immersion settings (Flege 2008), whereas aptitude-related factors remain under-researched in both naturalistic and classroom learning contexts.
Executive control functions (e.g., memory, attention, inhibition) constitute one source of aptitude-related inter-learner variation in L2 phonological acquisition (e.g., Darcy et al. 2016; Ghaffarvand Mokari and Werner 2019), as they underlie the effective functioning of speech processing mechanisms both in L1 and L2. Recent research (Saito et al. 2019(Saito et al. , 2020(Saito et al. , 2021 also suggests that general auditory processing skills may explain L2 speech development. In the present study we investigate the cognitive mechanism of attention control, and more specifically attentional flexibility (i.e., attentional switching skill) as a source of individual differences in L2 speech perception and production in instructed L2 learners.

Attention control and language processing
The executive network of the human brain, also known as executive control or executive function, is responsible for a set of cognitive control mechanisms (executive functions) that allow an individual to function efficiently in terms of self-control, problem solving, task shifting, action planning and goal implementation (Petersen and Posner 2012). Such mechanisms include the updating and mental manipulation of information in working memory (updating), selectively attending to information under focus while inhibiting irrelevant information (inhibiting), and efficiently shifting attention between tasks or representations (shifting) (Miyake and Friedman 2012), all of which are implicated in speech processing and language comprehension and production, and consequently in second language acquisition (SLA). Phonological short-term memory, the subcomponent of working memory responsible for temporarily holding auditory verbal information in working memory (updating), is implicated in L2 vocabulary acquisition (Speciale et al. 2004), L2 grammar learning (French and O'Brien 2008;Kormos and Sáfár 2008) and L2 speech perception (Darcy et al. 2015;MacKay et al. 2001). Inhibitory control is responsible for a bilinguals' control of language interference by inhibiting the language not in use (Green 1998). It is thus related to L2 phonological development, leading to lower levels of L2 phonological influence on the L1 in long-term immersion Peperkamp 2013, 2014) as well as less interference from the L1 in L2 phonological processing in instructed SLA (Darcy et al. 2016). However, much less research has investigated attention control as a source of inter-learner variability in L2 phonological acquisition despite its potentially central role (Ellis 2006;Robinson 1995). Since attention control determines the extent to which linguistic form can be attended to while meaning is being processed (Van Patten 2004) it could play a mediating role between input and acquisition.
Attention control (shifting) has been shown to explain a significant amount of variance in L2 learners' proficiency (Segalowitz and Frenkiel-Fishmann 2005) and Attention control and L2 phonological processing speaking fluency (Taube-Schiff and . Attention also relates to general processing mechanisms involved in the perception and production of speech. For example, it guides auditory processes during speech perception by focussing processing resources on the relevant information, and by allowing listeners to select the acoustic information that is critical for appropriately interpreting auditory events during oral communication (Baese-Berk et al. 2015;Mattys and Wiget 2011). Attention shifting has also been shown to facilitate perceptual learning, predicting listeners' skill in understanding an unfamiliar accent (Janse and Adank 2012) and seems connected to processing speed in native speakers' tonal discrimination (Ou et al. 2015;Ou and Law 2017). Taken together, attention shifting appears to contribute to experience-related quality differences in the nature of phonological representations in long-term memory (Heald and Nusbaum 2014). In speech production, attention skills are important in word planning processes (Sikora et al. 2016) and in resolving the selection of cross-linguistically co-activated linguistic representations in bilinguals (Kroll et al. 2008).
Given that individuals vary in attentional capacity (Petersen and Posner 2012) and in the use they make of their attentional resources (Wager et al. 2006), interlearner differences in phonological attainment may be partly due to individual differences in attention control. Learners must be able to shift their attentional focus flexibly between various phonetic cues and phonological dimensions (e.g., segmental duration and quality, distributional constraints, pitch changes) during phonological processing as spoken messages unfold in time in a way that is specific of the L2 and may differ from the L1. For example, in English segmental duration needs to be attended to as a primary cue to the voicing of word-final devoiced obstruents (longer /eɪ/ in plays than place) but is normally negligible as a cue in the identification of vowels (longer /iː/ in beat than bit), where vowel quality is the primary cue. Thus, learners of English need to develop attentional flexibility in the L2-specific use of segmental duration as a phonological cue in the phonology of English.
Although indirect evidence of the implication of attention in L2 speech learning may be found in the effectiveness of directing learners' attention towards specific phonetic dimensions during phonetic training (Guion and Pederson 2007) or acoustic cue manipulations (Iverson et al. 2005), evidence of a direct relationship between attention control skills and L2 speech learning is still inconclusive. For example, some phonetic training studies have found an association between auditory selective attention and accuracy gains in perceiving target phonological contrasts (Mora and Mora-Plaza 2019; Oliveira 2020), but others have not (Ghaffarvand Mokari and Werner 2019). One study examined whether differences in attention control predicted accuracy of L2 phono-lexical encoding but did not find a relationship (Daidone and Darcy 2021). The present study extends this line of research by focussing on the relationship between attention switching skill (measured through a novel speech-based attention switching task) and L2 phonological processing in the perception and production of L2 sound contrasts.

The present study
The goal of the present study is to explore the relationship between attention control and L2 phonological processing from an individual differences perspective. We hypothesized that a more efficient attention control may enhance the processing of acoustic-phonetic information in the input by bringing relevant (L2-specific) acoustic information to the foreground during speech processing while keeping irrelevant information in the background, which would lead to more accurate processing of L2 phonological categories in perception and production. We examine the relationship between performance on a domain-specific (speech) measure of attention control and measures of L2 phonological perception and production for two groups of lateonset L2 learners: a group of L1-English learners of Spanish and a group of L1-Spanish learners of English. Following previous research in the domain of grammar (Segalowitz and Frenkiel-Fishmann 2005), we chose to test attention switching skill through a domain-specific task (speech) that aimed at capturing L2 learners' individual differences in attention control during the processing of two types of phonetic information required in L2 speech learning: a specific language-independent segmental aspect of speech sounds (nasal vs. non-nasal) and a set of language-specific phonetic differences characterising a sound sequence (Spanish-like phonetics vs. English-like phonetics). As the phonetic and phonological dimensions that need to be attended to for successful phonological development are complex and mostly language-specific, we think of efficient attention control as a built-in cue enhancement device by means of which the appropriate relevant phonetic cues are brought to the perceptual foreground in L2 speech processing. We assume that learners do not only need to learn to attend to the relevant phonetic cues or dimensions when processing L2 speech sound contrasts (i.e. they need to learn the specific phonetic cue-weighting of linguistically relevant phonetic dimensions such as voicing, duration, or spectral information in the L2), they also need to learn to bring a specific phonetic dimension to the attentional foreground in one context and to the attentional background in another. L2 learners' ability to efficiently switch their attention between a specific language-independent segmental aspect of speech sounds (e.g. presence vs. absence of nasal resonance for nasal consonants) and a set of languagespecific phonetic differences characterising a sound sequence (differential segmental phonetic properties of Spanish-like vs. English-like speech, e.g. VOT in oral stops or unstressed vowel reduction) is a way to obtain a measure of attention switching skill in a speech processing context that would closely resemble L2 Attention control and L2 phonological processing learners' use of their attentional skills in processing L2 phonetic information. None of the currently available methods of assessing attention control have exclusively targeted phonological dimensions (but see Darcy et al. 2015;Safronova 2016), which we deemed crucial in establishing a link between attention control and phonological acquisition during L2 phonological processing. In this study, we have developed a novel, fully phonologically oriented version of an attention control task with the aim of observing L2 learners' efficiency in the use of their attentional flexibility resources during the phonological processing of L2 auditory stimuli.

Methods
We obtained measures of attention control with our attention shifting task and examined how these related to measures of learners' L2 speech perception (ABX categorical discrimination task) and production (delayed sentence repetition). We also obtained demographic background information from learners and, as vocabulary size may partly determine L2 phonological competence (Bundgaard-Nielsen et al. 2011), we also estimated their receptive vocabulary size as a phonologicallyrelated measure of overall proficiency (Uchihara and Clenton 2020) that we could control for. Participants included in the study had passed a pure-tone audiometry test (Reilly et al. 2007). A retrieval-induced inhibition task and a working memory (serial non-word recognition) task were also administered but are not reported here.
We tested L2 speech perception bi-directionally (Darcy et al. 2016), i.e., both learner groups were tested on English and Spanish stimuli, so that the English and Spanish stimuli served as control stimuli for the L1-English and L1-Spanish learners, respectively, and L1-English learners served as controls for the Spanish-learners' performance on the English stimuli, and vice versa. This design enhances generalizability because of the language-independent nature of any potential effects and correlations. For L2 speech production we used baseline measures from two groups of L1-English (n = 7) and L1-Spanish (n = 6) speakers recruited in the US and Spain, respectively, and who grew up speaking their L1 (English or Spanish) at home from birth. These speakers reported having undergone a primarily monolingual language learning experience. They had studied foreign languages at school but reported only using their L1 (English or Spanish) on a daily basis and stated they could not speak languages other than their L1 (English or Spanish) in a fluent manner. A limitation of this approach is that learners' performance was being compared to L1-English or L1-Spanish baselines that do not necessarily correspond to the input L2 learners are exposed to when acquiring the L2 through formal instruction or exposure to media.
The testing procedures were similar for both learner groups. They did the production task first, followed by the attention control task, the perception task, the vocabulary size test and the demographic and language background questionnaires. The testing session lasted 90 min approximately, including breaks between tasks. L2-Spanish learners were tested in a psycholinguistics laboratory at Indiana University in Bloomington (USA), whereas L2-English learners were tested (2-4) in the phonetics laboratory at the University of Seville (Spain). Participants were compensated for their participation either through a small payment or a USB-memory drive.

Participants
Participants were 21 Spanish learners of English (L2-English) and 19 English learners of Spanish (L2-Spanish) (see Table 1 for demographics). The learners' language background was determined in the call for participation stating that we were looking for native speakers of Spanish (in Spain) and of English (in the US). In addition, in the language background questionnaire participants were asked to fill in, several questions were included to determine whether they had been raised in either Spanish-or English-speaking homes in a Spanish-or English-speaking environments, with Spanish and English, respectively, as their only language of exposure. Current L2-use was estimated asking participants to choose from 5 L2-use intensity levels (0 = 0 %, 1 = 1-25 %, 2 = 26-50 %, 3 = 51-75 %, and 4 = 76-100 %) in 9 L2-use situations (e.g., conversations with friends, in internet chats, while shopping). that would produce an L2-use maximum score of 36, corresponding to an estimated L2 use of 76-100 %. Participants also self-evaluated their L2 proficiency on a 5-point scale (1 = very

Variable
L English (n = ) L  Spanish (n = ) Attention control and L2 phonological processing poor, 5 = very well) in speaking, understanding, reading and writing the L2, and we used the 4 ratings to compute an average score by participant. Motivation to learn or use the L2 was assessed through 9 statements (e.g., "I enjoy learning new words and new ways of saying things in English/Spanish") participants reacted to by selecting a level on a Likert-type agreement scale (1 = strongly agree, 9 = strongly disagree). A mean motivation score was obtained by averaging the scale score chosen for each statement. Vocabulary size was estimated through the Spanish and the English versions of a 120-item yes/no vocabulary size test (X_Lex, Meara and Milton 2003), which yields an estimate of receptive vocabulary size up to 5,000 words.
Overall, compared to the L2-English learners, L2-Spanish learners were younger, spoke the L2 less, had studied their L2 for less time and were slightly less motivated to learn their L2. However, both groups were comparable in terms of the age of onset of L2 learning, the age at which they started to use their L2, how long they had resided abroad and their self-reported level of L2 knowledge. However, L2-English learners had a significantly larger L2 vocabulary size than the L2-Spanish learners (t(38) = −3.143, p = 0.003), which might reflect a group difference in overall proficiency.

Attention control task
We developed a novel speeded set-switching task to measure attention control. In this task test trials were comprised of 10 nasal-initial nonwords, and 10 non-nasal-initial nonwords. All nonwords were disyllabic with a ˈCVCV structure. We chose to mainly use shared phonological categories for English and Spanish so as not to disadvantage one group over another by having to process too many unfamiliar sounds. Therefore, we created nonwords such as "saso", which could be pronounced distinctly in Spanish ([ˈsaso]) and English ([ˈsaesəʊ]). These 20 nonwords were recorded with English and Spanish phonetics by two female balanced early bilinguals who spoke Mexican Spanish and American English, so that voice identity could not be used to determine the stimulus language. These speakers reported having been raised speaking both Spanish and English in Spanish-English families, they had lived in both Spanish-and English-speaking environments (Mexico, Spain and the US) for extended periods of time, and they did not have a perceptually detectable Spanish accent when speaking English or an English accent when speaking Spanish.
The two phonological dimensions used, nasality and L1 phonetics can be considered comparable in difficulty across both participant L1s. Nasality, probes whether a stimulus initial sound is a nasal sound (/n/ or /m/) or not, whereas L1 phonetics probes whether a stimulus is produced with L1 or L2 phonetics. We chose these two dimensions because they trigger a fast, automatic decision (phoneme or accent detection, respectively) which can only be based on the phonetic properties of the stimuli, since they were all phonotactically legal non-words in the participants' L1 and L2. For example, participants had to decide on the presence of a nasal resonance at the beginning of the word, such as [ˈnole] as opposed to [ˈsaso], or on the Englishlike diphthongal realization of a vowel, such as [ˈdoʊfeɪ] as opposed to Spanish-like monophthongal [ˈdofe] ( Table 2).
Participants were asked to answer one of two possible questions: "Nasal?" versus "English?" (or "Nasal?" vs. "Spanish?" for L1-Spanish speakers) with respect to an auditory stimulus by pressing one of two assigned computer keys (yes or no). An experimental trial consisted of a fixation cross displayed for 500 ms, followed by the question (e.g., "Nasal?") displayed for 500 ms, followed by an auditory stimulus (e.g., [ˈnofe], spoken with Spanish phonetics). The task was administered with the software DMDX (Forster and Forster 2003). After a warm-up phase of 16 trials, and 8 practice trials on which feedback for accuracy (correct! or wrong) and speed (e.g., too slow! or 1800 ms) was provided, participants completed 82 trials.
Switch (S) trials, those showing a different question from the previous trial, alternated predictably with repeat (R) trials, those showing the same question as the previous trial, in SRSR sequences (Monsell 2003). Switch trials required participants to refocus their attention onto a different dimension and were expected to induce a switching cost, whereas repeat trials provided a baseline reaction time. The audio files were randomly ordered to match a SRSR sequence, resulting in two lists, one for each L1 with the only restriction that two "similar" tokens (e.g., /dofe/ spoken in Spanish and English) could not follow each other. Tokens from either voice were randomly assigned to a roughly equal number of items in each list.

L2 perception: speeded ABX discrimination task
In the speeded ABX categorization task (e.g., Gottfried 1984). Participants heard a sequence of three stimuli and had to identify the last stimulus (X) as either the same as A or B. The stimuli consisted of trisyllabic non-words in both Spanish and English Attention control and L2 phonological processing with the structure CV.ˈCV.CV(C) (e.g., [faˈneða]). Physically different tokens produced by the two female early balanced bilinguals (Mexican Spanish and American English) were used in each trial: one voice for stimuli A and B and the other for X. Thus, learners had to correctly identify whether X contained the same vowel as item A or item B by comparing realizations of the same nonword produced by two different voices. If learners are able to correctly identify the target contrasting vowels or consonants as being the same in two items spoken by two different voices, we interpret this to indicate that learners have developed distinct phonetic category representations for the target L2 vowels or consonants at a pre-lexical phonological level. Therefore, a significant correlation between attention control and L2 learners' performance in the ABX task would indicate that learners with stronger attention control skills are more likely to have developed distinct phonetic categories for the difficult L2 sounds we targeted. All participants heard all Spanish and English stimuli, in two separate blocks. The L2 contrasts for L1-Spanish learners were L1 contrasts for the L1-English learners, and vice versa ( Table 3). All of the L2 contrasts were deemed to pose learning difficulties for L2-Spanish (see Díaz and Simonet 2015, for /e/-/ei̯ /; Rose 2010, for /d/-/ɾ/) and L2-English learners (see Morrison 2009, for /iː/-/ɪ;/; Anrrich 2007, for /ʃ/-/ʧ/). In total, four nonword pairs per condition were tested; each pair was repeated in four combinations (ABA, ABB, BAA, and BAB), yielding a total of 128 trials, 64 for each stimulus language. Trials were assigned to two blocks according to stimulus language (English-Spanish or vice-versa), and block order was counterbalanced across participants. Within each block, trials were randomized. If a participant made no response within 2500 ms, the next trial was initiated. The task was administered on a PC through headphones using the presentation software DMDX (Forster and Forster 2003), and took about 15 min to complete. We computed % correct accuracy scores to gauge L2 learners' ability to qualitatively distinguish the L2 sound contrasts in perception and response time (RT) scores to gauge L2 learners' efficiency at perceptually processing the L2 sound contrasts, which reflected the robustness of their phonologically encoding. We expected group performance to be more accurate on L1 than L2 test contrasts and to be at ceiling for control contrasts, which would allow us to attribute differences in performance on the test condition to the L1 or L2 of the contrasts (rather than the stimulus language).

L2 production: delayed sentence-repetition task
In order to assess participants' production of the L2 contrasts tested perceptually in the ABX task we administered a delayed sentence repetition task (Trofimovich and Baker 2006) (see Appendix A). Participants performed 16 trials in the L2 in a recording booth. A trial consisted of the auditory and orthographic presentation of a question (prompt) and a following answer (response) 250 ms later, after which the prompt was presented again auditorily only (with a 500 ms delay) for the participant to repeat the previously heard response. The aim of this production task was to elicit the production of the same target sounds we tested in perception in a way that sound productions would reflect the nature of the phonetic categories learners have developed. We assumed that eliciting the target sounds in L2 words embedded in meaningful sentences presented in mini-dialogue format (prompt 1 → response 1; prompt 2 → response 2) where prompt 1 and response 1 are produced by different voices and prompt 2 intervenes between response 1 and the participants' response 2 (repeating response 1) would avoid participants focussing on the target word, which would facilitate eliciting the target vowels and consonants in a context that would closely resemble an L2 communicative context. We deemed this elicitation procedure would enhance the production of the target sounds in such a way that they would reflect the stage of development of the learners' L2. Although we cannot completely discard the possibility that participants would mimic the segmental content of the utterance, both the delay between the response and its repetition, and the intervening prompt between the response to be repeated and its repetition from memory, would minimize the possibility of direct mimicry as well as enhance attention to meaning rather than to segmental form (Trofimovich and Baker 2006). The stimuli in both languages were recorded by the two female balanced early bilinguals of Mexican Spanish and American English and were normalized for amplitude. In half of the prompt-response sets, one voice was used for the prompt token, and the other was used for the response tokens, and the reverse was done for the remaining sets.
We elicited four pairs of words for each of the two contrasts embedded in 16 response sentences in L2-Spanish (/e/-/ei̯ /: maceta-aceite, pena-peina, reno-reino, vente-veinte; Attention control and L2 phonological processing and /d/-/ɾ/: cada-cara, moda-moras, oda-oras, todos-toros) and the same in L2-English (/i/-/ɪ/: cheap-chips, feet-fit, seat-sit, sheep-ship; and /ʃ/-/ʧ/: shake-cheque, sheep-cheap, shows-chose, shops-chops). The task took 5-7 min to complete. Vowel production accuracy measures were based on the size of the Euclidean distance between the contrastive vowels and were contrast-specific. A larger Euclidean distance between two contrastive vowels represented a larger qualitative distinction between them in production, which was interpreted as an indication of higher production accuracy in contrastiveness (Melnik-Leroy et al. 2022). For the L2-Spanish monophthong-diphthong contrast /e/-/ei̯ /, three measurement points (MP) were placed 20 %, 50 % and 80 % into the vowels, and the mean values for F1, F2, and f0 were extracted from a 10 ms window centred at the three MPs. These frequency measures were first converted to Bark (B), and then a Bark-distance metric was computed by subtracting B0 from B1 (B1-B0) for tongue height and B1 from B2 (B2-B1) for degree of tongue fronting (Bohn and Flege 1990). We measured the amount of formant movement in the vowel by computing the Euclidean distance between the 20 % and the 50 % MPs and between the 50 % and the 80 % MPs. Then we added up the two Euclidean distances and used this spectral distance score as a measure of formant movement, as represented on the Bark-normalized vowel space. Higher formant movement indicates a diphthongized vowel (English-like), lower movement corresponds to a more Spanish-like monophthong. We also assessed whether the duration of the monophthong /e/ and the diphthong /ei̯ / were comparable across speaker groups by computing a duration difference score (in ms) and a duration difference ratio (e.g., /ei̯ / was 1.4 times longer than /e/) between e/ and /ei̯ / that would index how well learners could distinguish the monophthong from the diphthong in production.
For the L2-English /iː/ versus /ɪ/ contrast, F1, F2 and f0 were extracted from a 15 ms window centred at the midpoint of the steady-state portion of the second formant of the vowel. The Euclidean distance between the contrasting vowels on a Bark-normalized vowel space was used as a measure of accuracy in qualitatively differentiating the two vowels, so that a larger distance was interpreted as a more English-like distinction between the vowels. Because Spanish learners of English have been shown to also rely on duration cues in distinguishing this tense-lax vowel contrast, unlike L1 English speakers who rely primarily on spectral cues (Escudero and Boersma 2004), a duration difference score (in ms) and a duration difference ratio (e.g., /iː/ was 1.1 times longer than /ɪ/) between /iː/ and /ɪ/ were computed as a measure of accuracy in quantitatively differentiating the two vowels.
For all consonant contrasts production accuracy was measured categorically (score 0-8) by visually and auditorily inspecting the spectrograms. For the L2--Spanish /d/-/ɾ/ contrast an accurate realization of Spanish intervocalic /d/ was identified as a spirantized [ð], whereas accurate realizations of intervocalic /ɾ/ had to consist of a single-closure tap with very short constriction duration. For the L2--English /ʃ/-/ʧ/ contrast realizations had to be palato-alveolar and show presence (/ʧ/) or absence (/ʃ/) of a closure phase in the spectrogram.

Attention control
As the descriptives in Table 4 below show, the attention switching task worked as expected in that both groups responded faster to repeat than to switch trials, regardless of question type. This indicates that switching dimensions (L1 or nasality) had a response time cost in milliseconds based on which a switching cost score (the difference between switch and repeat RTs) can be obtained as a measure of attention control.
A mixed-effects model was fitted to the response speed data for correct responses (in SPSS 25) with the factors L2-group (Spanish, English), trial type (switch, repeat), dimension (L1, nasal) and stimulus language (L1, L2) and their interactions as fixed effects. The random effects structure that did not lead to a convergence error and provided a better fit of the data to the mixed-effects model (i.e. the lowest Akaike's information criterion AIC) included random intercepts for subject and item, and a random slope for dimension by subject. The significance threshold was set at p = 0.05 and in pairwise contrasts adjusted via sequential Bonferroni for all analyses. The visual analysis of residuals confirmed that the model was a satisfactory fit for the data structure. The parameter estimates are presented in Appendix B-1. These analyses revealed significant main effects of trial type (F(1, 2816) = 27.4, p < 0.001) and dimension (F(1, 2816) = 8.42, p = 0.004). The main effect of L2-group did not reach significance (F(1, 2816) = 2.16, p = 0.141), suggesting that both groups did not differ significantly from one another in overall response speed. The L2-group × trial type interaction reached significance (F(1, 2816) = 9.07, p = 0.003) because L2-Spanish learners were overall faster than L2-English learners on both repeat and switch trials (a difference that according to pairwise contrasts approached significance for repeat trials, but not for switch trials: t(2816) = −1.93, p = 0.054 and t(2816) = −0.976, p = 0.329, respectively). Crucially, both groups were significantly slower on switch than repeat trials (L2-Spanish: t(2816) = 6.00, p < 0.001; L2-English: t(2816) = 2.84, p = 0.005) and no other interaction involving trial type reached significance, suggesting that participants were slower on switch than repeat trials irrespective of dimension and stimulus language. The L2-group × stimulus language interaction reached significance because whereas L2-English learners responded slightly more slowly to L2 stimuli than L1 stimuli (t(2816) = −4.38, p < 0.001), L2-Spanish learners did not (t(2816) = 0.974, p = 0.330). All other interactions turned out to be non-significant (all Fs < 1.3, all ps > 0.24), suggesting that for both groups of learners responded faster on repeat than on switch trials for both dimensions and responded faster to the nasality dimension than to the L1 dimension regardless of trial type (see Table 4).
The dimensions L1 and Nasal are therefore not equivalent in terms of RT. While this could be due to the position of the elements in the stimuli permitting the decision (initial for Nasal but anywhere for L1, yielding a faster RT in the first case), it could also be due to the dimensions differing in the complexity of the acoustic cues listeners had to process. Nasality cues are clear and well-defined, whereas cues indicating L1 versus L2 are multiple and therefore likely more complex to process. An analysis of response accuracy allows teasing apart this question. If the RT difference is due to a difference in cue complexity, accuracy scores might also be lower for the L1 dimension (more complex) than for the nasality dimension (less complex), indicating that both dimensions would not be fully comparable in terms of processing difficulty and cognitive complexity.
A mixed-effects model (binary logistic regression with a binomial distribution) was fit to the response accuracy data with the factors L2-group (Spanish, English), trial type (switch, repeat) and dimension (L1, nasal) and stimulus language (L1, L2) and their double interactions as fixed effects with random intercepts for subject and item (see parameter estimates in Appendix B-2). The outcome of these analyses revealed a significant main effect of L2 group (F(1, 3228) = 4.43, p = 0.036) because overall L2--Spanish learners were significantly more accurate than L2-English learners were (0.92 vs. 0.87), in addition to being faster (see above). The main effects of trial type (F(1, 3228) = 14.4, p < 0.001) and dimension (F(1, 3228) = 36.8, p < 0.001) were significant, while neither the effect of stimulus language (F(1, 3228) = 1.56, p = 0.211) nor any of the interactions reached significance (all F < 2.0, all p > 0.16). Thus, the accuracy data follow the pattern of results for response speed; participants were significantly more accurate on repeat than switch trials for both questions (L1 and Nasality) and were significantly more accurate when responding to the nasality dimension than to the L1 dimension on both switch and repeat trial types (see Table 4).
These results suggest that switch costs were of a smaller magnitude in the nasal dimension than in the L1 dimension because nasality was an easier to process acoustic cue than L1/L2 phonetics. Therefore, the significant RT difference between the two dimensions was not only due to the position of the cues, but to a difference in complexity between dimensions. Consequently, because averaging RTs from the two dimensions might hide potential effects, we opted for computing two separate shift cost scores for each participant, one for each dimension, defined as the RT difference between switch and repeat trials, separately for the nasality and L1 conditions. We also used averaged RTs on switch trials only (separately for the nasality and L2 dimensions) as a general individual differences measure of how quickly L2 learners could re-focus their attention on a different dimension.

L2 perception: ABX discrimination
The results of the discrimination task (as shown in Figure 1) indicate that, as expected, participants performed at ceiling on L1 and L2 control contrasts and on L1 test contrasts, whereas they showed more difficulty in correctly discriminating L2 test contrasts (/iː/-/ɪ/ and /ʃ/-/ʧ/ for L2-English learners, /e/-/ei̯ / and /d/-/ɾ/ for L2-Spanish learners; see Figure 1 and Table 6). An exception to this is the high accuracy rate L2-English learners obtained for /ʃ/-/ʧ/ (M = 92, CI = 88-95). We therefore decided to use the L2 vowel test scores as an index of individual performance in discrimination.
Mixed-effects models were fitted to the accuracy (binary logistic regression with a binomial distribution) and RT data with L2-group (Spanish, English), stimulus language (English, Spanish), and condition (control vs. test) and their interactions as fixed effects, with random intercepts for subject and item. The parameter estimates for each model (accuracy and RT) are presented in Appendices B3 and B4.
L2-Group effects, as predicted, emerged for the test condition because both groups, overall, were less accurate when listening to L2 stimuli than L1 stimuli. L2-Spanish learners were more accurate with English test contrasts (M = 0.967), and less so with Spanish test contrasts (M = 0.803), a significant (t(5112) = 5.79, p < 0.001) mean difference of 0.164 (CI = 0.108-0.220). Conversely, L2-English learners were more accurate when listening to Spanish test stimuli (M = 0.906) compared to English test stimuli (M = 0.830), a significant mean difference of 0.076 (CI = 0.027-0.125; t(5112) = 3.02, p = 0.003). For the control condition, both L2 groups performed equally well on Spanish stimuli (Mean difference: 0.019; t(5112) = 1.69, p = 0.092), but on the English stimuli the L2-English learners were slightly less accurate (0.956) than the L2-Spanish learners (0.983; t(5112) = 2.44, p = 0.015). To sum up, none of the control contrasts can be said to have posed perceptual difficulties to learners, all accuracy rates being M > 0.956, whereas when performing on L2 test contrasts participants did show perception difficulties (see Table 5).
For RTs, tests of fixed effects revealed a significant main effect of stimulus language (F(1, 4667) = 14.8, p < 0.001), and significant L2-group × stimulus language (F(1, 4667) = 152.4, p < 0.001) and L2-Group × condition × stimulus language (F(1, 4667) = 18.2, p < 0.001) interactions, but neither the overall effect of L2-group (F(1, 4667) = 2.16, p = 0.142) nor that of condition (F(1, 4667) = 0.05, p = 0.822) reached significance. These effects overall partly parallel the accuracy data. When splitting the data set by condition we found that although the main effects of L2-group (F(1, 2204) = 2.00, p = 0.157) and stimulus language (F(1, 2204) = 0.129, p = 0.719) did not reach significance in the test condition, their interaction did (F(1, 2556) = 118.5, p < 0.001). As expected this interaction arose because both the L2-Spanish learners (t(2204) = −118.4, p < 0.001) and the L2-English learners (t(2204) = 130.1, p < 0.001) were more efficient (i.e., they obtained faster RTs) when processing L1 than L2 stimuli on the test condition. However, whereas in the control condition the main effect of L2-group did not reach significance (F(1, 2463) = 2.29, p = 0.130), both the main effect of stimulus language (F(1, 2463) = 28.1, p < 0.001) and the L2-group × stimulus language interaction did (F(1, 2463) = 38.2, p < 0.001). This is because whereas L2-Spanish learners were equally efficient on English and Spanish control stimuli (t(2463) = 1.04, p = 0.299), L2-English learners' RTs were slower on English than on Spanish stimuli (t(2463) = 7.91, p < 0.001), in accordance with their slightly lower accuracy on this condition. This might be attributed to the large RT variability of the L2-English group on the control condition. We next examined accuracy rates for the 8 phonological contrasts separately. A mixed-effects model (binary logistic regression with a binomial distribution) was fitted to the accuracy data (Table 5) with the factors L2-group (Spanish, English) contrast and their interaction as fixed effects, random intercepts for subject and item, and a random slope for contrast by subject (see Appendix B-5 for parameter estimates).
Except for the /ʃ/-/ʧ/ contrast, the results of the ABX task indicate specific perception difficulties with the L2 test contrasts for both learner groups. The amount of variability in L2 learners' ability to discriminate L2 contrasts indicated by the 95 % CIs in Table 5 (L2-Spanish: /e/-/ei̯ / = 0.746-0.894, /d/-/ɾ/ = 0.681-0.856; L2-English: /i:/-/ɪ/ = 0.610-0.803) suggests that their performance on the L2 vowel contrasts (/e/-/ei̯ / for L2-Spanish learners and /i:/-/ɪ/ for L2-English learners) can be used as a valid index of individual differences in L2 perception. We consequently opted for using the vowel accuracy scores as a measure of performance accuracy in L2 speech perception for the individual differences analyses.

L2 production: delayed sentence repetition
In general, the production data show the expected pattern of results, with L2 learners obtaining lower accuracy and greater variability in scores than the L1 speaker controls did (Table 6).
Because we computed individual vowel duration and vowel quality scores per speaker based on four productions of each of the contrasting vowels per language (L1 and L2), and because the vowel contrasts were quantitatively and qualitatively different for L2-English (/i:/-/ɪ/) and L2-Spanish learners (/e/-/ei̯ /), they were not directly comparable. Therefore, we assessed vowel production accuracy separately for each of the two L2 learner groups.
In a second set of analyses, we assessed whether L2 learners could qualitatively and quantitatively distinguish the vowels in the target vowel contrasts (/e/-/ei̯ / for L2-Spanish learners; /i:/-/ɪ/ for L2-English learners) to the extent that L1 speakers did. For L2-Spanish learners, we computed the difference in amount of tongue movement between /ei̯ / and /e/ (Euclidean distances in Bark; see Table 6) and submitted it to an independent samples t-test, which showed that L1-Spanish speakers produced a significantly larger difference in formant movement between /e/ and /ei̯ / than L2-Spanish learners did (t(23) = −6.44, p < 0.001). We did the same for the duration ratio measure, and found L1-Spanish speakers to produce a significantly larger duration ratio for /e/-/ei̯ / than L2-Spanish learners did (t(23) = −2.71, p < 0.012). For L2-English learners, we computed the Euclidean distance between /i:/ and /ɪ/ in Bark (see Table 6). An independent samples t-test showed that L1-English speakers produced a significantly larger distinction in quality (i.e. a larger Euclidean distance) between /i:/ and /ɪ/ than L2-English learners did (t(26) = −14.6, p < 0.001), but L2 learners did not differ from L1 speakers on the duration ratio measure for /i:/-/ɪ/ (t(26) = −1.54, p = 0.135).

Relationship between attention and L2 perception and production
The main goal of the current study is to explore the relationship between individual differences in attention control and L2 phonological processing for the perception (categorical discrimination) and production (delayed sentence repetition) of difficult L2 phonological contrasts in two groups of L2 learners: Spanish learners of English (/iː/-/ɪ/, /ʧ/-/ʃ/) and English learners of Spanish (/e/-/ei̯ /, /d/-/ɾ/). However, Spanish and English learners' performance on the consonant contrasts was not comparable because L2-English learners had no difficulty with the English /ʧ/-/ʃ/ contrast in either perception or production. This is likely due to the presence of a [ʃ] variant of the Spanish phoneme /ʧ/ in Andalusian Spanish (Regan 2020) coexisting with standard [ʧ] in the location where data was collected (Seville). Therefore, we gauged L2 phonological processing in perception and production through the vowel data only (/i:/-/ɪ/ for L2-English learners and /e/-/ei̯ / for L2-Spanish learners). In perception and in production, these two contrasts revealed substantial performance variation across learners. In perception we used the ABX discrimination accuracy scores as a measure of L2 learners' ability to perceptually distinguish between /iː/ and /ɪ/ (L2-English learners) or between /e/ and /ei̯ / (L2-Spanish learners) and the ABX discrimination RT scores as a measure of processing speed of the quality difference between the target vowels. Faster RTs were deemed to reflect a more robust encoding of the target phonological contrast. In production, we used a unified measure of Bark-normalized spectral distances between L2 vowels to estimate the accuracy of the vowel quality Attention control and L2 phonological processing contrast, and we used duration ratios between L2 vowels to estimate the accuracy of the vowel quantity contrast. That is, we assess L2 English learners' ability to qualitatively and quantitatively distinguish /iː/ from /ɪ/ and L2 Spanish learners' ability to qualitatively and quantitatively distinguish /e/ from /ei̯ / (see Table 7). However, spectral distances and duration ratios between L2-English /i:/ and /ɪ/ and L2-Spanish /e/ and /ei̯ / may not be directly comparable. In fact, the magnitude of the spectral distance for the L2-English monophthongal vowel contrast (/iː/-/ɪ/, M = 1.18 Bark) was much larger than that of the L2-Spanish monophthong-diphthong contrast (/e/-/ei̯ /, M = 0.61 Bark). Therefore, to make production accuracy measures comparable for all L2 learners, we computed individual z-scores of vowel production accuracy (spectral distances and duration ratios) based on the L1 speakers' means and standard deviations of the learners' corresponding L2. The relationship between attention control and phonological processing measures was explored for all L2 learners by using their attention control switch trial and switch cost RT scores (separately by the nasality and L1 phonetics dimensions) and the phonological processing scores of ABX accuracy and speed for perception and normalized spectral distance and duration ratio z-scores for production (see Table 8).
As the receptive vocabulary size of the L2-English learners was significantly larger than that of the L2-Spanish learners and this might be indicative of a betweengroups difference in L2 proficiency (Uchihara and Clenton 2020), which might have affected the L2 phonological processing measures, we first examined whether vocabulary size was associated to any of the attention and L2 phonological processing measures. Shapiro-Wilk tests of normality (and visual inspection of histograms and Q-Q plots) indicated that proficiency, attention and L2 phonological processing scores were normally distributed (all p > 0.05, Ws > 0.95), except for the spectral distance score (W(40) = 0.943, p = 0.045). We therefore used Spearman's-rho for correlational analyses involving this variable, and Pearson's-r correlation coefficients for all other variables. Vocabulary size was not associated to either L2 vowel perception (ABX), production (spectral distances), or our attention measures, so we did not include it as a co-variate in the correlations.
Interestingly, ABX accuracy was significantly related to spectral distances (r s = 0.421, p = 0.007), indicating an association between L2 learners' ability to distinguish between the contrasting vowels perceptually and their ability to produce a quality distinction between them in production. Although our perception task (categorial ABX discrimination of nonword items) taps into a pre-lexical phonological level of processing and our production task taps into a lexical semantic level of processing (elicitation of L2 words embedded in meaningful sentences), and are therefore not equivalent, we interpret this association to suggest that L2 learners who had developed more robust phonetic representations for the contrasting L2 sounds could also make a larger quality distinction between them in production.  Previous research has found perception to be more closely related to production within rather than across pre-lexical and lexical processing levels (Melnik-Leroy et al. 2022), but even within a pre-lexical processing level employing equivalent tasks in perception and production, a relationship between the two is not always attested (e.g., Kartushina et al. 2022; see Kartushina et al. 2022; Kato and Baese-Berk 2020; Melnik-Leroy et al. 2022; Nagle and Baese-Berk 2022, for discussion on the relationship between perception and production modalities). Significant medium-strength correlations were found between L2 learners' speed in adjusting to a new dimension in the attention switching task and in ABX discrimination speed, suggesting that individual differences in speed of processing underlie performance in both tasks, that is, deciding on the presence of a nasal resonance or L1 phonetics in a context of switching dimensions may require the same underlying processing skills as deciding on the identity of the vowel in an ABX trial where the target vowel could randomly appear in position A or B of the triad. This was corroborated by the fact that RTs on the repeat trials in the attention switching task also correlated significantly with ABX speed (L1: r = 0.547, p < 0.001; nasality: r = 0.554, p < 0.001). However, none of the attention measures were significantly associated with ABX accuracy, suggesting that attention control did not explain variance in how accurately L2 learners perceived the target vowel contrasts. A significant, though weak correlation, emerged between the switch cost measure (in the nasality condition), a measure of attentional flexibility, and the spectral distance score, indicating a weak tendency for L2 learners with stronger attention control to be better able to qualitatively distinguish the target L2 vowels in production. Significance tests for each measure were adjusted for multiple comparisons using Benjamini and Hochberg's (1995) False Discovery Rate procedure, at the 0.05 level for 5 simultaneous comparisons (p-values). For ABX speed, both significant correlations remain so after correction (the new significance threshold being 0.01 after FDR correction). For spectral distance, the correction places the p-value of the correlation between switch cost(nasality) and z-score above the significance threshold of 0.01.

Discussion and conclusions
We set out to explore the connection between individual differences in attention control (attention switching skill) and L2 phonological processing in perception and production. We conceptualized attention switching in terms of a cognitive skill functioning as a built-in "cue enhancement device" during L2 phonological processing that allows learners to efficiently extract the relevant language-specific segmental phonetic features of L2 sounds while bringing others to the perceptual background. We assessed individual differences in attention control through a novel speech-based task (adapted from the task switching paradigm) that required learners to switch their focus of attention between segmental speech dimensions: nasality (nasal vs. non-nasal) and language-specific phonetics (L1 vs. L2); we also measured learners' phonological processing in perception and production. We expected L2 learners with stronger attention switching skill to have developed more accurate L2 phonological representations during L2 learning (irrespective of learning history or target L2) based on their enhanced ability to attend to and extract the relevant segmental phonetic properties of the L2 sound system. Increased accuracy of L2 sound representation (L2 vowels) was expected to result in increased perceptual discrimination ability for difficult L2 sound contrasts (higher ABX discrimination scores) and increased ability to qualitatively distinguish contrasting sounds in production (larger spectral distance scores between L2 vowels). Attention switching scores were related to processing speed in the perceptual discrimination task, and to the spectral distance between vowels in production. This suggests attention switching skill plays a role in L2 phonological processing, despite being modest in size. Our findings clearly indicate that those learners who were more efficient (i.e., faster) at focussing their attention on a given speech dimension in the attention task were also faster at discriminating the target vowel contrasts. Contrary to our expectations however, we did not find an association between this attention switching measure and discrimination accuracy. The association between response speed in the attention and discrimination tasks might simply be due to the potential consistency of individual differences in working memory or phonological memory capacity (which we did not test in the current study) across the two tasks, or to the similar speech processing requirements of both tasks (deciding on the quality of a phonetic feature in a context of switching speech dimensions is similar to deciding on the identity of a vowel with respect to contrasting vowels that randomly switch positions in a triad), or both. The lack of relationship between attention switching skill and discrimination accuracy suggests that individual differences in attention switching, even when assessed through a speech-based task, do not predict L2 sound discrimination skills at intermediate proficiency levels. Further research needs to confirm whether this is indeed the case. For example, one study using speech-based attention control tasks found auditory selective attention (but not attention switching skill or auditory inhibition) to be related to L2 learners' gains in ABX discrimination accuracy after high-variability phonetic training (Mora and Mora-Plaza 2019). These studies are not directly comparable to ours though, since our task doesn't measure learning gains, but rather the state of phonological knowledge after it has been learnt. It is also plausible that the implication of attention skills in L2 phonological development, together with other sources of individual differences in L2 phonological processing, such as inhibition (Darcy et al. 2016) or general auditory processing skills (Saito et al. 2019(Saito et al. , 2020(Saito et al. , 2021 may contribute to a different extent at different stages of acquisition, playing a larger role at initial stages (or during initial learning) than at the intermediate proficiency level we targeted in the current study.
Interestingly, discrimination accuracy (which was unrelated to attention switching) was significantly related to spectral distances between the same contrasting vowel in production. These spectral distance scores were in turn related to attentional flexibility (albeit weakly), suggesting that attention switching skill may be more directly implicated in speech production than in speech perception. One way to explain these seemingly diverging findings is to consider the nature of the tasks we used to measure L2 perception and production. The perception task used nonwords, which likely enhanced a phonetic processing mode and made the processing of acoustic differences between the contrasting A and B items in the ABX trials easier than if the target vowels had been embedded in confusable lexical minimal-pair words (Ortega et al. 2021;Thomson and Derwing 2016). Because the task doesn't involve meaning and is lower in cognitive complexity, it may not allow differences in attention skill to influence performance because it reflects underlying phonological knowledge which has already been established in the past. While differences in attention may impact the way by which phonological representations are established at the time of learning (e.g. speed of learning or initial precision), they do not interact with the outcome during testing, which measures underlying phonological knowledge (suggesting that tasks such as the ABX are indeed good tasks to measure underlying phonological knowledge independently of attentional abilities). By comparison, the production task made use of L2 lexical words embedded in meaningful sentences that had to be repeated after a delay and intervening speech material (a prompt) that forced learners to repeat them from memory after having processed their meaning. The meaning-focused nature and the higher cognitive complexity of this task might have allowed individual differences in attention control to play a role, due to this task's increased attentional demands compared to ABX discrimination. As a result, L2 learners with efficient attention skills might have been better able to pay attention to the quality of the target sounds in the words that made up the sentences during the production task. Given the nature of this task and its primary focus on meaning during repetition from memory, it is uncertain whether learners could in fact focus on the phonetic features of the target difficult L2 sounds.
Finally, it is important to acknowledge that our vowel production accuracy measures are mainly based on a measure of distinctiveness, rather than one of "nativelikeness" or how close their vowel production was to that of L1 speakers. Thus, although our L2 learners' ability to produce a larger acoustic distance between two contrastive vowels was interpreted an indication of a larger qualitative distinction between them in production (as in recent research, e.g. Melnik-Leroy et al. 2022) and, by hypothesis, an indication of higher production accuracy, a more "accurate" production does not necessarily imply a more target-like quality in production. The potential difference between a measure of contrastiveness and one based on distance from L1 speakers' productions in indexing L2 learners' development of the target L2 vowel representations was partly overcome by using L1 speakers' means and standard deviations of the learners' corresponding L2 to compute the z-scores of vowel production accuracy. However, it is uncertain to what extent a measure of accuracy computed as the Euclidean distance between L2 learners' productions and those of L1 speakers might have resulted in comparable results as regards the role of attention control in predicting L2 vowel production accuracy.
In this study, we examined the extent to which attention switching skill is associated with L2 phonological processing in perception and production. Future research investigating the role of attention control in L2 phonological acquisition would benefit from exploring additional attentional skills such as auditory selective attention and from doing so with learners at different proficiency levels and within longitudinal research designs, both through lab-and classroom-based studies.