The associations between working memory and the effects of multimedia input on L2 vocabulary learning

The efficient use of working memory (WM) increases the potential of a learner’s cognitive abilities in learning through multimedia. The present study aims to explore the role of working memory in vocabulary learning through multimedia input. In particular, we explore the possible associations between two components of WM – executive WM and phonological short-term memory (PSTM) – and the effects of three types of input conditions (Definition + Word information + Video, Definition + Word information, and Definition) on second language (L2) vocabulary learning. A total of 95 students completed learning under the three conditions and took two WM tests: a reading span test, which measures complex executive WM, and a non-word span test, which gauges PSTM. We administered a vocabulary knowledge test, which included receptive and productive vocabulary knowledge, immediately and after two weeks. Our findings, based on repeated-measures analysis of covariance (ANCOVA), support the pronounced effects of the Definition + Word information + Video condition in vocabulary learning and retention, as well as the significant role of complex and phonologicalWM in vocabulary learning and retention under the three conditions. Theoretical and pedagogical implications concerning the role ofWM in vocabulary learning through multimedia input are discussed.


Introduction
Vocabulary knowledge is of great importance in foreign language (FL) and second language (L2) teaching and learning. Vocabulary knowledge is multifaceted in nature (Milton 2013). Nation (1990) and Schmitt (2014) categorized vocabulary knowledge into two dichotomous aspects: receptive and productive vocabulary knowledge. The former requires learners to recognize word form and understand word meaning. The latter refers to learners' ability to correctly express word meaning and appropriately use it in certain contexts (Laufer 1998;Laufer and Paribakht 1998). Given the multifaceted nature of vocabulary knowledge, language teachers have faced challenges in making vocabulary instruction effective. Multimedia program in the form of text, pictures, videos, and sound may help teachers deal with the challenge.
According to Paivio's (1990) dual-coding theory, visual and verbal aids, which come in the form of multimedia presentations, initiate, stimulate, and reinforce learning sensors. Owing to the recent boom in multimedia technologies, researchers have attempted to make vocabulary instruction more effective by integrating online visual and verbal aids into teaching and learning (Boers et al. 2017;Ramezanali and Faez 2019;Yanguas 2009;Yoshii and Flaitz 2002). Although visual and verbal aids can be part of traditional vocabulary instruction, the provision of online resources or technological aids may help learners acquire new word knowledge and develop strategies to enable them to take control of their learning and thus increase the depth of that knowledge (Teng 2018). In addition, understanding a word involves much more than knowing its definition, and simply memorizing a definition does not guarantee the ability to use a word in reading or writing. The use of multimedia technologies may benefit word knowledge acquisition (Teng 2021).
Based on the cognitive theory of multimedia learning (Mayer 2001), learners store incoming information in sensory memory and then select and transfer relevant visual and auditory information to modality-specific subsystems of working memory (WM), where it can be maintained and processed. Each of these modality-specific WM stores is limited in capacity. Learners need to construct incoming information in WM before integrating the information with their prior knowledge. Learning achievement takes place when such integration has occurred. According to Mayer (1997Mayer ( , 2001, WM, which involves conscious awareness, is vital for holding and manipulating multimedia input and contributing to knowledge acquisition. WM overload during processing may influence learners' perceptions or interpretations of multimedia input, thus affecting their vocabulary learning outcomes through multimedia input. Thus, it is worthwhile to explore how WM affects multimedia learning (Schüler et al. 2011), particularly in the context of L2 vocabulary learning. WM, a cognitive device for online information retrieval and processing, is necessarily implicated in a process of coordinating cognitive and linguistic resources for multimedia learning (Mayer 2001). Such a process may impose cognitive burden and pressure on learners' WM resources when processing information for vocabulary learning from multimedia learning. Related to this, it is essential to explore the role of WM on vocabulary learning through multimedia input. However, this issue has been neglected in the context of English education in China. Despite China's educational reforms (which have been underway for a number of years) that are designed to facilitate efforts to develop learners' communicative competence, English teaching in China is still dominated by grammar-translation teaching approaches, a situation which arises in a large part due to the exam-orientated nature of the Chinese education system (Rao 2013). Under the exam pressure, teachers and learners may depend on grammartranslation teaching approaches for intentional learning of word form and meaning. Multimedia technologies, which provide learners with different vocabulary learning resources, including websites, apps, and online learning platforms, have received less attention in the Chinese context. It is thus meaningful to use multimedia resources, which involve texts, audio recordings, pictures and videos, to foster learners' acquisition of different aspects of vocabulary knowledge (Ramezanali and Faez 2019).
Therefore, vocabulary learning from multimedia input requires the coordination of cognitive and linguistic resources. Learners with different WM capacities could be expected to execute and orchestrate these processes with different degrees of efficacy, and thus vary in how they benefit from multimedia input. Despite the acknowledged role of WM in vocabulary learning, previous studies have not explored the effects of multimedia input on vocabulary learning and retention from the perspective of learners' WM in the Chinese context. We have attempted to fill this gap by examining whether two different components of WMphonological short-term memory (PSTM) and complex WMare associated with the impact of three different input conditions (Definition + Word information + Video, Definition + Word information, and Definition) on vocabulary learning. The findings have implications for both theoretical understanding and pedagogical practice.
2 Literature review 2.1 Multimedia input in L2 vocabulary learning: cognitive theory of multimedia learning It is necessary to examine relevant multimedia learning theories for an understanding of L2 vocabulary learning through multimedia inputs. According to Paivio's (1972Paivio's ( , 1986Paivio's ( , 1990 dual-coding theory, humans, who possess various sensory modalities, can process information through two channels. One channel is responsible for verbal inputs from speech or writing, while the other channel is responsible for processing non-verbal information from images (Sadoski and Paivio 2001). From a cognitive perspective, multimedia learning features two information processing subsystems: verbal and visual information (Paivio 1972(Paivio , 1986(Paivio , 1990. Information received through the two channels may deepen learners' understanding of different types of information for L2 vocabulary learning. On this basis, Mayer (1997Mayer ( , 2001 proposed the well-known cognitive theory of multimedia learning (Figure 1), which contains three important constructs that may influence learning. The first refers to "dual channels," reflecting similar ideas of dual-coding theory (Mayer 2001;Paivio 1972Paivio , 1986Paivio , 1990. The second, called "limited capacity," indicates that learners can only process a limited amount of either visual or verbal information in WM. The third highlights that learners are active in constructing knowledge, including (1) selecting relevant information, (2) organizing information, and (3) integrating information with prior knowledge (Mayer 2001). From the explanations of the three key constructs, we know that the information process is enhanced by multimedia input. As Figure 1 shows, learners first notice multimedia information before processing verbal and visual information via their ears and eyes in sensory memory. After that, the selected information comes into learners' WM. The information is then organized and becomes coherent verbal/pictorial models. Finally, the information, together with learners' previous knowledge, is connected, integrated, and stored in their long-term memory.
In Figure 2, the colored boxes illustrate how input is cognitively processed only through textual input. When learners are provided with textual input such as word definitions and example sentences, they use their eyes to receive the information and then bring that information into their sensory memory. After that, they select input and send it to their WM, formed as verbal models. Based upon learners' prior knowledge, information is actively organized and integrated.  (Mayer 1997(Mayer , 2001).
As Figure 3 shows, the processing of multimedia (i.e., textual, pictorial, and auditory) input is more complex. For example, when learners are given both verbal and visual input (e.g., text and video), both their ears and eyes should be used to receive the information before delivering it to their sensory memory. Then, the two types of information are selected and transferred to learners' WM. The sounds and images in WM are not separate. The information mutually interacts and is converted, and then stored in verbal and pictorial models. The final step is to integrate learners' prior knowledge with the two types of information.

Research on multimedia input in L2 vocabulary learning
A number of previous studies have focused on the influence of different types of multimedia input (textual, auditory, and visual) on learners' vocabulary knowledge development (e.g., Akbulut 2007;Chun and Plass 1996;Ramezanali and Faez 2019;Yanguas 2009;Yoshii 2006;Yoshii and Flaitz 2002). Many of these studies have found that the combination of textual and visual input is more beneficial than one type of input. For example, as Plass et al. (1998) summarized, various kinds of  Working memory and vocabulary learning from multimedia input (auditory, visual, and pictorial) can make connections with (1) the target L2 word, (2) the image that represents the concept of the word, and (3) the first language (L1) equivalent. According to the questionnaire and interview data of Ramezanali and Faez (2019), learners indicated a more positive attitude towards the dual glossing mode of L2 definition and video animation.
More specifically, some studies have attempted to measure and compare the effects of different types of input. For instance, in Al-Seghayer's (2001) study, 30 English as a second language (ESL) learners were allocated to different input groups with the provision of (1) only printed text, (2) a printed text definition plus still pictures, and (3) a printed text definition plus video clips. The group with textual input and video clips performed better than the other two groups. In Yanguas' (2009) study, although the findings revealed significant differences between the experimental groups (the picture group and the text plus picture group) and the control group (the text-only group) regarding learners' receptive vocabulary knowledge development, no difference was found in terms of productive vocabulary knowledge. Likewise, Çakmak and Erçetin (2018) assigned 88 students who had a low English proficiency level to four groups according to gloss (i.e., the explanations of words that accompany a text)no gloss, textual gloss, pictorial gloss, and textual plus pictorial gloss. This study showed that the type of gloss had no significant effect on learners' receptive or productive vocabulary acquisition.
In a meta-analysis that included the examination of gloss mode in vocabulary learning (Yanagisawa et al. 2020), glossed reading mode led to significantly greater learning gains than the non-glossed reading condition. The greatest effect was yielded by multiple-choice glosses, followed by marginal glosses, then hyperlinked glosses. Similarly, Ramezanali et al. (2021) conducted a meta-analysis of multimodal input and L2 vocabulary learning. They found many possible factors that may shape the effects of multimedia input on L2 learners' vocabulary development. These factors include learners' L2 proficiency, the language of instruction, and research design. As a result of these outcomes, it appears that further research is needed to explore the possibility of L2 vocabulary learning from multimedia input. In addition, individual differences in WM capacity may impact multimedia learning outcomes, as in one study, students with high WM capacity were better able to recall and could transfer more information during multimedia learning than students with low WM capacity (Anmarkrud et al. 2019). It is thus essential to explore the role of WM in L2 vocabulary learning through multimedia input.

Working memory (WM) and L2 vocabulary learning
WM refers to the cognitive system for storing, processing, and manipulating information for the temporary maintenance of task-relevant details in the face of other distracting information (Baddeley 1998(Baddeley , 2003. WM is operationalized as learners' constrained cognitive capacity, which allows them to simultaneously store and process information to gain awareness for completing mental tasks (Baddeley 2003). With regard to WM, there are two major research traditions: one is British, and the other is North American. The British tradition advocates the simple and storage dimensions of WM, such as the non-word repetition span task (Gathercole et al. 1994). The North American tradition suggests the implementation of complex memory span tasks to tap into the dual functions of WM. However, Williams (2012) claimed that the distinctions between the British and North American traditions are not always clear; hence, the definition of WM should be based on storage and processing functions.
Individuals vary a great deal in their cognitive skills and this influences their vocabulary learning outcomes (Teng and Zhang 2021). In relation to learners' cognitive skills, WM capacity is one of the most extensively investigated factors in individual differences in cognition. A perusal of the literature on WM reveals that most researchers refer to Baddeley's (1998Baddeley's ( , 2003 model of WM as the most influential framework for understanding WM. Based on this model, WM comprises the central executive, the phonological loop, and the visuospatial sketchpad. The central executive function is concerned with the control of information required to carry out complex tasks (Baddeley and Hitch 1974). The phonological loop and the visuospatial sketchpad, which are assigned for short-term memory, play important roles in retaining information. In particular, the phonological loop stores phonological information (e.g., remembering a phone number), while the visuospatial scratchpad maintains visual and spatial information (e.g., memorizing chess configurations) (Baddeley and Hitch 1974). Information in the two systems is assumed to decay rapidly. Engle et al. (1999) challenged Baddeley and Hitch's (1974) original conception of WM and highlighted that WM should be more connected to complex cognition in general. The central executive is crucial to maintaining task-relevant information.
As noted above, different notions of WM abound in the literature. However, these are differences in emphasis, rather than overall conception. WM is a multicomponent system that consists of domain-specific storage systems and domaingeneral executive components. Previous models on WM center on two areas: Baddeley and his colleagues focused on the storage components, while Engle and his colleagues emphasized executive functions. These different perspectives account for individual differences in the efficiency of L1 and L2 processing.

Working memory and vocabulary learning from multimedia
Researchers have paid attention to L2 vocabulary learning from the perspective of WM. Cheung (1996) attempted to study the correlation between phonological memory and natural vocabulary development among young learners by adopting non-word span to measure phonological memory. The participants in Cheung's study comprised a group of 84 seventh-grade high school students in Hong Kong. The results showed that phonological memory underlies L2 vocabulary acquisition. Researchers have also separately measured the predictive role of PSTM and executive WM in L2 vocabulary learning. Although some findings have suggested that young learners might not be able to rehearse transforming novel verbal materials into long-term memory (Gathercole et al. 1994), Cheung (1996) argued that WM and the rehearsal process are important determinants for young learners to pass on information to register it in long-term memory. Martin and Ellis (2012) looked at PSTM based on non-word repetition, non-word recognition, and listening span. Their study's participants were 50 native English speakers who learned single vocabulary words and sentences in a foreign language. Their results implied that PSTM is correlated with learners' vocabulary learning performance (r = 0.33-0.45). Their regression analyses suggested that PSTM made independent and significant contributions to their participants' vocabulary learning gains (β = 0.39). However, the actual mechanisms underlying PSTM and WM were not fully understood in their study. As a result, the findings may not be extended to L2 or FL learning. In a recent study (Karousou and Nerantzaki 2020), the focus was on assessing the effectiveness of a phonological memory training educational intervention on the vocabulary development of young L2 learners. A total of 97 learners were divided into two groups: an experimental group and a control group. The phonological working memory test was an English-sounding non-word repetition test. Vocabulary learning was evaluated through receptive and productive knowledge tests. The training included 33 sessions, which lasted for 12 weeks. The results supported the significant relationship between phonological working memory and L2 vocabulary size. Although phonological working memory did not significantly affect L2 receptive vocabulary knowledge, it significantly predicted productive vocabulary gains. Engel and Gathercole (2012) explored the relationship between WM and L1, L2 and third language (L3) vocabulary, grammar, and literacy learning proficiency with 119 Luxembourgish-speaking children from 34 primary classes in 16 state schools. The learners completed complex span and verbal short-term storage WM tasks, as well as a series of vocabulary, grammar, and literacy tests. The results showed that PSTM was correlated with L1 and L2 vocabulary, grammar, and literacy learning outcomes. However, the study suggests that executive WM is a weak predictor of L2 vocabulary learning. The findings, based on controlling phonological awareness, indicated a non-significant relationship between PSTM and L3-French vocabulary acquisition. Such outcomes contradict those of Cheung (1996). One reason may be the different use of WM tasks. Specifically, while Cheung (1996) used non-word repletion, Engel and Gathercole (2012) employed digit span. Another reason might be that the acquisition of unfamiliar phonology for the learners in the Luxembourgish-speaking context was influenced by their capacity to discern the sound system of the target language. Yang et al. (2017) explored vocabulary learning outcomes under different involvement load conditions. They also examined the role of WM in vocabulary learning outcomes. Data were collected from 85 first-year English major university students in China. WM was based on the dual tasks of sentence-final word recall and semantic similarity judgment. They found that WM was a significant predictor of the vocabulary learning scores of the comprehension-only group and the gap-fill group, but not the sentence-writing group. In addition, WM did not influence the delayed vocabulary test scores. However, the scoring of the WM test was based on the composite score of the reading span test. Such a scoring system ignores the different roles of PSTM and executive WM (Engel et al. 1999).
In sum, the above studies suggest that PSTM, rather than executive WM, significantly predicts vocabulary learning outcomes. However, in Linck et al.'s (2014) metaanalysis the WM executive control component was significantly correlated with L2 proficiency outcomes, including receptive and productive vocabulary knowledge. Yet despite the insight generated by these studies gaps in our understanding remain. For example, the role of WM in the vocabulary learning rate through multimedia input, to the best of our knowledge, has not been examined. In a recent study involving 63 L2 learners of French (Montero Perez 2020), individual differences in complex WM (measured by a backward digit span and an Ospan task) were used to predict learners' performance in picking up new words from watching videos. The findings on the captioned videos, which require comprehension of bimodal inputs (Teng 2019), can be extended to multimedia input. However, the validity of the claims concerning the role of WM in vocabulary learning through multimedia input is open to question because such claims are based on limited evidence. As a result, more research is warranted to fill the gaps, such as investigating the associations between WM and vocabulary learning from multimedia input.

Rationale of the current study
Despite the sufficient attention given to the effects of multimedia input on vocabulary learning (Teng 2021), few studies have looked into the role of WM in the context of multimedia input-guided vocabulary learning (Schüler et al. 2011). Adding WM as a variable may enhance the theoretical and practical understanding of vocabulary learning outcomes through multimedia input. In addition, there has been a call to examine the interface between cognitive variables and treatment Working memory and vocabulary learning from multimedia type in vocabulary acquisition (Alzahrani 2017). Hence, we aimed to determine whether WM is associated with the effects of multimedia input on vocabulary learning. We have attempted to fill a research gap in vocabulary research by investigating how WM (e.g., complex WM and PSTM) impacts the effects of three types of input conditions (Definition + Word information + Video, Definition + Word information, and Definition) on vocabulary learning. The three learning conditions have not been explored in previous studies. The findings provide insights into WM and L2 vocabulary learning from multimedia input. To achieve our goals, we developed two research questions: (1) To what extent do the three multimedia conditions differ from each other in learners' L2 vocabulary acquisition? (2) Do complex WM and PSTM predict vocabulary learning outcomes under different input conditions?

Participants
We recruited participants from the Department of English at a Chinese university, and conducted this study in three intact classes. We invited a total of 105 first-year students from three classes to participate. They ranged in age from 18 to 20 years old. Their native language was Chinese, and they were learning English as a FL. All participants reported that they had no other learning language experiences apart from Chinese and English. We randomly assigned each class to one of the three types of input conditions described above. We excluded 10 participants who failed to complete the post-test. Therefore, the final dataset included 95 students, with 32 assigned to the Definition condition, 33 to the Definition + Word information condition, and 30 to the Definition + Word information + Video condition. The participants completed the Vocabulary Levels Test (VLT) developed by Schmitt et al. (2001). Learners under the condition of Definition achieved a mean score of 26.21 (SD = 1.35) for the 2,000 word level, 16.34 (SD = 1.02) for the 3,000 word level, and 4.52 (SD = 0.78) for the 5,000 word level. Learners under the condition of Definition + Word information achieved a mean score of 25.35 (SD = 1.31) for the 2,000 word level, 14.31 (SD = 0.93) for the 3,000 word level, and 3.22 (SD = 0.59) for the 5,000 word level. Learners under the condition of Definition + Word information + Video achieved a mean score of 26.89 (SD = 1.39) for the 2,000 word level, 15.11 (SD = 1.08) for the 3,000 word level, and 3.05 (SD = 0.51) for the 5,000 word level. None of the participants had any knowledge of words at the 10,000 level.

Target words
We selected all 24 target words from the Word of the Day section of the Merriam-Webster Online Dictionary (available at https://www.merriam-webster.com/word-ofthe-day). Merriam-Webster is one of the most well-known and trusted dictionaries across the world and is supported by a professional dictionary editing and writing team. In the Word of the Day section in the online dictionary, readers can enjoy one carefully chosen word every day. The section not only includes the word's definition, but also provides extra information (e.g., background stories, synonyms, antonyms, example sentences, and etymology). Unlike typical dictionaries, this section offers learners multimedia inputs with a video explaining word pronunciation, meaning and use, part of speech (PoS), and example sentences.
To ease the difficulty of understanding and acquiring a word, we used the VocabProfile section (http://www.lextutor.ca/vp/comp) of Compleat Lexical Tutor (www.lextutor.ca) to check word frequency. The lexical profiles of the definitions of the target words were as follows: 2,000 word level (65.37%), 3,000 word level (77. 38%), 4,000 word level (83.24%), and 5,000 word level (91.5%). The definitions were deemed appropriate for learners' comprehension. The 24 target words (Table 1) were all beyond the K-10 level, and can therefore be categorized as low-frequency words. As all the participants had reached certain vocabulary level, but none knew any words at the 10,000-word level (see the section above on the participants), the target words were likely to be unfamiliar to them. Indeed, as the pre-test results showed, none of the participants had any previous knowledge of the 24 target words.

Treatment
We divided all the participants into three treatment groups. Specifically, we only offered students in Group 1 word definitions, which are explanations of word meanings. In Group 2, students had further access to the background information, which usually contains a story that introduces the target word's origin. In Group 3, students received multimedia inputs (e.g., videos) chosen from the Word of the Day

Vocabulary tests
The vocabulary test used in this study was designed based on the Vocabulary Knowledge Scale (VKS) adapted from Wesche and Paribakht (1996). The VKS is regarded as a widely accepted and commonly used framework to measure learners' vocabulary knowledge and to report learners' acquisition from complete unfamiliarity to a status where they are able to correctly and appropriately use a word in a particular context (Paribakht and Wesche 1997). Because the original VKS can be applied to evaluate whether learners have seen a word before, we modified this option to better fit our study. Figure 4 contains an example of the VKS for the target word parvenu.
The participants did not necessarily report every option from A to F. For Option A, they were required to state whether they knew the word. If so, they had to give a general explanation of the word (Option B) and/or a definition (Option C). To assess productive knowledge, participants were asked to report whether they could use the word in a sentence (Option D) and, if so, to write down the sentence in English (Option E). Finally, they were asked to translate the sentence into Chinese (Option F). Learners who provided a negative answer for Option A did not proceed with the following options. Likewise, learners who provided a negative answer for Option D did not proceed with the options after that.

Scoring system for VKS
In terms of the VKS, we designed Options A, B, and C to evaluate participants' receptive knowledge (i.e., word meaning), while we designed Options D, E, and F to assess their productive knowledge (i.e., word use). As Table 3 shows, the scores of each option differed depending on the quality of the participants' answers. The maximum score for each item is 5. Here, each item refers to receptive or productive knowledge, and each aspect has five points. The total mark of either receptive or productive knowledge is thus 120 points.  An incorrect answer  No answer  Productive knowledge D √  E A grammatically and semantically correct sentence  (1) A sentence that demonstrates a very good knowledge of the target word but makes minor grammatical and syntactic error(s); or (2) A sentence that demonstrates satisfactory knowledge of the target word but uses correct grammar


(1) A sentence that demonstrates very good/satisfactory knowledge of the target word but makes major grammatical and syntactic error(s); or (2) A sentence that makes no/minor grammatical and syntactic error(s) but demonstrates little knowledge of the target word With regard to scoring, two experienced teachers who were not teaching the participants were invited to independently score the participants' answers to minimize the risk of scoring bias. A third teacher was also available if the two teachers gave different scores. The interrater reliability for the VKS was 91.4%. Disagreements were solved based on majority opinion.

Working memory
The measure of WM includes complex WM and PSTM. We examined complex WM through a reading span test, which was a complex memory span task adapted from Daneman and Carpenter (1980). The purpose of complex memory span tasks is to measure learners' executive WM (Wen 2015). The test we chose focuses on the processing and storage components of short-term memory. The reading span test required the learners to judge whether each sentence was plausible, while at the same time remembering the final word in the sentence. Learners then recalled the sentence-final words in the order in which the sentences were presented after the entire set of sentences. This test included 60 sentences in Chinese. Although Mackey et al. (2010) argued that WM is an individual cognitive variable that is not related to language, we decided to use sentences in the participants' native language (rather than English) to minimize the influence of some learners' lower English proficiency on judging the English sentences and memorizing the final word. Among the 60 sentences, 10 were practice items and 50 were target sentences. We divided the 50 target sentences into 12 sets consisting of two, three, four, and five sentences. Each set (also called a span) was repeated three times. All sentences were in an affirmative and active form and contained 12-16 words. Half the sentences were semantically plausible, while the other half were not plausible.
We investigated PSTM through a non-word repetition task that we adapted from Gathercole et al. (2001). This test only focuses on the storage component of short-term memory. The learners were required to listen to 22 sequences of nonwords. Each sequence included 4-7 one-syllable non-words. The learners were required to repeat all the words after each sequence. In total, the 22 sequences included 120 items. We created all the non-words based on the phonotactic rules of English. We used non-words rather than real words because learners' L2 vocabulary knowledge had the potential to impact their test performance, which would lead to an inaccurate evaluation of PSTM (Gathercole et al. 2001). A native speaker was invited to read the words and record the stimuli. The two WM tests were administered through E-Prime, a software that assesses learners' psychological behavior. Each participant was tested individually in a lab. The first WM test was conducted in the morning, while the second test took place in the afternoon to ease potential pressure related to memory overload placed on the participants.

Scoring for working memory
We adopted different scoring systems for the two WM tests to fit the nature of each test. In terms of the reading span test, we focused on three components: (a) the number of correctly recalled sentence-final words, (b) the number of correctly judged sentences, and (c) the mean reaction time for correctly judged sentences. We first transformed the raw scores for the three components into z-scores. We then summed the z-scores and divided them by three to obtain a composite score. With regard to reaction time, higher reaction times represented slower responses, for which we multiplied the z-scores for the reaction time by −1. This step ensured that a higher score would reflect better performance for the three components. Overall, a composite score represents the processing and storage components of short-term memory. Li and Roshan (2019) pointed out that a score for recalling sentence-final words only, rather than a composite score for the three components, might be inaccurate due to a possible trade-off between the storage and processing components of WM. For example, a learner may sacrifice accuracy of sentence judgment to achieve a better recall of the stimulus.
In terms of the non-word repetition test, the participants had to remember and repeat the 22 sequences of target words, including the 120 target words. The learners earned one point for each correctly recalled item. The total scores were based on the total words correctly recalled in the 22 sequences.

Data collection
We conducted the entire study in a computer classroom. Three teachers were responsible for the three conditions. We randomly assigned the three teachers to a condition after they attended a training session to help them understand the procedures of the treatment session. The teachers provided the participation instructions to the students about how to complete the experiment. The participants completed the training and the respective requirements under each condition in a computer classroom.
This study included a pre-test, a treatment, a post-test, and a delayed post-test (Table 4). The participants completed the pre-test four weeks before the treatment. Given that four weeks is a relatively long period of time, we assumed that learners would not commit the words to deliberate memory. The treatment session was completed in Week 5. We conducted the first post-test immediately after the treatment session. The participants completed the WM test (i.e., the complex WM test and the PSTM test) in Week 6. The reason for administering the WM test one week after the treatment session, rather than immediately after the treatment session, was to reduce the possibility that the cognitive load imposed on learners by the treatment session may influence their WM. The participants then finished the delayed post-test in Week 7. The VKS served as the pre-test, post-test, and delayed post-test. The differences were that we added different sets of non-target words to the test, and the order of test items was different. The purpose was to minimize learners' deliberate tendency to memorize the target words, because we expected their vocabulary learning outcomes mainly come from the training sessions. The participants had to finish the online VKS within 1 h. During the treatment session, each video was played only once. Learners needed to complete each question and could not move back to previous ones.
We obtained ethical permission for this study from the internal research committee of the experimented university. Participants signed a consent form to indicate their agreement to participate. They were assured anonymity and confidentiality, and were allowed to withdraw at any time. Participants received a coupon for their effort and time.

Data analysis
We analyzed all the data using SPSS Version 26. We ran a two-way analysis of covariance (ANCOVA), followed by Bonferroni post hoc comparisons, to test the influence of the independent variables on the dependent variables at the immediate and delayed tests while controlling for the covariate. The dependent variables were the two dimensions of the vocabulary test (receptive and productive vocabulary knowledge), which we administered twice. The independent variables were the three groups of input conditions. The covariates were the two components of WM. The test results allowed us to examine the possible impact of (1) different types of input conditions on vocabulary knowledge acquisition and (2) the impact of WM on vocabulary test scores.  We tested the same two-way ANCOVA model twice, firstly to investigate the post test data and secondly the delayed test data. Time was the within-group variable and treatment was the between-group variable. We entered complex and phonological WM in the model as covariates, although we had to exclude the pretest since the participants did not exhibit any prior knowledge on the tests. Before carrying out the ANCOVA, we examined the assumption of homogeneity of regression slopes via the interaction between treatment and WM when predicting each dependent variable. The p values were larger than 0.05 for the immediate and delayed post-tests of receptive and productive knowledge. In addition, the dependent variables were normally distributed. These findings indicate that the assumption of homogeneity of regression slopes was met for each dependent variable.

Results
There was a main effect of treatment conditions on the immediate post-test score after controlling for WM scores on receptive knowledge, F (2, 95) = 354.494, p < 0.001, η 2 p = 0.887, and productive knowledge, F (2, 95) = 345.294, p < 0.001, η 2 p = 0.885. There was also a main effect of treatment conditions on the delayed post-test score after controlling for WM scores on receptive knowledge, F (2, 95) = 263.317, p < 0.001, η 2 p = 0.853, and productive knowledge, F (2, 95) = 252.131, p < 0.001, η 2 p = 0.849. The results did not reveal a significant time effect of the immediate test on the delayed post-test score after controlling for WM scores in receptive knowledge, F (1, 90) = 0.172, p = 0.691, η 2 p = 0.003, and productive knowledge, F (1, 90) = 0.174, p = 0.691, η 2 p = 0.003. We did not detect a significant Time × Treatment interaction effect on the delayed post-test score after controlling for WM scores in receptive knowledge, F (1, 90) = 0.172, p = 0.691, η 2 p = 0.003, and productive knowledge, F (1, 90) = 0.174, p = 0.691, η 2 p = 0.003. A Bonferroni post hoc test of the immediate post-test showed that the Definition + Word information + Video group had significantly higher scores than the Definition + Word group in receptive knowledge (p < 0.001) and productive knowledge (p < 0.001). The Definition + Word group had significantly higher scores than the Definition group in receptive knowledge (p < 0.001) and productive knowledge (p < 0.001). In terms of the delayed post-test, the Definition + Word information + Video group demonstrated significantly higher scores than the Definition + Word information group in receptive knowledge (p < 0.001) and productive knowledge (p < 0.001). The Definition + Word information group scored significantly higher than the Definition group in receptive knowledge (p < 0.001) and productive knowledge (p < 0.001). The findings address the first research question concerning the different impact of the three learning conditions on L2 vocabulary learning.
The next step was to examine the role of WM in vocabulary learning outcomes. Overall, using Pillai's trace, we found a significant effect of complex WM on vocabulary learning, V = 0.207, F (4, 87) = 5.694, p < 0.001. Again, using Pillai's trace, we detected a significant effect of phonological WM on vocabulary learning, V = 0.210, F (4, 87) = 5.772, p < 0.001. Table 6 presents the effects of WM as a covariate on the different components of vocabulary knowledge at the two administered times. Table 6 shows that complex WM, as the covariate, significantly predicted the immediate test scores of productive knowledge, F = 22.664, p < 0.001, η 2 p = 0.201, and receptive knowledge, F = 16.795, p < 0.001, η 2 p = 0.167, as well as delayed test scores of productive knowledge, F = 21.387, p < 0.001, η 2 p = 0.192, and receptive knowledge, F = 10.858, p < 0.05, η 2 p = 0.108. Phonological WM, as the covariate, also significantly predicted immediate test scores of productive knowledge, F = 15.376, p < 0.001, η 2 p = 0.146, and receptive knowledge, F = 18.987, p < 0.001, η 2 p = 0.174, as well as delayed test scores of productive knowledge, F = 14.967, p < 0.001, η 2 p = 0.143, and receptive knowledge, F = 21.478, p < 0.001, η 2 p = 0.193. The findings address the second research question concerning how complex WM and PSTM predict L2 vocabulary learning outcomes under different input conditions.

Discussion
We examined the extent to which complex WM and PSTM were associated with the effects of three types of input conditions on vocabulary learning and retention. We found that (1) the learning and retention of vocabulary knowledge was more pronounced in the Definition + Word information + Video condition and (2) complex and PSTM influenced vocabulary learning and retention under the different input conditions. In the following section, we seek to explain and discuss the findings with reference to the nature of the treatment by comparing the results with previous findings and in relation to relevant theories. Theoretically, the findings contribute to existing frameworks such as the cognitive theory of multimedia learning (Mayer 1997(Mayer , 2001 by further verifying and extending its arguments. Pedagogically, the findings offer insights for developing instructed L2 vocabulary learning, thereby deepening understanding of the affordances of multimedia input on vocabulary learning in the digital era.

Vocabulary learning from multimedia input
Our first purpose was to explore how vocabulary is learned and retained from multimedia input. The findings support the idea that combining definitions and information of words with associated visuals more effectively facilitates receptive and productive vocabulary knowledge, compared to only providing word definitions or incorporating information. In line with earlier studies (e.g., Akbulut 2007;Chun and Plass 1996;Ramezanali and Faez 2019;Yanguas 2009;Yoshii 2006;Yoshii and Flaitz 2002), the availability of visual input, along with word meanings, could help L2 learners to perform better in acquiring vocabulary knowledge (versus a single type of textual input). One explanation for this is that the availability of multiple types of input for a word may encourage learners to actively look up the word's meaning, thereby reinforcing learning and retention. Another explanation relates to the so-called hypermnesia effect, which predicts better recall of visual input over time than textual input, as textual input tend to be forgotten. This psychological effect may account for the improved performance in vocabulary learning with Definition + Word information + Video, and the lack of pronounced improvement with text definitions alone. We speculate that visual input allows one to develop a mental mode of the information (Teng 2019). A textual definition, on the other hand, may not be sensitive to learners' cognitive constraints, for which learners might not be able to reflect and refresh their shortterm memory. In all three groups, scores for the delayed vocabulary tests were lower than those for the tests administered immediately after the treatment. One possible explanation is that the learners may have demonstrated attrition for the words they memorized during treatment, as the findings also showed the influence of learners' WM on vocabulary learning and retention. However, for words where Definition + Word information + Video were provided were recalled significantly better on the delayed tests, whereas words where only a text definition or Definition + Word information were provided were recalled less well on the delayed tests. These findings are in line with Mayer's cognitive theory of multimedia learning (1997,2001), particularly with dual-coding theory (Paivio 1986(Paivio , 1990). In the present study, the findings highlight the potential of presenting an explanation in words and visually, rather than solely textually. Mayer (2001) claimed that the effects of multimedia learning can be evaluated through transfer and retention. Transfer refers to learners' ability to use the material in a multimedia input to solve new problems and retention refers to learners' ability to remember important verbal information from multimedia input. In the present study, evidence for transfer was shown when learners demonstrated significantly better gains in receptive and productive vocabulary knowledge through processing information in the presented multimedia input. Evidence for retention was exhibited when learners demonstrated relatively modest gains on a delayed post-test. Reflecting the theory, the combination of definitions, word information, and videos may help learners to reinforce the referential connections between form and meaning, leading to better vocabulary knowledge learning and retention. The reinforcement of their vocabulary learning outcomes is probably due to the availability of multiple types of inputs for the target words. According to Mayer's cognitive theory of multimedia learning (1997,2001), the presentation of verbal and visual information can attract learners' attention, helping them to build mental images that depict connections or provide gestalt. In the present study, students who received word information and definitions with videos, which contained narration and animation, were able to better retain information because they received the same information at least twice, either verbally or through visual inputs. Consistent with Paivio's (1972Paivio's ( , 1986Paivio's ( , 1990 dual-coding theory, dual channels may be an aid rather than a hindrance. In the present study, we argue that dual channels may have a constant, fixed quality, allowing for the development of a more enduring mental model of the information, lessening the cognitive load of the EFL learners in processing information, and increasing their short-term recall of vocabulary knowledge.

WM and vocabulary learning from multimedia input
As mentioned above, there are a range of factors which affect learners' L2 vocabulary learning (Ramezanali et al. 2021). In the present study, WM was shown to be one of those factors, which is in accordance with the work of Anmarkrud et al. (2019). The findings suggest that WM predicted the participants' learning and retention of vocabulary knowledge in the three groups. In the study, the multimedia input in the form of a word definition and video information required learners to process the received input, while at the same time retrieving information from long-term memory and encoding new information in it. Based on the results, we argue that executive WMwhich involves the storage and manipulation of information in the service of cognitionis needed in processing information when learning words from multimedia input. Complex executive WM, which assumes both storage and processing functions, is correlated with vocabulary learning performance (Cheung 1996;Yang et al. 2017). In line with Montero Perez (2020), complex WM predicted learners' performance in picking up new words from viewing videos. However, the findings were not consistent with those of Engel and Gathercole (2012), who did not find executive WM to be a significant predictor of vocabulary learning outcomes. The positive outcome in vocabulary learning in the present study can be explained from two angles: testing and learning. In terms of a testing perspective, the low-WM learners may have retrieved word knowledge due to the repeated tests rather than from the treatment. With regard to a learning perspective, learners with better WM were probably better at internalizing multimedia input for automatic use.
Consistent with previous studies (e.g., Engel and Gathercole 2012;Karousou and Nerantzaki 2020;Martin and Ellis 2012), the findings also support the role of phonological WM in vocabulary learning performance. One possible explanation is that PSTMwhich supports the consolidation of stable phonological representations in long-term memory (Ellis 1996) may facilitate the maintenance of relevant information in a multimedia input, as well as the regulation of processing during complex operations, allowing one to notice linguistic features and the possible integration into vocabulary learning and retention. Engel and Gathercole (2012) explored the differential effects of executive and phonological memory on vocabulary learning. Révész (2012) also distinguished PSTM and complex WM, and argued that PSTM helps learners to perform better on oral tests, while complex WM is essential for performance on written tests. We assert that vocabulary learningwhich involves the sequential sound patterns of words and their arbitrary mapping to meaningrequires learners to harness their executive and phonological WM in processing input to acquire receptive and productive vocabulary knowledge. This argument partially aligns with Wen's (2015) proposal concerning (a) the role of executive WM in production and comprehension, and (b) the role of PSTM in affecting the final product of learning. However, our findings have to be interpreted cautiously, as we are still not sure how the different components of WM might influence the various aspects of vocabulary learning. Montero Perez (2020) offered insights into associating WM and multimedia input (e.g., videos), namely, that it is essential to explore the role of WM in multimedia input treatments based on a clear theoretical account of the mechanism through which it affects vocabulary learning. Clearly, the relationship between WM and vocabulary knowledge type in multimedia input needs to be further investigated.

Conclusions
The present study highlights the effectiveness of combining definition, word information, and videos for the learning and retention of vocabulary knowledge. However, individual differences in complex and phonological short-term memory predicted students' vocabulary learning and retention under different input conditions. Despite the value of this study's findings there are some limitations which must be noted. First, the findings should be approached with caution because the target population consisted of English major students at a Chinese university. Given this, the study should be replicated in other language learning contexts to generalize the findings. In addition, we did not examine learners' pre-existing differences in English proficiency level. Future research can therefore study whether high-level learners could benefit more from multimedia inputs than lowlevel learners. Second, the duration of treatment was short. As such, the claims relating to vocabulary learning and retention can be better supported through a longitudinal study. Third, the assessment of vocabulary learning only focused on receptive and productive knowledge. Milton and Fitzpatrick (2013) asserted that vocabulary learning is an incremental, dynamic process that includes knowledge of form, meaning, use, word association, word parts, and collocations. Future studies can employ more vocabulary assessments to tap into different components of vocabulary learning. Fourth, during treatment, learners may have enhanced their vocabulary learning performance because of repeated exposure to the target words. Word exposure frequency, an important variable in vocabulary research (Teng 2020), should therefore be investigated. Finally, the target words included adjectives, nouns, and verbs. It would be interesting to see whether multimedia input has differential effects on the learning and retention of different types of words.
Despite the limitations, our study provides pedagogical and theoretical implications for teaching and learning vocabulary through multimedia input. The findings demonstrate that exposing learners to multiple modalities of presentation (i.e., definition, background information, and video) leads to effective vocabulary learning and retention. Future studies might consider comparing the effects of different gloss conditions, similar to the approach adopted and the insights generated by Yanagisawa et al. (2020). Pedagogically, it is essential to instruct students to use multimedia input for vocabulary learning. Textbook designers can include interesting and relevant visual materials to accommodate learners' cognitive constraints in vocabulary learning and retention. Videos are not just for entertainment, but can be used to strengthen the connection between word form and meaning (Teng 2021). In terms of theoretical implications, the study's findings support dual-coding theory (Paivio 1972(Paivio , 1986(Paivio , 1990 and highlight the effects of explaining words and providing corresponding videos for better vocabulary learning and retention. The findings also support the cognitive theory of multimedia learning (Mayer 1997(Mayer , 2001 and emphasize the role of learners' memory resources in processing multimedia input. Multimedia input, in the form of combining visual and verbal input as retrieval cues, can stimulate and encourage learners' engagement in the cognitive processes required for meaningful vocabulary learning and retention.