From placement to diagnostic testing: Improving feedback to learners and other stakeholders in SELF (Système d ’ Evaluation en Langues à visée Formative)

: Since 2012 an interdisciplinary and culturally heterogeneous team composed of more than 30 people has been engaged in the complex process of conceiving, designing and validating an online placement test with formative orientation called SELF (Système d ’ Evaluation en Langues à visée Formative), developed and already deployed in six different languages – Italian and English as pilots, followed by French, Mandarin, Japanese and Spanish. Its results are used to form groups and classes of similar ability, or to identify students ’ strengths and weaknesses in three macro skills (listening, reading, limited writing). In this report, we describe the steps the multilingual team is currently taking to transform SELF into a diagnostic test that will ful ﬁ ll its original formative purpose and provide students and other stakeholders with more precise information about their performance. This can be done in two ways, by using the data automatically recorded by the online administration platform more thoroughly and by enriching user feedback with clear and informative graphics. This will enhance the validity of our test, and help close the gap between testing and learning.


Description of context
At Université Grenoble Alpes (France), courses in more than 20 foreign languages are on offer, in different modalities: face-to-face, blended or entirely through distance learning, and in semester-, year-long or intensive formats. In this context, and with thousands of students each year taking foreign languages as an elective, it is important to organize enrollments as efficiently as possible. Placement testing is an essential part of this process.
In 2012, a multilingual team started working on SELF, a placement test that would be available in several languages, and open to students from partner universities in France and abroad. The idea was to develop and document a methodology and tools that colleagues teaching other languages could later copy. The project was supported by a research grant 1 and is now operational in Italian, English, Japanese, Spanish, Mandarin, and French as a Foreign Language. SELF uses an item bank where each item is tagged by (among others) language, macro skill/language activity (listening, reading or writing), language focus (morphosyntax, lexis, pragmatics etc.), discourse type, observed difficulty during pretesting, and CEFR level, determined by a standard setting procedure with a panel of experts. It is a semi-adaptive multi-stage test. The first stage (the initial testlet) is common to all test takers, but the items in the second stage depend on test takers' results in the first. Results in the second stage are used to refine the estimation of learners' level and arrive at placement results that are as reliable as possible.
SELF, however, is not just a placement tool. Its initials stand for "Système d'Evaluation en Langues à visée Formative", i.e., a foreign language assessment system with formative orientation. Its goal is to help students realize where their strengths and weaknesses lie by giving them more precise feedback than just the group they should be placed in, in order to enable them to work, if desired, on their weak points and ultimately improve their foreign language skills. This feedback can be called diagnostic since "diagnostic tests seek to identify those areas in which a student needs further help. These tests can be fairly general, and show, for example, whether a student needs particular help with one of the four main language skills; or they can be more specific, seeking perhaps to identify weaknesses in a student's use of grammar" (Alderson et al. 1995: 12). At present, the feedback is rather limited, providing information on students' level in each skill targeted by the test (listening, reading, and basic writing skills), i.e., general diagnostic information in Alderson's terms, but does not provide more specific feedback. Other stakeholders that could be given more information are institutions (language centers at partner universities), using SELF to place their students. The following will provide an account of the steps that are currently being taken to optimize user feedback in these areas.

Account of activity: Development of diagnostic feedback
Currently, SELF users receive placement information about the course level they should enroll in (see Figure 1), ranging from A1 to C1/C2. The system does not distinguish between C levels (expert users), and lower levels are divided into sublevels (B1.1 and B1.2 for B1, for example).
Users are also provided with information about their level in the three skills targeted by SELF (see Figure 2), and can thus see whether they have a balanced learner profile with similar levels in all three skills, or whether they need to focus their efforts more heavily on one skill, for example, depending on what their ultimate goal is. The CEFR explicitly mentions "uneven profiles, partial competencies", and one of its achievements was to allow for this possibility and give instructors the tools to report it (Council of Europe 2001: 17).
This information can be downloaded and printed by students, but the interaction with the assessment system does not currently go any further. This is unfortunate, since there is potentially a lot more data stored by the system that could be used to provide more detailed information. The importance of detailed feedback has been emphasized by many researchers working on diagnosis: "for assessment  From placement to diagnostic: Improving feedback to learners information to be used effectively it needs to be detailed, innovative, relevant and diagnostic, and to address a variety of dimensions rather than being collapsed into one general score" (Shohamy 1992: 515).
For this reason, we are currently developing user dashboards which will result in much more interactive feedback and provide a richer experience to users. A computerized system is ideal to provide this kind of experience, since data storage is automatic and storage capacity (for our purposes) is close to unlimited. What is needed is a way to convey the information to the user in a clear and useful way. We propose to do this by converting the data to visual information. Data visualization is an extremely powerful tool (Larson-Hall 2017;Tufte 2001), and it will allow us to provide learners with more precise information about their scores, give them access to the items they got right or wrong, show them how long they took to answer each question, and what item types they found easy or hard (see Figure 3). Since information about each test administration will be recorded, this will also help students track their progress over time and adjust their goals as a result. This might result in positive washback, if the provision of visual feedback motivates students to improve and see for themselves the upward trend in their results (as in online games where score boards fulfill the same purpose).
Let us now look more closely at Figure 3 (bottom section). As mentioned earlier, question items in SELF are identified not just by the skill they correspond to, but also by other characteristics such as their language focus and their discourse type. The language focus is the critical information that designers believe test takers need to understand in order to answer the question correctly. This might be an element of morpho-syntax (for example, identifying a past tense marker to understand that the time reference is past, when no other elements provide this information), or lexis (understanding a key term), or pragmatic intention (the illocutionary force of an utterance, for example, a refusal disguised as a question). The discourse type is the prevalent genre of the text the item bears on (narrative, informative, argumentative, etc.). Each item is linked to a text which test takers need to process to answer the question correctly. We hypothesize that familiarity with different discourse types is likely to affect success (Cervini and Jouannaud 2015). Since each test item is defined by these characteristics, we can calculate the percentage of items successfully attempted for each language focus and each discourse type and display this information with a spider chart representing the test taker's strengths and weaknesses in each area. In our example (bottom of Figure 3), the learner is much better at working with informative discourse types than with narratives. A student majoring in science and destined to work with other genres may not feel this is a problem. However, given the centrality of narratives to our human experience, they would probably be advised to allocate some of their language learning efforts to this area. The final decision, however, rests with the language learner, who is ultimately responsible for their own learning.
Dashboards will also improve practicality for other stakeholders, such as instructors or administrators. Practicality is one of the components of test usefulness according to Bachman and Palmer (1996). It is important for administrators in From placement to diagnostic: Improving feedback to learners particular to understand what the test is about or how it should be used, as they ensure that correct decisions are taken following test administration. Current test session feedback for groups (for example, for all science students wanting to take a Japanese course in the first semester) is provided in spreadsheet format (see Figure 4). It summarizes information from individual administrations: for each student, we get personal identification details entered when registering on the SELF assessment platform (first and last names, email address, degree prepared, major), as well as information about test administration (date, time taken, results in terms of both placement, and subskill level).
However, spreadsheets are not always easy to read and they do not summarize the information contained in them. Administrators (and teachers) looking at a spreadsheet cannot tell at a glance how well the group did. Additional manipulation is required to obtain information about the central tendency or spread for each of the variables contained in the spreadsheet. We intend to automatize the process and provide the results in visual format (see Figure 5). This is a prototype still under development, and it might need to be tweaked to make sure the interface is not too cluttered, as it has been shown that too much information is detrimental to the uptake of feedback (Goodman and Hambleton 2004). Once the prototype is developed, we will be able to pilot it with a sample of potential users. At present, we display for each group session the number of participants (top right), and, from top to bottom, a calendar showing the spread of administrations over time, leafplots (horizontal histograms) for group placement and skill results, and boxplots for time taken to complete each part of the test. Access to individual results would still be possible, and would be made interactive: clicking on a line representing one student's results (bottom of Figure 5) would send the administrator to this student's dashboard.
The dashboards presented above are the first step toward transforming SELF, currently mostly used for placement purposes, into a diagnostic test that will provide much more information for each test taker to act upon. This transformation is not new, and other researchers have tried to exploit the data collected during administration of large-scale proficiency tests, for example, to present diagnostic information to learners as well as general results (Buck and Tatsuoka 1998;Liu 2015). However, this procedure (known as cognitive diagnostic assessment or CDA) requires sophisticated statistical analyses, and its use has not been taken up on a wide scale. Our method is much simpler, and simply proposes a way to visualize the raw data recorded by the administration platform using informative graphics (Larson-Hall 2017). We use free and opensource software for this purpose: R for uploading, cleaning, selecting and analyzing the data, and JavaScript (with the D3js library) for the production of dynamic and interactive graphics.

Conclusion and future prospects
SELF is currently used by more than 25 French universities and language centers as a CEFR-based placement test. Since 2016, when it became fully operational, more than 90,000 students have taken the test in one (or more) of its six foreign languages. The results are used to form groups and classes of similar ability, or to identify students' strengths and weaknesses in three macro skills. In this report, we have described the steps the multilingual team is currently taking to transform SELF into a more thorough diagnostic test that will fulfill its original formative purpose and provide students and other stakeholders with more precise information about their performance. This can be done in two ways: by using the data automatically recorded by the online administration platform more thoroughly and by enriching the feedback experience with it. Test administration data currently used for user feedback include total score and item characteristics in terms of level and targeted skill (so that a separate score is reported for each skill and scores are translated into corresponding CEFR levels). Additional data we intend to exploit include individual item scores, time spent on each item/ on the whole test, and more item characteristics such as language focus and discourse type. This additional information will be used to improve the feedback provided by the platform, in terms of quantity of information, usability, and comprehensibility. The use of clear and interactive graphics, as advocated by data analysts, will (we hope) help learners make sense of the results, motivate them to improve their skills, and lead them to make informed decisions based on the results displayed.
Once the prototypes have been developed and piloted with a sample of users, we will continue to explore other avenues for the provision of diagnostic feedback. One is the collection of more data through the development of more specific tests. Students with weak results in one area might want to take a further test targeting this area in more detail (for example, a phoneme discrimination test for students with weak results on items with a phonological language focus). The other avenue is to explore the use of more sophisticated data analysis techniques. One we are considering is the unsupervised clustering of students based on their answers using Latent Block Modeling (Brault and Mariadassou 2015). This will enable us to try to define learner profiles, and perhaps offer them feedback (semi)automatically. The final idea is to try to strengthen the link between assessment and learning by providing students access to online remedial modules based on their current learner profile (Alderson 2007;Masperi and Quintin 2014).