The longitudinal development of self-assessment and academic writing: an advanced writing programme

: Although several studies have investigated the self-assessment (SA) of writing skills, most research has adopted a cross-sectional research design. Conse-quently, our knowledge about the longitudinal development of SA is limited. This study investigated whether SA instruction leads to improvement in SA accuracy and insecondlanguage(L2) writing.A total of33English as a foreignlanguage(EFL) students composed and self-assessed two argumentative essays, one at the beginning (Time 1) and one at the end (Time 2) of a semester-long advanced writing (AW) programme at a Hungarian university. About half of the participants received SA instruction (experimental group), while the other half did not (control group). The essays were scored by two teachers and analysed for linguistic complexity. The results showed improvement in SA accuracy in both groups. However, the SA-teacher assessment(TA) correlation for the total scorewas statisticallysigni ﬁ cant onlyin the experimental group at Time 2 (post-instructional phase). Furthermore, the TA total scoresand a fewlinguistic complexity indicesshowedimprovements inL2writing in both groups. The pedagogical implications of these ﬁ ndings emphasising the importance of SA in EFL writing courses are also discussed.


Introduction
Teachers of English as a foreign language (EFL) are required to regularly assess their students' knowledge and performance, according to the National Core Curriculum of Hungary. Therefore, being able to assess language skills is undoubtedly an important skill for teacher trainees (Hubai and Lázár 2018). However, at universities, language teacher trainees are rarely taught how to assess language skills, especially writing abilities (Csépes 2016). Therefore, the implementation of writing assessment, such as self-assessment (SA) of writing abilities, might be beneficial for future language teachers in the Hungarian educational context. The benefits of the implementation of writing assessment and self-assessment are twofold at the tertiary level. First, teacher trainees will be instructed how to assess writing (e.g., how to design a rubric). Second, they will be made aware of the benefits of self-assessment.
SA is an "internal" approach to measuring language proficiency (Oscarson 1989: 1) which has gained popularity thanks to the increasing interest in learner autonomy and the conceptual change from teacher-to learner-centred instruction (Butler and Lee 2010;Dann 2002). SA facilitates learners' decision-making regarding their language abilities and setting their own goals in language learning (Chapelle and Brindley 2010;Chen 2008). Additionally, as an instrument to explore and understand language performance, SA has also been adopted by the Common European Framework of Reference for Languages (CEFR; Council of Europe 2001), the European Language Portfolio and the Bergen "Can-Do" project (Hasselgreen 2000). National policies have also promoted the implementation of SA in classrooms in Japan and Korea (Butler 2018;Butler and Lee 2006). However, little is known about the implementation of SA in classroom settings in the Hungarian educational context.
The present study attempts to explore the SA of writing abilities of English majors at a university in Hungary. The findings might justify the rationale for the implementation of SA as a tool to promote second language (L2) writing development. The results of this study might contribute to the field by yielding empirical insights for investigating the characteristics of SA among English majors.
2 Theoretical and empirical background 2.1 Self-assessment SA is defined as a process of formative assessment (Andrade 2019;Andrade and Du 2007), and SA practices involve learners reflecting on and evaluating the quality of their performance and their learning. Specifically, learners evaluate the extent to which their performance reflects explicitly stated objectives or criteria. SA practices also involve learners' identification of their own strengths and weaknesses in their performance, which learners revise accordingly (Andrade and Boulay 2003;Andrade and Du 2007;Goodrich 1996;Gregory et al. 2000;Hanrahan and Isaacs 2001;Paris and Paris 2001). The focus in self-assessment is on learning and improvement as opposed to summative assessment (Andrade 2019).
Previous research has shown that SA has numerous benefits. SA increases self-awareness of learning (Babaii et al. 2016;Oscarson 1989), fosters learner autonomy (Dann 2002;Oscarson 1989), promotes self-regulated learning (Butler 2016(Butler , 2018 and motivation (Birjandi and Tamjid 2012), and reduces anxiety (Bachman and Palmer 1996). In addition, a positive association has been found between SA and learner confidence and performance (Butler and Lee 2010;De Saint Léger 2009;Little 2009). SA has also been demonstrated to bridge the gap between learner perception and actual performance (Andrade and Valtcheva 2009) and reduce the disagreement between student and teacher assessment (Babaii et al. 2016;Chen 2008). Furthermore, SA has been found to expand the range of assessment; specifically, learners can gain more profound insight into their own learning as compared to an outsider (Oscarson 1989). SA also promotes a learner-centred curriculum (Little 2009). Finally, Kato (2009) found that students considered SA activities more helpful than goal-setting activities.
However, two of the biggest concerns about SA are its validity and reliability (Ashton 2014;Patri 2002). According to Butler (2018), these concerns can be addressed by investigating the relationships between SA and objective measures of language performance. Several empirical studies have investigated the relationship between SA and language performance measurements and found positive associations between them (e.g., Ashton 2014). In Li and Zhang's (2020) meta-analysis of SA and language performance, the overall correlation between SA and language performance was moderate (r = 0.466, p < 0.01), while in an earlier meta-analysis, Ross (1998) found that the correlations ranged from r = 0.52 to r = 0.65 across the four language skills. Li and Zhang's (2020) meta-analysis also revealed that listening had the strongest correlation (r = 0.486), followed by reading (r = 0.451) and speaking (r = 0.442). Writing skills showed the weakest correlation (r = 0.381), and this is in line with Ross's (1998) results. Li and Zhang (2020) attributed the relatively weak correlation between SA and writing abilities to the features of the criteria used. While the criteria employed for listening, reading, and speaking were predominantly adopted from well-established language proficiency scales (e.g., CEFR), the writing criteria were presented using vague dimensional descriptors (e.g., topic, content, and grammar). The broad and vague writing criteria may have led to greater confusion among learners on how to interpret the criteria, which might have resulted in a large variation in SA outcomes.

Self-assessment of writing abilities
There is a positive relationship between SA and teacher assessment (TA) of writing skills (Birjandi and Tamjid 2012;Liu and Brantmeier 2019;Matsuno 2009;Saito and Fujita 2004;Summers et al. 2019;Weigle 2010;Zheng et al. 2012) and between engagement in SA and L2 writing development (Wind 2021). However, the strength of the relationships range between weak to moderate. For example, Saito and Fujita (2004) found a weak correlation between SA and TA (r = 0.07), while Weigle (2010) detected moderate positive correlations between SA and TA (rater 1: r = 0.39, rater 2: r = 0.43). Investigating the writing abilities and SA accuracy of 106 Chinese learners of English, Liu and Brantmeier (2019) found that young learners are also able to accurately self-assess their writing. The researchers found a significant positive relationship between SA writing and writing production (r = 0.30, p < 0.01), showing a small to medium effect size. In contrast, several studies found that SA might not be a reliable alternative for formal assessment. Matsuno (2009) found that peer-assessment can play a useful role in writing classes, whereas SA has "limited utility as a part of formal assessment" (2009: 75). Moreover, Summers et al. (2019) found weak correlations between SA and placement test results, which posits the question whether SA can be used as a placement test.
Although there have been studies investigating the relationship between SA and writing, most investigations have adopted a cross-sectional research design. Therefore, little is known about the extent to which the accuracy of SA changes over time. To the best of our knowledge, there have been two studies (Birjandi and Tamjid 2012;Zheng et al. 2012) which investigated the development of SA accuracy in the EFL context. Birjandi and Tamjid's (2012) study explored the role of SA and peer-assessment (PA) in promoting language learners' writing performance. A total of 157 English as a foreign language teacher trainees were assigned to five groups (four experimental and one control group). The participants in Group 1 used journal writing as an SA technique, while Group 2 selfassessed their performance. The participants employed PA in Group 3, whereas the participants employed both SA and PA in Group 4. In addition, TA was employed in all experimental groups with the exception of Group 4. In the control group (Group 5), there only TA was employed. The participants took a teacherdesigned writing test at the beginning and another at the end of the investigation. The greatest improvements were observed in Group 2 and Group 3. Unfortunately, Birjandi and Tamjid (2012) did not give a detailed account of the writing tests used in their study. The authors stated that the participants were required to write a composition on "familiar topics" (Birjandi and Tamjid 2012: 520). However, it is not clear how many writing prompts were used and whether the same writing prompts were used at the beginning and at the end of the study.
In Zheng et al.'s (2012) study of students' SA in College English writing tests, 189 freshmen and sophomore students were instructed to assess their own writing work over an eight-week period. It was found that students could self-assess their writing quite well. The researchers highlighted that the SA of writing developed due to the instructions of the scoring rubric. After receiving rater training, the participants have shown significant ( p < 0.05) improvement in their SA accuracy in writing. For example, the correlation increased from r = 0.39 to r = 0.55 in writing task 1, and from r = 0.46 to r = 0.69 in writing task 2. The changes were statistically significant at the p < 0.01 level. Therefore, their study focussing on increasing SA accuracy through training provides a solid baseline for further research. However, the order of the writing tasks in Zheng et al.'s (2012) study was not equalised. Therefore, improvements in writing might be attributed to differences in difficulty between the three writing prompts used in their study.
In conclusion, positive correlations were found between student and teacher assessment in most recent studies (Birjandi and Tamjid 2012;Liu and Brantmeier 2019;Saito and Fujita 2004;Weigle 2010;Zheng et al. 2012). However, the pitfall is that the researchers could not gain insight into temporal changes in the development of SA practices owing to the cross-sectional research designs used. In addition, studies on the SA of writing abilities used teachers' scores or human raters only and did not consider more objective measures such as linguistic complexity indices calculated by computational tools. In the next section, we will discuss the most recent findings in the field of L2 writing development.

Linguistic complexity in second language writing development
L2 writing development has generally been investigated by measuring the constructs of complexity, accuracy, and fluency. Among the three constructs, complexity is the focus of this study. Linguistic complexity generally entails lexical and syntactic complexity. Both lexical and syntactic complexity are multidimensional constructs (Jarvis 2013;Norris and Ortega 2009). The results concerning the longitudinal developments of lexical and syntactic complexity in L2 writing are mixed. For example, Storch (2009) focused on changes in the academic writing of university students over one semester and found that participants' writing improved in structure and development of ideas but failed to improve linguistic complexity. Likewise, Knoch et al. (2015) found that clause length increased while subordination decreased over the three-year period in 32 undergraduate's writing. However, the changes were not statistically significant. Knoch et al. (2015) also found The longitudinal development of SA and academic writing that word length decreased while lexical sophistication increased from the preinstructional phase (Time 1) to the post-instructional phase (Time 2); nevertheless, these changes also lacked reaching statistical significance.
In contrast, Mazgutova and Kormos (2015) found statistically significant increases in lexical variability, lexical sophistication, and cohesion over one month in the writing of an intermediate-level group, while the upper-intermediate group's writing showed significant differences only in lexical sophistication. Statistically significant increases were also found in syntactic complexity in the intermediate group's writing, whereas they only found significant differences in one syntactic complexity index in writing of the upper-intermediate group.
Although there has been an inconsistency in the definition and the operationalisation of linguistic complexity, as well as a huge variation in the duration of investigations in studies on L2 writing development, the general trend is that there are more changes at intermediate and upper-intermediate levels of proficiency (Mazgutova and Kormos 2015) than at higher levels of proficiency (Knoch et al. 2015;Storch 2009), and students appear to rely more on phrasal complexity than on subordination at advanced levels of proficiency (Halliday and Matthiessen 1999).

Research questions
Although several studies have investigated the SA of writing skills, there are few studies examining (1) whether students' SA instruction improves SA accuracy, (2) whether students' L2 writing develops as measured by TA scores and (3) whether students' L2 writing develops as measured by linguistic complexity indices. Therefore, we designed our study based on these three aims. To address the above-mentioned research niche, the present study attempts to answer the following research questions (RQs).
RQ 1: How does the relationship between self-assessment and teacher assessment total and sub-scores change over a semester-long advanced writing programme?
RQ 2: How does L2 writing change over a semester-long advanced writing programme as measured by the self-assessment and teacher assessment total scores? RQ 3: How does L2 writing change over a semester-long advanced writing programme as measured by linguistic complexity indices?

Research design
This study employed an experimental design, with the experimental group receiving instruction on SA as opposed to the control group which did not receive such instruction. From the constructs of complexity, accuracy, and fluency, in our study we focused on complexity alone for two main reasons. First, accuracy was not considered in our study as the students composed their second essay electronically at Time 2. Consequently, although students were directly asked not to use any external help, we cannot rule out the possibility that they used spell-check programmes or autocorrect functions. Second, fluency, usually measured by the total number of words produced in a specific time limit, was neglected since the word count was determined by the task.

Research context
In this university, English majors are required to pass two academic skills (Academic skills 1 and 2) courses focusing on paraphrasing, summarising, and synthesising skills (Tankó 2019). These two academic skills courses are completed in the first two terms of the Bachelor of Arts (BA) in the English programme. At the end of the academic skills 1, students are required to write a guided summary, while at the end of the Academic skills 2, students are asked to write a synthesis. After completing the compulsory Academic skills courses, undergraduates are required to take the Advanced writing (AW) course aimed at improving their academic writing skills. However, in some cases there might be a year-long pause between the writing of the BA or the unified teacher training programme thesis and the completion of the AW courses. Consequently, being able to self-assess the quality of their writing might be a crucial skill for university students in Hungary.

The advanced writing course
The present research was conducted at a university in Budapest, Hungary, in AW courses during the spring term in 2020. The data for our study were collected from two AW courses taught by the authors. This, however, was not seen as an ethical issue because participation in this study was voluntary. The AW courses are usually held by different instructors, so there might be slight differences in the content, but the primary aim is to enhance students' academic writing skills mainly The longitudinal development of SA and academic writing by practising argumentative essay writing. The course focuses on task-based approaches involving academic reading, academic writing, critical thinking, participation in academic discussions, debating skills, receiving feedback, peer-review, self-assessment, and oral presentations. After completing weekly assignments, the students received written feedback from the instructors on how to improve their academic writing skills. The following criteria were highlighted by the instructors: forming an effective thesis statement, cohesion and coherence, paraphrasing, APA formatting of references, grammatical range and accuracy, vocabulary, quality of argumentation, style, punctuation, and paragraphing. The AW course was an ideal setting for this research for two main reasons. First, it was an intensive course focusing on writing development through detailed feedback from the instructors. Second, the participants had some prior knowledge about essay writing since they had taken the Academic skills 1 and 2 courses as prerequisites for the AW course.

Participants
The participants in this study were 33 students who enrolled in two AW courses. All of the students in the AW programme agreed to take part in our study. The students were selected by convenience and criterion sampling (Dörnyei 2007). The students were assigned to the two AW courses (AW Course 1 and AW Course 2) based on their registration in the university's system. AW Course 1 was instructed by the researcher who had no experience with SA, while the researcher who taught AW Course 2 had more experience with SA. Students in AW Course 1 (control group) did not receive SA instruction, whereas students in AW Course 2 (experimental group) received regular SA instruction.
The students were English majors around 20-25 years of age. The L1 of the participants was predominantly Hungarian (n = 29). However, there were also four international students (Chinese, Romanian, and Spanish). Students are eligible to register on the AW course upon successful completion of the Academic skills 1 and 2 courses. As the participants are at least third year English majors, the assumed level of their overall language proficiency was around the IELTS score of 7 (i.e., C1 based on the CEFR; Council of Europe 2001). The reason for this assumption is that first-year students at this university have to pass a Language Proficiency Exam (LPE) assessing their command of English at B2, B2+ and C1 levels as defined by the CEFR standards. The requirements and task types are based on the contents of a language practice book written by Vince and Sunderland (2003), and the exam was developed by item writers. First-year students have to complete the LPE successfully to continue their studies. Table 1 is a summary of the participants' gender distribution and L1 background.

Instruments
The participants were asked to write two argumentative essays, one in the pre-instructional phase (Time 1) and one in the post-instructional phase (Time 2). The order of the tasks was counterbalanced; therefore, in February 2020 the first half of the participants was asked to complete Task A and the other half Task B (in both groups control and experimental). In May, after the experimental group received regular SA instruction, the participants from both groups were asked to submit the second argumentative essay. We have chosen topics related to the field of language learning as the selected participants are the most familiar with this area. The writing prompts, piloted in Wind (2018), were the following: Task A: A native language teacher is always better than a non-native one. To what extent do you agree? Task B: The older you get, the more difficult it is to learn a foreign language.
To what extent do you agree?
Immediately after composing the 200-word-long argumentative essay, the participants were asked to self-assess their essay using a rubric based on a 5-point scale (see Table 2). The CEFR and IELTS band score equivalence of each selfassessment rubric score is also displayed in Table 2. More details regarding the instruments can be found in the Appendix. The writing rubric included the following four criteria: (1) task response, (2) coherence and cohesion, (3) vocabulary, and (4) grammatical range and accuracy. The students were asked to rate The longitudinal development of SA and academic writing their essays on a 5-point scale ranging from "bad" to "excellent" (1bad, 2poor, 3mediocre, 4good, 5excellent). At Time 1, the participants were informed about the assigned appropriate equivalents of SA scores with respect to CEFR levels. Time 1 was the pilot phase; the only issue detected by one participant concerned a spelling error in the instructions of one of the tasks. Other than that, no misunderstanding occurred in the completion of the tasks.

Data collection procedures
Data collection took place twice during the term, at the beginning and at the end of the second semester of the academic year 2019/2020. The course was planned to include 90 min of instruction per week, but this plan was disrupted by school lockdowns due to COVID-19. From mid-March, instruction and feedback were only provided online. Thus, students completed the first half of the research project in class before the pandemic and the second half through distance learning. This was not seen as a substantial drawback, however, mainly because outside class SA is not expected to differ greatly from in class SA. The students were given a writing prompt where they were asked to compose an argumentative essay of at least 200 words in 30 min. They were asked to rely on their own experience and knowledge and were not allowed to use dictionaries. However, a limitation to this might be that the use of dictionaries could not be controlled by the course instructors at Time 2. After the write-up, the students completed a writing rubric evaluating their own work immediately after they had finished.

Data analyses
The final mini corpus consisted of 66 essays of 16,920 words. We used web-based computational tools including Coh-Metrix 3.0, the L2 Syntactic Complexity Analyzer (L2SCA), and the Word and Phrase softwares to measure cohesion, syntactic complexity, and the percentage of genre-specific lexical items, respectively. We used these programmes because coding the texts manually would have been time-consuming. The texts were checked for spelling mistakes and nonexistent words beforehand in order to ensure that the programmes would be able to identify and analyse the lexical items. To find the appropriate equivalents of the self-assessment scores, IELTS descriptors were used along with the CEFR scale (Council of Europe 2001).

Statistical analyses
First of all, we calculated normality tests; the results of the Kolmogorov-Smirnov (K-S) tests indicated that the data showed normal distribution (p > 0.05) with skewness and kurtosis being within the acceptable ±2 range. However, the dimensions of the assessment sub-scores showed non-normal distribution with the K-S statistic being significant (p < 0.05) and values for skewness and kurtosis outside the acceptable range. Although the data are normally distributed for the total scores, due to the relatively small sample size, the researchers opted for non-parametric tests. An additional reason for using non-parametric tests lies in the fact that previous studies in the field with similar research designs also used non-parametric tests to analyse small-scale data (e.g., Mazgutova and Kormos 2015). Wilcoxon signed-rank test, the non-parametric equivalent of the paired samples t test, was applied to analyse the differences between the two groups. Cohen's delta was calculated using Excel to check the effect size or standardised mean difference (Cohen 1988) as it may be of crucial practical importance for researchers (Lakens 2013). It must be noted that since the data for SA in Time 2 was missing from three participants, the researchers' decision was to calculate with the SA The longitudinal development of SA and academic writing score of Time 1 in these three instances. This seemed to be the best option since the researchers aimed to avoid a type I error, that is, arriving at a false positive result, claiming that there were significant differences where there were none. Statistical analyses were conducted with SPSS version 22. To check inter-rater reliability, Cohen's kappa, which measures the strength of agreement between two raters or coders (Altman 1991), was calculated. As can be seen in Table 3, the inter-rater reliability of both phases reached a substantial level of agreement based on Landis and Koch (1977) and a good agreement according to Altman (1991).

Linguistic complexity
In this study, linguistic complexity was harmonised with the SA rubric used to score the essays. Therefore, the constructs of (1) cohesion, (2) lexical and (3) syntactic complexity were considered. Cohesion was measured by the all connectives (CNCAll) index, using the Coh-Metrix 3.0 (Graesser et al. 2004(Graesser et al. , 2011. Connectives (e.g., because, whereas, moreover) are important in creating cohesive connections between ideas, clauses, and connectives even give hints about the organisation of texts (Cain and Nash 2011;Crismore et al. 1993;Longo 1994;Sanders and Noordman 2000;van de Kopple 1985). In our study, it was expected that the incidence of all connectives might increase over time.
Lexical complexity is a multidimensional construct composed of at least three main sub-constructs: (1) lexical density, (2) lexical sophistication, and (3) lexical variability (Jarvis 2013). However, in this study we focused on the development of students' academic vocabulary. Therefore, the percentage of academic words was measured in the texts by the academic vocabulary list (AVL) index, computed by the Word and Phrase software (Gardner and Davies 2014).
Although syntactic complexity is a multidimensional construct (Norris and Ortega 2009), a general index, the mean length of clause (MLC), was calculated by the L2 Syntactic Complexity Analyzer (L2SCA; Ai and Lu 2013; Lu 2010Lu , 2011Lu and Ai 2015). Both Verspoor et al. (2017) and Wind (2021) have claimed that the MLC index is a reliable indicator of general syntactic complexity.

Ethical considerations and quality control
All 33 students in the course participated voluntarily in the present research project and were preliminarily informed that they had the right to opt out of the study at any time and that their anonymity was protected throughout the study. In order to ensure intercoder reliability and retain the objectivity of the analysed texts as much as possible, Cohen's kappa was computed. The tasks piloted by Wind (2018) were piloted in the present study on the first occasion (Time 1) and proved to be understandable for the participants as no misunderstandings occurred.
4 Results and discussion 4.1 RQ 1: How does the relationship between self-assessment and teacher assessment total and sub-scores change over a semester-long advanced writing programme?
In order to answer RQ1, we first analysed the SA total scores, which were correlated with the corresponding TA total scores at the beginning (Time 1) and at the end of the AW courses (Time 2). Table 4 shows that there were positive associations between the SA and the TA total scores in both groups (control and experimental) at Time 1 and Time 2. In addition, the correlation coefficient between the SA and TA total scores was statistically significant at Time 2 in the experimental group, indicating a moderate positive relationship based on Muijs (2004) (r = 0.502, p < 0.05). Overall, SA accuracy improved in both groups over the semester-long AW programme; nevertheless; it must be noted that the improvement is statistically significant (p < 0.05) only in the experimental group, The longitudinal development of SA and academic writing which indicates that students receiving SA instruction showed considerable improvement in their SA accuracy. Following the analysis of the total SA scores, a correlational analysis was computed between SA and TA sub-scores at Time 1 and Time 2. Table 4 shows that there were predominantly positive relationships between the SA and TA subscores in both groups. There were weak negative correlations between the SA and TA scores on grammatical range and accuracy at Time 2 in both groups, and there was a weak negative association between SA and TA scores on task response at Time 1 in the experimental group. The SA-TA correlations were statistically significant (p < 0.05) for task response and vocabulary at Time 1 and for vocabulary at Time 2 in the control group, while the SA-TA correlation was statistically significant for task response at Time 2 in the experimental group.
The positive correlations found in our study between SA and writing performance are in line with the results of previous studies (Liu and Brantmeier 2019;Matsuno 2009;Saito and Fujita 2004;Summers et al. 2019;Weigle 2010). Nevertheless, as compared to the correlation coefficient between SA and writing skills (r = 0.525) reported in Ross's (1998) meta-analysis, correlations in this research endeavour are found to be weaker. In contrast, the SA-TA correlation coefficients (total scores) in the control group at Time 1 and Time 2 and in the experimental group at Time 2 in our study were stronger than the correlation coefficient between SA and writing (r = 0.381) reported in Li and Zhang's (2020) meta-analysis. According to Boud and Falchikov (1989), familiarity with SA might have an effect on the correlation between SA and language abilities. In our study, the relatively weak correlation coefficients might be attributed to the participants' unfamiliarity with SA practices. Thus, it can be concluded from both correlational analyses of the total and sub-scores of SA and TA that students receiving SA instruction tended to improve their SA accuracy, and this result is statistically significant.
4.2 RQ 2: How does L2 writing change over a semester-long advanced writing programme as measured by the self-assessment and teacher assessment total scores?
To answer RQ2, Wilcoxon signed-rank test was calculated to compare SA scores at Time 1 and Time 2 and the TA scores at Time 1 and Time 2. The descriptive statistics for the SA and TA total scores are displayed in Table 5. Both the SA and the TA scores increased from Time 1 to Time 2 in both groups, indicating an improvement in SA accuracy.
However, the results of the Wilcoxon signed-rank tests, displayed in Table 5, show statistically significant differences only for the TA scores (Z = −2.306, p = 0.021) in the control group with a large effect size (d = 0.88). Thus, the change from Time 1 to Time 2 in TA total scores points not only to the statistical but the practical significance (Kirk 1996) of this result. This means that the result, besides not being due to chance, may also have notable importance for writing practices.
The results of our study are also consistent with the findings of studies which investigated the development of SA accuracy in the EFL context (Birjandi and Tamjid 2012;Chen 2008;Zheng et al. 2012). Birjandi and Tamjid (2012) found that SA accuracy improved over a semester, while Chen (2008) detected development in SA accuracy over 12 weeks; Zheng et al. (2012) reported improvements in SA accuracy over an eight-week period. However, Chen (2008) focused on oral performance with two weeks of training and 10 weeks of SA and TA. 4.3 RQ 3: How does L2 writing change over a semester-long advanced writing programme as measured by the linguistic complexity indices?
To answer RQ3, Wilcoxon signed-rank tests were performed to compare linguistic complexity indices at Time 1 and Time 2. The descriptive statistics of the linguistic complexity indices are displayed in Table 6. Interestingly, the cohesion and lexical complexity indices increased in the students' essays in the control group but decreased in the students' essays in the experimental group. These results suggest that students' essays in the control group tended to become more cohesive and contain more academic words. Table 6 also demonstrates that the syntactic complexity index decreased in both groups. This result indicates that the students The longitudinal development of SA and academic writing tended to shorten their clauses in their essays. However, the results of the Wilcoxon signed-rank tests did not show statistically significant differences for the linguistic complexity indices. Limited changes in lexical and syntactic complexity are not infrequent in the literature on L2 writing development. For example, Knoch et al. (2015) also found no statistically significant changes in lexical and syntactic complexity measures except for fluency over a three-year degree study at a university in Australia. Likewise, Storch (2009) found no statistically significant changes in complexity and accuracy in a semester-long study at a university. The limited improvements in our study can be explained by two possible reasons. First, the duration of the investigation was relatively short compared to Knoch et al.'s (2015) three-year-long study. Second, the proficiency level of the participants was relatively high (around B2, B2+, C1 CEFR level). It can be presumed that at higher levels of language proficiency, EFL learners make improvements in fewer areas of linguistic complexity. For example, Mazgutova and Kormos (2015) found that the lower-proficiency (intermediate) group in their study improved in more areas of linguistic complexity than the higherproficiency (upper-intermediate) group. Another possible reason for the relative stagnation of L2 writing development can be attributed to the limited functioning of self-regulatory processes, closely linked to SA. For example, Wind and Harding (2020) found that the limited use of self-regulatory processes contributed to the stagnation of the development of linguistic complexity in L2 writing.

Conclusions and pedagogical implications
The present results have important implications for teaching writing courses at universities as well as for language centres dedicated to improving students' writing skills. First, our study shows that SA instruction leads to improvement in SA accuracy over a semester-long AW programme. Our results are in line with Chen's (2008) conclusions that regular feedback and practice results in improvement in learners' ability to assess their own writing. Additionally, based on the TA total scores, the students in the control group significantly developed their writing skills over a semester-long period. However, this improvement was not clearly evidenced by the changes in the linguistic complexity indices, since none of the complexity indices showed significant increases over time. Second, self-perceived weaknesses in writing (e.g., the inability to produce clear, smoothly flowing, complex essays in terms of language as well as content) can inform instructors so that they can adjust their writing instruction accordingly. Such washback effects might facilitate the promotion of learner-centred pedagogy which is particularly needed in Hungarian universities and during the COVID-19 pandemic. Teacher-centred instructions and teaching towards examinations might hinder learner autonomy and prevent students from independently setting goals and making decisions for their learning or implementing any means for reducing possible weaknesses. Along with Liu and Brantmeier (2019), we can conclude that employing SA might promote learner autonomy and university students' self-regulation. Our study has some limitations, which should be followed up by further research. First, the number of participants (N = 33) was relatively low compared to other studies on the SA of writing skills (Birjandi and Tamjid 2012;Liu and Brantmeier 2019;Saito and Fujita 2004;Weigle 2010;Zheng et al. 2012). However, in Mazgutova and Kormos's (2015) study, the Wilcoxon signed-rank test was performed on a lower number of samples (n = 12) than in our study (n = 16). Due to the relatively low number of participants, individual differences might have encompassed some features that might have emerged in a study with a larger sample size. Consequently, future studies should replicate this research with a bigger sample.
Second, despite the fact that our findings tended to show positive correlations between SA and writing, these correlations may not entirely capture SA accuracy (Ashton 2014). The positive correlations only implied a possible trend that university students could accurately self-assess their writing performance. Additional studies on whether university students at different levels (and not only English majors) over-or under-estimate their writing skills are therefore necessary before any generalisations can be made.
Third, along with Liu and Brantmeier (2019), by only looking at the positive correlations detected in our study, we cannot verbalise how university students respond to SA items; therefore, this is yet to be examined to provide recommendations for important stakeholders (e.g., language teachers). Subsequently, further research would be indispensable for exploring the full process of SA and for understanding what leads to more accurate SA (Liu and Brantmeier 2019). Accordingly, Butler (2018) stressed that SA has a socially complex and cognitively demanding nature.
One possible future direction of research is to investigate the moderating effects of a number of variables that might play important roles in SA such as the type of criteria used in SA, the presence and form of SA criteria, SA training, the types of SA measurements, their reliability, and the number of items the SA measurement includes (Li and Zhang 2020). Furthermore, it would be worthwhile to investigate possible developments in writing and self-assessment with the same participants over at least two consecutive semesters to allow more time for improvement.
Research funding: This study was funded by the Scientific Foundations of Education Research Program of the Hungarian Academy of Sciences.

Writing task B
Name: Date: You should spend about 30 min on this task. Write about the following topic: Give reasons for your answer and include any relevant examples from your own knowledge or experience. Write at least 200 words.

Self-assessment
After completing the writing task, please rate your essay based on the following criteria (5 = excellent, 4 = good, 3 = mediocre, 2 = poor, 1 = bad). The older you get, the more difficult it is to learn a foreign language. To what extent do you agree?

References
The longitudinal development of SA and academic writing