Does Early Educational Tracking Contribute to Gender Gaps in Test Achievement A CrossCountry Assessment

On average, boys score higher onmath achievement tests and girls score higher in reading; these gaps increase between primary and secondary school. Using PISA, PIRLS, and TIMSS data, we investigate the role of early educational tracking (sorting students into different types of secondary schools at an early age) on gender gaps in test achievement in a cross-country difference-in-differences framework. We find strong evidence that early tracking increases gender differences in reading. Formath test scores, we do not find consistent evidence that early tracking contributes to the gender gap.


Introduction
Results from large-scale international achievement tests such as PISA (Programme for International Student Assessment), TIMSS (Trends in International Mathematics and Science Study), and PIRLS (Progress in International Reading Literacy Study) reveal gender gaps in student test scores. On average, girls score higher than boys in reading, while boys outperform girls in math (Guiso et al. 2008;OECD 2009a;OECD 2016b;Mullis et al. 2015). At the mean, the reading gap is more pronounced than the mathematics gap, but high-scoring boys perform better than high-scoring girls in math (Baye and Monseur 2016;Bedard and Cho 2010). A particularly worrying pattern emerges from the data: the gaps in test scores that exist at the end of primary school become more pronounced as school children age. The gap at the end of lower secondary school is larger than the gap at the end of primary school (Baye and Monseur 2016;Fryer et al. 2010). Cultural differences across countries may help explain the widening gaps. For example, the gap in math is less pronounced in gender-egalitarian countries (Else-Quest et al. 2010;Stoet and Geary 2015). Educational institutions across countries may also differently affect girls' and boys' schooling performance. The level of standardization in school (Connor et al. 2013), the sex composition in classes (Lee and Bryk 1986), and variation in students' socio-economic background (Legewie and DiPrete 2012) have all been found to impact girls' and boys' performance differently.
In this paper we focus our analysis on the effect of one particular institutional school setting: the age at which students are first sorted into different types of secondary school. In early tracking countries, lower secondary schools sort students into tracks depending on their primary school performance; in late tracking countries, all students attend the same lower secondary school with similar educational trajectories. 1 Several studies have investigated the impact of early tracking on overall educational inequality, finding that it disadvantages the lowest-performing students (Ammermüller 2005;Hanushek and Wößmann 2006;Montt 2011). Other assessments of early tracking show that it has negative effects on political engagement (Van de Werfhorst 2017); leads to worse test performance for the children of immigrants who do not speak the language of the testing country at home (Ruhose and Guido 2016); and, in contrast to these other findings, has slightly equalizing effects on the health outcomes of people across educational classes (Delaruelle et al. 2019).
The contribution of the present paper is to study the effect of early tracking on the gender gap in math and reading test scores. Existing literature on the topic of early tracking and gender test score gaps has not reached a consensus finding. For reading, the findings in van Hek et al. (2019) suggest that early tracking gives an advantage to boys, while the analysis in Hermann and Kopasz (2019) suggests the opposite. For math, too, the results are mixed. Bedard and Cho (2010) find that early tracking is associated with a higher gender gap in achievement; Hermann and Kopasz (2019) find that early tracking corresponds with a lower gender gap; and Bodovski et al. (2020) find no statistically significant relationship between early 1 The age of first tracking usually doesn't differ across regions within a country. Notable exceptions include Germany, the effects of whose region-specific tracking systems have been studied widely (e.g. Schindler and Bittmann 2021;Traini et al. 2021). Other exceptions have given the context for studies looking at the relationship between early versus late tracking and gendered educational inequality in Finland (Pekkarinen 2008) and Northern Ireland (Maurin and McNally 2007). tracking and the gender gap in math scores. A main reason for the discrepancy in results is the inconsistency in data and especially method applied. None of these papers employ an empirical strategy that could lead to a causal interpretation of the results. Their insights are very important to the literature, but when assessing the effect of early tracking on gender gaps in test performance, several other steps need to be taken. In particular, the few existing studies looking at this question either leave out fundamental variables that can help explain achievement gaps (such as per-pupil expenditures or class time spent working on a particular subject); do not control for existing gaps in test scores at the end of primary school; or do not include the information provided by the complex survey design in the data.
In this paper, we use a cross-country difference-in-differences approach, using variation in countries' age of first tracking as well as information on test scores from before tracking (primary school, using TIMSS and PIRLS data) and after tracking (secondary school, using PISA data) to see the effect of early tracking. empirical approach is very similar to that taken by Ruhose and Guido (2016), who study the effect of early tracking on the gap in test scores for children of immigrants versus native-born parents.
There are two theories to help frame our understanding of why early tracking could contribute to a gender gap in test scores. first is the "maturity hypothesis," which can explain girls' higher scores in reading. There are two streams to this hypothesis, both based on the large literature showing that boys mature later than girls (e.g. DiPrete and Jennings 2012;Tanner 1990). The streams differ in their assumptions about the age at which the maturity gap emerges. The first stream says that the maturity gap emerges at a young age, before even early tracking occurs, giving girls in early tracking countries better chances of being selected for a more rigorous academic track. Since all students perform better when they are grouped with higher achievers (Altermatt and Pomerantz 2005;Hanushek and Kain 2003;Huang 2009;Robertson and James 2003), girls placed onto higher tracks because of their greater maturity will benefit from early tracking more than less-mature boys (Bedard and Cho 2010;Jürges and Schneider 2011). Early sorting seems indeed to be related to a large gap in reading scores: even controlling for test achievement at the end of primary school, the gap in reading in early tracking countries is larger than in late tracking countries (Jürges and Schneider 2011). The second stream of the maturity hypothesis says that the maturity gap emerges later. 2 In this case, it will be late tracking instead that benefits girls (Anderson et al. 2001;Keulers et al. 2010;Pekkarinen 2008). Depending on when the maturity gap emerges and its alignment with a country's tracking age, girls can have a larger or smaller advantage from tracking age. The streams of the maturity gap hypothesis help us understand the size and timing of girls' better performance in reading, but cannot explain why boys score higher in mathematics.
A second theory can help explain girls' lower average performance in mathematics: that of gender roles and socialization that prescribe math as a "boy" subject. Girls are less likely to participate in mathematics courses and boys are underrepresented in non-science subjects in secondary school (Langen et al. 2008;Pinxten et al. 2012), in part because students adapt their focus in schooling based on gender roles prescribed by parental, teacher, and peer beliefs about gender identity (Gallagher and Kaufman 2004;Hadjar et al. 2014;Ma 2001). We suggest that socialization can help explain why models of major choice based solely on student ability predict that more girls should be in science, mathematics, economics, and engineering majors while more boys should be in humanities, psychology, and social scienceorientated tracks than is actually the case (Hyde and Mertz 2009;Turner and Bowen 1999). Indeed ethnographers observe that high school students classify themselves into social categories and try to find the best match between their own and their peers' social identity (Eckert 1989). Moreover, teachers' prejudices regarding girls' and boys' abilities in certain areas might affect how they treat and grade boys and girls (Hörstermann et al. 2010;Li 1999;Ziegler et al. 1998). Early tracking can lead to disproportionate sorting into gender-typical fields because young children may be more inclined to follow the socially assigned path for their biological sex, which would increase gender gaps in both math (in favor of boys) and reading (in favor of girls).

Empirical Strategy
To study the link between early tracking and gender gaps in test scores, we use a difference-in-differences (DiD) approach, following the progression in the relevant literature from Hanushek and Wößmann (2006), Ruhose and Guido (2016), and Hermann and Kopasz (2019). Hermann and Kopasz (ibid.) are alone in having applied cross-country DiD estimation methods to the analysis of early tracking and gender differences in test scores. They used multilevel modelling and found that early tracking raises girls' achievement test scores. The present study goes beyond the work of Hermann and Kopasz (ibid.) in four ways. First, we employ the appropriate weights and multiple imputation scheme on the data in our analysis, which is necessary to maximize the accuracy of the information in the data sets. 3 Second, we use more recent waves of the data (2015 for PISA and 2011 for TIMSS/PIRLS); in the data section below we discuss why we choose to use these waves instead of the currently most recent available data, TIMSS 2019, PIRLS 2016, and PISA 2018. Third, we include a battery of important robustness checks to verify and deepen the understanding of our results. Fourth, we include important country-and student-level information that justify the use of DiD to identify causal effects.
Follow the same empirical strategy to test for the effect of early tracking on test scores gaps as Ruhose and Guido (2016). We define countries as early tracking if they track students for the first time before the age of 15 (we test the sensitivity of this cut-off point in robustness checks below). We estimate the following equation: where Y ics is the test score of individual i in country c; the index s represents the test score in secondary school. ET c is an indicator equal to 1 if the country is an early tracking country. SEC s is an indicator equal to 1 if the individual is in secondary school; this variable can be interpreted as a grade fixed effect. G i is an indicator equal to 1 for the lower achieving sex, that is, it is equal to 1 for girls for the models predicting math scores and equal to 1 for boys in the models predicting reading scores. μ c is a country fixed effect and captures all country-specific time invariant factors. X is a vector that includes individual-level background variables, capturing information on student age; minutes spent in mathematics and reading in school; the number of books at home; and father's and mother's education. interact all of these individual-level background variables with the indicator of whether a student is in secondary school via SEC s × X. textbfZ is a vector of four country-level time-variant control variables: GDP growth; the year-on-year change in educational expenditures; the share of female teachers; and an index measure of a country's gender equality. We interact these country-level measures with the gender indicator to test for different effects of these country-level characteristics on male versus female students via Z × G i .
For causal identification of β 1 , one has to make the following three identifying assumptions. First, there are no unobservable factors correlating (a) the existence of early tracking and (b) gender differences in primary school test achievement at different points in the school year. The "different points in the school year" is relevant because both TIMSS and PIRLS exams are conducted during but not exactly at the end of the school year. The assumption implies that there should be no structural incorporate the plausible values (that is, the plausible multiple imputation method) also recommended by the OECD (2009b), though not in a causal DiD setting.
Early Tracking and Gender Achievement Gaps differences between early and late tracking countries that change the gender gap depending on when during the school year students are tested. This first assumption would only be violated if, for example, parents in early tracking countries push girls and boys differently at the end of primary school, which seems unlikely.
Second, there can be no unobservable factors that differently affect the evolution of boys' and girls' test scores in early versus late tracking countries that are not already present at the end of primary school. There are several potential examples of such issues, such as if per-pupil expenditures between primary and secondary school in early tracking countries increased or decreased to a different extent than in late tracking countries and one gender benefited more from this change in expenditures. Other potential confounding time effects might be if the share of female teachers changed at different rates in early versus late tracking countries and the resulting effects on students differ by gender (Diefenbach and Klein 2002), or if GDP grew differently in late versus early tracking countries and these changes in GDP growth between grade four and grade eight affected the performance of girls and boys differently. These potential scenarios are the impetus for why we include controls for GDP growth, education expenditure change, female teacher share, and overall gender inequality in a country in our empirical models. Inclusion of these variables eliminates time-effect distortions, and is one of the most important methodological improvements over earlier studies. Only Bodovski et al. (2020) studying the relationship between school differentiation and the gender gap in math and science (but not reading) include similar variables (though in the context of a non-causal analysis, not taking differences at the end of primary school into account). 4 Third and finally, we need to assume that there are no structural differences between early and late tracking countries that could affect the gender gap depending on whether students at the age of 15 are still in lower secondary school. Even though this assumption is widely ignored in the literature, it is crucial when analyzing the determinants of gender gaps in test achievement. 5 We therefore include information 4 A potential related issue is migration. We would have confounding time effects if there are differences in the immigration rates between early and late tracking countries that are not already captured in β 2 and if these differences affect boys and girls differently. This is unlikely, but in PIRLS 2006 and TIMSS 2007 contain information on parents' country of birth and we use it in robustness checks below. 5 An example will illustrate: in Austria (an early tracking country), students usually choose between different tracks a second time at the age of 14. At the age of 15, when PISA is conducted, some students are already on tracks that are highly specialized in certain fields, such as technical or language-based tracks with completely different curricula. The tracks are highly gender-differentiated: technical schools comprise almost 75 % males, while social tracks comprise almost 85 % females (Austria 2017). In many late tracking countries, however, the age of first selection is first at 16 (as shown in Table 1), which means that the schools have a lower degree of gender differentiation at the age of 15, when PISA is conducted. Early tracking might thus be correlated with gender differentiation at the age of 15.
on "minutes spent in mathematics" and "minutes spent in test language" to the models. As such, we compare students with the same time spent on these subjects; the variables serve as a proxy for the degree of gender segregation in mathematics and reading at the age of 15 and arguably fulfills assumption three.
School systems across countries differ in important ways, and in ways that may be related to gender gaps in achievement scores in early versus late tracking countries. By controlling for cross-country differences in GDP growth, changes in school expenditures, the share of female teachers, and an index measure of gender equality, as well as student-specific number of minutes spent in math and reading, we arguably deal with most factors that would question the validity of a causal interpretation of our results. In the next section we describe our data in more detail before turning to the results in Section 4.

Data
We use the PIRLS 2011, TIMSS 2011, and PISA 2015 datasets for the main analysis. PISA data come from the OECD while PIRLS (reading) and TIMSS (math) come from the International Association for the Evaluation of Educational Achievement (IEA). All three studies were first conducted in the late 1990s or early 2000s and were subsequently repeated every few years (PISA every three years, PIRLS every four years, and TIMSS every five years) (IEA 2019a; IEA 2019b; OECD 2019).
Data from previous waves are used in some robustness checks. Data from newer waves are available but are not appropriate for our empirical strategy. Our main models compare PIRLS and TIMSS data on primary school students in 2011 with PISA data on secondary school students in 2015. The timing of these surveys implies that the students we compare were born in the same year. Therefore, comparing the students in these datasets allows us to eliminate any differences in students related to birth cohort, such as social norms about gender while the child was very young. The timing of newer waves of the data do not offer the opportunity to control for birth cohort effects. The newest data available are on primary school students in 2016 (PIRLS) and 2019 (TIMSS) and secondary school students in 2018 (PISA). These data do not line up to control for birth cohort effects. The other main potential source of bias is time effects: much changed between 2011 and 2015 (e.g. the economy in most sampled countries was stronger in 2015 than in 2011) that may have affected school performance of boys and girls differently. We deal with this by controlling for time-variant factors, such as per-pupil expenditures and GDP Without controlling for this fact, β 1 might capture the effect of gender segregation at the age of 15 rather than the effect of early tracking.
Early Tracking and Gender Achievement Gaps growth, and interacting these with gender. Newer waves of the data would also have time-specific effects (especially comparing PIRLS 2016 to PISA 2018), while also suffering from birth cohort effects. We thus use older data because they allow us to at least eliminate birth cohort effects.
The most important difference between the PISA and the IEA datasets is that the target population in PISA is 15-year-old students, whereas the TIMSS and PIRLS samples comprise grade four students, who are 10 or 11. All datasets contain a full set of responses from individual students, school principals, and parents to very similar questions. Additionally, TIMSS (which asks about math, but not reading) is the only of the three surveys conducted at grades four and eight. We exploit this fact to perform the analysis within two available grades in TIMSS in a robustness check below. For all analyses, we standardize the data such that the mean scores are 500 with a standard deviation of 100. We use senate weights throughout the analysis, which means that each country contributes equally to the results. This is important in our context because the policy intervention (early tracking) is done on the country level; for the overall analysis of the policy, the effects of the policy in each country should be counted the same. Otherwise the effects of the policy in larger countries, like Germany, would play a stronger role in the conclusions about the effect of the policy.
To limit the range of cultural differences across countries in the sample, the main analysis includes only European countries. Robustness checks based on additional OECD countries are shown below. Some European countries were dropped due to important missing country-level data or because they were not included in one of the three main datasets. If important individual-level data were missing, we impute them using country averages. The 20 countries yield more than 200,000 individuallevel observations.
The multivariate regressions described in equation (1) include the following individual-level control variables. "Age" indicates student's age at the time of the test. "Books" shows students' response to the question "how many books do you have at home?"; the response categories were "none or few books," "one bookshelf," "one bookcase," "two bookcases," and "three or more bookcases." We align these answers into categories 1-5 in our data. "Books" together with the parents' education variables act as family background proxy variables. We control for dummy variables indicating the highest completed ISCED level of education of both the father and the mother, measured as primary, secondary, or tertiary education. In TIMSS and PIRLS, a variable on time spent in math and language is derived from teacher's response to the question, "In a typical week, how much time do you spend on language/mathematics instruction and/or activities with the students?" In PISA, students answer the question, "How many class periods per week are you typically required to attend for the subjects in test language/mathematics?" On the country level, we derive control variables from other sources. The OECD provides data on the age of first tracking (OECD 2013, Figure IV.2.4, p. 78;OECD 2016a, Figure II.5.8, p. 167). These data were double-checked with UNESCO's World Data on Education, which provide detailed information on different European education systems (UNESCO Institute for Statistics 2013). Following Hanushek and Wößmann (2006) and Ruhose and Guido (2016), we define a country as early tracking if the first tracking occurs before the age of 15. Countries that track students at 15 or later are categorized as late tracking countries. Based on this definition, there are 10 late tracking countries and 10 early tracking countries in the main model. Table 1 shows the age of first tracking for the baseline sample countries and additional countries used in robustness checks. Column 2 of Table 1 also reports the number of education programs available for students at age 15 (OECD 2013, Figure  We further control for four variables to reduce the threat of grade-variant variables (that is, variables that change over time, as students move from grade four to grade eight) to our identification strategy. First, the measures of a country's public education expenditure per pupil in primary and secondary school (as percent of GDP per capita) are gathered by the Education Policy and Data Center (EPDC 2019) and the Knoema database (Knoema 2019). We combine both datasets to calculate the average change in primary education expenditure between 2007 and 2011 and the average change in secondary school spending between 2011 and 2015. The underlying assumption is that students who participated in PIRLS/TIMSS in 2011 were exposed to the education expenditures in primary school between 2007 and 2011, whereas PISA participating students were affected by secondary education expenditures between 2011 and 2015. Second, the definition of GDP growth follows the same logic. PIRLS/TIMSS 2011 students were exposed to the GDP growth between 2007 and 2011 while in primary school and PISA students were exposed to GDP growth between 2011 and 2015 while in secondary school. The variable was calculated based on GDP per capita (PPP current international dollars) supplied by the World Bank International Comparison Program database (2013).
Third. we use data on the share of female teachers in primary school in 2011 and in secondary school in 2015 provided by the UNESCO Institute for Statistics (2019). 6 Finally we control for differences in country-level gender inequality, using the Overall Global Gender Gap Index from 2011 to 2015 published by the World Economic Forum (Hausmann et al. 2011;Hausmann et al. 2015). 7  7 The index is based on 14 indicators that measure gender inequality in economic participation and opportunity, educational attainment, health and survival, and political empowerment. Examples for such indicators are wage equality for women and men for similar work, female enrollment rates over male enrollment rates at different educational stages, and the share of women in parliament. The higher the index, the smaller a country's gender-based gaps in access to resources and opportunities.
Tables 2 and 3 show summary statistics from the datasets on reading and math, respectively. The four variables of the country-specific data is the same in both tables. The female share of teachers is higher in primary school than in secondary school, and it is higher in early tracking countries. GDP growth at the time students were in secondary school was much stronger in late tracking countries. Changes in educational spending also differed greatly across early versus late tracking countries. Late tracking countries have higher levels of gender equality. These differences in major social and economic indicators highlight the importance of including these factors in our empirical model.

Early Tracking and Gender Achievement Gaps
The reading data in Table 2 show that parents in late tracking countries are higher educated than those in early tracking countries, and families in late tracking countries have more books in the home. Students spend more class time on reading in primary school than in secondary school; this is true in both early and late tracking countries, though the change is smaller in late tracking countries. Similar patterns can be seen in the math data in Table 3. Parental education, especially mother's education, is significantly higher in late tracking countries. Households in late tracking countries have more books. Secondary school classes in late tracking countries spend more time on math than comparable classes in early tracking countries, though as in reading, the difference between primary and secondary school is smaller in late tracking countries. These structural differences even in individual-level background characteristics across early versus late tracking countries show the importance of comparing students with similar characteristics to get at the effects of the early tracking policy on their performance. Without controlling for personal-and country-level background characteristics, we can see changes in the average gender-specific achievement gap across grades in early versus late tracking countries in Tables 2 and 3, as well as in Figure 1. The average gender achievement gap in Figure 1 is defined as the average test score of the lower performing sex minus the average score of the higher performing sex. 8 Therefore, a negative slope represents an increasing gender gap between primary and secondary school, whereas a positive slope indicates a decreasing gender gap. The negative slope in reading is steeper in early tracking countries than in late tracking countries. The figure gives a visualization to the descriptive statistics in Table 2. In the table we see that the gender gap in reading in late tracking countries Figure 1: Average gender achievement gap in early and late tracking countries. Note: The gender gap is calculated as the difference in scores between the weaker achieving gender (girls for math, boys for reading) minus the stronger achieving gender, thus always giving a negative result.
8 Figure 1 was calculated based on the 20 countries included in the main models.
Early Tracking and Gender Achievement Gaps in primary school (21 points) is higher (28 points) in secondary school. The gap in early tracking countries starts at a lower point in primary school (14 points) but increases more dramatically to be much larger in secondary schools, to 26 points. For math, late tracking countries achieve a narrowing of the gender gap between primary and secondary school (the slope is positive and the gap goes from eight to seven points, as shown in Table 2), while the gender gap in early tracking countries increases slightly across school levels, from 9 to 11 points). It is interesting to note that the gender gap in reading in primary school is stronger in late tracking countries than it is in early tracking countries. We do not have an explanation for this fact but note that it highlights the importance of comparing gaps between primary school and secondary school to identify the effect of tracking policies, which occur after primary school; we need to account for any existing differences before the implementation of the policy. Table 4 shows the DiD results for reading and mathematics. The coefficient corresponding to the triple interaction term "Secondary × Early Tracking × Male/ Female" is the main parameter of interest (β 1 in equation (1)). A negative sign implies a larger increase in the gender gap between primary and secondary school in early versus late tracking countries. The models include country fixed effects, driving the R 2 up so high.

Main Results
The most important finding can be read from the top line of the table: early educational tracking increases the gender gap in reading scores, while the effect on math scores is smaller and statistically insignificant. In other words, tracking students into more specialized secondary schools increases the gender gap in reading, but does not affect the gender gap in math scores. For reading, early tracking leads to a 7.4 point higher gap in test scores, meaning that early tracking increases the gender gap in test achievement by 7.4 % of one standard deviation. This is a large impact, considering that the average gap is 25 points at the end of lower secondary school (as seen in Table 2); early tracking accounts for more than 25 % of the gap.
The DiD findings are consistent with the summary statistics given in Figure 1 and Tables 2 and 3. For reading, we can see in Table 2 that boys' scores improved from primary school to secondary school in late tracking countries (achieving 491 points in primary school and 499 in secondary school), but went down from primary school to secondary school in early tracking countries (485 to 481 points, respectively). Girls' scores increased from primary to secondary school in both early and late tracking countries, but more so in late tracking countries. Figure 1 also shows that the gender gap increased more dramatically in early tracking countries. Thus, the DiD results  showing that early tracking increases the gender gap in reading is not surprising. The results for math are also consistent with summary statistics in Figure 1 and Table 3, in that there were not large changes in math test scores for boys or girls between primary and secondary school in either early or late tracking countries. Thus, it is no wonder that the DiD estimate reveals no effect of early tracking on the gender gap in test scores. Aside from the main results, three sets of other results from Table 4 are worth highlighting. First, in the school setting, a greater share of female teachers corresponds to a greater gender gap (in favor of girls) for reading and a higher share of female teachers helps to mitigate the gender gap in math scores (as seen by the positive and statistically significant coefficient on the interaction between the female teacher share and being a female student). Next, surprisingly, spending more class time on reading and math seems to be less helpful (math) or even harmful (reading) to the scores of the weaker gender in the respective subjectsthough the coefficients are very small and thus perhaps not very meaningful. One explanation for the negative coefficient on the interaction term with gender in reading might be reverse causation: classes with larger gender gaps spend more time on that subject later on. For math, more class time on the subject helps boys but not girls.
Second, at home, having more books at home corresponds with higher test scores in both reading and mathan unsurprising finding. Having three or more bookcases at home (variables "Books 4" and "Books 5) is particularly helpful for secondary school students. Similarly, having more highly educated parents corresponds with higher test scores; the coefficients are larger for mother's education than father's education. Interestingly, the impact of parental education on descendant test scores is stronger in Notes: Standard errors were clustered on school level; significance levels are * p<., ** p<., *** p<.; provided senate weights were used and modified such that each country carries a weight of one; for the dependent variable "Reading Score" and "Mathematics Score" plausible values were used and standardized at mean  and standard deviation  for each survey separately.
primary school than in secondary school, as can be seen by the negative coefficients on the interaction terms between parental education and secondary school. Third, the overall economic situation in the country is also related to test scores, and sometimes in gender-specific ways. Greater per-pupil expenditures are associated with slightly lower reading scores, which is surprising and concerning, though the coefficient is relatively small and again might be due to reverse causality (schools spend more in response to poor performance). Similarly, GDP growth is associated with lower test scores. One potential explanation for these findings is that the change in both variables was calculated between 2007 and 2015. During this period, the financial crisis led to major economic decline. Since the countries with the strongest educational performance in the dataset were most affected by the economic crisis, 9 a temporary negative effect of growth in GDP and educational expenditures on educational achievement might be conceivable. Finally, we see that greater gender equality in the country is associated with higher test scores in both math and reading for both boys and girls. In both subjects, it benefits the test scores of girls more than boys.
There are two potentially surprising findings in Table 4. The first is the coefficient on "male" in the model predicting reading scores: it is positive, very large, and statistically significant. This is counter-intuitive because indeed, boys have lower scores on reading tests, as seen in Figure 1 and Table 2. This coefficient can be explained by the interaction terms and the time-varying country-specific variables included in the model. Table A1 in the appendix shows how the coefficient on "male" changes as we add more variables to the model. In the first column with no control variables, we see that the coefficient on "male" is negative and statistically significant, as expected. As we add control variables and interaction terms between the control variables, this variable changes in sign and magnitude. In the third model in Table A1, we include the gender equality index variable and its interaction with the male dummy variable. With these inclusions, the coefficient on male alone becomes large and statistically significant. This result occurs because the value of the gender equality index is between 0 and 100, and as Tables 2 and 3 show, its average value lies between 70.6 and 77.8. Thus, the coefficient "gender equality index × male" (−0.743) times the actual value of the gender equality index (a number between 70 and 80), added to the stand-alone male coefficient gives a total male "effect" that is much smaller and much more in line with expectations. If we further consider the value of the coefficient on "female teacher share × male" (−0.536) times the actual value of the female teacher share (on average, a number between 67 and 90), then the overall male "effect" is indeed negative, es expected.
The second surprising result in Table 4 is the coefficient on secondary school, which is strongly negative and statistically significant in the model predicting reading scores and the model predicting math scores. Again, Tables A1 and A2 show the development of the coefficient on this variable as we include more variables in the specification. In the first column of both tables, without any controls, we see the expected positive coefficient on secondary: test scores are, on average, higher in secondary school, and this is more true for reading (Table A1) than math (Table A2)results in line with the summary statistics in Tables 2 and 3. When we add students' background characteristics in the second column of Tables A1 and A2, the coefficient on secondary becomes strongly negative. Again, this result can be explained by the inclusion of other control variables in the model, and their interaction with secondary school status. The base level for age, for example, is zero, though all students in the data are over 10. Interacting the actual value of this variable with the coefficient on "secondary × age" (which is between 12 and 14.1 in each model in Tables A1 and A2) gives a large positive number; combining this with the stand-alone coefficient on "secondary" gets us closer to the expected (positive) result. The same thinking is true for all other variables interacted with secondary school status; the number of books at home is particularly powerful. Combining the coefficients on the variables interacted with secondary school with the actual value of the variables and adding these together shows a positive relationship between secondary school and test scores.
In sum, early tracking is found to increase gender gaps in reading (7.3 % of one standard deviation). In math, there is a statistically insignificant relationship between early tracking and the gender gap in test scores. These results are in contrast to some studies in the literature; given the differences in the empirical strategies employed, the divergence in findings is not particularly surprising. The findings in van Hek et al. (2019) suggest that early tracking narrows the gender gap in reading; our results are more in line with Hermann and Kopasz (2019), who find early tracking to increase the reading gap. For math test scores, our results are in line with Bodovski et al. (2020), who find no statistically significant relationship between early tracking and the gender gap.

Robustness Checks
Most of our robustness checks are similar to those performed by Hanushek and Wößmann (2006), Ruhose and Guido (2016), and Hermann and Kopasz (2019). The tables in this section only report the parameter of interest (β 1 ), but the estimates are based on the complete model used in the main analysis.

Definition of Early Tracking
Our first set of robustness analyses investigates whether the main results are driven by the definition of early tracking (thus far, the cut-off was tracking before the age of 15). In the first column of Tables 5 and 6, this definition is replaced by the number of education programs available for students at the age of 15. The number of available tracks varies between one educational program for very late tracking countries and up to seven educational programs for early tracking countries (see Table 1). The measure of the number of tracks available is an indication of the differentiation of a country's school system. A negative coefficient on the variable "number of tracks" indicates that greater differentiation before age 15 leads to higher gender differences in test achievement. Table 5 reveals that this is exactly the case for reading. In mathematics, the coefficient in Table 6 is close to zero; more differentiation does not affect the gender gap in math scores.
Next, instead of strictly separating early and late tracking countries with a particular cut-off age as in the main analysis, it is also possible to use the actual age of first tracking, measured by a continuous variable. As shown in Table 1, the tracking age varies between 10 in Austria and Germany and 16 in Northern European and  Tables 5 and 6. A positive coefficient indicates that earlier tracking increases the gender gap in test scores. This is the case for reading scores: for both subjects, every one year increase in the age of first tracking leads to a smaller negative effect of early tracking on the gender gap in test achievement. For math test scores, the effect of tracking at an earlier age defined in this way remains statistically insignificant.
In the third robustness check, we examine whether the main results are driven by countries with very early tracking. To do so, we remove the five countries whose first tracking occurs at age 10 or 11 from the sample. The results are given in the third column of Tables 5 and 6. The results are consistent with the main findings. Even without very early tracking countries in the sample, early tracking increases the gender gap in reading, and it has no statistically significant effect on the gender gap in math scores. Finally, we check the robustness of the exact cutoff age used in the main analysis. Here we change the cut-off of early versus late tracking from 15 to 13. Using this definition, there is still a strong effect of early tracking on the gender gap in reading. The coefficient for math remains statistically insignificant.
Taken together, the results of these analyses show that the main findings are not driven by the definition of "early" tracking or the exact age at which the tracking occurs. Notes: Standard errors were clustered on school level; significance levels are * p<., ** p<., *** p<.; provided senate weights were used and modified such that each country carries a weight of one; for the dependent variable "Mathematics Score" plausible values were used and standardized at mean  and standard deviation  for each survey separately. Very early is defined as tracking before the age of .

Additional Countries
The main analysis uses data from 20 European countries. Using data from countries in other regions gives us the opportunity to check the main model's sensitivity to a broader range of institutional contexts. Similar results for this extended model would be an indication that the results for the 20 European countries in the main results have international external validity. The additional countries and their tracking ages can be seen in the lower portion of Table 1. Countries added for the reading and math models are shown in the column "Reading Big" and "Math Big," respectively. The choice of countries was based solely on the availability of variables used in the main model. There are eight additional countries for the big sample for reading (seven of which are late tracking) and five additional countries for the big sample for math (all of which are late tracking). The results are presented in Table 7. 10 Adding the new countries results in only minor changes in the parameter of interest for the reading results. However, for math, the coefficient becomes large and statistically significant. We thus have evidence that the effect of early tracking on the gender gap in math test scores depends on the countries included in the sample. A second test addresses the role of any one country on determining the results. Minor differences for the effect of early tracking depending on the sub-sample of countries used can be expected. It is, however, important to ensure that the main Notes: Standard errors were clustered on school level; significance levels are * p<., ** p<., *** p<.; senate weights were used such that each country carries a weight of one; for the dependent variable plausible values were used and standardized at mean  and standard deviation  for each survey separately. Due to missing data in the big mathematics model, parents' education was dropped.
model result is not solely driven by certain unobservable characteristics of one particular country. Following Ruhose and Guido (2016) to rule out a single country's relevance for the main results, we perform a piece-wise deletion of one country at a time and a re-estimation of the main model. Table 8 gives these results. The first column shows which country was excluded from the analysis. The table is read as follows: If Austria is excluded from the sample, i.e. the main model is estimated for nine early and 10 late tracking countries, the parameter of interest is −7.132 in reading and −1.125 in mathematics. If only the Czech Republic is excluded, the values change to −6.347 in reading and to −3.316 in mathematics.  Notes: Standard errors were clustered on school level; significant levels are * p<., ** p<., *** p<.; provided senate weights were used and modified such that each country carries a weight of one; for the dependent variables plausible values were used and standardized at mean  and standard deviation  for each survey separately; coefficients are based on the full models.
In reading, results are statistically significant no matter which country is excluded. It is interesting to note that when Northern European countries are excluded, the effect of early tracking on gender differences in reading increases (that is, the coefficient raises above the baseline level of 7.4 points). These results emerge because all Northern European countries are late tracking countries with relatively large gender differences in reading scores. In mathematics, regardless of which single European country is left out of the analysis, the coefficient remains statistically insignificant.

Dataset Timing
In this section, we check the robustness of our results using earlier waves of the data and using TIMSS data from both grade four and grade eight. These exercises allow us to check if the effect of early tracking is stable over time, and if our models produce results similar to other studies using the same data. To use older data, we must drop four countries in the reading analysis and up to seven countries in the math analysis due to missing data. When we do the analysis comparing TIMSS data from grade four and grade eight, we only have data on two early tracking countries.
The first analysis in this section changes the combination of dataset years used. The main analysis matches PIRLS/TIMSS 2011 with PISA 2015 data, which allowed us to observe students at different schooling levels who were born in the same year, thus eliminating potential birth cohort effects. We deal with any potential biasing time effects by interacting the change in educational expenditures, GDP growth, the share of female teachers, and the gender index with the gender of the pupil. Matching two datasets of the same year eliminates the possibility of any time effects, but the existence of biasing birth cohort effects may exist. Matching PIRLS/ TIMSS 2011 data with PISA 2012 data removes most time effects but introduces potential cohort effects. For this exercise, we do not include the country controls used in the main analysis (that is, changes in GDP growth, educational expenditures, share of female teachers, and gender inequality index over time) because those over-time changes are irrelevant when we compare datasets from the same year or from just one year apart.
The results of matching PIRLS/TIMSS 2011 with PISA 2012 data can be found in the first column of Tables 9 and 10. They show that the negative effect of early tracking on gender differences in reading achievement are slightly smaller but similar to the main results. The results for math using these two datasets are very similar to the results in the main model: a negative but small and statistically insignificant effect of early tracking.
Our second exercise looks at how the effect of early tracking may have changed over time. Theoretically, the effect of early tracking on gender differences should not change significantly over small periods of time. To test this, we calculate the DiD estimate by pooling primary school data from 2006 PIRLS (for reading) and 2007 TIMSS (for math) with secondary school data from PISA 2012; results are given in the second column in Tables 9 and 10. In these earlier data, we do not have information on four countries that were in the main analysis (Czech Republic, Finland, Ireland,  and Portugal). The coefficient on the effect of early tracking in reading gender gaps was over 12 points in the earlier data. When pairing the earlier data for the math analysis, we must drop seven countries (Romania, Belgium, Finland, Spain, Ireland, Portugal, and Poland) from the analysis due to data constraints. The result is parallel to the main findings: a small, negative, and statistically insignificant effect of early tracking. Third, the impossibility of controlling for migrant status in the main models might have led to biased results. Unlike the 2011 datasets, TIMSS 2007 and PIRLS 2006 measure parents' country of birth. 11 In the third column of Tables 9 and 10, this variable was added to the model in column two. When comparing columns two and three, any biased results in relation to the student's migrant status can be ruled out; the migration variable does not change the estimate in any meaningful way.
Finally, we consider that TIMSS data are not only available at grade four but also measures students' test scores four years later, at grade eight. Matching TIMSS grade four with TIMSS grade eight data, instead of only using PISA data at the secondary level, has two advantages. First, since PISA and TIMSS don't measure exactly the same, we can test whether results are sensitive to the particular secondary school data used. The second advantage is that PISA and TIMSS grade eight surveys are carried out for sightly different student ages. In the PISA study, students are usually slightly older. We did not use TIMSS grade four and grade eight data as our main model because TIMSS has data on math but not reading. In addition, for our set of control variables, only a limited number of countries are available (only two early tracking countries and eight late tracking countries). 12 Results using data from TIMSS 2011 grade four and TIMSS 2015 grade eight are presented in column four of Table 10. The coefficient of interest is slightly larger than the coefficient in the main model, but it too is statistically insignificant.
In sum, the robustness checks show that the effect of early tracking on gender differences in reading is consistent across different years and countries included in the analysis. We conclude that the main results for the effect of early tracking on the gender gap in reading scores are robust. In mathematics, the main results are statistically insignificant and remain so throughout almost all robustness checks. The one exception is when we include five non-European late-tracking countries; their inclusion leads to a statistically significant coefficient measuring the effect of early tracking. Otherwise, there is overwhelming evidence that for the 20 European countries in the sample, early tracking does not lead to an increase in the gender gap in math test scores.

Conclusions
In this paper, we investigated the impact of early educational tracking on the widening gender gap in educational performance between primary and secondary school. We estimated a difference-in-differences model using cross-country variation in the age of first educational tracking as a policy variable. We used largescale international test studies at the primary and secondary school level for reading and math. Applying these data allowed us to compare boys' and girls' test scores between early and late tracking countries in secondary school, conditional on gender differences that already existed in primary school. By applying a crosscountry DiD model taking into account a wide range of control variables absent in previous literature and all characteristics of a complex sample design, this study could produce causal estimates of the effect of early tracking on the change in the gender gap in test scores between primary and secondary school.
The results for reading indicate that early tracking increases gender differences between primary and secondary school. A wide range of robustness checks confirm the results. Our empirical results on the gender gap in reading scores are in line with the the first stream of the maturity hypothesis, which predicted stronger gender gaps in reading for countries with early tracking instead of late tracking. The reading results are also consistent with our gender roles hypothesis, which predicted that early tracking widens the gender gap both in reading and mathematics. For math scores, almost across the board, there is no evidence that early tracking affects the gender gap in test scores. The sole exception is when we include five non-European late-tracking countries in the sample. Overall, the results do not give evidence to conclude that early tracking impacts the gender gap in math scores.
The findings for reading are in line with the results in Hermann and Kopasz (2019), while van Hek et al. (2019) found the opposite. The latter used non-causal methods and is thus not comparable. As reproduced in this study, the results for the effect of early tracking on the gender gap in mathematics are sensitive to several details of the analysis. Using less recent data, Bedard and Cho (2010) concluded that early tracking related to worse performance for girls in mathematics. Hermann and Kopasz (2019), however, have shown early tracking to be positively related to girls' performance in mathematics. Bodovski et al. (2020), like us, found a statistically insignificant relationship between early tracking and gender gaps in mathematics.
The key conclusion drawn from this paper is that early tracking is likely to contribute to a widening gender gap between primary and secondary school in reading. The reading results give clear implications: early tracking increases gender differences in academic achievement.   Notes: Standard errors clustered on school level; significance levels are * p<., ** p<., *** p<.; provided senate weights were used and modified such that each country carries a weight of one; for the dependent variable "Reading Score" plausible values were used and standardized at mean  and standard deviation  for each survey separately.     Notes: Standard errors were clustered on school level; significant levels are * p<., ** p<., *** p<.; provided senate weights were used and modified such that each country carries a weight of one; for the dependent variable "Mathematics Score" plausible values were used and standardized at mean  and standard deviation  for each survey separately.