Gain Scores, ANCOVA, and Propensity Matching Procedures for Evaluating Treatments in Education

Abstract Researchers have several options available to analyze data from interventions when participants have not been randomly allocated into conditions. Among these are the gain score, ANCOVA, and propensity matching procedures. Each of these attempts to account for pre-treatment differences among the conditions, but they do so differently. These procedures are reviewed and methods for estimating them in R are shown. The choice of which of these procedures to use can be difficult. Different situations are shown where they perform differently. The primary conclusion of this paper is that models should be hypothesized for how the data may arise, data simulated for these models, and the properties of statistical procedures evaluated. A goal of this paper is to show these procedures without extensive mathematics in order to allow a broad read-ership to use these methods in their situations.

In much education research it is not practical (and sometimes impossible) to allocate students into conditions: for example, to change students' social class; to change their early childhood experiences; to change their pre-study knowledge; to make them have speci c psychological conditions, etc. This is true in many disciplines: e.g., an astronomer cannot assign a star to go supernova; a historian cannot randomly decide whether Trotsky or Stalin succeeds Lenin; see also the evaluations at www. povertyactionlab.org.
Because of this, researchers rely on statistical procedures, often with the belief that certain statistical procedures somehow "control" the e ects of other variables, usually called covariates. These procedures do not physically "control" or in any other way a ect these covariates. Some people believe this, what Braun (2013) describes as magical thinking. Unfortunately this misnomer has been shared with generations of students. Three procedures are discussed: gain scores, ANCOVA, and propensity matching. All of these provide accurate solutions in certain circumstances to certain research questions. The di culty is knowing when each of these procedures, and if any of these, is appropriate. These methods are described in more detail using two examples. Graphical models and simulations are used to illustrate how a researcher might decide which to use. Two examples are used to illustrate steps that can be used to help guide this choice.
The following steps can be taken before conducting statistical analysis of any intervention: 1. describe models for how the data might arise, 2. simulate data according to these models, and 3. evaluate di erent statistical procedures.

A Mathematics Intervention and Review of Procedures
A mathematics intervention example was chosen to represent evaluations where the researcher wants to know if a treatment works, has a prior score measured before the treatment that is on the same scale as the outcome measure, but the researcher cannot randomly allocate people into a treatment and a control condition.
Suppose you want to assess whether a year-long program of a weekly after-school "math club" improves students' scores at the end of the year. Denote being in the club with Treatment = and 0 otherwise, the assessment before treatment as Pre, and the end of year assessment with Post (subscripts will not be used on variables in text or in gures, but all vary by student). There would be complaints if students were randomly assigned to this intervention. First, let's assume students volunteer for the club. Who volunteers for an after-school math club is not random. Assume there is some latent variable, call it Propensity, that is the probability of whether a student (or their guardian) volunteers. This is an unmeasured variable. Propensity is likely to be in uenced by many things, including how much math knowledge the student has already learned and aspects of their home environment (e.g., diligence doing homework). Call this Knowledge. Knowledge will also a ect the two test scores. Knowledge and Propensity will be in uenced by many other variables, and will also in uence other variables. Some of these may be observed, but most will not be. For illustration, only simple models are used in this section.
It is often useful to draw relationships using the mathematics concept of a graph (not everyone agrees, see Imbens & Rubin, 2015, p. 22). A graph, in its mathematical sense, is a set of nodes (called vertices in some texts), some of which are connected by edges. In causal models these edges are often shown with an arrow on one end to denote the direction of causality. Following the style used in structural equation modeling, ellipses will be used here for unmeasured constructs and rectangles for observed variables. The seminal reference applying graphs to causal models in science is Pearl (2009), and good introductions include: Elwart (2013), Morgan & Winship (2015), and Pearl et al. (2016). Figure 1A shows the model described above. Each observed variable also includes an error term, which are not shown in order to make the gures less cluttered (they would be small ellipses with an arrow pointed towards the observed variable). Models are simplications. For example, there will be other variables that inuence Knowledge and Propensity, and are in uenced by these. The primary interest for the researcher is estimating the edge Treatment → Post, enclosed with a red ellipse. For the model in Figure 1A, the students' scores on Pre do not in uence whether someone is in the treatment.
This might be because treatment allocation is decided before the tests are scored. Figure 1B supposes that the Pre scores do in uence propensity, and that Knowledge does not. This might be if school administrators were deciding who was in the math club based solely on the scores from this assessment, and the other factors in uencing Propensity were unrelated to Knowledge (e.g., whether the student was available when the club occurred). Figure 1C allows both of these to in uence Propensity. More complex models are considered in Example #2 of this paper.

. Review of Statistical Methods using R
Comparing scores on Post by Treatment using something like a t-test is not a good measure of the intervention when the two groups begin systematically di erent in ways associated with the outcome variable. Di erences on the outcome variable could be due either to the intervention or to the pre-existing di erences. Three procedures will be considered in this paper: gain scores, ANCOVA, and propensity matching. Each of these is sometimes described as "controlling for previous performance." Before doing any inferential statistical modelling, descriptive statistics and visualizations should be done (Wilkinson, 1999;Wright, 2003). Example plots are shown in Figure 2 for hypothetical data for post and pre scores for two groups. This shows a scatterplot with ellipses to show where approximately 50% of each group lie. These are made with the dataEllipse function of car (Fox & Weisberg, 2019). The colored lines are regression lines for each group. The slopes of these are less than one (shown with the grey line, for Pre = Post), which is consistent with regression towards the mean (Galton, 1886). The histograms for each variable are shown to the right of the scatterplot. These show the distributions for both groups for both variables, and include a kernel density curve made with R's density function. The mean for each group is shown by the vertical dashed lines. Further information about making plots in R can be found in Murrell (2019). The a and b in the plot are discussed below.
The rst two procedures are often discussed with reference to Lord's paradox (1967;1969). Lord described a situation: students' weights, before and after a year of college, where interest was with the gender di erence. Lord imagined two statisticians proposing di erent methods for the analysis. The rst statistician proposed sub- tracting the two weights and comparing means of these gain scores. The second statistician proposed predicting nal weight from gender after conditioning on the initial weight using an ANCOVA. The statisticians reached di erent conclusions. The rst statistician found no difference in weight gain, for either females or males, so no gender di erence. The second found males weighed more than females after conditioning on pre-weights. Several authors have shown when and why these approaches can produce di erent e ects (e.g., Hand, 1994;Holland & Rubin, 1983;Kim & Steiner, 2020;Pearl, 2016;Wainer, 1991;Wright, 2006Wright, , 2020. Since Lord described this paradox, propensity matching (e.g. Rosenbaum, 2002;Rubin, 2006) has become very popular, though many express concerns that it is sometimes used without due concern (e.g., Pearl, 2009;Sekhon, 2009). Therefore these three procedures will be reviewed. The environment language R (R Core Team, 2019) will be used for simulations, so here the code to estimate these models in R is presented.

. . Gain scores
The simplest of the procedures considered, computationally, is the gain score method. Let Gain = Post − Pre and it is assumed that these gain scores have approximately the same meaning for each level of Pre. It must make sense to equate, for example, Tom's increase from 96 to 99 with Jerry's increase from 47 to 50. Analyses can then be conducted on this variable using Treatment with or without the other observed variables. Lord's (1967) rst statistician conducted a t-test between the two groups on the gain score (i.e., no other covariates). Eqn. 1 shows this as a regression model: The procedure estimates β and β , usually by nding these values such that e i is as small as possible (i.e., least squares). If only the single predictor variable Treatment is used this is equivalent to a t-test. The three dots, · · · , are shown to emphasize that further covariates could be included. Most interest is in β , the value associated with the treatment, and the usual null hypothesis is: In R, where Covs is the set of any covariates beyond the treatment (Group) and initial score (Pre), this would be:

gainmod <-lm(Post -Pre~Group + Covs)
The lm function stands for linear model. The gain scores are calculated on the left of the~and the model is on the right. If you have several data sources, you can tell R which of these objects the variable is in. If the data are all stored in the same data frame, use the data slot: The result (gainmod) is stored here as an object called gainmod. Information from this object can be printed by typing gainmod, or by placing it inside functions like summary(gainmod) (summary output), anova(gainmod) (ANOVA table), confint(gainmod) (con dence intervals of the xed e ects), predict(gainmod) (predicted values), and plot(gainmod) (useful plots for checking assumptions). These functions all nd what the class of object gainmod is (class is "lm") and report these results accordingly. For introductions to using R for statistics see Crawley (2015), Field et al. (2012), Matlo (2020), and Venables & Ripley (2002, R is an implementation of the language S, so books about S are also applicable). For dis-cussions of philosophy behind and history of R see Chambers (2008) for details and Chambers (2009) for an abbreviated discussion. Useful sources about the programming language include Chambers (1998;, Matlo (2011), Venables & Ripley (2000), and Wickham (2019 The intercept is the mean gain for the control group ( β = . ). The di erence between the two gain scores, shown in the nal line, is: β = − . , with t( ) = − . , p < . . The coe cient is negative meaning the gain by the treatment group is less than the gain by the control group. The mean for the treatment group is the sum of these two estimates: . − . = − . . The equivalent t-test provides a little more output. Both show the same t-value and p-value. The gain is positive for the control group and negative for the treatment group. The t-value is negative in the regression output but positive in the t-test output just because of how the t-test function represents the group variable.
Since the hope is that the treatment has a positive e ect this should probably be written as a negative t-value. The mean gain score for each group is shown in Figure 2 by the two lines marked by a with arrows. These lines go from the means of the two variables for each group, shown with a circle in the middle of the ellipse, to the value where the group would have been if there had been no gain (both the horizontal and vertical values being the mean for the Pre score). The conclusion from the gain score procedure would likely be that the treatment had a negative e ect on achievement.
. . ANCOVA Lord's (1967) second statistician conducted an ANCOVA. This is the procedure most often described as controling covariates. There have been decades of warnings about the limitations of this procedure (e.g., Kahneman, 1965;Meehl, 1970). The phrase ANCOVA can mean di erent things to di erent people, but here it will refer to the following model: If β is xed at 1, this becomes eqn. 1. This ANCOVA also tests if β = , like eqn. 1, but this β is di erent. It is the e ect after conditioning on Pre and the outcome variable is di erent. More covariates are often added to the regression, further obfuscating the meaning of β . Cox & Donnelly (2011, p. 111) suggest using notation that shows all the variables being conditioned upon when reporting an e ect to make the complexity more transparent. If the additional covariates were cov , cov , and cov , the treatment e ect could be written as: β treatment|pre,cov ,cov ,cov . Sometimes further variables are added to regressions/ANCOVAs with the hope that this somehow gets closer to the isolating the causal impact of Treatment on Post. If the convention were to report estimates with this longer notation it might make clearer that adding more variables is unlikely to simplify the meaning of the e ects. Adding or removing a covariate changes the meaning of the parameter being estimated. It is important to think carefully about which covariates to include based on their role in the overall causal model and to evaluate the performance of these statistical models with simulated data sets. In R, where Covs is the set of any covariates other than Treatment and Pre scores, this would be: The results can be found using the same R functions listed above because both are produced with the lm function so both produce class "lm" objects. After conditioning on Pre, the treatment e ect is positive: β = . , t( ) = . , p < . . Thus, the typical conclusion from this analysis would be that the treatment had a positive e ect on achievement. Because the gain score approach and the ANCOVA approach lead to di erent conclusions, both cannot be right, which is why Lord (1967) called this a paradox.

. . Propensity Score Matching
Trying to reach causal conclusions when people are not randomly allocated into groups is di cult (Campbell & Stanley, 1963). Propensity matching attempts to create an accurate model of who chooses (or is chosen) to be in the intervention, and then uses this information to compare people with similar propensities to be treated. Propensity matching was developed in a series of papers by Rosenbaum and Rubin. The seminal textbook is Rosenbaum (2002) and many of their contributions have been republished in Rubin (2006). The phrase propensity matching now applies to several di erent approaches that aim to achieve these goals. An excellent introduction to propensity matching, which covers many of these approaches, is Leite (2017). He describes six steps for propensity score analysis (p. 7).
1. Prepare data. This includes choosing which covariates to include. This requires knowledge of how the di erent variables may relate and therefore knowledge of the research domain. Leite includes dealing with missing values in this step. 2. Propensity score estimation. This involves using the covariates to estimate the probability that each person will be in the treatment group (e.g., with logistic regression). 3. Propensity score method implementation. The analyst decides whether to create "matched" groups or use some other technique (e.g., weighting according to the propensity score in a regression).
4. Covariate balance estimation. Depending on the previous step, the analyst evaluates how successful the implementation is. For example, if matched groups were created, are their scores similar on the covariates? 5. Estimate the e ect of a treatment. The method depends on previous steps. 6. Sensitivity analysis. A variety of methods can be used here, including the focus of Example #2 in this paper, seeing if the results vary by whether particular covariates are included or not (#2a), and varying assumptions of the data-creation model (#2b).
Here the Matching package (Sekhon, 2011) and its Match function are used. To use a package you rst must install it (e.g., install.packages("Matching")). This can be done just once so it is on your computer and whenever you want to make sure that you have the most up-todate version. Next, you load it (e.g., library(Matching)), and this your need to do within each R session in which you want to use the package's functions.
The function can be called just using its name (here Match), but if you have several packages open you might have another function also named Match. If you are not sure it is safest to use Matching::Match to call functions. In this formula Covs may or may not include Pre, depending on what the analyst believes. The propensity score estimation is done with a logistic regression and the matching and estimation steps are done by the Match function. The goal of this paper is not to argue for or against a particular propensity matching approach. There is still much debate about this. Leite's (2017) coverage seems well balanced.

library(Matching)
propval <-glm(Treatment~Covs,family=binomial)$fitted PrMatmod <-Match(PostTest,Treatment,propval, estimand="ATE",BiasAdjust=TRUE,ties=TRUE) The Match function has several options (type ?Match in R to see these once the package is installed). The estimand="ATE" means the estimate is the average treatment e ect. This is the average estimated di erence due to treatment for both those given the treatment and those not give the treatment. The user also has the option to estimate the treatment e ect separately for those in the treatment condition or for those in the control condition. These would be valuable for di erent applied problems: the former to estimate the e ect for those likely to sign up for the treatment and the later the value if you could encourage those who would not normally sign up to sign up. If you type summary(PrMatmod) the computer will print the results. Further information can be extracted from this object. Type PrMatmod$est to get the estimated treatment effect. To see other information that can be extracted type str(PrMatmod). Propensity matching is usually used with many variables to predict the group allocation, but for illustration here it is done with just pre. Here the results (estimate = . , z = . , p = . ) are consistent with the ANCOVA approach, showing a positive e ect from the treatment.¹ library(Matching) propval <-glm(group~pre,family=binomial)$fitted suppressPackageStartupMessages(library(Matching)) PrMatmod <-Match(post,group,propval, estimand="ATE",BiasAdjust=TRUE,ties=TRUE) summary(PrMatmod) Propensity matching methods can be used in conjunction with the gain score and ANCOVA approaches. For the gain score approach, the gain score would simply be used as the dependent variable. Propensity matching can be combined with ANCOVA; after either matching or weighting for the propensity scores, by using additional covariates (including those used to estimate propensity) to predict the outcome. This approach, sometimes called doubly robust, can increase the power for detecting a treatment e ect. These extensions are not explored in the current paper. Further, there are many other methods that can be used for addressing inadequacies of the basic ANCOVA (e.g., non-linearity, measurement error, clustered data).

For more information about conducting
propensity matching in R see Leite (2017 (2019) and Matlo (2017).

Examples using Simulation
Four simulations are presented as examples of how a researcher might go about deciding which approaches to use. The key steps are: a) creating a set of plausible data models, b) creates lots of data sets for these, c) conduct the statistical procedures that you wish to compare on these, and d) compare these ndings. These examples are to illustrate this approach, but the topics were chosen because they are important in deciding among procedures: how people are allocated to groups and colliders.

. Varying How Students are Allocated into Groups
Simulation methods are used here to explore potential bias for the three statistical procedures for the three data models shown in Figure 1. R (R Core Team, 2019, Version 3.6.1) is used. Other software (e.g., Python, SAS, SPSS, Stata) could have been used, but using R allows use of its propensity matching packages (e.g., Keller & Tipton, 2016) and it is freely available to all readers. Simulation methods allow the data-creation models to be varied to examine further research questions. In this example, the relationships among Knowledge, Pre, and Treatment are varied.

. . Example #1a: Having the true e ect equal zero
Data were created for each of the models depicted in Figure 1 and the three statistical models applied. For the gain score model, Treatment was used to predict the di erence in the two scores (equivalent to a t-test on the gain scores). For the ANCOVA Treatment and Pre were used to predict Post. For the propensity matching, Pre was used to predict Treatment, propensity values were taken for this, and Sekhon's (2011) Match function used to estimate the e ect of treatment. Normally propensity matching is used with a larger number of variables. This is done in Example #2.
To make these results easier to interpret, in Example #1a the true treatment e ect is zero. Thus, the correct answer for all of these models is β = so the mean of unbiased t and z values should be near zero. In #1b di erent e ect sizes are used. Values for the latent variable Knowledge were drawn from a unit Normal distribution, i.e., Knowledge ∼ N(µ = , σ = ). Pre was drawn from Knowledge+N(µ = , σ = ), then standardized so that it has a mean of 0 and standard deviation of 1. Propensity was drawn from Knowledge + N(µ = , σ = ), Pre + N(µ = , σ = ), and Pre + Knowledge + N(µ = , σ = ), for panels A, B, and C, respectively, from Figure 1. Each of these was standardized. Treatment was decided by a Bernoulli process with probability equal to the propensity variable; so like the ip of a weighted coin. Post is drawn from Knowledge + N(µ = , σ = ), then standardized. It is not a ected by Treatment, so the true treatment e ect is zero. The three statistical models were estimated. This was repeated 10,000 times so that the estimates are quite precise. The code is in the Appendix.   Table 1 shows the mean t (from the gain score and AN-COVA output) and z values (from the propensity matching output) for this simulation. Because this is a simulation it is known that the true e ect is 0. The gain score model provides unbiased estimates for the Figure 1A, where Pre does not in uence the propensity to be in the treatment group and therefore does not in uence being in the treatment group. The other two statistical methods provide biased estimates here. They both suggest the treatment had a negative e ect. The opposite occurs for data produced according to Figure 1B, where Knowledge does not a ect propensity. Here ANCOVA and propensity matching produce mean estimates near zero, but the gain score procedure is biased, suggesting a positive e ect for the treatment. When both Pre and Knowledge a ect Propensity, as in Figure 1C, the gain score procedure estimates a positive treatment a ect while the ANCOVA and propensity matching procedures estimate negative treatment e ects. The ndings for the gain score and ANCOVA procedures are consistent with Wright (2006) and the graphical models of Pearl (2016). See also discussion in Holland & Rubin (1983) and Steiner et al. (2011). The researcher would need to decide which of the data-creation models in Figure 1 is most appropriate in order to decide if any of these three statistical models should be used. Some methods for this are described in Lockwood & McCa rey (2020).

. . Example #1b: Varying the True E ect Size
Example #1a shows when each procedure produces unbiased estimates when the true e ect is zero. Here this is extended to negative and positive e ect sizes. This allows examination of the procedures' power to detect e ects to be examined. The simulation was repeated but when creating the Post values a treatment e ect was included. The treatment e ect was drawn from a uniform distribution from -1 to +1. For those in the treatment condition this was added to the knowledge latent variable (distributed N(µ = , σ = )) and a normally distributions error term (µ = , σ = ). The resulting variable was standardized to have a mean of 0 and standard deviation of 1. One hundred thousand replications were done for each situation (A, B, or C from Figure 1 by procedure (gain score, ANCOVA, or propensity matching). Figure 3 was made predicting each e ect size. The loess procedure was used with a span of only 25% and a quadratic curve so that the lines are smooth. The left panel shows situation A, where Pre scores have no inuence on propensity to receive treatment. Here the gain score has a mean test statistic of zero when there is no true e ect, and a negative mean with negative true e ects and positive means when the true e ect is positive. The curve is convex and has a slope less than one meaning the estimates are biased towards zero. The ANCOVA and propensity matching procedures give virtually the same estimates, so are shown with a single curve. Their estimates are too low, though because the curve is also convex it does intercept the no bias line. For situation B, where the Pre score does in uence propensity to receive treatment, the ANCOVA and propensity matching perform well and the gain score more is biased over-estimating the treatment e ect (again, it is overestimated since low Pre scores had the higher propensity for being in the treatment). The curves are all slightly convex with slopes less than one. For situation C, the gain score procedure and the ANCOVA and propensity matching procedures are biased in opposite directions. This shows the importance of understanding the allocation mechanism when deciding how to analyze data.

. Causal Models, Colliders, and More Covariates
Causal models, and in particular something called a collider, will be the focus of this second example. A common example used to introduce colliders is the smokingbirth weight paradox (e.g. Pearl et al., 2016). Medical researchers are interested in how a mother smoking a ects many things including infant mortality and birth weight. Infants of smokers appear to have higher infant mortality rates, but researchers found that conditioning on birth weight can reverse this e ect. Researchers speculated that smoking might somehow protect low-weight infants. However, examining the possible causal models of this situation, which Pearl et al. (2016) do, reveals the likely reason. Figure 4 shows observed variables for smoking, birth weight, and infant mortality. There are other causes for both low birth weight and infant mortality and these are depicted with an unmeasured variable called Other. Because many of these conditions a ect infant mortality more than smoking does, it means conditioning on birth weight means you are comparing infants of mothers who smoked with infants who have conditions with higher infant mortality rates.
If the purpose is to measure the direct e ect of smoking on infant mortality (Smoking → Mortality), it is important to consider backdoor (also called indirect) paths between Smoking and Mortality. A path is any set of edges between two nodes where no node is included more than once (so non-recursive). There are two backdoor paths in Paths can be either blocked or unblocked. If they are unblocked it means information can ow along them confounding measurement of the direct path. Therefore, often a goal of choosing covariates is to block backdoor paths. How detrimental the e ects of an unblocked door path are depends on the product of the path coe cients (Loehlin, 1998). To understand blocking paths it is necessary to consider three ways in which three variables, call them X = eat cake, Y = be happy, and Z = smile, can be causally related: Chain: X (eat cake) → Y (be happy) → Z (smile), Fork: X (eat cake) ← Y (be happy) → Z (smile), and Collider: X (eat cake) → Y (be happy) ← Z (smile). Pearl (2009, pp. 16-17) describes two rules to determine if the path between two nodes, X and Z, is blocked and how this is a ected by conditioning on the middle variable Y.
Pearl's rst rule is that if a path contains a chain or a fork, it is unblocked unless the middle variable is conditioned upon. Chains and forks are associated with the phrases mediation and spurious correlation in the education literature. An example of an unblocked chain is that an exercise pamphlet can lead to students planning to exercise and this can lead to more exercise (Hill et al., 2007). If you prevent students from the planning phase then giving participants a pamphlet is not as e ective. A common textbook example of a fork is the positive association between ice cream consumption and murder in cities (Peters, 2013, see also, Vigen, 2015. Warm weather causes both of these to increase. If you could condition on the weather by looking at one particular weather (e.g., sunny and 83 • F), this would block the path and assuming no other e ects are present, the correlation would be near zero. In Figure 4, the path Smoking → Weight → Mortality includes a chain (middle variable Weight) and therefore begins unblocked. Much of the justi cation for using covariates is to block paths like this. One way to block this path would be to condition upon Weight. While it may seem useful to block this path, if the goal is to measure the complete causal e ect of smoking, this path simply shows how the e ects of smoking may be partially mediated by birth weight.
Pearl's second rule is that a path with a collider begins block, but is unblocked if the collider, or any variable in uenced by it (called descendants in graph theory terminology), is conditioned upon. Colliders are less dis-cussed in the education literature than forks and chains. Wright (2017) uses a river metaphor to describe a collider. Imagine two tributaries arriving from di erent directions at a deep sink hole. The hole is the collider. The water from each would not reach the other; the path is blocked. If the hole is lled, water could ow between the tributaries and the path would be unblocked. Weight is a collider in: Smoking → Weight ← Other → Mortality because it is in uenced by both smoking and other causes. One method for blocking the rst path (conditioning on Weight) will unblock the second path. It is unblocking this second path that creates the illusion that smoking decreases infant mortality (illusory if one thinks the AN-COVA measures the e ect of smoking on infant mortality). The smoking-birth weight paradox example was shown because it clearly shows the e ects of conditioning on a collider.
The causal model in Figure 4 is similar to ones in education. Wright (2017) provides an education example. It involves a collider that is measured before the treatment. Figure 5 shows the data-creation model that could be assumed when educational systems attempt to estimate school e ectiveness. Suppose this is for estimating the e ectiveness of a 9th grade class. The e ectiveness of the school is in uence by environmental factors like the economics of the neighborhood. These also in uence previous schooling and therefore grades from earlier years (denoted Pre). Characteristics of the student and their family also in uence grades before the 9th grade and afterwards (denoted Post). The critical edge to estimate school e ectiveness is the one from School → Post. The backdoor path is: The variable Pre is a collider so conditioning on this unblocks this backdoor path thereby causing problems for estimating the direct e ect. Using methods often used in the US to measure e ectiveness, Wright (2017Wright ( , 2018 showed that the estimated e ectiveness can be negatively correlated with true e ectiveness.

. . Example #2a: How Manifest and Latent Variables Relate
The example used in this section is more complex than Examples #1a and #1b, to re ect-at least to some extentthe complexity of causal models applicable for many education research situations. It is common to measure several other variables with the hope of blocking (and keeping blocked) all backdoor paths thereby isolating the direct effect. Figure 6 shows the data-creation model assumed. There are three latent variables (Environment, Knowledge, and Grit) and two key observed variables (whether the person had the Treatment and the Post score). It is assumed here that Treatment is in uenced by the Environment the student is in (e.g., geographic location). In addition, it is assumed that there are three measured variables that are related to each of the latent variables. For the simulation each of these observed variables have been created to be correlated approximately r = . with their associated latent variable, but how they are associated with the latent variable di ers. One in uences the latent variable (e.g., e → Environment), one is in uenced by the latent variable (e.g., Pre ← Knowledge), and for the other one, another variable (depicted just with a circle) in uences both (e.g., g ← # → Grit). While this model looks complex, like all models in social science it is still a simpli cation. Several of the nodes listed likely in uence others (e.g., Environment → Grit; Grit → Pre), there are many other constructs that could play a role, and there are many other variables that could be measured. It is worth noting that observed variables are not placed along the paths from Treatment, Knowledge, and Grit to Post, and there are no nodes just a ecting Treatment. These e ects could provide additional ways to measure the treatment e ect. There are several ways to address measurement of interventions and readers are encouraged to consult textbooks devoted to this (e.g., Imbens & Rubin, 2015;Morgan & Winship, 2015;Pearl et al., 2016).
The variable Pre is named in Figure 6 as opposed to just calling it x because of its central role in the gain score model. It is assumed that it is on the same scale as dard deviations of one). It is assumed that Post − Pre is meaningful and that this di erence represents the same gain throughout the span of Pre. This is a vital assumption for the gain score approach. Pre is in uenced by Knowledge. In graph theory terminology it is a descendant of Knowledge. According to Pearl's second rule, conditioning on Pre unblocks the path Treatment ← Environment → Knowledge ← Other → Post because it is a descendant of a collider (Example #2b explores this further). As with Example #1a, the data were created so that true treatment e ect is 0. Post is the sum of Knowledge, Grit, and a N(µ = , σ = ) error term, and then standardized. To create the observed variables so that they have correlations of approximately r = . with their associated latent variable required a few steps (this could be done in several ways). These will be shown for k , Pre, k . kph is the unnamed latent variable in uencing Knowledge (know in the code). The R code is: kph <-rnorm(n) k1 <-rnorm(n) k3 <-scale(kph + .71*rnorm(n)) know <-scale(.85*kph + .6935*k1 + .6*other + .6*env) pre <-scale(sqrt(.25)*know + sqrt(.75)*rnorm(n)) First, the variables kph and k1 are drawn from unit Normal distributions. k3 is created by adding kph and a unit Normal variable multiplied by slightly more than 1, and this sum standardized. The latent variable know is the standardized sum of kph, k1, grit, and a Normal variable, each weighted to produce the desired correlation. Pre is the standardized sum of weighted know and a Normal variable. The weights were chosen so that the correlation of the variables was .5. As shown in Example #2b, the results are sensitive to these weights.
The propensity to be in the treatment condition was the quantile (the qrank function in the appendix) of the Environment variable (env). The contributions from Environment and Other to Knowledge are equal, and a Normally distributed random error is also added. The code is in the Appendix.
The study had n = and there were 10,000 replications per condition. This number of replications was used so that the standard error for the z value for each statistical model was approximately 0.01 (and thus the widths of the 95% con dence intervals about .04).
The mean t (for gain score and ANCOVA) and z (for propensity matching) values for the treatment are shown in Table 2. The rst line shows the mean values when all nine observed variables are used as covariates. The mean for the gain score model is approximately zero, so provides nearly unbiased estimates. It is important, however, not to conclude that the gain score methods works well for the data-creation model in Figure 6. The analyst should test the sensitivity of these ndings varying the strength of the di erent relationships (Leite's [2017] sixth step). This is shown in Example #2b.
Both the ANCOVA and propensity matching procedures are biased, suggesting the treatment has a negative e ect. As with Examples #1a and #1b, the ANCOVA proce- The mean t (gain score and ANCOVA) and z values when using all the covariates, excluding each set of three covariates, and excluding each individual covariate for Example #2a. The data were created so that the true treatment e ect was zero.
dure is more biased than propensity matching. Next, sets of covariates were excluded. For the gain score model excluding any variable created a slight downward bias. For the ANCOVA and propensity matching procedures excluding the three observed variables related to Knowledge increased the bias further. The increased bias, for both these statistical procedures, was most evident for e and e , but excluding each of the three increased the bias. This shows all three of these variables are useful to include in this situation for these statistical procedures.
Excluding the three observed variables associated with Knowledge slightly increased the bias for ANCOVA and slightly decreased it for propensity matching. However, the e ect of excluding the three individual observed variables was di erent. For both ANCOVA and propensity matching, excluding k and k decreased the bias, but excluding Pre increased the bias, substantially for ANCOVA. Excluding all the Grit variables decreased the bias, though excluding just g had only minimal e ect for ANCOVA and none of these greatly a ected the propensity matching bias.
. . Example #2b: Improving the Pre score as a measure of Knowledge Leite's (2017) sixth step for propensity matchingexploring the sensitive of your conclusions to variationsis relevant for all simulations. Example #2a appears to show that the gain score model may be better for the datacreation model in Figure 6. However, it is important to examine how this conclusion is a ected by changing the weights used to create the data.
Here only one variation is chosen. Many people create assessments to accurately measure Knowledge and may combine many such assessments into one Pre score. This aggregate score might correlate with Knowledge more highly than the r = . in Example #2a. Only one change was made to the code. This line: The correlation between Pre and Knowledge goes up to about r = . . The simulation was repeated and the results shown in Table 3. Now the gain score model produces positively biased estimates, and the ANCOVA and propensity matching procedures produce negatively biased estimates, though these are not as extreme as with Example #2a. As with the previous example, the propensity matching methods is less biased than ANCOVA. The mean t (gain score and ANCOVA) and z values when using all the covariates, excluding each set of three covariates, and excluding each individual covariate for Example #2b. The data were created so that the true treatment e ect was zero.

Summary
Gain score, ANCOVA, and propensity matching procedures are all used with the hope of isolating the causal e ect of an intervention on an outcome measure. Brief introductions were provided for these and included R code and output. No statistical procedure can guarantee to isolate this e ect in all situations, but each can be useful depending on the model that led to the data. When more e ects are added to the data-creation model it can become impossible to block all backdoor paths with any statistical procedure. There are situations where no statistical procedure can accurately estimate a treatment e ect. The goal is to choose which procedure is most accurate and to warn readers about this limitation. It is important to make readers aware of the assumptions made when trying to reach causal conclusions. Four simulations were presented to provide examples for how researchers could decide among these procedures. The overall purpose of this paper is to show that it is important to consider how the data may have arisen, and Examples #1a and #1b focus on understanding how people are allocated to groups. If there are several di erent plausible models that could account for the data, it is worth examining several of these using simulation methods as was done here. Examples #1a and #1b showed that that there are situations where the gain score model does better than the ANCOVA and propensity matching procedures, and situations where it does worse. In particular, if the covariate in uences the propensity to be in the treatment condition, the gain score method is biased, but the AN-COVA and propensity matching methods are not. The results show the opposite is true when other variables in uence propensity, but the covariate does not. These ndings are consistent with Holland & Rubin (1983); Pearl (2016); Wright (2006). The purpose of Examples #2a and #2b were to show that the choice of covariates is important. They showed that it is important to consider that way in which di erent observed variables may relate to the latent variables.
Unfortunately, the analyst will not know the true causal model underlying the data. Assumptions must be made in order to make causal conclusions (Cartwright, 2014). This is particularly true when random assignment is not used. The analyst can try several causal models to test if the statistical approach is sensitive to these changes (Leite's sixth step). Researchers should be prepared for reviewers and readers to propose their own causal models. Enough information should be provided to allow these peers to simulate data to show if the statistical procedure used would be appropriate for their choice of causal model. When there are di erences, it is often necessary to conduct further research designed to evaluate which set of causal models is more appropriate. This can be timeconsuming. Statistical procedures for intervention studies without random allocation are not simple. Campbell & Stanley (1963) describe several threats to the validity. Some, but not all, of these are eased by using random assignment, which they encourage when practical.
It is important to conclude by stressing that plotting the data and exploratory/descriptive data analysis is an important step (e.g., Figure 2). However, the the choice of plots and descriptive statistics is also often in uenced by the assumptions made. If you are uncertain which procedures to use, you can use multiple approaches, but you should describe these in the write-up (i.e., not use several and just report the one with the p-value most likely to lead to publication). Steegen et al. (2016) take this to an extreme. They advise analysts to run many potential models and then report the distribution of results and weight them for plausibility. Here, the advice is rst think about how the data may have arise to try to limit the choice to a small number of plausible models.