Identification of the causal effect of a treatment T on an outcome Y in observational studies is typically based either on the unconfoundedness assumption (also called selection on observables, exogeneity, ignorability, see, e.g. de Luna and Johansson , Imbens and Wooldridge , Pearl ) or on the availability of an instrument. The unconfoundedness assumption says loosely that all the variables affecting both the treatment T and the outcome Y are observed (we call them covariates) and can be controlled for. An instrument is usually defined as a variable affecting the treatment T, and such that it is related to the outcome Y only through T (and possibly the observed covariates). When available, instruments can be used to identify causal effects in parametric situations. Nonparametric identification is also possible with the help of instruments, and Angrist et al.  develop a theory for the nonparametric identification and estimation of local average causal effects. Abadie  and Frölich  extended these results to the situation where the observed covariates are related to the instrument. Note also that nonparametric identification can be obtained with the related concept of (fuzzy) regression discontinuity designs; see Hahn et al. , Battistin and Retore , Dias et al.  and Lee [10, Sec. 5.5.3]. When a causal effect is identified, a test of the unconfoundedness assumption may be devised by comparing the estimates of the causal effects obtained both under the unconfoundedness assumption and using the instrument (classical Durbin–Wu–Hausman (DWH) test in a parametric setting). This was recently used by Donald et al.  to propose a test of the unconfoundedness assumption in a nonparametric framework.
In this paper, we introduce general instrumental conditions under which it is possible to test for the unconfoundedness or exogeneity assumption. The instrumental assumptions are general and, for instance, they do not necessarily yield identification of a causal effect when the unconfoundedness assumption does not hold. Indeed, to obtain the nonparametric identification of a local average causal effects stronger (and untestable) assumptions must be made on the instrument, see, e.g. Imbens and Angrist , Angrist et al. , Angrist and Fernandez-Val , Donald et al.  and Guo et al. . In particular, these papers use a monotonicity assumption saying that the instrument must affect the treatment in a monotone fashion, as well as do not allow for unobserved heterogeneity to affect both the instrument and the treatment. Based on our general instrumental conditions we can propose a statistic to test the unconfoundedness assumption. The proposed test is related to the use of two control groups to test the unconfoundedness assumption, an idea previously used, e.g. in Rosenbaum , de Luna and Johansson  and Dias et al. . Rosenbaum  was probably first to formalize the idea that two control groups provide information on the unconfoundedness assumption and described actual observational studies where different control groups were available. One of our contributions in this context is the introduction of general assumptions under which an observed variable can be used to split an available control group in order to test the unconfoundedness assumption nonparametrically. However, the test statistic we eventually propose does not actually require the split to be done.
In Section 4, we present a motivating example where Swedish register data are used to study the causal effect of job practice (JP) on employment. We have access to a rich set of background characteristics on unemployed individuals, although the question remains whether the effect of JP on employment is confounded by unobserved heterogeneity. In this study, unemployed have access to JP through their participation into a labor market program. During 1998 there were two such labor market programs available in Sweden offering JP with different probabilities. Because we know that the two programs differ mainly only with respect to their propensity to offer JP, the participation into the two programs may be assumed to affect employment differently only through JP. We, thus, argue that program participation fulfills our instrumental conditions. In contrast with usual instrumental assumptions this allows potential unobserved heterogeneity in the program and JP assignment to be correlated. We apply the introduced test to check whether the estimated effect of JP on employment is biased due to unobservables affecting both JP and employment.
Before treating this motivating example in more details in Section 4, Section 2 presents the model, introduces instrumental assumptions and develops the theoretical results which then allow us to introduce a test of the unconfoundedness assumption. Section 3 presents a simulation study of the finite sample properties of the proposed test. In particular, one of the designs used illustrates the situation where the monotonicity assumption mentioned above does not hold. The paper is concluded in Section 5.
2 Theory and method
We use the Neyman–Rubin model [16, 17] for causal inference when the interest lies in the causal effect of a binary treatment T, taking values in , on an outcome. Let us thus define , , called potential outcomes. The latter are interpreted as the outcomes resulting from the assignment , , respectively. We then observe . Let us also assume that we observe a set of variables which are not affected by the treatment assignment. We will need to distinguish in particular and Z two vectors of such variables, the latter of dimension one.
For , we consider () as a random vector variable with a given joint distribution, from which a random sample is drawn. Population parameters that are often of interest in this context are the average causal effect and the average causal effect on the treated or on the non-treated .
(A.1) For ,
The common support assumption can be investigated by looking at the data. The unconfoundedness assumption may be considered as realistic in situations where the set of characteristics is rich enough, and when there is subject-matter theory to support the assumption.
2.2 Instrumental assumptions, test and power
Let us now consider situations where the variable Z takes values in (if not, it may be made dichotomous using a threshold) and fulfills the following assumption.
(A.2) For ,
Assumption (A.2) prohibits (a) a direct effect from Z to , i.e. an effect not going through T and (b) unobserved variables affecting both Z and . On the other hand, (A.2) allows unobserved variables to affect both Z and T which is typically prohibited by usual instrumental assumptions [4–6]. Note that when assuming (A.2) in the sequel, Z and may also be independent conditional on a subset of , and, e.g. Z may be randomized as discussed after Proposition 1. We also need the following regularity condition.
(A.3) If (A.1) and (A.2) hold, then , for respectively.
Assumption (A.3) is a regularity condition and is violated only in specific situations, of which Example 1 is typical.
Example 1 Let us assume that the vector has joint normal distribution, where U and V are two unobserved covariates and the set of observed covariates is empty. Assume now that the following model generates the data: (1)
where are jointly normal and independently distributed. Let and , where is the indicator function. Figure 1 gives a graphical representation of the model, where are omitted. We can write the conditional expectations where is function of the parameters in (1).
In Example 1, (A.1) and (A.2) will typically be violated, unless we assume that , in which case by joint normality, and thereby and . The constrained parametrization yields thus an example where (A.3) is violated since (A.1) and (A.2) hold while one can check that does not necessary hold.
This type of example is called unstable [3, Sec. 2.4] in the sense that (A.1 and A.2) will cease to hold as soon as the parameter values do not fulfill the constraint . Using directed acyclic graphs,1 it can be shown that assumption (A.3) holds as soon as the distribution is stable, where, e.g. a distribution parametrized with a parameter vector is said stable if no independence can be destroyed by varying the parameter ; see Pearl [3, Sec. 2.4] for a formal general definition. Note here that (A.3) does not imply any parametrized functional form.
Proposition 1 Assume (A.1)–(A.3), then (2)
Proof. By assumption (A.1) and (A.2) hold. Then, for ,
The conditional independence statement obtained in Proposition 1 is testable from the data when conditioning on (see next section). Finding evidence in the data against (2) yields evidence against the assumptions of the proposition. Thus, evidence against (2) can be interpreted as evidence against the unconfoundedness assumption (A.1) if (A.2) is known to hold from subject-matter considerations – (A.3) being a regularity condition. One application is a random experiment (where Z is a random assignment to a treatment) with restricted compliance T [4, 12]. Another example of application is treated in detail in Section 4. Note that while identification of the causal effect of T on Y may follow from (A.2) with linear models, see, e.g. Pearl [3, p. 248], this is not true in general, and stronger assumptions are needed to obtain nonparametric identification of a causal effect such as, e.g. a local average treatment effect [4–6]. In particular, our result does not rely on two assumptions typically made to obtain such identification; that the instrument must affect the treatment in a monotone fashion and that no unobserved heterogeneity is allowed to affect both the instrument and the treatment.
For a test based on (2) to have power against (A.1) we further need to have that Z and T are dependent conditional on . This is typically assumed for instrumental variables to be useful for identification. Examples of situations (expressed with directed acyclic graphs; see Footnote 1) for a test that would be based on (2) to have power against (A.1) are given in Figure 2, panels (a)–(c), while panel (d) shows a case where such a test would not have power. A caveat here is that (2) can be tested only when conditioning on . This has no practical consequence if the test rejects this null hypothesis. On the other hand, in cases where (2) is not rejected for , we have no information on whether it is violated for . In independent and related work, Guo et al. [14, eqs (3) and (4)] give an example where (2) holds for although not for , and yet a specific causal effect is identified without the help of Z when the earlier mentioned monotonicity assumption holds.
Different strategies may be adopted to test two null hypotheses given by Proposition 1, i.e.
Note that for , (A.1)–(A.3) need to hold only for and, thus, only is to be tested. Similarly, is relevant when is of interest, while both null hypotheses are relevant for . In this paper we propose a testing strategy2 based on the fact that under and we have and , for all , respectively, where
Given a random sample of n individuals indexed by i, , we consider a nonparametric estimator for , ,
where , , with denoting the cardinality of the set A, and and are nonparametric estimators of and , respectively. The two latter estimates may be obtained by nearest neighbor matching, or any other smoothing technique. Since and , respectively, under and , the test statistics (3)
will then, under the necessary regularity conditions, be asymptotically normally distributed with mean zero and variance one, where is the standard error of , for . For instance, if nearest neighbor matching estimators are used, then the asymptotic theory and in particular can be found in Abadie and Imbens . A subsampling estimator of is also available in this case in de Luna et al. . As noted above, when is of interest, then both hypotheses and are relevant and higher power may be obtained by considering the joint statistic
which is asymptotically distributed, since and are independent.
We should note here that the statistics above are testing conditional mean independence, which is relevant when average causal effects are targeted. Alternatively, one may wish to use tests of conditional independence statements based on all the moments of the underlying distribution , thereby making the methods relevant when quantile or distributional causal effects are of interest.
3 Monte Carlo study
We use a Monte Carlo study to investigate the finite sample properties (empirical size and power) of the test in (3), where K-nearest neighbor matching is used as nonparametric estimator of and , together with the Abadie and Imbens  variance estimator. As noted above, in situations where is of interest and (A.1)–(A.3) are assumed to hold for instead for only , then C could be used instead of thereby increasing the power of the test. As a benchmark we also implement a parametric DWH test, where we first regress T on X and Z and then add the residuals from this fit as a covariate into the outcome equation for Y. The test for the unconfoundedness assumption is then a Wald test on the parameter for the included residual covariate (see, e.g. Wooldridge [26, Chap. 6], and Rivers and Vuong ). We use a robust covariance matrix .
We consider a data generating process (DGP) which mimics a situation with a randomized assignment to a treatment (Z) with non-perfect compliance ( below), where T denotes the actual treatment assignment, as well as more general situations where the effect of Z on T is allowed to be confounded by unobservables. For unit i, let
We let , , , , , and be independently distributed as . Moreover, we also let and consider two cases for : (homogeneous treatment effect) and (heterogeneous treatment effect). Parameters are varied in the study in order to study the empirical size and power of the test . Five designs, denoted D.1–D.5, are considered and described in Table 1. For the situation where we set (Design D.2), the instrumental variable Z is non-monotone, i.e. there exists individuals j for which and (called defiers), where , , are potential treatment values for individual j when switching to (everything else equal) k equal 0 or 1. The proportion of defiers when is 8.4%. Thus, for design D.2 the monotonicity assumption necessary for the nonparametric identification of the local average causal effect is violated [4–6]. Another assumption for identification made in the latter references is that , and, hence, the instrument does not recover identification in designs D.3 and D.5.
The two tests mentioned above – and DWH – are evaluated in testing the null hypothesis , and empirical size and power of the tests are obtained by letting . K-nearest neighbor matching estimators with and 7 are used to compute , and we restrict X to have common support when conditioning on and . We consider sample sizes = 500, 1,500 and 3,000. In the continuous response cases, DWH should have correct size when irrespective of whether the instrument is monotone or not, or whether the relation with T is confounded or not. DWH is also expected to have correct size  in the binary response case with homogeneous causal effect (). In contrast, DWH is expected to breakdown in all heterogeneous cases (), since the response model is then misspecified. Up to our knowledge, no nonparametric test has previously been proposed in the literature for situations in Table 1 where an average causal effect is not nonparametrically identified. On the other hand, using is expected to give correct size and has power in all situations simulated.
The results from the Monte Carlo simulations are displayed in Figures 3 and 4. The empirical sizes are also displayed in Table 2. The nonparametric test with behaves well with all the DGPs considered, with empirical size close to 5% and power increasing with sample size. Results with other values for K can be obtained from the authors. Empirical sizes were comparable for all K values considered, while power increased with K: significantly so from to and only marginally from to 7. Power is further increased when using C instead of (see Table 3 for design D.1; similar increase was obtained for the other designs) as expected since the former is based on stronger assumptions. On the other hand, the DWH test has too large empirical size in the heterogeneous cases (). In the homogeneous treatment setup () DWH behaves well with respect to its empirical size. This was expected as noted in the previous section, thereby yielding an interesting benchmark. In such homogeneous cases, the nonparametric test has similar or better power than DWH, except for Designs D.1, where DWH is based on correctly specified models. For Design D.2 (non-monotone instrument), has markedly higher power than DWH.
In summary, the results obtained show that the nonparametric test (3) performs well in situations where DWH is consistent. By making fewer assumptions, (3) is also shown to work with non-monotone instruments and instruments whose effect on the treatment is confounded by unobservables, i.e. in situations where a local average causal effect is not identified.
4 Effect of JP
We consider a case study where the interest lies in estimating the effect of JP for unemployed on employment status. JP was offered within two separate labor market training (LMT) programs in Sweden during 1998. One program was run by the regular program provider in Sweden; the Swedish National Labor Market Board (AMV). The other program was offered by the Federation of Swedish Industries (Swit). To be eligible to the programs the unemployed individuals had to be at least 20 years of age and enrolled at the public employment service. There was no difference in benefits for the two groups of trainees. The fundamental idea with the Swit program was to increase the contacts between the unemployed individuals and employers by providing JP. In a survey conducted in June 2000 on 1,000 program participants from both programs, 69.5% of the Swit participants and 52% of the AMV participants stated that they obtained access to JP.3 Except for the idea to provide more contacts with employers the two programs were similar. Both programs tested the individual’s motivation and ability before recruitment by similar selection procedures (see Johansson , for a thorough description of the selection). The types of courses given within the Swit and the AMV programs are displayed in Table 4. The similarities of the two programs are apparent. Thus, despite differences in procurement between the two organizations (Swit and AMV), there do not seem to be any large differences between the types of LMT courses offered nor with the selection of participants. The fact that the programs distinguish themselves only with respect to JP availability prompts us that the effect of LMT program choice on labor market outcome should differ only through the effect of JP. This suggests that LMT program choice has the property (A.2) of an instrument for JP.
Based on the survey one can see in Table 5 that there is a statistical significant 18.1 percentage points difference in employment six months after leaving the program (the two programs have same average length) when comparing individuals having JP with those without. In the table we have some individual background variables: (i) education, (ii) work handicap (see disabled), (iii) gender (1 if man and 0 if women) and (iv) immigration status (1 if immigrant 0 else). Finally, since the propensity of receiving JP are higher in larger labor markets with also better labor market opportunities we need to control for region of residence in the estimation of an effect of JP. Sweden was divided into four regions: Stockholm, Skåne, Västra Götaland and the rest of the country. Stockholm, Skåne and Västra Götaland are the three regions with the largest population. Note that we have good reasons to assume that the two LMT programs only differ in their JP prospects, thus if the labor market opportunities affect the access to the LMT programs this does not invalidate them being used as an instrument for JP.
We can see some average differences between the two samples. Those with JP are (i) less disabled and (ii) less likely to live in Stockholm. The level of education also differs: they have on average more compulsory and upper secondary education but also less college education than those with no JP. Based on these average differences, it is difficult to argue that those with JP have better labor market prospects than those without JP. The single factor suggesting the JP population has better labor opportunities without JP is that they are less likely disabled. In order to further study the selection into JP we used the covariates from the table and estimated a logistic regression model (a propensity score) including merely main effects. The results from this estimation (not displayed) are that individuals who are from Stockholm or Västra Götaland, and disabled, are less likely to receive JP. There is no statistical significant (5% level) differences in education between the two groups for instance. Figure 5, left panel, displays the propensity scores estimated. The latter gives evidence for the common support assumption in (A.1). In order to investigate the related assumption included in (A.2), we also fit the probability of getting into Swit versus AMV with a logistic regression including main effects, and Figure 5, right panel, also provides evidence for the latter assumption.
Because there are 969 JP (treated) individuals for only 528 non-treated individuals an estimate of the average causal effect of JP on the treated (ACT), , will typically suffer from severe bias due to difficulties in finding matches to the treated. Thus, we estimate instead the average causal effect of JP on the non-treated (ACNT), . A reasonable assumption is that individuals with higher than average return from JP are the ones who select themselves into JP. This means that ACNT yields a lower bound for ACT, .
Assumption (A.1) need only to be fulfilled for in order for us to estimate ACNT, i.e. where the covariates are displayed in Table 5. A nearest neighbor matching estimator using the minimum Mahalanobis distance between the covariates of Table 5 is used to estimate the parameter , yielding 12% points, with standard error [23, Theorems 6 and 7] estimated to 5% points. Hence, there is a significant effect from JP.
4.1 Testing the unconfoundedness assumption
We test for the null hypothesis using in (3). Nonparametric estimation is performed with nearest neighbor matching on the covariates displayed in Table 5 using the minimum Mahalanobis distance, also for computing the standard deviation ; see Abadie and Imbens . The resulting value for test statistic is 1.31. Hence, we cannot reject the unconfoundedness assumption (p-value of 0.18). We also perform a DWH test by estimating a linear probability model with the discrete covariates displayed in Table 5, yielding a p-value of 0.09. Thus, given the maintained assumption (A.2), none of the test can reject the null hypothesis, at the 5% level, that the effect of JP on employment is not confounded, although the DWH test by making stronger assumptions has a p-value under 10%.
Identification of the causal effect of a treatment on an outcome in observational studies is typically based either on the unconfoundedness assumption or on the availability of an instrument (e.g. Angrist et al. ). In this paper, by introducing general instrumental assumptions we are able to propose an easy to use nonparametric test for the unconfoundedness assumption in situations where the same assumptions do not allow for the nonparametric identification of a causal effect. We illustrate the framework introduced with a study of the effect of JP for unemployed on employment, where we argue that an instrument fulfilling our conditions is available through the existence of two LMT programs with different degree of accessibility to JP.
In many applications, nonparametric identification of causal effects using instruments is non-trivial, e.g. when a non-testable monotonicity property for the instrument must hold [4–6] and/or when a large set of control variables is needed for the instrument to be valid. Using our weaker instrumental conditions, one may test for the unconfoundedness assumption. If the latter is not rejected, this gives some ground to the analyst to proceed using an identification strategy based on the unconfoundedness assumption. We have operationalized the theoretical results with a test statistic based on K-nearest neighbor matching estimators. Other nonparametric regression estimators may be used instead, such as, e.g. local polynomial regression and splines. Finally, it is worth noting here that for durations outcomes with censored data, the test proposed herein may be implemented by making use of the matching estimators for censored duration responses presented in Fredriksson and Johansson  and de Luna and Johansson .
This paper has benefited from useful comments from Martin Huber, Ingeborg Waernbaum, an editor, an anonymous referee and seminar participants at John Hopkins, Maryland University and the third Joint IZA/IFAU Conference on Labor Market Policy Evaluation. De Luna acknowledges the financial support of the Swedish Research Council through the Swedish Initiative for Research on Microdata in the Social and Medical Sciences (SIMSAM), the Ageing and Living Condition Program and grant 70246501. Johansson acknowledges the financial support of the Swedish Council for Working Life and Social Research (grant 2004–2005).
Pearl J. Causality, 2nd ed. Cambridge: Cambridge University Press, 2009. Google Scholar
Hahn J, Todd P, van der Klaaw W, Todd W, Van der Klaauw P. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica 2001;69:201–9. CrossrefGoogle Scholar
Dias M, Ichimura H, van den Berg G. The matching method for treatment evaluation with selective participation and ineligibles. IFAU Working Papers, 2008:6, Institute for Labour Market Policy Evaluation, Uppsala, 2008. Google Scholar
Lee M-J. Micro-econometrics for policy, program and treatment effects. Oxford: Oxford University Press, 2005. Google Scholar
Angrist J, Fernandez-Val I. ExtrapoLATE-ing: external validity and overidentification in the late framework. NBER Working Paper, 16566, National Bureau of Economic Research, Cambridge, MA, 2010. Google Scholar
Guo Z, Cheng J, Lorch S, Small D. Using an instrumental variable to test for unmeasured confounding. Working Papers, 2013. Google Scholar
Neyman J. Sur les applications de la théorie des probabilités aux experiences agricoles: essai des principes. Roczniki Nauk Rolniczych X 1923:1–51. In Polish, English translation by D. Dabrowska and T. Speed in Stat Sci 1990;5:465–72. Google Scholar
Dawid AP. Conditional independence in statistical theory. J R Stat Soc Ser B 1979;41:1–31. Google Scholar
Lauritzen S. Graphical models. Oxford: Oxford University Press, 1996. Google Scholar
de Luna X, Johansson P, Sjöstedt-de Luna S. Bootstrap inference for k-nearest neighbour matching estimators. IZA Discussion Papers 5361, Institute for the Study of Labor, Bonn, 2010. Google Scholar
Wooldridge J. Econometric analysis of cross section and panel data. Cambridge: MIT Press, 2002. Google Scholar
White H. Maximum likelihood estimation of misspecified models. Econometrica 1982;50:1–25. Google Scholar
Johansson P, Martinson S. Det nationella it-programmet – en slutrapport om swit. Forskningsrapporter, 2000:8, Institute for Labour Market Policy Evaluation, Uppsala, 2000. Google Scholar
de Luna X, Johansson P. Non-parametric inference for the effect of a treatment on survival times with application in the health and social sciences. J Stat Plann Inference 2010;140:2122–37. CrossrefGoogle Scholar
Directed acyclic graphs, e.g. Figure 1, together with a stable (also called faithful) distribution for the variables are used to describe conditional independence relations between variables; see Lauritzen  for a general account on graphical models and de Luna et al.  for their use together with potential outcomes.
One related strategy could be to use the concept of two independent control groups . Under we can use Z to obtain two independent control groups (one defined by and one by ) for estimating , yielding and , respectively. Under the difference has expectation zero and this makes the basis for a test statistic. However, since we need to compute two nonparametric estimators of , the resulting statistic has poor finite sample properties, for instance, when the covariates have different support in the two control groups created. This has been confirmed in simulation experiments not presented here.
A detailed description of the survey can be found in Johansson and Martinson . The survey contained a total of 19 questions. These concerned (i) the individual’s background, (ii) the individual’s labor market training and (iii) the individual’s present labor market situation.