*Prices in US$ apply to orders placed in the Americas only. Prices in GBP apply to orders placed in Great Britain only. Prices in € represent the retail prices valid in Germany (unless otherwise indicated). Prices are subject to change without notice. Prices do not include postage and handling if applicable. RRP: Recommended Retail Price.

Generalizing empirical findings to new environments, settings, or populations is essential in most scientific explorations. This article treats a particular problem of generalizability, called “transportability”, defined as a license to transfer information learned in experimental studies to a different population, on which only observational studies can be conducted. Given a set of assumptions concerning commonalities and differences between the two populations, Pearl and Bareinboim [1] derived sufficient conditions that permit such transfer to take place. This article summarizes their findings and supplements them with an effective procedure for deciding when and how transportability is feasible. It establishes a necessary and sufficient condition for deciding when causal effects in the target population are estimable from both the statistical information available and the causal information transferred from the experiments. The article further provides a complete algorithm for computing the transport formula, that is, a way of combining observational and experimental information to synthesize bias-free estimate of the desired causal relation. Finally, the article examines the differences between transportability and other variants of generalizability.

The problem of transporting knowledge from one population to another is pervasive in science. Conclusions that are obtained in a laboratory setting are transported and applied elsewhere, in an environment that differs in many aspects from that of the laboratory. Experiments conducted on a group of subjects are intended to inform policies on a different group, usually more general and in which the studied group is just one of its parts.

Surprisingly, the conditions under which this extrapolation can be legitimized were not formally articulated until very recently [1–3]. Although the problem has been discussed in many areas of statistics, economics, and the health sciences, under rubrics such as “external validity” [4, 5], “meta-analysis” [6–8], “overgeneralization” [9], “quasi experiments” [10], (Ch. 3 [11]), “heterogeneity” [12], these discussions are limited to verbal narratives in the form of heuristic guidelines for experimental researchers – no formal treatment of the problem has been attempted to answer the practical problem of generalizing across populations posed in this article (see Section 6 for related work.).

Recent developments in causal inference enable us to tackle this problem formally. First, the distinction between statistical and causal knowledge has received syntactic representation through causal diagrams [13–16]. Second, graphical models provide a language for representing differences and commonalities among domains, environments, and populations [1]. Finally, the inferential machinery provided by the do-calculus [13, 16, 17] is particularly suitable for combining these two advances into a coherent framework and developing effective algorithms for knowledge transfer.

Armed with these tools, we consider transferring causal knowledge between two populations and . In population , experiments can be performed and causal knowledge gathered. In , potentially different from , only passive observations can be collected but no experiments conducted. The problem is to infer a causal relationship R in using knowledge obtained in . Clearly, if nothing is known about the relationship between and , the problem is trivial; no transfer can be justified. Yet the fact that all experiments are conducted with the intent of being used elsewhere (e.g., outside the laboratory) implies that scientific explorations are driven by the assumption that certain populations share common characteristics and that, owed to these commonalities, causal claims would be valid in new settings even where experiments cannot be conducted.

To formally articulate commonalities and differences between populations, a graphical representation named selection diagrams was devised in Ref. 1, which represent differences in the form of unobserved factors capable of causing such differences. Given an arbitrary selection diagram, our challenge is to decide whether commonalities override differences to permit the transfer of information across the two populations. We show that this challenge can be met by an effective procedure that decides when and how transportability is feasible.

The article is organized as follows. In Section 2, we motivate the problem of transportability using three simple examples and informally summarize the findings of Pearl and Bareinboim [1]. In Section 3, we formally define the notion of selection diagrams and transportability, exemplify how it can be reduced to a problem of symbolic transformation in do-calculus, and provide examples for models that prohibit transportability. In Section 4, we provide a graphical criterion for deciding transportability in arbitrary diagrams. In Section 5, we provide an effective procedure for deciding transportability, which returns a correct transport formula whenever such exists. In Section 6, we compare transportability to other problems of generalizing empirical findings. Section 7 provides concluding remarks.

2 Motivation

To motivate the formal treatment of transportability, we use three simple examples taken from Ref. 1 and graphically depicted in Figure 1.

Example 1. Consider the problem of transferring experimental results between two locations. We first conduct a randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group, denoted. We now wish to generalize the results to the population of New York City (NYC), but we find the distributionin LA to be different from the one in NYC (call the latter. In particular, the average age in NYC is significantly higher than that in LA. How are we to estimate the causal effect of X on Y in NYC, denoted?^{1}

The selection diagram for this example (Figure 1(a)) conveys the assumption that the only difference between the two population are factors determining age distributions, shown as , while age-specific effects are invariant across cities. Difference-generating factors are represented by a special set of variables called selection variables S (or simply S-variables), which are graphically depicted as square nodes (■).^{2} From this assumption, the overall causal effect in NYC can be derived as follows^{3}:

[1]

Figure 1

Causal diagrams depicting Examples 1–3. In (a) Z represents “age.” In (b) Z represents “linguistic skills” while age (in hollow circle) is unmeasured. In (c) Z represents a biological marker situated between the treatment (X) and a disease (Y).

The last line constitutes a transport formula for R. It combines experimental results obtained in LA, , with observational aspects of NYC population, , to obtain an experimental claim about NYC.^{4}

Our first task in this article will be to explicate the assumptions that renders this extrapolation valid. We ask, for example, what must we assume about other confounding variables beside age, both latent and observed, for eq. [1] to be valid, or, would the same transport formula hold if Z was not age, but some proxy for age, say, “language skills” (Figure 1(b)). More intricate yet, what if Z stood for an exposure-dependent variable, say hyper-tension level, that stands between X and Y (Figure 1(c))?

Let us examine the proxy issue first.

Example 2. Let the variable Z in Example 1 stand for subjects’ language skills, and let us assume that Z does not affect exposureor outcome, yet it correlates with both, being a proxy for age which is not measured in either study (see Figure 1(b)). Given the observed disparity, how are we to estimate the causal effectfor the target population of NYC from the z-specific causal effectestimated at the study population of LA?

Our intuition dictates, and correctly so, that since reading ability has no causal effect on treatment nor on the outcome the proper transport formula would be

[2]

namely, the causal effect is “directly” transportable with no calibration needed (to be shown later on). This will be the case even if the observed joint distribution is the same as in Example 1 where Z stands for age. We see, therefore, that the proper transport formula depends on the causal context in which population differences are embedded, not merely on the joint distribution over the observed variables.

This example also demonstrates why the invariance of Z-specific causal effects should not be taken for granted. While justified in Example 1, with Z = age, it fails in Example 2, in which Z was equated with “language skills.” The intuition is clear. A NYC person at skill level is likely to be in a totally different age group from his skill-equals in LA and, since it is age, not skill that shapes the way individuals respond to treatment, it is only reasonable that LA residents would respond differently to treatment than their NYC counterparts at the very same skill level.

Example 3. Examine the case where Z is a X-dependent variable, say a disease bio-marker, standing on the causal pathways between X and Y as shown in Figure 1(c). Assume further that the disparityis discovered in each level of X and that, again, both the average and the z-specific causal effectare estimated in the LA experiment, for all levels of X and Z. Can we, based on information given, estimate the average (or z-specific) causal effect in the target population of NYC?

Assuming that the disparity in stems only from a difference in subjects’ susceptibility to X, as encoded in the selection the diagram of Figure 1(c), we will demonstrate in Section 3 that the correct transport formula should be

[3]

which is different from both eqs. [1] and [2]. It calls instead for the z-specific effects to be weighted by the conditional probability , estimated at the target population.

Figure 2

Selection diagram with two “difference-producing” factors (S and S
); the derivation of transportability is more involved using Lemma 1, and it is shown step by step using the algorithm in Section 6.

In these three intuitive examples transportability amounts to simple operations (i.e., recalibration, direct transport, and weighted recalibration); however, in more elaborate examples, the full power of formal analysis would be required. For instance, Pearl and Bareinboim [1] showed that, in the problem depicted in Figure 2, where both the Z-determining mechanism and the U-determining mechanism are suspect of being different, the transport formula for the relation is given by

This formula instructs us to estimate and in the experimental population, then combine them with the estimates of and in the target population. Pearl and Bareinboim [1] derived this formula using the following lemma, which translates the property of transportability to the existence of a syntactic reduction using a sequence of do-calculus rules.

Lemma 1 [1]. Let D be the selection diagram characterizingand, anda set of selection variables in D. The relationis transportable fromtoif the expressionis reducible, using the rules of do-calculus, to an expression in whichappears only as a conditioning variable in do-free terms.

The logic of this reduction is simple. Terms lacking an S variable are estimable at the source population while those lacking the do-operator are estimable non-experimentally at the target population. If such a reduction exists, the resulting expression gives the transport formula for R.

Lemma 1 is declarative but not computationally effective, for it does not specify the sequence of rules leading to the needed reduction, nor does it tell us if such a sequence exists. It is useful primarily as a verification tool, to confirm the transportability of a given relation once we are in possession of a “witness” sequence.

To overcome this deficiency, Pearl and Bareinboim [1] proposed a recursive procedure (their Theorem 3), which can handle many cases, among them Figure 2, but is not “complete”, that is, diagrams exist that support transportability and which the recursive procedure fails to recognize as such. The procedure developed in this article are guaranteed to make correct identification in all cases. We summarize our contributions as follows:

•

We derive a general graphical condition for deciding transportability of causal effects. We show that transportability is feasible if and only if a certain graph structure does not appear as an edge subgraph of the inputted selection diagram.

•

We provide necessary or sufficient graphical conditions for special cases of transportability, for instance, controlled direct effects (CDE).

•

We construct a complete algorithm for deciding transportability of joint causal effects and returning a proper transport formula whenever those effects are transportable.

3 Preliminaries

The semantical framework in our analysis rests on structural causal models (SCM) as defined next, also called probabilistic causal models or data-generating models.

Definition 1 (Structural Causal Model [22, p. 203]). A SCM is a 4-tuplewhere:

1.

U is a set of background or exogenous variables, representing factors outside the model, which nevertheless affect relationships within the model.

2.

V is a set of endogenous variables, assumed to be observable. Each of these variables is functionally dependent on some subsetof .

3.

F is a set of functionssuch that eachdetermines the value of , .

4.

A joint probability distributionover U.

In the structural causal framework [22, Ch. 7], actions are modifications of functional relationships, and each action on a causal model M produces a new model , where is obtained after replacing for every with a new function that outputs a constant value x given by . See Appendix 1 for a gentle introduction to structural models or Ref. 23) for a more detailed discussion.

We follow the conventions given in Ref. 22. We will denote variables by capital letters and their values by small letters. Similarly, sets of variables will be denoted by bold capital letters, sets of values by bold letters. We will use the typical graph-theoretic terminology with the corresponding abbreviations , , and , which will denote respectively the set of observable parents, ancestors, and descendants of the node set in G. By convention, these sets will include the arguments as well, for instance, the ancestral set will include . We will usually omit the graph subscript whenever the graph in question is assumed or obvious. A graph will denote the induced subgraph G containing nodes in and all arrows between such nodes. Finally, stands for the edge subgraph of G where all incoming arrows into and all outgoing arrows from are removed.

Key to the analysis of transportability is the notion of “identifiability,” defined below, which expresses the requirement that causal effects be computable from a combination of data P and assumptions embodied in a causal graph G.

Definition 2 (Causal Effects Identifiability [22, p. 77]). The causal effect of an actionon a set of variablessuch thatis said to be identifiable from P in G ifis uniquely computable fromin any model that induces G.

Causal models and their induced graphs are normally associated with one particular domain (also called setting, study, population, environment). In the transportability case, we extend this representation to capture properties of several domains simultaneously. This is made possible if we assume that there are no structural changes between the domains, that is, all structural equations share the same set of arguments, though the functional forms of the equations may vary arbitrarily.^{5},^{6}

Definition 3 (Selection Diagram). Letbe a pair of SCM relative to domains, sharing a causal diagram G.
is said to induce a selection diagram D if D is constructed as follows:

1.

Every edge in G is also an edge in D;

2.

D contains an extra edgewhenever there might exist a discrepancyorbetween M and .

In words, the S-variables locate the mechanisms where structural discrepancies between the two domains are suspected to take place.^{7} Alternatively, one can see a selection diagram as a carrier of invariance claims between the mechanisms of both domains – the absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable is the same in the two domains.^{8}

Armed with a selection diagram and the concept of identifiability, transportability of causal effects (or transportability, for short) can be defined as follows:

Definition 4 (Causal Effects Transportability). Let D be a selection diagram relative to domains. Letbe the pair of observational and interventional distributions of, andbe the observational distribution of. The causal effectis said to be transportable fromtoin D ifis uniquely computable fromin any model that induces D.

In some broad sense, one can view transportability as a special case of identifiability, where the pair of structures constitutes a global model, and the task is to infer a property of one population from sum total of the information available (i.e., ). However, the unique challenges of dealing with two diverse environments under two different experimental regimes, and the special problems that emerge from this combination can benefit appreciably from viewing transportability as distinct major extension of identifiability. To witness, all identifiable causal relations in are also transportable, because they can be computed directly from and require no experimental information from . This observation engender the following definition of trivial transportability.

Definition 5 (Trivial Transportability). A causal relation R is said to be trivially transportable fromto, ifis identifiable from .

The following observation establishes another connection between identifiability and transportability. For a given causal diagram G, one can produce a selection diagram D such that identifiability in G is equivalent to transportability in D. First set , and then add selection nodes pointing to all variables in D, which represents that the target domain does not share any commonality with its pair – this is equivalent to the problem of identifiability because the only way to achieve transportability is to identify R from scratch in the target domain.

Another special case of transportability occurs when a causal relation has identical form in both domains – no recalibration is needed. This is captured by the following definition.

Definition 6 (Direct Transportability). A causal relation R is said to be directly transportable fromto, if .

A graphical test for direct transportability of follows from do-calculus and reads: ; in words, X blocks all paths from S to Y once we remove all arrows pointing to X and condition on Z. As a concrete example, the z-specific effect in Figure 1(a) is the same in both domains; hence, it is directly transportable. Also, the effect in Figure 1(b) is the same in both domains; hence, it is directly transportable.

These two cases will act as a basis to decompose the problem of transportability into smaller and more manageable subproblems. For instance, let us estimate the effect in the bio-marker example depicted in Figure 1(c).

[4]

[5]

[6]

In eq. [4], the target relation R is conditioned on Z. The effect in eq. [5] is trivially transportable since it is identifiable in , and in eq. [6] is directly transportable since .

Now we turn our attention to conditions that preclude identifiability. The following lemma provides an auxiliary tool to prove non-transportability and is based on refuting the uniqueness property required by Definition 4.

Lemma 2. Letbe two sets of disjoint variables, in populationand, and let D be the selection diagram.
is not transportable fromtoif there exist two causal modelsandcompatible with D such that , ,
, for any set, all families have positive distribution, and .

Proof. Let I be the set of interventional distributions , for any set . The latter inequality rules out the existence of a function from to . ■

While the problems of identifiability and transportability are related, Lemma 2 indicates that proofs of non-transportability are more involved than those of non-identifiability. Indeed, to prove non-transportability requires the construction of two models agreeing on , while non-identifiability requires the two models to agree solely on the observational distribution P.

The simplest non-transportable structure is an extension of the famous “bow arc” graph named here “s-bow arc,” see Figure 3(a). The s-bow arc has two endogenous nodes: X, and its child Y, sharing a hidden exogenous parent U, and a S-node pointing to Y. This and similar structures that prevent transportability will be useful in our proof of completeness, which requires a demonstration that whenever the algorithm fails to transport a causal relation, the relation is indeed non-transportable.

Theorem 1.
is not transportable in the s-bow arc graph.

Proof. The proof will show a counterexample to the transportability of through two models and that agree in and disagree in .

Assume that all variables are binary. Let the model be defined by the following system of structural equations: , and by the following one: , where represents the exclusive or function.

Lemma 3. The two models agree in the distributions .

Proof. We show that the following equations must hold for and :

for all values of . The equality between is obvious since and X has the same structural form in both models. Second, let us construct the truth table for Y:

X

S

U

Y_{1}

Y_{1}

0

0

0

0

0

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

1

0

0

1

1

1

0

1

0

0

1

1

0

0

1

1

1

1

1

1

To show that the equality between holds, we rewrite it as follows:

[7]

In eq. [7], the expressions for are functions of the tuples , which evaluate to the same value in both models. Similarly, the expressions for are functions of the tuples , which also evaluate to the same value in both models.

We further assert the equality between the interventional distributions in , which can be written using the do-calculus as

[8]

Evaluating this expression points to the tuples and , which map to the same value in both models. ■

Figure 3

(a) Smallest selection diagram in which is not transportable (s-bow graph). (b) A selection diagram in which even though there is no S-node pointing to Y, the effect of X on Y is still not-transportable due to the presence of a sC-tree (see Corollary 2).

Lemma 4. There exist values ofsuch that .

Proof. Fix , and let us rewrite the desired quantity in as

[9]

Since is a function of the tuples , it evaluates in to and in to .

Hence, together with the uniformity of , it follows that and , which finishes the proof. ■

By Lemma 2, Lemmas 3 and 4 prove Theorem 1. ■

4 Characterizing transportable relations

The concept of confounded components (or C-components) was introduced in [26] to represent clusters of variables connected through bidirected edges and was instrumental in establishing a number of conditions for ordinary identification (Def. 2). If G is not a C-component itself, it can be uniquely partitioned into a set of C-components. We now recast C-components in the context of transportability.^{9}

Definition 7 (sC-component). Let G be a selection diagram such that a subset of its bidirected arcs forms a spanning tree over all vertices in G. Then G is a sC-component (selection confounded component).

A special subset of C-components that embraces the ancestral set of Y was noted by Shpitser and Pearl [27] to play an important role in deciding identifiability – this observation can also be applied to transportability, as formulated in the next definition.

Definition 8 (sC-tree). Let G be a selection diagram such that, all observable nodes have at most one child, there is a node Y, which is a descendent of all nodes, and there is a selection node pointing to Y. Then G is called a Y-rooted sC-tree (selection confounded tree).

The presence of this structure (and generalizations) will prove to be an obstacle to transportability of causal effects. For instance, the s-bow arc in Figure 3(a) is a Y-rooted sC-tree where we know is not transportable there.

In certain classes of problems, the absence of such structures will prove sufficient for transportability. One such class is explored below and consists of models in which the set X coincides with the parents of Y.

Theorem 2. Let G be a selection diagram. Then for any node Y, the causal effectsis transportable if there is no subgraph of G which forms a Y-rooted sC-tree.

Theorem 2 provides a tractable transportability condition for the CDE – a key concept in modern mediation analysis, which permits the decomposition of effects into their direct and indirect components [35, 36]. CDE is defined as the effect of X on Y when all other parents of Y (acting as mediators) are held constant, and it is identifiable if and only if is identifiable (Pearl, 2009, p. 128).

The selection diagram in Figure 1(a) does not contain any Y-rooted sC-trees as subgraphs and therefore the direct effect (causal effects of Y’s parents on Y) is indeed transportable. In fact, the transportability of CDE can be determined by a more visible criterion:

Corollary 1. Let G be a selection diagram. Then for any node Y, the direct effectis transportable if there is no S node pointing to Y.

The next corollary demonstrates that sC-trees are obstacles to the transportability of even when they do not involve Y, i.e., transportability is not a local problem – if there exists a node W that is an ancestor of Y but not necessarily “near” it, transportability is still prohibited (see Figure 3(b)). This fact anticipates that transporting causal effects for singletons is not necessarily easier than the general problem of transportability.

Corollary 2. Let G be a selection diagram, andanda set of variables. If there exists a node W that is an ancestor of some nodesuch that there exists a W-rooted sC-tree which contains any variables in, thenis not transportable.

We now generalize the definition of sC-trees (and Theorem 3) in two ways: first, Y is augmented to represent a set of variables; second, S-nodes can point to any variable within the sC-component, not necessarily to root nodes. For instance, consider the graph G in Figure 4. Note that there is no Y-rooted sC-tree nor W-rooted sC-tree in G (where W is an ancestor of Y), and so the previous results cannot be applied even though the effect of X on Y is not transportable in G – still, there exists a Y-rooted sC-forest in G, which will prevent the transportability of the causal effect.

Definition 9 (sC forest). Let G be a selection diagram, whereis the maximal root set. Then G is a-rooted sC-forest if G is a sC-component, all observable nodes have at most one child, and there is a selection node pointing to some vertex of G (not necessarily in ).

Figure 4

Example of a selection diagram in which is not transportable, there is no sC-tree but there is a sC-tree.

Building on [27], we introduce a structure that witnesses non-transportability characterized by a pair of sC-forests. Transportability will be shown impossible whenever such structure exists as an edge subgraph of the given selection diagram.

Definition 10 (s-hedge). Letbe set of variables in G. Letbe-rooted sC-forests such that , , ,
. Then F andform a s-hedge forin G.

For instance, in Figure 4, the sC-forests , and form a s-hedge to .^{10} The idea here is similar to the hedge, and we can see a s-hedge as a growing sC-forest , which does not intersect , to a larger sC-forest F that do intersect .

We state below the formal connection between s-hedges and non-transportability.

Theorem 4. Assume there existthat form a s-hedge forinand. Thenis not transportable fromto .

To prove that the s-hedges characterize non-transportability in selection diagrams, we construct in the next section an algorithm which transport any causal effects that do not contain a s-hedge.

5 A complete algorithm for transportability of joint effects

The algorithm proposed to solve transportability is called sID (see Figure 5) and extends previous analysis and algorithms of identifiability given in [13, 26, 27, 32, 34]. We choose to start with the version provided by Shpitser (called ID) since the hedge structure is explicitly employed, which will show to be instrumental to prove completeness. We build on two observations developed along the article:

Figure 5

Modified version of identification algorithm capable of recognizing transportable relations.

1.

Transportability: Causal relations can be partitioned into trivially and directly transportable.

2.

Non-transportability: The existence of a s-hedge as an edge subgraph of the inputted selection diagram can be used to prove non-transportability.

The algorithm sID first applies the typical c-component decomposition on top of the inputted selection diagram D (which, by definition, is also a causal diagram of ), partitioning the original problem into smaller blocks (call these blocks sc-factors) until either the entire expression is transportable or it runs into the problematic s-hedge structure.

More specifically, for each sc-factor Q, sID tries to directly transport Q. If it fails, sID tries to trivially transport Q, which is equivalent to solving an ordinary identification problem. sID alternates between these two types of transportability, and whenever it exhausts the possibility of applying these operations, it exits with failure with a counterexample for transportability – that is, the graph local to the faulty call witnesses the non-transportability of the causal query since it contains a s-hedge as edge subgraph.

Before showing the more formal properties of sID, we demonstrate how sID works through the transportability of in the graph in Figure 2.

Since and , where , , and , we invoke line 4 and try to transport respectively , , and . Thus the original problem reduces to transporting .

Evaluating the first expression, sID triggers line 2, noting that nodes that are not ancestors of Z can be ignored. This implies that with induced subgraph , where stands for the hidden variable between X and Z. sID goes to line 5, in which in the local call . In the sequel, sID goes to line 9 since contains only one sC-component. Note that in the ordinary identifiability problem the procedure would fail at this point, but sID proceeds to line 10 testing whether . The test comes true, which makes sID directly transport with data from the experimental population , i.e., .

Evaluating the second expression, sID again triggers line 2, which implies that with induced subgraph . sID goes to line 5, in which in the local call . Thus it proceeds to line 6 testing whether there are more than one sC-components. The test comes true (since ), which makes sID to trivially transport with observational data from , i.e., .

Evaluating the third expression, sID goes to line 5 in which , where . It proceeds to line 6 testing whether there is more than one component, which is true in this case. It reaches line 8, in which . Thus it tries to transport over the induced graph , which stands for ordinary identification, and yields (after trivial simplifications) . The return of these calls composed coincide with the expression provided in the first section.

We prove next soundness and completeness of sID.

Theorem 5 (soundness). WheneversIDreturns an expression for, it is correct.

Theorem 6. AssumesIDfails to transport(executes line 11). Then there exists ,
, such that the graph pairreturned by the fail condition ofsIDcontain as edge subgraphs sC-forests F,
that form a s-hedge for .

Theorem 7. The rules of do-calculus, together with standard probability manipulations are complete for establishing transportability of all effects of the form .

Many problems in statistics and causal inference can be framed as problems of generalizability, though inherently different from that of transportability.

Consider, for example, classical statistical inference, it can be viewed as a generalization from properties of a random sample of a population to properties of the population itself. Two centuries of statistical analysis have rendered this task well understood and fairly complete.

Next consider the problem of causal inference, that is, to estimate causal-effects from observational studies (given a set of causal assumptions). This class of problems can be viewed as a generalization from a population under observational regime to a population under experimental regime. Since the imposition of experimental regime (e.g., forcing individuals to receive treatment) induces a behavioral change in the population, the problem can be viewed as generalization between two diverse populations. Fortunately, the disparities between the two populations are local (assumes atomic interventions), involving only the treatment assignment mechanism and, so, with the help of model assumptions, a complete solution to the problem can be obtained (using do-calculus). We can decide algorithmically whether the assumptions at hand are sufficient for estimating a given causal effect and, if the answer is affirmative, we can derive its estimand.

An important variant in causal inference is the task of estimating causal effects from surrogate experiments, namely, experiments in which a surrogate set of variables Z are manipulated, rather than the one (X) whose effect we seek to estimate.^{11} This variant too can be viewed as an exercise in generalization, this time from a population under regime to that same population under regime . A complete solution to this problem is reported in Ref. 37.

Another challenge of generalizability flavor arises, in both observational and experimental studies, when samples are not randomly drawn from the population of interest , but are selected preferentially, depending on the values taken by a set of variables. This problem, known as “selection bias” (or “sampling selection bias”), has received due attention in epidemiology, statistics, and economics [38–41] and can be viewed as a generalization from the sampled population to the population at large, when little is known about their relationships save for qualitative assumptions about the selection mechanism. Graphical models were used to improve the understanding of the problem [42–45] and gave rise to several conditions for recovering from selection bias when the probability of selection is available.

Likewise, Refs. 21, 46, 47 tackle variants of the sample selection problem assuming that certain relationships are invariant between the two groups (i.e., sample and population). The former assumed knowledge of the probability of selection in each of the principal stratum, while the latter exploited (using propensity score analysis) the availability of the probability of selection in each combination of covariates.

More recently, Didelez et al. [49] studied conditions for recovering from selection bias when no quantitative knowledge is available about selection probabilities. Bareinboim and Pearl [50] extended these conditions and provided a complete characterization, together with an algorithm, for deciding when a bias-free estimate of the odds ratio (OR) can be recovered from selection-biased data. They also developed methods using instrumental variables that recover other effect measures when information about the target population is available for some variables (see also Ref. 51).

The problem of transportability is fundamentally different from the other problems of generalizability discussed above. Transportability deals with two distinct populations that are different both in their inherent characteristics (encoded by the S variables) and the regimes under which they are studied (i.e., experimental vs. observational).

Hernán and VanderWeele [52] addressed a problem related to transportability in the context of “compound treatments,” namely, treatments that can be implemented in multiple versions (e.g., “exercise at least 15 minutes a day”). Transportability arises when we wish to predict the response of a population that implements one version of the treatment from a study on another population, in which another version is implemented. Petersen [53] showed that this problem is a variant of the general problem treated in Ref. 1, to which this article provides an algorithmic solution.

Finally, it is important to mention two recent extensions of the results reported in this article. Bareinboim and Pearl [2] have addressed the problem of transportability in cases where only a limited set of experiments can be conducted at the source environment. Subsequently, the results were generalized to the problem of “meta-transportability,” that is, pooling experimental results from multiple and disparate sources to synthesize a consistent estimate of a causal relation at yet another environment, potentially different from each of the formers [3].

7 Conclusions

Informal discussions concerning the difficulties of generalizing experimental results across populations have been going on for almost half a century [4, 5, 54–56] and appear to accompany every textbook in experimental design. By and large, these discussions have led to the obvious conclusions that researchers should be extremely cautious about unwarranted generalization, that many threats may await the unwary, and that extrapolation across studies requires “some understanding of the reasons for the differences” [54, p. 11].

The formalization offered in this article embeds this discussion in a precise mathematical language and provides researchers with theoretical guarantees that, if certain conditions can be ascertained, generalization across populations can be accomplished, protected from the threats and dangers that the informal literature has accumulated.

Given judgmental assessments of how target populations may differ from those under study, the article offers a formal representational language for making these assessments precise (Definition 3) and, subsequently, deciding whether, and how, causal relations in the target population can be inferred from those obtained in experimental studies. Corollary 4 in this article provides a complete (necessary and sufficient) graphical condition for deciding this question and, whenever satisfied, we further provide an algorithm for computing the correct transport formula (Figure 5). The transport formula specifies the proper way of modifying the experimental results so as to account for differences in the populations. These transport formulae enable the investigator to select the essential measurements in both the experimental and observational studies and combine them into a bias-free estimand of the target quantity.

While the results of this article concern the transfer of causal information from experimental to observational studies, the method can also benefit in transporting statistical findings from one observational study to another [57]. The rationale for such transfer is twofold. First, information from the first study may enable researchers to avoid repeated measurement of certain variables in the target population. Second, by pooling data from both populations, we increase the precision in which their commonalities are estimated and, indirectly, also increase the precision by which the target relationship is transported. Substantial reduction in sampling variability can be thus achieved through this decomposition [58].

Of course, our analysis is based on the assumption that the analyst is in possession of sufficient background knowledge to determine, at least qualitatively, where two populations may differ from one another. In practice, such knowledge may only be partially available. Still, as in every mathematical exercise, the benefit of the analysis lies primarily in understanding what must be assumed about reality for generalization to be valid, what knowledge is needed for a given task to succeed, and how sensitive conclusions are to knowledge that we do not possess.

Acknowledgment

A preliminary version of this article was presented at the 26th AAAI Conference, Toronto, CA, July, 2012 [59]. We appreciate the insightful comments provided by two anonymous referees. This article benefited from discussions with Onyebuchi Arah, Stuart Baker, Susan Ellenberg, Eleazar Eskin, Constantine Frangakis, Sander Greenland, David Heckerman, James Heckman, Michael Hoefler, Marshall Joffe, Rosa Matzkin, Geert Molengergh, William Shadish, Ian Shrier, Dylan Small, Corwin Zigler, and Song-Chun Zhu.

This research was supported in parts by grants from NSF #IIS-1249822, and ONR #N00014–13–1-0153 and #N00014–10–1-0933.

Appendix 1: causal assumptions in nonparametric models

The tools presented in this article were developed in the framework of nonparametric SCM, which subsumes and unifies many approaches to causal inference.^{12}

A SCM M conveys a set of assumptions about how the world operates. This contrasts the statistical tradition in which a model is defined as a set of distributions (see footnote 15). Causal models is better viewed as a set of assumptions about Nature, with the understanding that each assumption (i.e., that the set of arguments of does not include variable ) constrains the set of distributions (like ) that the model can generate.

The formal structure of SCM’s was defined in Section 3, here we illustrate their power as inference engines. Consider a simple SCM model depicted in Figure 6(a), which represents the following three functions:

[10]

where in this particular example, , , and are assumed to be jointly independent but otherwise arbitrarily distributed. Each of these functions represents a causal process (or mechanism) that determines the value of the left variable (output) from the values on the right variables (inputs) and is assumed to be invariant unless explicitly intervened on. The absence of a variable from the right-hand side of an equation encodes the assumption that nature ignores that variable in the process of determining the value of the output variable. For example, the absence of variable Z from the arguments of conveys the empirical claim that variations in Z will leave Y unchanged, as long as variables and X remain constant.

Figure 6

The diagrams associated with (a) the structural model of eq. [6] and (b) the modified model of eq. [11], representing the intervention .

Representing Interventions, counterfactuals, and causal effects

This feature of invariance permits us to derive powerful claims about causal effects and counterfactuals, even in nonparametric models, where all functions and distributions remain unknown. This is done through a mathematical operator called , which simulates physical interventions by deleting certain functions from the model, replacing them with a constant , while keeping the rest of the model unchanged [61–63]. For example, to emulate an intervention that holds X constant (at ) in model M of Figure 6(a), we replace the equation for x in eq. [10] with , and obtain a new model, ,

[11]

the graphical description of which is shown in Figure 6(b).

The joint distribution associated with the modified model, denoted describes the post-intervention distribution of variables Y and Z (also called “controlled” or “experimental” distribution), to be distinguished from the preintervention distribution, , associated with the original model of eq. [10]. For example, if X represents a treatment variable, Y a response variable, and Z some covariate that affects the amount of treatment received, then the distribution gives the proportion of individuals that would attain response level and covariate level under the hypothetical situation in which treatment is administered uniformly to the population.^{13}

In general, we can formally define the postintervention distribution by the equation

[12]

In words, in the framework of model M, the postintervention distribution of outcome Y is defined as the probability that model assigns to each outcome level . From this distribution, which is readily computed from any fully specified model M, we are able to assess treatment efficacy by comparing aspects of this distribution at different levels of .^{14}

Identification, d-separation and causal calculus

A central question in causal analysis is the question of identification in partially specified models: Given assumptions set A (as embodied in the model), can the controlled (postintervention) distribution, , be estimated from data governed by the preintervention distribution ?

In linear parametric settings, the question of identification reduces to asking whether some model parameter, , has a unique solution in terms of the parameters of P (say the population covariance matrix). In the nonparametric formulation, the notion of “has a unique solution” does not directly apply since quantities such as have no parametric signature and are defined procedurally by simulating an intervention in a causal model M, as in eq. [11]. The following definition captures the requirement that Q be estimable from the data:

Definition 11 (Identifiability).^{15}A causal queryis identifiable, given a set of assumptions A, if for any two models (fully specified)andthat satisfy A, we have

[13]

In words, the functional details of and do not matter; what matters is that the assumptions in A (e.g., those encoded in the diagram) would constrain the variability of those details in such a way that equality of P’s would entail equality of Q’s. When this happens, Q depends on P only, and should therefore be expressible in terms of the parameters of P.

When a query Q is given in the form of a do-expression, for example , its identifiability can be decided systematically using an algebraic procedure known as the do-calculus [13]. It consists of three inference rules that permit us to map interventional and observational distributions whenever certain conditions hold in the causal diagram G.

The conditions that permit the application these inference rules can be read off the diagrams using a graphical criterion known as d-separation [65].

Definition 12 (d-separation). A set S of nodes is said to block a path p if either

1.

p contains at least one arrow-emitting node that is in S, or

2.

p contains at least one collision node that is outside S and has no descendant in S.

If S blocks all paths from set X to set Y, it is said to “d-separate X and ” and then, it can be shown that variables X and Y are independent given S, written .^{16}

D-separation reflects conditional independencies that hold in any distribution that is compatible with the causal assumptions A embedded in the diagram. To illustrate, the path in Figure 6(a) is blocked by and by , since each emits an arrow along that path. Consequently we can infer that the conditional independencies and will be satisfied in any probability function that this model can generate, regardless of how we parametrize the arrows. Likewise, the path is blocked by the null set , but it is not blocked by since Y is a descendant of the collision node X. Consequently, the marginal independence will hold in the distribution, but may or may not hold.^{17}

The rules of do-calculus

Let X, Y, Z, and W be arbitrary disjoint sets of nodes in a causal DAG G. We denote by the graph obtained by deleting from G all arrows pointing to nodes in X. Likewise, we denote by the graph obtained by deleting from G all arrows emerging from nodes in X. To represent the deletion of both incoming and outgoing arrows, we use the notation .

The following three rules are valid for every interventional distribution compatible with G.

Rule 1 (Insertion/deletion of observations):

[14]

Rule 2 (Action/observation exchange):

[15]

Rule 3 (Insertion/deletion of actions):

[16]

where is the set of Z-nodes that are not ancestors of any W-node in .

To establish identifiability of a query Q, one needs to repeatedly apply the rules of do-calculus to Q, until the final expression no longer contains a do-operator^{18}; this renders it estimable from non-experimental data. The do-calculus was proven to be complete to the identifiability of causal effects in the form [69, 70], which means that if Q cannot be expressed in terms of the probability of observables P by repeated application of these three rules, such an expression does not exist.

We shall see that, to establish transportability, the goal will be different; instead of eliminating do-operators, we will need to separate them from a set of variables S that represent disparities between populations.

Appendix 2

Theorem 2. Let G be a selection diagram. Then for any node Y, the direct effectis transportable if there is no subgraph of G which forms a Y-rooted sC-tree.

Proof. We known from Tian [71, Theorem 22] that whenever there exists no subgraph of G satisfying all of the following: (i) ; (ii) has only one c-component, T itself; (iii) All variables in T are ancestors of Y in , the direct effect on Y is identifiable, as sC-trees are structures of this type. Further Shpitser and Pearl [27, Theorem 2] showed that the same holds for C-trees, which also implies the inexistence of a sC-trees. Since such structure does not show up in G, the target quantity is identifiable, and hence transportable.

It remains to show that the same holds whenever there exists a subgraph that is a C-tree and in which no S node points to Y, i.e., there is no Y-rooted sC-tree at all. It is true that , given that all directed paths from S to Y are closed. This follows from the following facts: (1) all paths from S passing through Y’s ancestors were cut in ; (2) all bidirected paths were also closed given that the conditioning set contains only root nodes, and a connection from S must pass through at least one collider; (3) transportability does not depend on descendants of Y (by argument similar to Tian [71, Lemma 9]). Thus, it follows that we can write , concluding the proof. ■

Corollary 1. Let G be a selection diagram. Then for any node Y, the direct effectis transportable if there is no S node pointing to Y.

Proof. Follows directly from Theorem 2. ■

Lemma 5. The exclusive OR (XOR) function is commutative and associative.

Proof. Follows directly from the definition of the XOR function. ■

Remark 1. The construction given below is a strict generalization of Theorem 1, and it is useful because it will provide a simplified construction of the one provided in Theorem 1, and also set the tone for proofs of generic graph structures which will in the sequel show to be instrumental in proving non-transportability in arbitrary structures.

Theorem 3. Let G be a Y-rooted sC-tree. Then the effects of any set of nodes in G on Y are not transportable.

Proof. The proof will proceed by constructing a family of counterexamples. For any such G and any set , we will construct two causal models and that will agree on , but disagree on the interventional distribution .

Let the two models , agree on the following features. All variables in are binary. All exogenous variables are distributed uniformly. All endogenous variables except Y are set to the bit parity (sum mod 2) of the values of their parents. The two models differ in respect to Y’s definition. Consider the function for Y, to be defined as follows:

Lemma 6. The two models agree in the distributions .

Proof. Since the two models agree on and all functions except , it suffices to show that maintains the same input/output behavior in both models for each domains.

Subclaim 1: Let us show that both models agree in the observational and interventional distributions relative to domain , i.e., the pair . The index variable S is set to 0 in , and evaluates to in both models, which proves the subclaim.

Subclaim 2: Let us show that both models agree in the observational distribution relative to , i.e., . The index variable S is set 1 in , and evaluates to in , and 1 in . Since the evaluation in can be rewritten as , it remains to show that always evaluates to 0.

This fact is certainly true, consider the following observations: a) each variable in has exactly two endogenous children; b) the given tree has Y as the root; c) all functions are XOR – these imply that Y is computing the bit parity of the sum of all U nodes, which turns out to be even, and so evaluates to 0 and proves the subclaim. ■

Lemma 7. For any set , .

Proof. Given the functional description and the discussion in the previous Lemma, the function evaluates always to 1 in .

Now let us consider . Note that performing the intervention and cutting the edges going toward creates an asymmetry on the sum of the bidirected edges departing from , and consequently in the sum performed by Y. It will be the case that some will appear only once in the expression of Y. Therefore, depending on the assignment , we will need to evaluate the sum (mod 2) over in Y or its negation, which given the uniformity of the distribution of will yield in both cases. ■

By Lemma 2, Lemmas 6 and 7 together prove Theorem 3. ■

Corollary 2. Let G be a selection diagram, letandbe set of variables. If there exists a node W which is an ancestor of some nodeand such that there exists a W-rooted sC-tree which contains any variables in, thenis not transportable.

Proof. Fix a W-rooted sC-tree T, and a path p from W to Y. Consider the graph . Note that in this graph
. From the last Theorem is not transportable, it is now easy to construct in such a way that the mapping from to is one to one, while making sure all distributions are positive.

Remark 2. The previous results comprised cases in which there exist sC-trees involved in the non-transportability of Y – i.e., Y or some of its ancestors were roots of a given sC-tree. In the problem of identifiability, the counterpart of sC-trees (i.e., C-trees) suffices to characterize non-identifiability for singleton Y. But transportability is more subtle and this is not the case here – it not only depends on X and Y “locations” in the graph, but also the relative position of the S-nodes. Consider Figures 4 and 7(a) (called sp-graph). In these graphs there is no sC-tree but the effect of X on Y is still non-transportable.

The main technical subtlety here is that in sC-trees, a S-node combines its effect with a X-node intersecting in the root node (considering only the bidirected edges), which is not the case for non-transportability in general. Note that in the graphs in Figure 4, and the sp-graph, the nodes S and X intersect first through ordinary edges and meet through bidirected edges only on the Y node. This implies a certain “asynchrony” because, in the structural sense, the existence of a S-node implies a difference in the structural equations between domains, but only this difference does not imply non-transportability (for instance, is transportable in the sp-graph even though the equations of Z being different in both models).

Figure 7

Selection diagrams in which is not transportable, there is no sC-tree but there is a sC-forest. These diagrams will be used as basis for the general case; the first diagram is named sp-graph and the second one sb-graph.

The key idea to produce a proof for non-transportability in these cases is to keep the effect of S-nodes after intersecting with X “dormant” until they reach the target Y and then manifest. We implement this idea in the next two proofs, which can be seen as base cases, and should pavement the way for the most general problem.

Theorem 8.
is not transportable in the sp-graph (Figure 7(a)).

Proof. We will construct two causal models and
compatible with the sp-graph that will agree on , but disagree on the interventional distribution .

Let us assume that all variables in are binary, and let be the common cause of X and Y, be the common cause of Z and Y, and be the random disturbance exclusive to Z. Let and be defined as follows:

and:

Both models agree in respect to , which is defined as follows: .

Lemma 8. The two models agree in the distributions .

Proof. Subclaim 1: Let us show that both models agree in the observational and interventional distributions relative to domain , i.e., the pair . In both models X has the same expression, which entails the same (uniform) probabilistic behavior in both cases. The index variable S is set to 0 in , and Z evaluates to in and in . Clearly, for any value of , since is the same and uniformly distributed in both models, we obtain the same (uniform) input/output probabilistic behavior in and (note that can freely vary independently of X). In similar way, Y evaluates to in both models, which entails the same (uniform) input/output probabilistic behavior in both models. In regard to , it is clear that Z did not depend (probabilistically) on the specific value of X, and so the equality between both models follows. For the case when we have , Y evaluates to in and in , and given the uniformity of , they preserve the same (uniform) input/output probabilistic behavior. (For a more elaborated argument, see Theorem 4 below.)

Subclaim 2: Let us show that both models agree in the observational distribution relative to . The index variable S is set 1 in , evaluates to in , and in . Again, for any value of X, together with the uniformity of , we obtain the same (uniform) input/output probabilistic behavior in both models (note again that can freely vary independently of variations of X, and so Z). Further, evaluates to 1 in both models, which yields the same (uniform) input/output behavior in both models. (To guarantee positivity, we can apply the trick of making a new such that returns 0 half the time, and the other half (i.e., set , where C is a fair coin.) ■

Lemma 9. There exist values ofsuch that
.

Proof. Fix . First notice that evaluates to in and in . Given that is uniformly distributed, both quantities coincide (and they represent the effect of X on Z, which is transportable in G). Now the evaluation of in reduces to , while it reduces to 1 in , which show disagreement and finishes the proof of this Lemma. ■

By Lemma 2, Lemmas 8 and 9 together prove Theorem 8. ■

Remark 3. There exists a different sort of asymmetry in the case of Figure 7(b) (calledsb-graph), and the nodesX and S do not intersect before meeting Y – i.e., they have disjoint paths and Y lies precisely in their intersection.

Still, this case is not the same of having a sC-tree because in sb-graphs we need to keep the equality from the S nodes to Y until S intersects X on Y. Employing a similar construct as in the sp-graph, we keep the effect of S dormant until it reaches Y and then emerges.

Theorem 9.
is not transportable in the sb-graph (Figure 7(b)).

Proof. We construct two causal models and compatible with the sb-graph that will agree on , but disagree on the interventional distribution
.

Let us assume that all variables in are binary, and let be the common cause of X and Y, be the common cause of Z and Y, and be the random disturbance exclusive to X. Let and agree with the following definitions:

and disagree in respect to Z as follows:

Both models also agree in respect to , which is defined as follows:

.

Lemma 10. The two models agree in the distributions .

Proof. Subclaim 1: Let us show that both models agree in the observational and interventional distributions relative to domain , i.e., the pair
. The index variable S is set to 0 in , and are defined in the same way in both models, and so it suffices to analyze Y, which in this case evaluates to in both models, preserving the same (uniform) probabilistic behavior. Given that, it is not difficult to see that both models also evaluate in the same way when considering the interventions in I.

Subclaim 2: Let us show that both models agree in the observational distribution relative to . The index variable S is set 1 in , given that are defined in the same way in both models, together with the uniformity of U make them evaluate in the same way in both models, and Y evaluates to 1 in both models. (As in Lemma 8, the same trick to make the distribution positive could be applied here.) ■

Lemma 11. There exist values ofsuch that
.

Proof. Fix . First notice that evaluates to in both models, and the evaluation of in reduces to 1, while it reduces to in
. It follows that in
, evaluates to 1 with probability 1, while in it evaluates to 1 with probability , which disagree by construction, finishing the proof of this Lemma. ■

By Lemma 2, Lemmas 10 and 11 together prove Theorem 9. ■

Remark 4. There are two complementary components to forge a general scheme to prove arbitrary non-transportability. First, the construct of Theorem 4 shows how to prove non-transportability for general structures such assC-trees. In the sequel, the specific proofs of non-transportability for the sp-graph (Theorem 9) and sb-graph (Theorem 10) partition the possible interactions betweenX, S andY. In the former,X andS intersect before meeting withY, while in the latter they have disjoint paths andY lies in their intersection. In the sequel, the proof for the general case combines these analyses, which we show below.

Theorem 4. Assume there existthat form a s-hedge forinand. Thenis not transportable fromto .

Proof. We first consider counterexamples with the induced graph , and assume, without loss of generality, that H is a forest. We construct two causal models and that will agree on , but disagree on the interventional distribution .

Let F be an -rooted sC-forest, let be the set of observable variables and be the set of unobservable variables in F. Let us assume that all variables in are binary. Call the set of variables pointed by S-nodes in , which by the definition of sC-forest is guaranteed to be non-empty.

In model 1, let each compute the bit parity of all its observable and unobservable parents (i.e., , where the xor is applied for each element of the set and the result computed so far), while in model 2, let compute the bit parity of all its parents except that any node in disregards the parents values if the parent is in F (i.e., if is in , and , otherwise).

Define as follows:

where
is constructed in similar way as in and above, and is an additional fair coin exclusively pointing to W. Let us call
the collection of such coins. Furthermore, let us assume that each is also a fair coin (i.e., ).

Lemma 12. The two models agree in the distribution ofand there exists a value assignmentforsuch that .

Proof. For , the result follows directly since the systems of equations in both models reduce to the construction given in Theorem 4 at [27]. ■

Lemma 13. The two models agree in the distributions .

Proof. Let us show that both models agree in the observational distribution P relative to domain . The selection variable S is set to 0 in
, and note that both systems are the same as in except that now each variable has an extra variable pointing to it that should be taken into account in W’s evaluation, and in turn in the whole system.

We have a forest over the endogenous nodes and all functions compute the bit parity of the value of their parents, and so we can view each node as computing the sum mod 2 of its exogenous ancestors in H. We want to show that the distribution of each family is equally likely for each possible assignment (i.e., , for all
).

Let us partition the analysis in two cases. First consider the case of in which there exists a S-node in the respective sC-tree. Note that the evaluation of relies only on the value of
in its respective tree since has an even number of endogenous children in F, and it is counted twice, so evaluates to zero (i.e., it does not affect ’s evaluation). For now, let us assume that there is only one that affects the evaluation of . Given the uniformity of , it suffices to show that can vary independently for any configuration of the parents of .

For any configuration of , consider the corresponding evaluation of , and also . We want to show that it is possible to flip the current value of from to while preserving the parents’ evaluation . Assume this is not so. This implies that the evaluation of and count the same ’s, contradiction.

To see why, consider the set of parents of that are descendents of . Now, for each of these parents flip the minimum number of variables from , and call this set . (Note that this is always possible since we need at most one U for each parent, which should exist by construction of sC-forest.) Now, make , and note that since flipping the values of compensates the flip of . But it is also true now that evaluates to since, in the same way as before, all other variables in are cancelled out in ’s evaluation, including the ones in . This proves the claim.

Consider the following two facts: Subclaim 1: Let X and Y be two binary variables such that and . Then the probabilistic input/output behavior of is the same of Y. The variable whenever , which happens with probability . Since , the expression reduces to . Subclaim 2: Let X and Y be two binary variables such that . Then the probabilistic input/output behavior of is the same of X (or Y). This follows directly from Subclaim 1. It is clear that if there are multiple nodes from in the evaluation of
, the same construction is also valid given the subclaim above. It is also not difficult to generalize this argument to consider root set that are not singleton, including roots in which there are not S-nodes as ancestors.

Finally, let us consider the case of . It suffices to show that the function from to is 1–1 when we fix
. We use the same argument as Shpitser. Assume this is not so, and fix two instantiations of that map to the same value of , and differ by the set . Since the bidirected edges form a spanning tree, there exists with an odd number of parents in (and were not in , by construction). Order them topologically and let the topmost be called X. Note that if we flip all values in , the value of X will also flip, contradiction. Given the uniformity of , the claim follows. We can put this together with the previous claim, and the result follows. We can add fair coins as the input to all other variables outside F, which will imply the claim for the whole graph G.

In regard to the equality between I, note that given that the equality of both models holds for P, and removing edges due to interventions will just make some nodes from to have an odd number of children, it it not difficult to see based on the previous argument that this just creates more variables that are free to vary, which will entail the same probabilistic uniform behavior in both models. Another way to see this fact is to consider the new exogenous variables from that have only one children after the intervention as analogous to
, and so the same argument follows. ■

Finally, Lemma 2 together with Lemmas 12 and 13 prove Theorem 4. ■

Theorem 5 (soundness). WheneversIDreturns an expression for, it is correct.

Proof. Noting that the selection diagram inputted to sID is also a causal diagram over , and trivial transportability is equivalent to identifiability in , the correctness of the identifiability calls was already established elsewhere [27, 34].

It remains to show the correctness of the test in line 10 of sID. First note that, by construction, in each local call is always a set of pre-treatment covariates. But now the correctness follows directly by S-admissibility of together with Corollary 1 in Ref. 1. Further note that the set of -nodes outside the local component will not affect separability of the S-nodes inside it (following the topology of the hedge), and other S-nodes outside can be removed from the expression before the test. More specifically, note that the effect in each local call that uses line 10 can be expressed in its expanded form (using a typical C-component decomposition), and given that the independence imposed by S-admissibility holds, together with the fact that both populations share the same causal graph G, allow that the functions of to be replaced with the respective functions in , which implies the result. ■

Remark 5. The next results are similar to the identification counterparts given in Refs. 26, 69.

Theorem 6. AssumesIDfails to transport(executes line 11). Then there exists ,
, such that the graph pairreturned by the fail condition ofsIDcontain as edge subgraphs sC-forests F,
that form a s-hedge for .

Proof. Before failure sID evaluated false consecutively at lines 5, 6, and 10, so D local to this call is a sC-component, and let be its root set. We can remove some directed arrows from D while preserving as root, yielding a -rooted sC-forests F. Since by construction is closed under descendants and only directed arrows were removed, both are sC-forests. Also by construction, together with the fact that and from the recursive call are clearly subsets of the original input, finish the proof.

Corollary 3 (completeness). sIDis complete.

Proof. The result follows from Theorem 6 where is not transportable in H. But now, it is easy to add the remaining variables from G, making them independent of H (e.g., as random coins). So, the models in the counterexample induce G, and witness the non-transportability of .

Corollary 4.
is transportable fromtoin G if and only if there is not s-hedge forin G for anyand
.

Proof. Follows directly from the previous Corollary. ■

Theorem 7. The rules of do-calculus, together with standard probability manipulations are complete for establishing transportability of all effects of the form .

Proof. It was shown elsewhere [69] that the steps of sID but line 10 correspond to sequences of standard probability manipulations and applications of the rules of do-calculus. The line 10 is constituted by a conditional independence judgment, and standard probability operations for the replacement of the functions based on the invariance allowed by the S-admissibility of the local in each recursive call (as discussed above in the proof of correctness). ■

References

1.

Pearl J, Bareinboim E. Transportability of causal and statistical relations: a formal approach. In: Proceedings of the Twenty-Fifth National Conference on Artificial Intelligence (AAAI 2011). Menlo Park, CA: AAAI Press, 2011:247–54.

2.

Bareinboim E, Pearl J. Causal transportability with limited experiments. Technical Report Technical Report R-408, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2013, submitted.

3.

Bareinboim E, Pearl J. Meta-transportability of causal effects: A formal approach. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2013), 2013, forthcoming.

4.

Campbell D, Stanley J. Experimental and quasi-experimental designs for research. Chicago: Wadsworth Publishing, 1963.

5.

Manski C. Identification for prediction and decision. Cambridge, Massachusetts: Harvard University Press, 2007.

6.

Glass GV. Primary, secondary, and meta-analysis of research. Educ Res 1976;5:3–8. [Crossref]

7.

Hedges LV, Olkin I. Statistical methods for meta-analysis. Academic Press, 1985.

8.

Owen AB. Karl pearsons meta-analysis revisited. Ann Stat 2009;37:3867–92. [Crossref]

9.

Höfler M, Gloster A, Hoyer J. Causal effects in psychotherapy: counterfactuals counteract overgeneralization. Psychother Res 2010, DOI: 10.1080/10503307.2010.501041. [PubMed][Crossref]

10.

Shadish W, Cook T, Campbell D. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton-Mifflin, 2nd ed., 2002.

11.

Adelman L. Experiments, quasi-experiments, and case studies: a review of empirical methods for evaluating decision support systems. Systems, Man and Cybernetics, IEEE Transactions on, 1991;21:93–301.

12.

Morgan S, Winship C. Counterfactuals and causal inference: methods and principles for social research (Analytical Methods for Social Research). New York: Cambridge University Press, 2007.

13.

Pearl J. Causal diagrams for empirical research. Biometrika 1995;82:669–710. [Crossref]

14.

Greenland S, Pearl J, Robins J. Causal diagrams for epidemiologic research. Epidemiology 1999;10:37–48. [PubMed][Crossref]

15.

Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. Cambridge, MA: MIT Press, 2nd ed., 2001.

16.

Pearl J. Causality: models, reasoning, and inference. New York: Cambridge University Press, 2nd ed., 2009.

17.

Koller D, Friedman N. Probabilistic graphical models: principles and techniques. MIT Press, 2009.

18.

Westergaard H. Scope and method of statistics. Am Stat Assoc 1916:15:229–76.

19.

Yule G. On some points relating to vital statistics, more especially statistics of occupational mortality. J R Stat Soc 1934;97:1–84. [Crossref]

20.

Lane P, Nelder J. Analysis of covariance and standardization as instances of prediction. Biometrics 1982;38:613–21. [Crossref][PubMed]

21.

Cole S, Stuart E. Generalizing evidence from randomized clinical trials to target populations. Am J Epidemiol 2010;172:107–15. [PubMed][Crossref]

22.

Pearl J. Causality: models, reasoning, and inference. New York: Cambridge University Press, 2nd ed., 2000.

23.

Pearl J. Causal inference in statistics: an overview. Stat Surv 2009;3:96–146. [Crossref]

24.

Pearl J, Verma T. A theory of inferred causation. In Allen J, Fikes R, Sandewall E, editors. Principles of knowledge representation and reasoning: Proceedings of the Second International Conference. San Mateo, CA: Morgan Kaufmann, 1991:441–52.

25.

Bareinboim E, Brito C, Pearl J. Local characterizations of causal bayesian networks. In: Croitoru M, Corby O, Howse J, Rudolph S, Wilson N, editors. GKR-IJCAI, Lecture Notes in Artificial Intelligence (7205), Springer-Verlag, 2012:1–17.

26.

Tian J, Pearl J. A general identification condition for causal effects. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002). Menlo Park, CA: AAAI Press/The MIT Press, 2002:567–73.

27.

Shpitser I, Pearl J. Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI 2006). Menlo Park, CA: AAAI Press, 2006:1219–26.

28.

Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. New York: Springer-Verlag, 1993.

29.

Galles D, Pearl J. Testing identifiability of causal effects. In Besnard P, Hanks S, editors. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI 1995). San Francisco: Morgan Kaufmann, 1995:185–95.

30.

Pearl J, Robins J. Probabilistic evaluation of sequential plans from causal models with hidden variables. In Besnard P, Hanks S, editors. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI 1995). San Francisco: Morgan Kaufmann, 1995:444–53.

31.

Halpern J. Axiomatizing causal reasoning. In Cooper G, Moral S, editors. Uncertainty in artificial intelligence. San Francisco, CA: Morgan Kaufmann, 1998:202–10, also J Artif Intell Res 2000;12:3, 17–37.

32.

Kuroki M, Miyakawa M. Identifiability criteria for causal effects of joint interventions. J R Stat Soc 1999;29:105–17.

33.

Verma T, Pearl J. Equivalence and synthesis of causal models. In: Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (UAI 1990). Cambridge, MA, 1990:220–27, also in Bonissone P, Henrion M, Kanal LN, Lemmer JF, editors. Uncertainty in artificial intelligence 6. Elsevier Science Publishers, B.V., 1990:255–68, 1991.

34.

Huang Y, Valtorta M. Identifiability in causal bayesian networks: A sound and complete algorithm. In: Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI 2006). Menlo Park, CA: AAAI Press, 2006:1149–56.

35.

Pearl J. Direct and indirect effects. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI 2001). San Francisco, CA: Morgan Kaufmann, 2001:411–20.

36.

Pearl J. The mediation formula: A guide to the assessment of causal pathways in nonlinear models. In Berzuini C, Dawid P, Bernardinell L, editors. Causality: statistical perspectives and applications. New York: Wiley, Chapter 12, 2012.

37.

Bareinboim E, Pearl, J. Causal inference by surrogate experiments: z-identifiability. In de Freitas N, Murphy K, editors. Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI 2012), AUAI Press, 2012, 113–20.

38.

Cornfield J. A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 1951;11:1269–75. [PubMed]

39.

Whittemore A. Collapsibility of multidimensional contingency tables. J R Stat Soc, B 1978;40:328–40.

40.

Geng Z, Guo J, Fung W-K. Criteria for confounders in epidemiological studies. J Royal Stat Soc Series B 2002;64:3–15. [Crossref]

41.

Heckman JJ. Sample selection bias as a specification error. Econometrica 1979;47:153–61. [Crossref]

42.

Robins JM, Hernan M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60. [Crossref][PubMed]

43.

Hernán M, Hernández-Díaz S, Robins J. A structural approach to selection bias. Epidemiology 2004;15:615–25. [PubMed][Crossref]

44.

Lauritzen SL, Richardson TS. Discussion of mccullagh: sampling bias and logistic models. J R Stat Soc Ser B 2008;70:140–50.

45.

Geneletti S, Richardson S, Best N. Adjusting for selection bias in retrospective, case-control studies. Biostatistics 2009;10. [PubMed]

46.

Weisberg H, Hayden V, Pontes V. Selection criteria and generalizability within the counterfactual framework: explaining the paradox of antidepressant-induced suicidality? Clin Trials 2009;6:109–18. [Crossref][PubMed]

47.

Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2011;174:369–86. [Crossref]

48.

Angrist J, Imbens G, Rubin D. Identification of causal effects using instrumental variables (with comments). J Am Stat Assoc 1996;91:444–72. [Crossref]

49.

Didelez V, Kreiner S, Keiding N. Graphical models for inference under outcome-dependent sampling. Stat Sci 2010;25:368–87. [Crossref]

50.

Bareinboim E, Pearl J. Controlling selection bias in causal inference. In Girolami M, Lawrence N, editors. Proceedings of The Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), JMLR (22), 2012:100–08.

51.

Pearl J. A solution to a class of selection-bias problems. Technical Report R-405, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2012.

52.

Hernán M, VanderWeele T. Compound treatments and transportability of causal inference. Epidemiology 2011;22:368–77. [PubMed][Crossref]

53.

Petersen M. Compound treatments, transportability, and the structural causal model: the power and simplicity of causal graphs. Epidemiology 2011;22:378–81. [Crossref][PubMed]

54.

Cox D. The Planning of Experiments. New York: John Wiley and Sons, 1958.

55.

Heckman J. Randomization and social policy evaluation. In Manski C, Garfinkle I, editors. Evaluations: welfare and training programs. Cambridge, MA: Harvard University Press, 1992:201–30.

56.

Hotz VJ, Imbens G, Mortimer JH. Predicting the efficacy of future training programs using past experiences at other locations. J Econ 2005; 125:241–70. [Crossref]

57.

Pearl J, Bareinboim E. Transportability of causal and statistical relations: A formal approach. Technical Report Technical Report r-372, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2011.

58.

Pearl J. Some thoughts concerning transfer learning with applications to meta-analysis and data sharing estimation. Technical Report R-387, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2012.

59.

Bareinboim E, Pearl J. Transportability of causal effects: completeness results. In Hoffmann J, Selman B, editors. Proceedings of The Twenty-Sixth Conference on Artificial Intelligence (AAAI 2012), 2012:698–704.

60.

Bollen KA, Pearl J. Eight myths about causality and structural equation models. In Morgan SL, editor. Handbook of Causal Analysis for Social Research (in press), New York: Springer, 2013, Chapter 15.

61.

Haavelmo T. The statistical implications of a system of simultaneous equations. Econometrica 1943;11:1–12, reprinted in D.F. Hendry DF, Morgan MS, editors. The foundations of econometric analysis. Cambridge University Press, 1995:477–90. [Crossref]

62.

Strotz R, Wold H. Recursive versus nonrecursive systems: an attempt at synthesis. Econometrica 1960;28:417–27. [Crossref]

63.

Pearl J. Trygve haavelmo and the emergence of causal calculus. Technical Report R-391, Cognitive Systems Lab, Department of Computer Science, UCLA; To appear: Econometric Theory, special issue on Haavelmo Centennial, 2012.

64.

Lehmann EL, Casella G. Theory of point estimation (Springer Texts in Statistics). New York: Springer, 2nd ed., 1998.

65.

Pearl J. Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann, 1988.

66.

Hayduk L, Cummings G, Stratkotter R, Nimmo M, Grygoryev K, Dosman D, et al. Pearls d-separation: one more step into causal thinking. Struct Equ Modeling 2003;10:289–311. [Crossref]

67.

Glymour M, Greenland S. Causal diagrams. In Rothman K, Greenland S, Lash T, editors. Modern epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins, 3rd ed., 2008:183–209.

68.

Berkson J. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull 1946;2:47–53. [Crossref]

69.

Shpitser I, Pearl J. Identification of conditional interventional distributions. In Dechter R, Richardson T, editors. Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006). Corvallis, OR: AUAI Press, 2006:437–44.

70.

Huang Y, Valtorta M. Pearl’s calculus of intervention is complete. In Dechter R, Richardson T, editors. Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. Corvallis, OR: AUAI Press; 2006:217–24.

71.

Tian J. Studies in causal reasoning and learning. PhD Thesis, Computer Science Department, University of California, Los Angeles, CA, 2002.

See Def. 3 below for formal construction of selection diagrams. In all diagrams, dashed arcs (e.g., ) represent the presence of latent variables affecting both X and Y.↩

3

This result can be derived by purely graphical operations if we write as , thus attributing the difference between and to a fictitious event . The invariance of the age-specific effect then follows from the conditional independence , which implies , and licenses the derivation of the transport formula.↩

4

Eq. [1] reflects the familiar method of “standardization” – a statistical extrapolation method that can be traced back to a century-old tradition in demography and political arithmetic [18–21]. We will show that standardization is only valid under certain conditions.↩

The assumption that there are no structural changes between domains can be relaxed as follows. Starting with the structure in the target population , make , and then add S-nodes to D following the same procedure as in Def. 3.↩

7

Transportability analysis assumes that enough structural knowledge about both domains is known in order to substantiate the production of their respective causal diagrams. In the absence of such knowledge, causal discovery algorithms might be used to help in inferring the diagrams from data [15, 22, 24].↩

8

These invariance assumptions are analogous to the missing-arrows in the causal graphs [25] which allow one to identify causal-effects from observational data.↩

9

Departing from results given in [28–32], the advent of C-components complements the notion of inducing path, which was earlier introduced in [33], and led to a breakthrough result proving completeness of the do-calculus for non-parametric identification of causal effects by [27, 34].↩

10

Note that, by definition, at least one S-node has to appear in both .↩

11

A surrogate variable is different from instrumental variable in that the former should lead to the identification of causal effect even in nonparametric models; IV methods are limited to “local” causal effects (so-called LATE [48]).↩

12

We use the acronym SCM for both parametric and non-parametric representations (which is also called Structural Equation Model (SEM)), though historically, SEM practitioners preferred the parametric representation and often confuse with regression equations [60].↩

13

Equivalently, can be interpreted as the joint probability of under a randomized experiment among units receiving treatment level . Readers versed in potential-outcome notations may interpret as the probability , where is the potential outcome under treatment .↩

14

Counterfactuals are defined similarly through the equation (see [16, Ch. 7]), but will not be needed for the discussions in this article.↩

15

This definition appears to be similar to, but differ fundamentally from the standard statistical definition [64, p. 22] which deals with the unidentifiability of the parameter set from a distribution . In our case, the query is not a parameter of P (see [22, p. 77]).↩

16

See Hayduk et al. [66], Glymour and Greenland [67], and Pearl [16, p. 335] for a gentle introduction to d-separation.↩

17

This special handling of collision nodes (or colliders, e.g., ) reflects a general phenomenon known as Berkson’s paradox [68], whereby observations on a common consequence of two independent causes render those causes dependent. For example, the outcomes of two independent coins are rendered dependent by the testimony that at least one of them is a tail.↩

18

Such derivations are illustrated in graphical details in Ref. [16, p. 87].↩

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

## Comments (0)