A directed acyclic graph (DAG) can represent the factorization of a joint distribution of a set of random variables. To be more precise, a Bayesian network is a pair , where G is a DAG and P is a joint probability distribution of variables in one-to-one correspondence with the nodes of G, with the property that each variable is conditionally independent of its non-descendants given its parents. It follows from this definition that the joint probability P factors according to G, as the product of the conditional probabilities of each node given its parents. Thus a discrete Bayesian network is fully specified by a DAG and a set of conditional probability tables, one for each node given its parents [1, 2].
A causal Bayesian network is a Bayesian network enhanced with a causal interpretation. Work initiated by Pearl [3, 4] investigated the identification of causal effects in causal Bayesian networks when some variables are assumed observable and others are hidden. In a non-parametric setting, with no assumptions about the state space of variables, there is a complete algorithm for determining which causal effects between variables are identifiable [5–8].
As powerful as this theory is, however, it does not address identifiability when assumptions are made on the nature of the variables. Indeed, by specializing to finite state spaces, causal effects that were non-identifiable according to the theory above may become identifiable. One particular example, with DAG shown in Figure 1, has been studied by Kuroki and Pearl . If the state space of hidden variable 0 is finite, and observable variables 1 and 4 have state spaces of larger sizes, then the causal effect of variable 2 on variable 3 can be determined, for generic parameter choices.
In this paper we study in detail identification properties of certain small Bayesian networks, as a first step toward developing a systematic understanding of identification in the presence of finite hidden variables. While this includes an analysis of the model with the DAG above, our motivation is different from that of Kuroki and Pearl , and results were obtained independently. We make a thorough study of networks with up to five binary variables, one of which is unobservable and parental to all observable ones, as shown in Table 3 of the Appendix. These investigations lead us to develop some basic tools and arguments that can be applied more generally to questions of parameter identifiability.
In addition, for each such binary model in Table 3, we determine a value such that the marginalization from the full joint distribution to that over the observable variables is generically k-to-one. Although we restrict this exhaustive study to binary models for simplicity, straightforward modifications to our arguments would extend them to larger state spaces. A typical requirement for such an extended identifiability result is that the state spaces of observable variables be sufficiently large, relative to that of the hidden variable, as in the result of Kuroki and Pearl  described earlier. Interestingly, that result restricted to finite state spaces follows easily from our framework, and can be obtained for continuous state spaces of observable variables using arguments of Allman et al. .
We use the term “DAG model” for the collection of all Bayesian networks with the same DAG and specification of state spaces for the variables. With the conditional probability tables of nodes given their parents forming the parameters of the model, we thus allow these tables to range over all valid tables of a fixed size to give the parameter space of such a model.
That some of the DAG models we consider have non-identifiable parameters is a consequence of the well-known non-uniqueness (in most circumstances) of non-negative rank decompositions of matrices. An example is the infinite-to-one parameterization of model 4-2a in Table 3. For greater detail on this issue see the work of Mond et al.  and Kubjas et al. .
In dealing with discrete unobserved variables, another well-understood identifiability issue is sometimes called label swapping. If the latent variable has r states, there are parameter choices, obtained by permuting the state labels of the latent variable, that generate the same observable distribution. Thus the parameterization map is generically at least -to-one. For models with a single binary latent variable, it is thus commonly expected that parameterizations are either infinite-to-one due to a parameter space of too high a dimension, or 2-to-one due to label swapping. Our work, however, finds surprisingly simple examples such that the mapping is 4-to-one, so that subtler non-identifiability issues arise.
Our analysis arises from an algebraic viewpoint of the identifiability problem. With finite state spaces the parameterization maps for DAG models with hidden variables are polynomial. Given a distribution arising from the model, the parameters are identifiable precisely when a certain system of multivariate polynomial equations has exactly one solution (up to label swapping of states for hidden variables). Though in principle computational algebra software can be used to investigate parameter identifiability, the necessary calculations are usually intractable for even moderate size DAGs and/or state spaces. In addition, one runs into issues of complex versus real roots, and the difficulty of determining when real roots lie within stochastic bounds. While our arguments are fundamentally algebraic, they do not depend on any machine computations.
If a single polynomial in one variable is given, of degree n, then it is well known that the map from to that it defines will be generically n-to-one. Indeed the equation will be of degree n for each choice of a, and generically will have n distinct roots. This fact generalizes to polynomial maps from to ; there always exists a such that the map is generically k-to-one. However if has real coefficients, and is instead viewed as a map from (a subset of) to , it may not have a generic k-to-one behavior. For instance, from a typical graph of a cubic one sees there can be sets of positive measure on which it is 3-to-one, and others on which it is one-to-one, as well as an exceptional set of measure zero on which the cubic is 2-to-one. While this exceptional set arises since a polynomial may have repeated roots, the lack of a generic k-to-one behavior is due to passing from considering a complex domain for the function, to a real one.
The fact that the polynomial parameterizations for the models investigated here have a generic k-to-one behavior on their parameter space thus depends on the particular form of the parameterizations. For those binary models in Table 3, we prove this essentially one model at a time, while obtaining the value for k. In the case of finite k, our arguments actually go further and characterize the k elements of in terms of a generic . Of course when this is nothing more than label swapping, but for the cases of more is required. Precise statements appear in later sections. In some cases, we also give descriptions of an exceptional subset of where the generic behavior may not hold. In all cases, the reader can deduce such a set from our arguments.
After setting terminology in Section 2, in Section 3 we establish that, when all variables have fixed finite state spaces, Markov equivalent DAGs specify parameter equivalent models. Thus in answering generic identifiability questions one need only consider Markov equivalence classes of DAGs. In Section 4 we revisit the fundamental result due to Kruskal , as developed in Allman et al.  for identifiability questions. We give explicit identifiability procedures for the DAG this to which this result applies most directly (model 3-0), and also for the DAG of model 4-3b. These two DAGs are basic cases whose known identifiability is then leveraged in Section 5 to determine generic identifiability results for all the binary DAGs we catalog. Although in Section 5 we do not push our arguments toward exhaustive consideration of non-binary models, in many cases it would be straightforward to do so. For instance, if all variables associated to a DAG have the same size state space, little in our arguments needs to be modified.
Finally, in Section 6 we construct an explicit distribution for the generically 4-to-one parameterization of model 4-3e in which there are two different causal effects consistent with the observable distribution. This is possible because the parameter sets that give rise to this distribution differ in ways beyond label swapping. Determining causal effects coherently in this context is thus impossible. This example provokes a general caution: The parameterization of a discrete DAG model can be k-to-one with k larger than one would expect from label swapping, and when this occurs quantifying causal effects can be highly problematic.
We view the main contribution of this paper not as the determination of parameter identifiability for the specific binary models we consider, but rather as the development of the techniques by which we establish our results. Ultimately, one would like fairly simple graphical rules to determine which parameters are identifiable, and perhaps even to yield formulas for them in terms of the joint distribution. Establishing similar results for more general graphical models, not specified by a DAG, is also desirable. Some work in this context already exists (see, e.g., Stanghellini and Vantaggi ).
2 Discrete DAG models and parameter identifiability
The models we consider are specified in part by DAGs in which nodes represent random variables , and directed edges in E imply certain independence statements for the joint distribution of all variables . A bipartition of is given, in which variables associated to nodes in O or H are observable or hidden, respectively. Finally, we fix finite state spaces, of size for each variable .
A DAG entails a collection of conditional independence statements on the variables associated to its nodes, via d-separation, or an equivalent separation criterion in terms of the moral graph on ancestral sets. A joint distribution of variables satisfies these statements precisely when it has a factorization according to as with denoting the set of parents of v in . We refer to the conditional probabilities as the parameters of the DAG model, and denote the space of all possible choices of parameters by . The parameterization map for the joint distribution of all variables, both observable and hidden, is denoted as where is the k-dimensional probability simplex of stochastic vectors in . Thus is precisely the collection of all probability distributions satisfying the conditional independence statements associated to (and possibly additional ones).
Since the probability distribution for the model with hidden variables is obtained from that of the fully observable model, its parameterization map is where denotes the appropriate map marginalizing over hidden variables. The set is thus the collection of all observable distributions that arise from the hidden variable model. This collection depends not only on the DAG and designated state spaces of observable variables, but also on the state spaces of hidden variables, even though the sizes of hidden state spaces are not readily apparent from an observable joint distribution.
With all variables having finite state spaces, the parameter space can be identified with the closure of an open subset of , for some L. We refer to L as the dimension of the parameter space. The dimension of is easily seen to be (1)In the case of all binary variables, this simplifies to (2)where is the number of nodes in with in-degree k.
If a statement is said to hold for generic parameters or generically then we mean it holds for all parameters in a set of the form , where the exceptional set E is a proper algebraic subset of . (Recall an algebraic subset is the zero set of a finite collection of multivariate polynomials.) As proper algebraic subsets of are always of Lebesgue measure zero, a statement that holds generically can fail only on a set of measure zero.
As an example of this language, for any DAG model with all variables finite and observable, generic parameters lead to a distribution faithful to the DAG, in the sense that those conditional independence statements implied by d-separation rules will hold, and no others . Equivalently, a generic distribution from such a model is faithful to the DAG.
There are several notions of identifiability of parameters of a model; we refer the reader to Allman et al. . The strictest notion, that the parameterization map is one-to-one, is easily seen to hold when all DAG variables are observable with mild additional assumptions (e.g., positivity of all parameters). If a model has hidden variables, then this is too strict a notion of identifiability, as the well-known issue of label swapping arises: One can permute the names of the states of hidden variables, making appropriate changes to associated parameters, without changing the joint distribution of the observable variables. For a model with one r-state hidden variable, label swapping implies that for any generic there are at least other points with . But since these are isolated parameter points that differ only by state labeling, this issue does not generally limit the usefulness of a model, provided that we remain aware of it when interpreting parameters.
The strongest useful notion of identifiability for models with hidden variables is that for generic , if , then and differ only up to label swapping for hidden variables. This notion is our primary focus in this paper, which we refer to as generic identifiability up to label swapping. In particular, for models with a single binary hidden variable it is equivalent to the parameterization map being generically 2-to-one.
3 Markov equivalence and parameter identifiability
Two DAGs on the same sets of observable and hidden nodes are said to be Markov equivalent if they entail the same conditional independence statements through d-separation. (Note this notion does not distinguish between observable and hidden variables; all are treated as observable.) Thus for fixed choices of state spaces of the variables, two different but Markov equivalent DAGs, , have different parameter spaces and different parameterization maps, yet .
For studying identifiability questions, it is helpful to first explore the relationship between parameterizations for Markov equivalent graphs. A simple example, with no hidden variables, is instructive. Consider the DAGs on two observable nodes which are equivalent, since neither entails any independence statements. Now the particular probability distribution with requires parameters on the first DAG to be while parameters on the second DAG can be for any . Thus this particular distribution has identifiable parameters for only one of these DAGs. (Here and in the rest of the paper conditional probability tables specifying parameters have rows corresponding to states of conditioning, i.e., parent, variables.)
Of course, this probability distribution was a special one, and is atypical for these models, which are easily seen to have generically identifiable parameters (as do all DAG models without hidden variables). Nonetheless, it illustrates the need for “generic” language and careful arguments for results such as the following.
With all variables having fixed finite state spaces, consider two Markov equivalent DAGs, and , possibly with hidden nodes. If the parameterization map is generically k-to-one for some , then is also generically k-to-one.
In particular if such a model has parameters that are generically identifiable up to label swapping, so does every Markov equivalent model.
This theorem is a consequence of the following:
With all variables having finite state spaces, consider two Markov equivalent DAGs, and , with parameter spaces and parameterization maps , , for the joint distribution of all variables. Then there are generic subsets and a rational homeomorphism , with rational inverse, such that for all
Proof. Recall that an edge of a DAG is said to be covered if . By Chickering , Markov equivalent DAGs differ by applying a sequence of reversals of covered edges.
We thus first assume the differ by the reversal of a single covered edge of . Let , so , . Now any is a collection of conditional probabilities , including . From these, successively define Using these last two conditional probabilities, along with those specified by for all , define parameters . Now is defined and continuous on the set where and are strictly positive.
One easily checks that the same construction applied to the edge in gives the inverse map.
If differ by a sequence of edge reversals, one defines the as subsets where all parameters related to the reversed edges are strictly positive, and let be the composition of the maps for the individual reversals.□
Proof of Theorem 1. Suppose has a generic subset S on which is k-to-one and the map of Lemma 2 is invertible. Then will be a generic subset of , and the identity from Lemma 2 shows that is k-to-one on . Thus we need only establish the existence of such an S.
Let , be the generic sets of Lemma 2. Let be a generic set on which is k-to-one. We may thus assume are all proper algebraic subsets. Since is generically k-to-one with finite k, the set must be contained in a proper algebraic subset of , say . We may therefore take .□
4 Two special models
In this section, we explain how one may explicitly solve for parameter values from a joint distribution of the observable variables for models specified by two specific DAGs with hidden nodes.
Parameter identifiability of the model with DAG shown in Figure 2 is an instance of a more general theorem of Kruskal  (see also [18, 19]). However, known proofs of the full Kruskal theorem do not yield an explicit procedure for recovering parameters. Nonetheless, a proof of a restricted theorem (the essential idea of which is not original to this work, and has been rediscovered several times) does. We include this argument for Theorem 3, since it is still not widely known and provides motivation for the approach to the proof of Theorem 4 for models associated to a second DAG, shown in Figure 3. Our analysis of the second model appears to be entirely novel. For both models, we characterize the exceptional parameters for which these procedures fail, giving a precise characterization of a set containing all non-identifiable parameters.
4.1 Explicit cases of Kruskal’s theorem
Parameters for the model are:
, a stochastic vector giving the distribution for the -state hidden variable .
For each of , a stochastic matrix .
The Kruskal row rank of a matrix M is the maximal number r such that every set of r rows of M is linearly independent.
Note that the Kruskal row rank of a matrix may be less than its rank, which is the maximal r such that some set of r rows is independent.
Our special case of Kruskal’s theorem is the following:
Consider the model represented by the DAG of model 3-0, where variables have states, with . Then generic parameters of the model are identifiable up to label swapping, and an algebraic procedure for determination of the parameters from the joint probability distribution can be given.
More specifically, if has no zero entries, have rank , and has Kruskal row rank at least 2, then the parameters can be found through determination of the roots of certain th degree univariate polynomials and solving linear equations. The coefficients of these polynomials and linear systems are rational expressions in the joint distribution.
Proof. For simplicity, consider first the case . Let be a probability distribution of observable variables arising from the model, viewed as a array.
Marginalizing P over (i.e., summing over the 3rd index), we obtain a matrix which, in terms of the unknown parameters, is the matrix product Similarly, if , then the slices of P with third index fixed at i (i.e., the conditional distributions given , up to normalization) are where is the ith column of .
Assuming are non-singular, and has no zero entries, is invertible and we see (3)Thus the entries of the columns of can be determined (without order) by finding the eigenvalues of the , and the rows of can be found by computing the corresponding left eigenvectors, normalizing so the entries add to 1. (If has repeated entries in the ith column, the eigenvectors may not be uniquely determined. However, since the matrices for various i commute, and has Kruskal row rank 2 or more, the set of these matrices do uniquely determine a collection of simultaneous 1-dimensional eigenspaces. We leave the details to the reader.) This determines and , up to the simultaneous ordering of their rows.
A similar calculation with determines , and , up to the row order. Since the rows of are distinct (because it has Kruskal rank 2), fixing some ordering of them fixes a consistent order of the rows of all of the .
Finally, one determines from .
The hypotheses on the rank and Kruskal rank of the parameter matrices can be expressed through the non-vanishing of minors, so all assumption on parameters used in this procedure can be phrased as the non-vanishing of certain polynomials. As a result, the exceptional set where it cannot be performed is contained in a proper algebraic subset of the parameter set.
Since the computations to perform the procedure involve computing eigenvalues and eigenvectors of matrices whose entries are rational in the joint distribution, the second paragraph of the theorem is justified.
In the more general case of , one can apply the argument above to subarrays of P corresponding to submatrices of and that are invertible. All such subarrays will lead to the same eigenvalues of the matrices analogous to those of eq. (3), so eigenvectors can be matched up to reconstruct entire rows of and . The vector is determined by a formula similar to that above, using a subarray of the marginalization .□
4.2 Another special model
Parameters for the model are:
, a stochastic vector giving the distribution for the -state hidden variable .
Stochastic matrices of size ; of size for ; and of size .
. Consider the model represented by the DAG of model 4-3b, where variables have states, with . Then generic parameters of the model are identifiable up to label swapping, and an algebraic procedure for determination of the parameters from the joint probability distribution can be given.
More specifically, suppose have no zero entries, the and matrices have rank , and there exists some with such that for all , the entries of satisfy inequality (7). Then from the resulting joint distribution the parameters can be found through determination of the roots of certain nth degree univariate polynomials and solving linear equations. The coefficients of these polynomials and linear systems are rational expressions in the entries of the joint distribution.
Proof. Consider first the case . With viewed as an array, we work with slices of P,
Note that these slices can be expressed as (4)where is the diagonal matrix given in terms of parameters by and and are as in the statement of the theorem.
Equation (4) implies for and that (5)and the hypotheses on the parameters imply the needed invertibility. But this shows the rows of are left eigenvectors of this product.
In fact, if , , then the eigenvalues of this product are distinct, for generic parameters. To see this, note the eigenvalues are (6)for , so distinctness of eigenvalues is equivalent to (7)for all . Thus a generic choice of leads to distinct eigenvalues.
With distinct eigenvalues, the eigenvectors are determined up to scaling. But since each row of must sum to 1, the rows of are therefore determined by P.
The ordering of the rows of the has not yet been determined. To do this, first fix an arbitrary ordering of the rows of , say, which imposes an arbitrary labeling of the states for . Then using eq. (4), from we can determine and with their rows ordered consistently with . For , using eq. (4) again, from we can determine and with a consistent row order. Thus and are determined.
To determine the remaining parameters, again appealing to eq. (4), we can recover the distribution using With no longer hidden, it is straightforward to determine the remaining parameters.
The general case of is handled by considering subarrays, just as in the proof of the preceding theorem.□
Remark. In the case of all binary variables, the expression in eq. (6) is just the conditional odds ratio for the observed variables , conditioned on . Inequality (7) can thus be interpreted as saying there is a non-zero 3-way interaction between the variables , which is the generic situation.
5 Small binary DAG models
All variables are assumed binary throughout this section. In Table 3 of the Appendix, we list each of the binary DAG models with one latent node which is parental to up to 4 observable nodes. We number the graphs as A- where is the number of observed variables, is the number of directed edges between the observed variables, and x is a letter appended to distinguish between several graphs with these same features. As the table presents only the case that all variables are binary, the observable distribution lies in a space of dimension .
The primary information in this table is in the column for k, indicating the parameterization map is generically k-to-one. As discussed in the introduction, the existence of such a k is not obvious, and does not follow from the behavior of general polynomial maps in real variables.
The models 4-3e and 4-3f, for which the parameterization maps are generically 4-to-one, are particularly interesting cases, as for these models there are non-identifiability issues that arise neither from overparameterization (in the sense of a parameter space of larger dimension than the distribution space) nor from label swapping. While these models are ones that can plausibly be imagined as being used for data analysis, they have a rather surprising failure of identifiability, which is explored more precisely in Section 6.
We now turn to establishing the results in Table 3.
For many of the models A− the dimension of the parameter space computed by eq. (2) exceeds the dimension of the probability simplex in which the joint distribution of observed variables lies. In these cases, the following proposition applies to show the parameterization is generically infinite-to-one. We omit its proof for brevity.
Let be any map defined by real polynomials, where S is an open subset of and . Then f is generically infinite-to-one.
This proposition applies to all models in Table 3 with an infinite-to-one parameterization, with the single exception of 4-2a. For that model, amalgamating and together, and likewise and , we obtain a model with two 4-state observed variables that are conditionally independent given a binary hidden variable . One can show that the probability distributions for this model form an 11-dimensional object, and then a variant of Proposition 5 applies.
For models 3-0 and 4-3b (and the Markov equivalent 4-3a), specializing Theorems 3 and 4 of the previous section to binary variables yields the claims in the table.
For the remaining models, the strategy is to first marginalize or condition on an observable variable to reduce the model to one already understood. One then attempts to “lift” results on the reduced model back to the original one.
We consider in detail only some of the models, indicating how the arguments we give can be adapted to others with minor modifications.
5.1 Model 4-1
Referring to Figure 4, since node 2 is a sink, marginalizing over gives an instance of model 3-0 with the same parameters, after discarding . Thus generically all parameters except are determined, up to label swapping.
But note that if the (unknown) joint distribution of is written as an matrix U, with and , then the matrix product has entries which form the observable joint distribution. Since generically is invertible, from the observable distribution and each of the already identified label swapping variants of we can find U. From U we marginalize to obtain and . Under the generic condition that are strictly positive, is as well, and so we can compute .
Models 4-0 and 4-2d are handled similarly, by marginalizing over the sink nodes 4 and 3, respectively.
An alternative argument for models 4-1 and 4-0 proceeds by amalgamating the observed variables, , into a single 4-state variable, and applying Theorem 3 directly to that model. We leave the details to the reader.
5.2 Models 4-2b,c
Up to renaming of nodes, the DAGs for models 4-2b and 4-2c are Markov equivalent. Thus by Theorem 1, it is enough to consider model 4-2c, as shown in Figure 5.
We condition on , to obtain two related models. Letting denote the conditioned variable at node i, the resulting observable distributions are With a hidden variable and observed variables , these distributions arise from a DAG like that of model 3-0. With parameters for the original model , matrices for , and matrices , and the standard basis vector, parameters for the conditioned models are
the stochastic matrix , and
for , the stochastic matrix , whose rows are the and rows of .
In particular, can be identified up to reordering its rows, and is invertible. But let U denote the (unknown) matrix with . Then has as its entries the observable distribution . Thus can be determined from P. Since U is the distribution of the induced model on with no hidden variables, it is then straightforward to identify all remaining parameters of the original model.
Thus all parameters are identifiable generically, up to label swapping. More specifically, they are identifiable provided that for either or 1 the three matrices have rank 2, and and the jth column of have non-zero entries.
5.3 Models 4-3e,f
Due to Markov equivalence, we need consider only 4-3e, as shown in Figure 6.
By conditioning on , , we obtain two models of the form of 3-0. One checks that the induced parameters for these conditioned models are generic. Indeed, in terms of the original parameters they are , , which are generically non-singular since they are simply submatrices of , and at the hidden node which generically has non-zero entries.
Thus for generic parameters on the original model, up to label swapping we can determine and , . However, we do not have an ordering of the states of that is consistent for the recovered parameters for the two models. Generically we have four choices of parameters for the two models taken together. Each of these four choices leads to a possible joint distribution ; viewing this joint distribution as a matrix, the four versions differ only by independently interchanging the two entries in each column, thus keeping the same marginalization . Generically, each of the four distributions for yields different parameters and . The matrices , , are then obtained using the same rows as in , though the ordering of the rows is dependent on the choice made previously.
Having obtained four possible parameter choices, it is straightforward to confirm that they all lead to the same joint distribution. Thus the parameterization map is generically 4-to-one.
6 Identification of causal effects
Here we examine the impact of k-to-one model parameterizations on the causal effect of one observable variable on another, when a latent variable acts as a confounder. For simplicity we assume that the latent variable is binary, though our discussion can be extended to a more general setting.
According to Theorem 3.2.2 (Adjusting for Direct Causes), p. 73 of Pearl , the causal effect of on can be obtained from model parameters by an appropriate sum over the states of the other direct causes of . This sum is invariant under a relabeling of the states of those direct causes, and therefore the causal effect is not affected by label swapping if one of these is latent. As an instance, the causal effect of on in model 4-2b is (8)Thus when label swapping is the only source of parameter non-identifiability, causal effects are uniquely determined by the observable distribution.
Things are more complex when parameter non-identifiability arises in other ways. For example, model 4-3e has one binary latent variable but a 4-to-one parameterization. In Table 1 two choices of parameters, (1), (2), are given for this model, as well as the common observable distribution they produce. These parameters and their two variants from label swapping at node 0 give the four elements of the fiber of the observable distribution.
For any parameters of model 4-3e, the causal effect of on is again as given in eq. (8). However, due to the 4-to-one parameterization, there can be two different causal effects that are consistent with an observable distribution. As such, there may be distributions such that one causal effect leads to the conclusion that there is a positive effect of on (i.e., setting gives a higher probability of than setting ), while the other causal effect leads to the conclusion that there is a negative effect of on . Indeed, the observable distribution in Table 1 is such an instance. In Table 2 the two causal effects corresponding to that given distribution are shown. Here parameters (2) lead to a positive effect of on , while parameters (1) lead to a negative effect.
More generally, for generic observable distributions of this model, choices of parameters that differ other than by label swapping can give different causal effects. However, it varies whether the effects have the same or different signs.
Paraphrasing Pearl , the problem of identifying causal effects in non-parametric models has been “placed to rest” by the proof of completeness of the do-calculus and related graphical criteria. In this paper we show that the introduction of modest (parametric) assumptions on the size of the state spaces of variables allows for identifiability of parameters that otherwise would be non-identifiable. Causal effects can be computed from identified parameters, if desired, but our techniques allow for the recovery of all parameters. In the process of proving parameter identifiability for several small networks, we use techniques inspired by a theorem of Kruskal, and other novel approaches. This framework can be applied to other models as well.
We have at least three reasons to extend the work described in this paper. The first is to develop new techniques and to prove new theoretical results for parameter identifiability; this provides the foundation of our work. The second is to reach the stage at which one can easily determine parameter identifiability for DAG models with hidden variables that are used in statistical modeling; this motivates our work. A third and related focus of future work is to address the scalability of our approach and to automate it. As noted above many of our arguments do not depend on variables being binary. Also, a strategy that we used successfully to handle larger models is to first marginalize or condition on an observable variable to reduce the model to one already understood, and then to “lift” results on the reduced model back to the original one. We are working toward turning this strategy into an algorithm.
The authors thank the American Institute of Mathematics, where this work was begun during a workshop on Parameter Identification in Graphical Models, and continued through AIM’s SQuaRE program.
Table 3 shows all DAGs with four or fewer observable nodes and one hidden node that is a parent of all observable ones (see Section 5 for model naming convention). Markov equivalent graphs appear on the same line. The dimension of the parameter space is , and is the dimension of the probability simplex in which the joint distribution lies. The parameterization map is generically k-to-one.
Neapolitan RE. Probabilistic reasoning in expert systems: theory and algorithms. New York, NY: John Wiley and Sons, 1990. Google Scholar
Neapolitan RE. Learning Bayesian networks. Upper Saddle River, NJ: Pearson Prentice Hall, 2004. Google Scholar
Pearl J. Causality: models, reasoning, and inference, 2nd ed. Cambridge: Cambridge University Press, 2009. Google Scholar
Huang Y, Valtorta M. Pearl’s calculus of intervention is complete. In Proceedings of the twenty-second conference on uncertainty in artificial intelligence (UAI-06), 2006:217–24. Google Scholar
Pearl J. The do-calculus revisited. In Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence (UAI-12), 2012:4–11. Google Scholar
Shpitser I, Pearl J. Complete identification methods for the causal hierarchy. J Mach Learn Res 2008;9:1941–79.Google Scholar
Tian J, Pearl J. A general identification condition for causal effects. In Proceedings of the eighteenth national conference on artificial intelligence (AAAI-02), 2002:567–73. Google Scholar
Mond D, Smith J, van Straten D. Stochastic factorizations, sandwiched simplices and the topology of the space of explanations. R Soc Lond Proc Ser Math Phys Eng Sci 2003;459:2821–45. CrossrefGoogle Scholar
Kubjas K, Robeva E, Sturmfels B. Fixed points of the EM algorithm and nonnegative rank boundaries. In revision, 2013. Google Scholar
Stanghellini E, Vantaggi B. Identification of discrete concentration graph models with one hidden binary variable. Bernoulli 2013;19:1920–37. Available at: http://dx.doi.org/10.3150/12-BEJ435. CrossrefWeb of Science
Lauritzen SL. Graphical models. Oxford Statistical Science Series, vol. 17. New York: The Clarendon Press Oxford University Press, Oxford Science Publications, 1996. Google Scholar
Meek C. Strong completeness and faithfulness in Bayesian networks. In Proceedings of the eleventh annual conference on uncertainty in artificial intelligence (UAI-95), San Francisco, CA: Morgan Kaufmann, 1995:411–18. Google Scholar
Chickering DM. A transformational characterization of equivalent Bayesian network structures. In Proceedings of the eleventh annual conference on uncertainty in artificial intelligence (UAI-95), San Francisco, CA: Morgan Kaufmann, 1995:87–98. Google Scholar
About the article
Published Online: 2014-12-03
Published in Print: 2015-09-01
Research funding: American Institute of Mathematics Structured Quartet Research Ensemble grant.