Jump to ContentJump to Main Navigation
Show Summary Details

Journal of Causal Inference

Ed. by Imai, Kosuke / Pearl, Judea / Petersen, Maya Liv / Sekhon, Jasjeet / van der Laan, Mark J.

2 Issues per year

Online
ISSN
2193-3685
See all formats and pricing

Parameter Identifiability of Discrete Bayesian Networks with Hidden Variables

Elizabeth S. Allman
  • Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, USA
  • Email:
/ John A. Rhodes
  • Corresponding author
  • Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, USA
  • Email:
/ Elena Stanghellini
  • Dipartimento di Economia Finanza e Statistica, Universita di Perugia, Perugia, Italy
  • Email:
/ Marco Valtorta
  • Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
  • Email:
Published Online: 2014-12-03 | DOI: https://doi.org/10.1515/jci-2014-0021

Abstract

Identifiability of parameters is an essential property for a statistical model to be useful in most settings. However, establishing parameter identifiability for Bayesian networks with hidden variables remains challenging. In the context of finite state spaces, we give algebraic arguments establishing identifiability of some special models on small directed acyclic graphs (DAGs). We also establish that, for fixed state spaces, generic identifiability of parameters depends only on the Markov equivalence class of the DAG. To illustrate the use of these results, we investigate identifiability for all binary Bayesian networks with up to five variables, one of which is hidden and parental to all observable ones. Surprisingly, some of these models have parameterizations that are generically 4-to-one, and not 2-to-one as label swapping of the hidden states would suggest. This leads to interesting conflict in interpreting causal effects.

Keywords: parameter identifiability; discrete Bayesian network; hidden variables

1 Introduction

A directed acyclic graph (DAG) can represent the factorization of a joint distribution of a set of random variables. To be more precise, a Bayesian network is a pair (G,P), where G is a DAG and P is a joint probability distribution of variables in one-to-one correspondence with the nodes of G, with the property that each variable is conditionally independent of its non-descendants given its parents. It follows from this definition that the joint probability P factors according to G, as the product of the conditional probabilities of each node given its parents. Thus a discrete Bayesian network is fully specified by a DAG and a set of conditional probability tables, one for each node given its parents [1, 2].

A causal Bayesian network is a Bayesian network enhanced with a causal interpretation. Work initiated by Pearl [3, 4] investigated the identification of causal effects in causal Bayesian networks when some variables are assumed observable and others are hidden. In a non-parametric setting, with no assumptions about the state space of variables, there is a complete algorithm for determining which causal effects between variables are identifiable [58].

Figure 1

The DAG of a Bayesian network studied by Kuroki and Pearl [9], denoted 4-2b in the Appendix

As powerful as this theory is, however, it does not address identifiability when assumptions are made on the nature of the variables. Indeed, by specializing to finite state spaces, causal effects that were non-identifiable according to the theory above may become identifiable. One particular example, with DAG shown in Figure 1, has been studied by Kuroki and Pearl [9]. If the state space of hidden variable 0 is finite, and observable variables 1 and 4 have state spaces of larger sizes, then the causal effect of variable 2 on variable 3 can be determined, for generic parameter choices.

In this paper we study in detail identification properties of certain small Bayesian networks, as a first step toward developing a systematic understanding of identification in the presence of finite hidden variables. While this includes an analysis of the model with the DAG above, our motivation is different from that of Kuroki and Pearl [9], and results were obtained independently. We make a thorough study of networks with up to five binary variables, one of which is unobservable and parental to all observable ones, as shown in Table 3 of the Appendix. These investigations lead us to develop some basic tools and arguments that can be applied more generally to questions of parameter identifiability.

In addition, for each such binary model in Table 3, we determine a value k{} such that the marginalization from the full joint distribution to that over the observable variables is generically k-to-one. Although we restrict this exhaustive study to binary models for simplicity, straightforward modifications to our arguments would extend them to larger state spaces. A typical requirement for such an extended identifiability result is that the state spaces of observable variables be sufficiently large, relative to that of the hidden variable, as in the result of Kuroki and Pearl [9] described earlier. Interestingly, that result restricted to finite state spaces follows easily from our framework, and can be obtained for continuous state spaces of observable variables using arguments of Allman et al. [10].

We use the term “DAG model” for the collection of all Bayesian networks with the same DAG and specification of state spaces for the variables. With the conditional probability tables of nodes given their parents forming the parameters of the model, we thus allow these tables to range over all valid tables of a fixed size to give the parameter space of such a model.

That some of the DAG models we consider have non-identifiable parameters is a consequence of the well-known non-uniqueness (in most circumstances) of non-negative rank decompositions of matrices. An example is the infinite-to-one parameterization of model 4-2a in Table 3. For greater detail on this issue see the work of Mond et al. [11] and Kubjas et al. [12].

In dealing with discrete unobserved variables, another well-understood identifiability issue is sometimes called label swapping. If the latent variable has r states, there are r! parameter choices, obtained by permuting the state labels of the latent variable, that generate the same observable distribution. Thus the parameterization map is generically at least r!-to-one. For models with a single binary latent variable, it is thus commonly expected that parameterizations are either infinite-to-one due to a parameter space of too high a dimension, or 2-to-one due to label swapping. Our work, however, finds surprisingly simple examples such that the mapping is 4-to-one, so that subtler non-identifiability issues arise.

Our analysis arises from an algebraic viewpoint of the identifiability problem. With finite state spaces the parameterization maps for DAG models with hidden variables are polynomial. Given a distribution arising from the model, the parameters are identifiable precisely when a certain system of multivariate polynomial equations has exactly one solution (up to label swapping of states for hidden variables). Though in principle computational algebra software can be used to investigate parameter identifiability, the necessary calculations are usually intractable for even moderate size DAGs and/or state spaces. In addition, one runs into issues of complex versus real roots, and the difficulty of determining when real roots lie within stochastic bounds. While our arguments are fundamentally algebraic, they do not depend on any machine computations.

If a single polynomial p(x) in one variable is given, of degree n, then it is well known that the map from to that it defines will be generically n-to-one. Indeed the equation p(x)=a will be of degree n for each choice of a, and generically will have n distinct roots. This fact generalizes to polynomial maps from n to m; there always exists a k{} such that the map is generically k-to-one. However if p(x) has real coefficients, and is instead viewed as a map from (a subset of) to , it may not have a generic k-to-one behavior. For instance, from a typical graph of a cubic one sees there can be sets of positive measure on which it is 3-to-one, and others on which it is one-to-one, as well as an exceptional set of measure zero on which the cubic is 2-to-one. While this exceptional set arises since a polynomial may have repeated roots, the lack of a generic k-to-one behavior is due to passing from considering a complex domain for the function, to a real one.

The fact that the polynomial parameterizations for the models investigated here have a generic k-to-one behavior on their parameter space thus depends on the particular form of the parameterizations. For those binary models in Table 3, we prove this essentially one model at a time, while obtaining the value for k. In the case of finite k, our arguments actually go further and characterize the k elements of ϕ1(ϕ(θ)) in terms of a generic θ. Of course when k=2 this is nothing more than label swapping, but for the cases of k=4 more is required. Precise statements appear in later sections. In some cases, we also give descriptions of an exceptional subset of Θ where the generic behavior may not hold. In all cases, the reader can deduce such a set from our arguments.

After setting terminology in Section 2, in Section 3 we establish that, when all variables have fixed finite state spaces, Markov equivalent DAGs specify parameter equivalent models. Thus in answering generic identifiability questions one need only consider Markov equivalence classes of DAGs. In Section 4 we revisit the fundamental result due to Kruskal [13], as developed in Allman et al. [10] for identifiability questions. We give explicit identifiability procedures for the DAG this to which this result applies most directly (model 3-0), and also for the DAG of model 4-3b. These two DAGs are basic cases whose known identifiability is then leveraged in Section 5 to determine generic identifiability results for all the binary DAGs we catalog. Although in Section 5 we do not push our arguments toward exhaustive consideration of non-binary models, in many cases it would be straightforward to do so. For instance, if all variables associated to a DAG have the same size state space, little in our arguments needs to be modified.

Finally, in Section 6 we construct an explicit distribution for the generically 4-to-one parameterization of model 4-3e in which there are two different causal effects consistent with the observable distribution. This is possible because the parameter sets that give rise to this distribution differ in ways beyond label swapping. Determining causal effects coherently in this context is thus impossible. This example provokes a general caution: The parameterization of a discrete DAG model can be k-to-one with k larger than one would expect from label swapping, and when this occurs quantifying causal effects can be highly problematic.

We view the main contribution of this paper not as the determination of parameter identifiability for the specific binary models we consider, but rather as the development of the techniques by which we establish our results. Ultimately, one would like fairly simple graphical rules to determine which parameters are identifiable, and perhaps even to yield formulas for them in terms of the joint distribution. Establishing similar results for more general graphical models, not specified by a DAG, is also desirable. Some work in this context already exists (see, e.g., Stanghellini and Vantaggi [14]).

2 Discrete DAG models and parameter identifiability

The models we consider are specified in part by DAGs G=(V,E) in which nodes vV represent random variables Xv, and directed edges in E imply certain independence statements for the joint distribution of all variables [15]. A bipartition of V=OH is given, in which variables associated to nodes in O or H are observable or hidden, respectively. Finally, we fix finite state spaces, of size nv for each variable Xv.

A DAG G entails a collection of conditional independence statements on the variables associated to its nodes, via d-separation, or an equivalent separation criterion in terms of the moral graph on ancestral sets. A joint distribution of variables satisfies these statements precisely when it has a factorization according to G as P=vVPXv|Xpa(v),with pa(v) denoting the set of parents of v in G. We refer to the conditional probabilities θ=(P(Xv|Xpa(v)))vV as the parameters of the DAG model, and denote the space of all possible choices of parameters by Θ=ΘG,{nv}. The parameterization map for the joint distribution of all variables, both observable and hidden, is denoted as ϕ:ΘΔvVnv1,where Δk is the k-dimensional probability simplex of stochastic vectors in k+1. Thus ϕ(Θ) is precisely the collection of all probability distributions satisfying the conditional independence statements associated to G (and possibly additional ones).

Since the probability distribution for the model with hidden variables is obtained from that of the fully observable model, its parameterization map is ϕ+=σφ:ΘΔvOnv1,where σ denotes the appropriate map marginalizing over hidden variables. The set ϕ+(Θ) is thus the collection of all observable distributions that arise from the hidden variable model. This collection depends not only on the DAG and designated state spaces of observable variables, but also on the state spaces of hidden variables, even though the sizes of hidden state spaces are not readily apparent from an observable joint distribution.

With all variables having finite state spaces, the parameter space Θ can be identified with the closure of an open subset of [0,1]L, for some L. We refer to L as the dimension of the parameter space. The dimension of Θ is easily seen to be dim(Θ)=vV(nv1)wpa(v)nw.(1)In the case of all binary variables, this simplifies to dim(Θ)=vV2|pa(v)|=k=0mk2k,(2)where mk is the number of nodes in G with in-degree k.

If a statement is said to hold for generic parameters or generically then we mean it holds for all parameters in a set of the form ΘE, where the exceptional set E is a proper algebraic subset of Θ. (Recall an algebraic subset is the zero set of a finite collection of multivariate polynomials.) As proper algebraic subsets of n are always of Lebesgue measure zero, a statement that holds generically can fail only on a set of measure zero.

As an example of this language, for any DAG model with all variables finite and observable, generic parameters lead to a distribution faithful to the DAG, in the sense that those conditional independence statements implied by d-separation rules will hold, and no others [16]. Equivalently, a generic distribution from such a model is faithful to the DAG.

There are several notions of identifiability of parameters of a model; we refer the reader to Allman et al. [10]. The strictest notion, that the parameterization map is one-to-one, is easily seen to hold when all DAG variables are observable with mild additional assumptions (e.g., positivity of all parameters). If a model has hidden variables, then this is too strict a notion of identifiability, as the well-known issue of label swapping arises: One can permute the names of the states of hidden variables, making appropriate changes to associated parameters, without changing the joint distribution of the observable variables. For a model with one r-state hidden variable, label swapping implies that for any generic θ1Θ there are at least r!1 other points θjΘ with ϕ+(θ1)=ϕ+(θj). But since these are isolated parameter points that differ only by state labeling, this issue does not generally limit the usefulness of a model, provided that we remain aware of it when interpreting parameters.

The strongest useful notion of identifiability for models with hidden variables is that for generic θ1Θ, if ϕ+(θ1)=ϕ(θ2)+, then θ1 and θ2 differ only up to label swapping for hidden variables. This notion is our primary focus in this paper, which we refer to as generic identifiability up to label swapping. In particular, for models with a single binary hidden variable it is equivalent to the parameterization map being generically 2-to-one.

3 Markov equivalence and parameter identifiability

Two DAGs on the same sets of observable and hidden nodes are said to be Markov equivalent if they entail the same conditional independence statements through d-separation. (Note this notion does not distinguish between observable and hidden variables; all are treated as observable.) Thus for fixed choices of state spaces of the variables, two different but Markov equivalent DAGs, G1G2, have different parameter spaces Θ1,Θ2, and different parameterization maps, yet ϕ1(Θ1)=ϕ2(Θ2).

For studying identifiability questions, it is helpful to first explore the relationship between parameterizations for Markov equivalent graphs. A simple example, with no hidden variables, is instructive. Consider the DAGs on two observable nodes 12,12,which are equivalent, since neither entails any independence statements. Now the particular probability distribution P(X1=i,X2=j)=Pij with P=1/201/20requires parameters on the first DAG to be P(X1)=(1/2,1/2),P(X2|X1)=1010,while parameters on the second DAG can be P(X2)=(1,0),P(X1|X2)=1/21/2t1tfor any t[0,1]. Thus this particular distribution has identifiable parameters for only one of these DAGs. (Here and in the rest of the paper conditional probability tables specifying parameters have rows corresponding to states of conditioning, i.e., parent, variables.)

Of course, this probability distribution was a special one, and is atypical for these models, which are easily seen to have generically identifiable parameters (as do all DAG models without hidden variables). Nonetheless, it illustrates the need for “generic” language and careful arguments for results such as the following.

Theorem 1: With all variables having fixed finite state spaces, consider two Markov equivalent DAGs, G1 and G2, possibly with hidden nodes. If the parameterization map ϕ1+ is generically k-to-one for some k, then ϕ2+ is also generically k-to-one.

In particular if such a model has parameters that are generically identifiable up to label swapping, so does every Markov equivalent model.

This theorem is a consequence of the following:

Lemma 2: With all variables having finite state spaces, consider two Markov equivalent DAGs, G1 and G2, with parameter spaces Θi and parameterization maps ϕi, i{1,2}, for the joint distribution of all variables. Then there are generic subsets SiΘi and a rational homeomorphism ψ:S1S2, with rational inverse, such that for all θS1 ϕ1(θ)=ϕ2(ψ(θ)).

Proof. Recall that an edge ij of a DAG is said to be covered if pa(j)=pa(i){i}. By Chickering [17], Markov equivalent DAGs differ by applying a sequence of reversals of covered edges.

We thus first assume the Gi differ by the reversal of a single covered edge ij of G1. Let W=paG1(i)=paG2(j), so paG1(j)=W{i}, paG2(i)=W{j}. Now any θΘ1 is a collection of conditional probabilities P(Xv|Xpa(v)), including P(Xi|W),P(Xj|Xi,W). From these, successively define P(Xi,Xj|W)=P(Xj|Xi,W)P(Xi|W), P(Xj|W)=kP(Xi=k,Xj|W), P(Xi|Xj,W)=P(Xi,Xj|W)/P(Xj|W).Using these last two conditional probabilities, along with those specified by θ for all vi,j, define parameters ψ(θ)Θ2. Now ψ is defined and continuous on the set S1 where P(Xi|W) and P(Xj|Xi,W) are strictly positive.

One easily checks that the same construction applied to the edge ji in G2 gives the inverse map.

If G1,G2 differ by a sequence of edge reversals, one defines the Si as subsets where all parameters related to the reversed edges are strictly positive, and let ψ be the composition of the maps for the individual reversals.□

Proof of Theorem 1. Suppose Θ1 has a generic subset S on which ϕ1+ is k-to-one and the map ψ of Lemma 2 is invertible. Then ψ(S) will be a generic subset of Θ2, and the identity ϕ2+(θ)=ϕ1+(ψ1(θ))from Lemma 2 shows that ϕ2+ is k-to-one on ϕ(S). Thus we need only establish the existence of such an S.

Let S1=Θ1E1, S2=Θ2E2 be the generic sets of Lemma 2. Let S1=Θ1E1 be a generic set on which ϕ1+ is k-to-one. We may thus assume E1,E1,E2 are all proper algebraic subsets. Since ϕ1+ is generically k-to-one with finite k, the set (ϕ1+)1(ϕ1+(E1)) must be contained in a proper algebraic subset of Θ1, say E1′′. We may therefore take S=Θ1(E1E1′′).□

4 Two special models

In this section, we explain how one may explicitly solve for parameter values from a joint distribution of the observable variables for models specified by two specific DAGs with hidden nodes.

Parameter identifiability of the model with DAG shown in Figure 2 is an instance of a more general theorem of Kruskal [13] (see also [18, 19]). However, known proofs of the full Kruskal theorem do not yield an explicit procedure for recovering parameters. Nonetheless, a proof of a restricted theorem (the essential idea of which is not original to this work, and has been rediscovered several times) does. We include this argument for Theorem 3, since it is still not widely known and provides motivation for the approach to the proof of Theorem 4 for models associated to a second DAG, shown in Figure 3. Our analysis of the second model appears to be entirely novel. For both models, we characterize the exceptional parameters for which these procedures fail, giving a precise characterization of a set containing all non-identifiable parameters.

4.1 Explicit cases of Kruskal’s theorem

The model we consider has the DAG of model 3-0 in Table 3, also shown in Figure 2 for convenience.

Figure 2

The DAG of model 3-0, the Kruskal model

Parameters for the model are:

  • 1.

    p0=P(X0)Δn01, a stochastic vector giving the distribution for the n0-state hidden variable X0.

  • 2.

    For each of i=1,2,3, a n0×ni stochastic matrix Mi=P(Xi|X0).

We use the following terminology.

Definition: The Kruskal row rank of a matrix M is the maximal number r such that every set of r rows of M is linearly independent.

Note that the Kruskal row rank of a matrix may be less than its rank, which is the maximal r such that some set of r rows is independent.

Our special case of Kruskal’s theorem is the following:

Theorem 3: Consider the model represented by the DAG of model 3-0, where variables Xi have ni2 states, with n1,n2n0. Then generic parameters of the model are identifiable up to label swapping, and an algebraic procedure for determination of the parameters from the joint probability distribution P(X1,X2,X3) can be given.

More specifically, if p0 has no zero entries, M1,M2 have rank n0, and M3 has Kruskal row rank at least 2, then the parameters can be found through determination of the roots of certain n0th degree univariate polynomials and solving linear equations. The coefficients of these polynomials and linear systems are rational expressions in the joint distribution.

Proof. For simplicity, consider first the case n0=n1=n2=n. Let P=P(X1,X2,X3) be a probability distribution of observable variables arising from the model, viewed as a n×n×n3 array.

Marginalizing P over X3 (i.e., summing over the 3rd index), we obtain a matrix which, in terms of the unknown parameters, is the matrix product P+=P(X1,X2)=M1Tdiag(p0)M2.Similarly, if M3=(mij), then the slices of P with third index fixed at i (i.e., the conditional distributions given Xi=i, up to normalization) are Pi=P(X1,X2,X3=i)=M1Tdiag(p0)diag(M3(,i))M2,where M3(,i) is the ith column of M3.

Assuming M1,M2 are non-singular, and p0 has no zero entries, P+ is invertible and we see P+1Pi=M21diag(M3(,i))M2.(3)Thus the entries of the columns of M3 can be determined (without order) by finding the eigenvalues of the P+1Pi, and the rows of M2 can be found by computing the corresponding left eigenvectors, normalizing so the entries add to 1. (If M3 has repeated entries in the ith column, the eigenvectors may not be uniquely determined. However, since the matrices P+1Pi for various i commute, and M3 has Kruskal row rank 2 or more, the set of these matrices do uniquely determine a collection of simultaneous 1-dimensional eigenspaces. We leave the details to the reader.) This determines M2 and M3, up to the simultaneous ordering of their rows.

A similar calculation with PiP+1 determines M1, and M3, up to the row order. Since the rows of M3 are distinct (because it has Kruskal rank 2), fixing some ordering of them fixes a consistent order of the rows of all of the Mi.

Finally, one determines p0 from M1TP+M21=diag(p0).

The hypotheses on the rank and Kruskal rank of the parameter matrices can be expressed through the non-vanishing of minors, so all assumption on parameters used in this procedure can be phrased as the non-vanishing of certain polynomials. As a result, the exceptional set where it cannot be performed is contained in a proper algebraic subset of the parameter set.

Since the computations to perform the procedure involve computing eigenvalues and eigenvectors of matrices whose entries are rational in the joint distribution, the second paragraph of the theorem is justified.

In the more general case of n1,n2n0, one can apply the argument above to n0×n0×n3 subarrays of P corresponding to submatrices of M1 and M2 that are invertible. All such subarrays will lead to the same eigenvalues of the matrices analogous to those of eq. (3), so eigenvectors can be matched up to reconstruct entire rows of M1 and M2. The vector p0 is determined by a formula similar to that above, using a subarray of the marginalization P+.□

4.2 Another special model

The model we consider next has the DAG of model 4-3b in Table 3, reproduced in Figure 3 for convenience.

Figure 3

The DAG of model 4-3b

Parameters for the model are:

  • 1.

    p0=P(X0)Δn01, a stochastic vector giving the distribution for the n0-state hidden variable X0.

  • 2.

    Stochastic matrices M1=P(X1|X0) of size n0×n1; Mi=P(Xi|X0,X1) of size n0n1×ni for i=2,3; and M4=P(X4|X0,X3) of size n0n3×n4.

Theorem 4: . Consider the model represented by the DAG of model 4-3b, where variables Xi have ni2 states, with n2,n4n0. Then generic parameters of the model are identifiable up to label swapping, and an algebraic procedure for determination of the parameters from the joint probability distribution P(X1,X2,X3,X4) can be given.

More specifically, suppose p0,M1,M3 have no zero entries, the n0×n2 and n0×n4 matrices M2i=P(X2|X0,X1=i),1in1,and M4j=P(X4|X0,X3=j),1jn3have rank n0, and there exists some i,i with 1i<in1 such that for all 1j<j<n3, 1k<kn0 the entries of M3 satisfy inequality (7). Then from the resulting joint distribution the parameters can be found through determination of the roots of certain nth degree univariate polynomials and solving linear equations. The coefficients of these polynomials and linear systems are rational expressions in the entries of the joint distribution.

Proof. Consider first the case n0=n2=n4=n. With P=P(X1,X2,X3,X4) viewed as an n1×n×n3×n array, we work with n×n slices of P, Pi,j=P(X1=i,X2,X3=j,X4),

that is, we essentially condition on X1,X3, though omit the normalization.

Note that these slices can be expressed as Pi,j=(M2i)TDi,jM4j,(4)where Di,j=diag(P(X0,X1=i,X3=j)) is the diagonal matrix given in terms of parameters by Di,j(k,k)=p0(k)M1(k,i)M3((k,i),j),and M2i and M4j are as in the statement of the theorem.

Equation (4) implies for 1i,in1 and 1j,jn3 that Pi,j1Pi,jPi,j1Pi,j=(M4j)1Di,j1Di,jDi,j1Di,jM4j,(5)and the hypotheses on the parameters imply the needed invertibility. But this shows the rows of M4j are left eigenvectors of this product.

In fact, if ii, jj, then the eigenvalues of this product are distinct, for generic parameters. To see this, note the eigenvalues are M3((k,i),j)M3((k,i),j)/(M3((k,i),j)M3((k,i),j),(6)for 1kn, so distinctness of eigenvalues is equivalent to M3((k,i),j)M3((k,i),j)M3((k,i),j)M3((k,i),j)M3((k,i),j)M3((k,i),j)M3((k,i),j)M3((k,i),j),(7)for all 1k<kn. Thus a generic choice of M3 leads to distinct eigenvalues.

With distinct eigenvalues, the eigenvectors are determined up to scaling. But since each row of M4j must sum to 1, the rows of M4j are therefore determined by P.

The ordering of the rows of the M4j has not yet been determined. To do this, first fix an arbitrary ordering of the rows of M41, say, which imposes an arbitrary labeling of the states for X0. Then using eq. (4), from Pi,1(M41)1 we can determine Di,1 and M2i with their rows ordered consistently with M41. For j1, using eq. (4) again, from (M2i)TPi,j we can determine Di,j and M4j with a consistent row order. Thus M2 and M4 are determined.

To determine the remaining parameters, again appealing to eq. (4), we can recover the distribution P(X0,X1,X2) using (M2i)TPi,j(M4j)1=diag(P(X0,X1=i,X3=j)).With X0 no longer hidden, it is straightforward to determine the remaining parameters.

The general case of n0n2,n4 is handled by considering subarrays, just as in the proof of the preceding theorem.□

Remark. In the case of all binary variables, the expression in eq. (6) is just the conditional odds ratio for the observed variables X1,X3, conditioned on X0. Inequality (7) can thus be interpreted as saying there is a non-zero 3-way interaction between the variables X0,X1,X2, which is the generic situation.

5 Small binary DAG models

All variables are assumed binary throughout this section. In Table 3 of the Appendix, we list each of the binary DAG models with one latent node which is parental to up to 4 observable nodes. We number the graphs as A-Bx where A=|O|=|V|1 is the number of observed variables, B=|E||O| is the number of directed edges between the observed variables, and x is a letter appended to distinguish between several graphs with these same features. As the table presents only the case that all variables are binary, the observable distribution lies in a space of dimension 2A1.

The primary information in this table is in the column for k, indicating the parameterization map is generically k-to-one. As discussed in the introduction, the existence of such a k is not obvious, and does not follow from the behavior of general polynomial maps in real variables.

The models 4-3e and 4-3f, for which the parameterization maps are generically 4-to-one, are particularly interesting cases, as for these models there are non-identifiability issues that arise neither from overparameterization (in the sense of a parameter space of larger dimension than the distribution space) nor from label swapping. While these models are ones that can plausibly be imagined as being used for data analysis, they have a rather surprising failure of identifiability, which is explored more precisely in Section 6.

We now turn to establishing the results in Table 3.

For many of the models ABx the dimension of the parameter space computed by eq. (2) exceeds the dimension 2A1 of the probability simplex in which the joint distribution of observed variables lies. In these cases, the following proposition applies to show the parameterization is generically infinite-to-one. We omit its proof for brevity.

Proposition 5: Let f:Sm be any map defined by real polynomials, where S is an open subset of n and n>m. Then f is generically infinite-to-one.

This proposition applies to all models in Table 3 with an infinite-to-one parameterization, with the single exception of 4-2a. For that model, amalgamating X1 and X2 together, and likewise X3 and X4, we obtain a model with two 4-state observed variables that are conditionally independent given a binary hidden variable X0. One can show that the probability distributions for this model form an 11-dimensional object, and then a variant of Proposition 5 applies.

For models 3-0 and 4-3b (and the Markov equivalent 4-3a), specializing Theorems 3 and 4 of the previous section to binary variables yields the claims in the table.

For the remaining models, the strategy is to first marginalize or condition on an observable variable to reduce the model to one already understood. One then attempts to “lift” results on the reduced model back to the original one.

We consider in detail only some of the models, indicating how the arguments we give can be adapted to others with minor modifications.

5.1 Model 4-1

Referring to Figure 4, since node 2 is a sink, marginalizing over X2 gives an instance of model 3-0 with the same parameters, after discarding P(X2|X0,X1). Thus generically all parameters except P(X2|X0,X1) are determined, up to label swapping.

Figure 4

The DAG of model 4-1

But note that if the (unknown) joint distribution of X0,X1,X2,X3 is written as an 8×2 matrix U, with U((i,j,k),)=P(X0=,X1=i,X2=j,X3=k),and M4=P(X4|X0), then the matrix product UM4 has entries (UM4)((i,j,k),)=P(X1=i,X2=j,X3=k,X4=),which form the observable joint distribution. Since generically M4 is invertible, from the observable distribution and each of the already identified label swapping variants of M4 we can find U. From U we marginalize to obtain P(X0,X1,X2) and P(X0,X1). Under the generic condition that P(X0),P(X1|X0) are strictly positive, P(X0,X1) is as well, and so we can compute P(X2|X0,X1)=P(X0,X1,X2)/P(X0,X1).

Models 4-0 and 4-2d are handled similarly, by marginalizing over the sink nodes 4 and 3, respectively.

An alternative argument for models 4-1 and 4-0 proceeds by amalgamating the observed variables, X1,X2, into a single 4-state variable, and applying Theorem 3 directly to that model. We leave the details to the reader.

5.2 Models 4-2b,c

Up to renaming of nodes, the DAGs for models 4-2b and 4-2c are Markov equivalent. Thus by Theorem 1, it is enough to consider model 4-2c, as shown in Figure 5.

Figure 5

The DAG of model 4-2c

We condition on X1=j, j=1,2 to obtain two related models. Letting Xi(j) denote the conditioned variable at node i, the resulting observable distributions are P(X2(j),X3(j),X4(j))=P(X2,X3,X4|X1=j)=P(X1=j)1P(X1=j,X2,X3,X4).With a hidden variable X0(j) and observed variables X2(j),X3(j),X4(j), these distributions arise from a DAG like that of model 3-0. With parameters for the original model p0=P(X0), 2×2 matrices Mi=P(Xi|X0) for i=1,4, and 4×2 matrices Mi=P(Xi|X0,X1), i=2,3 and ej the standard basis vector, parameters for the conditioned models are

  • 1.

    the vector p0(j)=P(X0(j))=P(X0|X1=j)=P(X1=j)1P(X0,X1=j)=1p0TM1ej(diag(p0)M1ej),

  • 2.

    the 2×2 stochastic matrix M4(i)=P(X4(i)|X0(i))=M4, and

  • 3.

    for i=2,3, the 2×2 stochastic matrix Mi(j)=P(Xi(j)|X0(j)), whose rows are the (0,j) and (1,j) rows of Mi.

Now if p0 and column j of M1 have non-zero entries, it follows that p0(j) has no zero entries. If additionally M2(j),M3(j),M4 all have rank 2, by Theorem 3 the parameters of these conditioned models are identifiable, up to the labeling of the states of the hidden variable. As these assumptions are generic conditions on the parameters of the original model, we can generically identify the parameters of the conditioned models.

In particular, M4 can be identified up to reordering its rows, and is invertible. But let U denote the (unknown) 8×2 matrix with U((i,j,k),)=P(X0=,X1=i,X2=j,X3=k). Then P=UM4 has as its entries the observable distribution P(X1,X2,X3,X4). Thus U=PM41 can be determined from P. Since U is the distribution of the induced model on X0,X1,X2,X3 with no hidden variables, it is then straightforward to identify all remaining parameters of the original model.

Thus all parameters are identifiable generically, up to label swapping. More specifically, they are identifiable provided that for either j=0 or 1 the three matrices M4,M2(j),M3(j) have rank 2, and p0 and the jth column of M1 have non-zero entries.

5.3 Models 4-3e,f

Due to Markov equivalence, we need consider only 4-3e, as shown in Figure 6.

Figure 6

The DAG of model 4-3e

By conditioning on X1=j, j=1,2, we obtain two models of the form of 3-0. One checks that the induced parameters for these conditioned models are generic. Indeed, in terms of the original parameters they are P(Xi|X0,X1=j), i=2,3,4, which are generically non-singular since they are simply submatrices of P(Xi|X0,X1), and at the hidden node P(X0|X1=j)=P(X1=j|X0)P(X0)P(X1=j|X0=)P(X0=)which generically has non-zero entries.

Thus for generic parameters on the original model, up to label swapping we can determine P(X0|X1=j) and P(Xi|X0,X1=j), i=2,3,4. However, we do not have an ordering of the states of X0 that is consistent for the recovered parameters for the two models. Generically we have four choices of parameters for the two models taken together. Each of these four choices leads to a possible joint distribution P(X0,X1); viewing this joint distribution as a matrix, the four versions differ only by independently interchanging the two entries in each column, thus keeping the same marginalization P(X1). Generically, each of the four distributions for P(X0,X1) yields different parameters P(X0) and P(X1|X0). The matrices P(Xi|X0,X1), i=2,3,4, are then obtained using the same rows as in P(Xi|X0,X1=j), though the ordering of the rows is dependent on the choice made previously.

Having obtained four possible parameter choices, it is straightforward to confirm that they all lead to the same joint distribution. Thus the parameterization map is generically 4-to-one.

6 Identification of causal effects

Here we examine the impact of k-to-one model parameterizations on the causal effect of one observable variable on another, when a latent variable acts as a confounder. For simplicity we assume that the latent variable is binary, though our discussion can be extended to a more general setting.

According to Theorem 3.2.2 (Adjusting for Direct Causes), p. 73 of Pearl [4], the causal effect of Xi on Xj can be obtained from model parameters by an appropriate sum over the states of the other direct causes of Xj. This sum is invariant under a relabeling of the states of those direct causes, and therefore the causal effect is not affected by label swapping if one of these is latent. As an instance, the causal effect of X1 on X2 in model 4-2b is P(X2|do(X1=x1))=P(X2|X1=x1,X0=1)P(X0=1)+P(X2|X1=x1,X0=2)P(X0=2).(8)Thus when label swapping is the only source of parameter non-identifiability, causal effects are uniquely determined by the observable distribution.

Things are more complex when parameter non-identifiability arises in other ways. For example, model 4-3e has one binary latent variable but a 4-to-one parameterization. In Table 1 two choices of parameters, (1), (2), are given for this model, as well as the common observable distribution they produce. These parameters and their two variants from label swapping at node 0 give the four elements of the fiber of the observable distribution.

Table 1

A rational example for model 4-3e. The parameter choices (1) and (2) lead to the same observable distribution, shown at the bottom. For the 4×2 matrix parameters, row indices refer to states of a pair of parents i<j ordered lexicographically as (1,1), (1,2), (2,1), (2,2), with the first entry referring to parent i, and the second to parent j

For any parameters of model 4-3e, the causal effect of X1 on X2 is again as given in eq. (8). However, due to the 4-to-one parameterization, there can be two different causal effects that are consistent with an observable distribution. As such, there may be distributions such that one causal effect leads to the conclusion that there is a positive effect of X1 on X2 (i.e., setting X1=2 gives a higher probability of X2=2 than setting X1=1), while the other causal effect leads to the conclusion that there is a negative effect of X1 on X2. Indeed, the observable distribution in Table 1 is such an instance. In Table 2 the two causal effects corresponding to that given distribution are shown. Here parameters (2) lead to a positive effect of X1 on X2, while parameters (1) lead to a negative effect.

Table 2

The causal effects of X1 on X2 for the example in Table 1

More generally, for generic observable distributions of this model, choices of parameters that differ other than by label swapping can give different causal effects. However, it varies whether the effects have the same or different signs.

7 Conclusion

Paraphrasing Pearl [6], the problem of identifying causal effects in non-parametric models has been “placed to rest” by the proof of completeness of the do-calculus and related graphical criteria. In this paper we show that the introduction of modest (parametric) assumptions on the size of the state spaces of variables allows for identifiability of parameters that otherwise would be non-identifiable. Causal effects can be computed from identified parameters, if desired, but our techniques allow for the recovery of all parameters. In the process of proving parameter identifiability for several small networks, we use techniques inspired by a theorem of Kruskal, and other novel approaches. This framework can be applied to other models as well.

We have at least three reasons to extend the work described in this paper. The first is to develop new techniques and to prove new theoretical results for parameter identifiability; this provides the foundation of our work. The second is to reach the stage at which one can easily determine parameter identifiability for DAG models with hidden variables that are used in statistical modeling; this motivates our work. A third and related focus of future work is to address the scalability of our approach and to automate it. As noted above many of our arguments do not depend on variables being binary. Also, a strategy that we used successfully to handle larger models is to first marginalize or condition on an observable variable to reduce the model to one already understood, and then to “lift” results on the reduced model back to the original one. We are working toward turning this strategy into an algorithm.

Acknowledgments

The authors thank the American Institute of Mathematics, where this work was begun during a workshop on Parameter Identification in Graphical Models, and continued through AIM’s SQuaRE program.

Appendix

Table 3 shows all DAGs with four or fewer observable nodes and one hidden node that is a parent of all observable ones (see Section 5 for model naming convention). Markov equivalent graphs appear on the same line. The dimension of the parameter space is dim(Θ), and 2A1 is the dimension of the probability simplex in which the joint distribution lies. The parameterization map is generically k-to-one.

Table 3

Small binary DAG models

References

  • 1.

    Neapolitan RE. Probabilistic reasoning in expert systems: theory and algorithms. New York, NY: John Wiley and Sons, 1990.

  • 2.

    Neapolitan RE. Learning Bayesian networks. Upper Saddle River, NJ: Pearson Prentice Hall, 2004.

  • 3.

    Pearl J. Causal diagrams for empirical research. Biometrika 1995;82:669–710. [Crossref]

  • 4.

    Pearl J. Causality: models, reasoning, and inference, 2nd ed. Cambridge: Cambridge University Press, 2009.

  • 5.

    Huang Y, Valtorta M. Pearl’s calculus of intervention is complete. In Proceedings of the twenty-second conference on uncertainty in artificial intelligence (UAI-06), 2006:217–24.

  • 6.

    Pearl J. The do-calculus revisited. In Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence (UAI-12), 2012:4–11.

  • 7.

    Shpitser I, Pearl J. Complete identification methods for the causal hierarchy. J Mach Learn Res 2008;9:1941–79.

  • 8.

    Tian J, Pearl J. A general identification condition for causal effects. In Proceedings of the eighteenth national conference on artificial intelligence (AAAI-02), 2002:567–73.

  • 9.

    Kuroki M, Pearl J. Measurement bias and effect restoration in causal inference. Biometrika 2014;101:423–37. [Web of Science] [Crossref]

  • 10.

    Allman E, Matias C, Rhodes J. Identifiability of parameters in latent structure models with many observed variables. Ann Statist 2009;37:3099–132. [Crossref]

  • 11.

    Mond D, Smith J, van Straten D. Stochastic factorizations, sandwiched simplices and the topology of the space of explanations. R Soc Lond Proc Ser Math Phys Eng Sci 2003;459:2821–45. [Crossref]

  • 12.

    Kubjas K, Robeva E, Sturmfels B. Fixed points of the EM algorithm and nonnegative rank boundaries. In revision, 2013.

  • 13.

    Kruskal J. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl 1977;18:95–138. [Crossref]

  • 14.

    Stanghellini E, Vantaggi B. Identification of discrete concentration graph models with one hidden binary variable. Bernoulli 2013;19:1920–37. Available at: http://dx.doi.org/10.3150/12-BEJ435. [Crossref] [Web of Science]

  • 15.

    Lauritzen SL. Graphical models. Oxford Statistical Science Series, vol. 17. New York: The Clarendon Press Oxford University Press, Oxford Science Publications, 1996.

  • 16.

    Meek C. Strong completeness and faithfulness in Bayesian networks. In Proceedings of the eleventh annual conference on uncertainty in artificial intelligence (UAI-95), San Francisco, CA: Morgan Kaufmann, 1995:411–18.

  • 17.

    Chickering DM. A transformational characterization of equivalent Bayesian network structures. In Proceedings of the eleventh annual conference on uncertainty in artificial intelligence (UAI-95), San Francisco, CA: Morgan Kaufmann, 1995:87–98.

  • 18.

    Rhodes J. A concise proof of Kruskal’s theorem on tensor decomposition. Linear Algebra Appl 2010;432:1818–24. [Web of Science] [Crossref]

  • 19.

    Stegeman A, Sidiropoulos ND. On Kruskal’s uniqueness condition for the Candecomp/Parafac decomposition. Linear Algebra Appl 2007;420:540–52. [Crossref] [Web of Science]

About the article

Published Online: 2014-12-03

Published in Print: 2015-09-01


Research funding: American Institute of Mathematics Structured Quartet Research Ensemble grant.


Citation Information: Journal of Causal Inference, ISSN (Online) 2193-3685, ISSN (Print) 2193-3677, DOI: https://doi.org/10.1515/jci-2014-0021. Export Citation

Comments (0)

Please log in or register to comment.
Log in