In this article, we propose a new method to learn the underlying acyclic mixed graph of a linear non-Gaussian structural equation model with given observational data. We build on an algorithm proposed by Wang and Drton, and we show that one can augment the hidden variable structure of the recovered model by learning multidirected edges rather than only directed and bidirected ones. Multidirected edges appear when more than two of the observed variables have a hidden common cause. We detect the presence of such hidden causes by looking at higher order cumulants and exploiting the multi-trek rule. Our method recovers the correct structure when the underlying graph is a bow-free acyclic mixed graph with potential multidirected edges.
Building on the theory of causal discovery from observational data, we propose a method to learn linear non-Gaussian structural equation models with hidden variables. Importantly, we allow each hidden variable to be a parent of multiple of the observed variables, and we represent this graphically via multidirected edges. Therefore, given observational data, we seek to find an acyclic mixed graph whose vertices correspond to the observed variables, and which has directed and multidirected edges.
Consider an acyclic mixed graph , where is the set of vertices, is the set of directed edges, and is the set of multidirected edges, i.e., consists of tuples of vertices such that the vertices in each tuple have a hidden common cause. The graph in Figure 1c has only directed and bidirected edges, while the one in Figure 1b also has a 3-directed edge. Both of these mixed graphs represent the directed acyclic graph (DAG) in Figure 1a with the variables 1 and 5 unobserved. Representing hidden variables via such edges is quite commonly used for causal discovery .
We restrict our attention to the class of linear structural equation models (LSEMs). A mixed graph gives rise to an LSEM, which is the set of all joint distributions of a random vector such that the variable associated with vertex is a linear function of a noise term and the variables , as varies over the set of parents of , denoted pa (i.e., the set of all vertices such that ). Thus,
When the variables are Gaussian, we are only able to recover the mixed graph up to Markov equivalence from observational data [1,2]. When the variables are non-Gaussian, however, it is possible to recover the full graph from observational data. This line of work originated with the paper  by Shimizu et al. in which a linear non-Gaussian structural equation model corresponding to a DAG, i.e., a graph without confounders, can be identified from observational data using independent component analysis (ICA). Instead of ICA, the subsequent DirectLiNGAM  and Pairwise LiNGAM  methods use an iterative procedure to estimate a causal ordering; Wang and Drton  gave a modified method that is also consistent in high-dimensional settings in which the number of variables exceeds the sample size .
Hoyer et al.  considered the setting where the data are generated by a linear non-Gaussian acyclic model (LiNGAM), but some variables are unobserved. Their method, like ours, recovers an acyclic mixed graph with multidirected edges. However, it uses overcomplete ICA and requires the number of latent variables in the model to be known in advance. Furthermore, current implementations of overcomplete ICA algorithms often suffer from local optima and cannot guarantee convergence to a global one. To avoid using overcomplete ICA while still identifying unobserved confounders, Tashiro et al.  proposed a procedure, called ParcelLiNGAM, which tests subsets of observed variables. Wang and Drton , however, showed that this procedure works whenever the graph is ancestral and proposed a new procedure, called BANG, which uses patterns in higher-order moment data and can identify bow-free acyclic mixed graphs, a set of mixed graphs much larger than the set of ancestral graphs. The BANG algorithm, however, recovers a mixed graph which only contains directed and bidirected edges.
Our method builds on the BANG procedure. We can identify a bow-free acyclic mixed graph, and, in addition, join bidirected edges together into a multidirected edge whenever more than two of the observed variables have a hidden common cause. Note, however, that while the BANG algorithm works for all types of confounding, we require that the unobserved confounding is linear. We do believe our results could be extended to work in the case of non-linear confounding, but we have left it to future work.
The rest of the article is organized as follows. In Section 2, we give the necessary background on mixed graphs, LSEMs, and multi-treks. In Section 3, we present our algorithm (MBANG) and we prove that it recovers the correct graph as long as the empirical moments are close enough the population moments. In Section 4, we present numerical results, including simulations of different graphs and error distributions, and an application of our algorithm to a real dataset. We conclude with a short discussion in Section 5.
In this section, we introduce the key concepts that we use throughout the article, including mixed graphs with multidirected edges, LSEMs, multi-treks, and cumulants.
2.1 Mixed graphs with multidirected edges
The notion of a mixed graph is widely used in graphical modeling, where bidirected edges depict unobserved confounding. In this article, a mixed graph is also allowed to contain multidirected edges to depict slightly more complicated unobserved confounding.
We call a mixed graph, where is the set of vertices, is the set of directed edges, and is the set of multidirected edges (all -directed edges for ) (see part (b)).
For , a -directed (or multidirected) edge between distinct nodes , denoted by , is the union of -directed edges with the same hidden source and sinks , respectively. Multidirected edges are unordered (Figure 2).
Our method is able to recover bow-free acyclic mixed graphs. A mixed graph is bow-free if it does not contain a bow, and a bow consists of two vertices such that there is both a directed and a multidirected edge between and . In other words, and there exists such that . A mixed graph is acyclic if it does not contain any directed cycles, where a directed cycle is a sequence of directed edges of the form (Figure 3).
Acyclic mixed graphs with potential multidirected edges are all one can hope to recover from observational data. Suppose for a moment that we have data coming from a DAG where a subset of the variables is unobserved, e.g., consider the DAG in Figure 4, where we only observe variables 1, 4, and 5.
Hoyer et al.  showed that any DAG model in which some of the variables are unobserved is observationally and causally equivalent to a unique canonical model, where a canonical model is a non-Gaussian LSEM corresponding to an acyclic mixed graph such that none of the latent variables have any parents, and each latent variable has at least two children. This means that the distribution of the observed variables in the original model is identical to that in the canonical model, and causal relationships of observed variables in both models are identical. Therefore, we can focus our attention on the set of canonical models, which can be represented precisely by acyclic mixed graphs with multidirected edges. The graph in Figure 4b is the canonical model corresponding to the one in Figure 4a.
Our algorithm recovers the canonical model equivalent to the original model, except in situations in which there is more than one unobserved cause of several of the observed variables. More precisely, if are observed and is an unobserved cause whose children are precisely , then the presence of no other unobserved cause whose children are a subset of will be found.
Let be an acyclic mixed graph as defined in the previous section. It induces a statistical model, called an LSEM, for the joint distribution of a collection of random variables , indexed by the graph’s vertices. The model hypothesizes that each variable is a linear function of its parent variables and a noise term :
where is the set of parents of vertex . The variables are assumed to have mean 0, and the coefficients and are unknown real parameters. Since we can center the variables , we assume that the coefficients are all equal to 0. In addition, the multidirected edge structure of defines dependencies between the noise terms , that is, if and are not connected by a multidirected edge, then and are independent variables. In particular, if is a DAG, i.e., it does not have any multidirected edges, then all noise terms are mutually independent. Typically termed a system of structural equations, the system (1) specifies cause–effect relations whose straightforward interpretability explains the widespread use of the models [10,11].
We can rewrite the system (1) as
where , is the coefficient matrix satisfying whenever , and is the noise vector. Note that since is assumed to be acyclic, we can permute the rows and columns of the coefficient matrix and apply the same permutation to the vector so that the matrix becomes lower triangular. Therefore, the matrix is invertible, and the system (1) can further be rewritten as
2.3 Cumulants and the multi-trek rule
We recall the notion of a cumulant tensor for a random vector .
Let be a random vector of length . The th cumulant tensor of is defined to be a ( times) table, , whose entry at position is
where the sum is taken over all partitions of the set .
Note that if each of the variables has zero mean, then we can restrict the summing over partitions for which each has size at least two. For example:
We now recall the notion of a multi-trek from ref. .
A -trek in a mixed graph between nodes is an ordered collection of directed paths , where has sink and either have the same source of vertex, or there exists a multidirected edge such that the sources of all lie in (Figure 5).
The following consequence of the multi-trek rule  connects cumulants of LSEMs to multi-treks.
 Let be an acyclic mixed graph, and let . Then,
for the kth cumulant tensor of any random vector whose distribution lies in the LSEM corresponding to if and only if there is no -trek between in .
Note that the first part of Theorem 1 assumes that the cumulant vanishes for all distributions in the model. Because the model is the image of an algebraic map, the authors of ref.  show that the theorem statement also holds for generic error distributions. In other words, whenever the error distribution is generic non-Gaussian, if and only if the vertices do not have a common ancestor (either latent or observed).
3 Main result
In this section, we present our algorithm and we show that it recovers the correct graph given enough samples. In the chart below, we illustrate how the algorithm works when applied to observational data coming from the graph in Figure 1a.
Given a data matrix whose columns are i.i.d. samples from an LSEM on an unknown bow-free acyclic mixed graph with an unknown direct effect matrix , we aim to recover the graph and the matrix . The first step is to apply the BANG procedure  and obtain the coefficient matrix together with a bow-free acyclic mixed graph , which contains directed and bidirected edges only. The bidirected edges are obtained from by replacing each multidirected edge by bidirected edges, one for each pair . We call such a set the bidirected subdivision of . We then, “remove” the directed edges by removing the direct effects given by . This is done by replacing the original data matrix with . This new matrix can be thought of as observations from an LSEM corresponding to the acyclic mixed graph , which is the same as with the directed edges removed. However, we only know the bidirected subdivision of . Finally, using the higher order cumulants of and a clique-finding algorithm on the graph , we identify all multidirected edges in . The algorithm is summarized below.
|Algorithm 1: MBANG procedure|
|1:||Input: arising from an unknown LSEM on with unknown direct effects matrix .|
|2:||Apply the BANG procedure to estimate the coefficient matrix and the mixed graph .|
|3:||Let be the new observation matrix, corresponding to the graph which has no directed edges.|
|4:||Initiate , .|
|5:||Apply Algorithm 2 to and to recover the set of multidirected edges.|
|6:||Output: the graph and the matrix .|
Algorithm 2 finds those multidirected edges that contain the most vertices and are consistent with the cumulant structure of the data matrix . It is based on the Bron–Kerbosh algorithm  for finding all cliques in an undirected graph, applied to the bidirected edges in .
3.1 The BANG procedure
In this section, we briefly describe the BANG procedure . It is an algorithm which takes as input a data matrix and returns a bow-free acyclic mixed graph , which contains directed and bidirected edges only, consistent with as well as a direct effects matrix .
It first identifies sibling and ancestor relations, and then it distinguishes parent and non-parent ancestors. Recall that two vertices are siblings if there is a multidirected edge between them, a vertex is a parent of a vertex if there is a directed edge , and a vertex is an ancestor of a vertex if there is a directed path .
However, the BANG algorithm returns bow-free acyclic mixed graphs in which the latent variable structure is recorded using bidirected edges only, i.e., it recovers the bidirected subdivision of the true set of multidirected edges . This means it cannot determine whether more than two vertices have a common cause. We resolve this problem using cumulant information.
3.2 Finding the multidirected edge structure
More generally, as depicted in Figure 8, we first apply the BANG algorithm to the data matrix arising from an LSEM on , to obtain an estimate of the direct effects matrix and the bow-free acyclic mixed graph with directed and bidirected edges only, where there is a bidirected edge between and in if and only if there is a multidirected edge in containing both and , i.e., is the bidirected subdivision of . Then, we form a new data matrix and we consider the graph which contains only multidirected edges. In other words, the data matrix can be thought of as a matrix of samples coming from an LSEM on the graph . However, we only know the graph . The bidirected edges recovered by the BANG algorithm can now be “merged” together to obtain the true multidirected edge structure . Since contains only bidirected edges, we look for all cliques (of bidirected edges) in such that , where is the th cumulant of . We do this by adapting the Bron–Kerbosch algorithm , which is used for finding all maximal cliques in an undirected graph.
Algorithm 2 is a direct modification of the Bron–Kerbosch algorithm . It is a recursive algorithm that maintains three disjoint sets of vertices and aims to output all maximal cliques in the graph which contain all vertices in , do not contain any of the vertices in , and could contain some of the vertices in . The only addition to the Bron–Kerbosch algorithm that Algorithm 2 does is the test of whether or not specific cumulants are non-zero in Line 7. In the cumulant test, if several vertices have a common ancestor, their higher order cumulant is non-zero by Theorem 1. In our algorithm, we consider the bidirected cliques of the graph and we use the cumulant test to see if the vertices in such a clique have a common ancestor. If that is the case, then they should instead be connected by one multidirected edge rather than a clique of bidirected edges.
|Algorithm 2: Determine multidirected edges|
|1:||Input: and . Denote the elements of to be .|
|2:||if and are both empty then|
|3:||Report as a multidirected edge.|
|5:||for all vertex do|
|6:||if , where is the set of vertices adjacent to in then|
|7:||if or s.t. then|
|8:||call Algorithm 2|
In practice, since Theorem 1 holds for generic distributions, sometimes random variables consistent with a model that has a multidirected edge between may still have a zero cumulant. For example, if , , and are all caused by a hidden parent variable whose distribution is symmetric around 0, we could obtain . To solve this problem, we relax the criterion . In our implementation, we also check if there exists such that . Note that there is a -trek between if and only if there is a -trek between for . If there is a multidirected edge such that but due to non-genericity of the model , we find that often times , which allows us to reach the correct conclusion. This is illustrated in the following example.
Let be a hidden variable, and let
where are i.i.d. symmetric with mean 0 for . Then we have
which implies . At the same time,
This shows that is likely to be non-zero.
3.3 Theoretical results
In this section, we show that Algorithm 1 recovers the correct bow-free acyclic mixed graph given enough samples. We begin with a population moment result.
Suppose is generated from an LSEM corresponding to a bow-free acyclic mixed graph . Then for a generic choice of the coefficient matrix and generic error moments, when given the population moments of , Algorithm 1 produces the correct graph .
Note that the only assumptions that the BANG procedure makes are that the graph is bow-free acyclic and the error terms are non-Gaussian, i.e., the error terms are generic.
Algorithm 1 first applies the BANG procedure to produce a bow-free acyclic mixed graph containing bi-directed edges and directed edges only as well as the coefficient matrix . In ref. , the authors show that the BANG algorithm recovers the correct bow-free acyclic mixed graph (with directed and bidirected edges only) and the correct coefficient matrix . Note that this means that the graph will have a bidirected edge between two nodes and which are part of a multidirected edge in . Consider the data matrix which corresponds to a graph obtained by removing all directed edges in , and the graph obtained from by removing all directed edges. Our task is to merge together some of the bidirected edges in in order to obtain the graph .
There is a multidirected edge between in (or, equivalently, in ), if and only if any two of are joined by a bidirected edge in , i.e., they will form a clique, and
where is the th cumulant of (assuming the distribution of is generic). Therefore, finding all multidirected edges is equivalent to finding all maximal cliques in for which (2) holds. This is precisely what Algorithm 2 does – it applies the Bron–Kerbosch algorithm for finding all cliques in an undirected graph (which can equally well be applied to the graph since it only has bidirected edges) and for each such clique it checks whether (2) holds.
Therefore, Algorithm 1 produces the correct graph .□
Suppose are generated by an LSEM which corresponds to a bow-free acyclic mixed graph . Then, for generic choices of the effect coefficient matrix , and generic error moments, there exist such that if the sample moments are within a ball of the population moments of , then Algorithm 1 will produce the correct graph when comparing the absolute value of the sample statistics to as a proxy for the independence tests in the BANG procedure and when comparing the absolute value of the cumulants to as a proxy in the tests for the vanishing of cumulants in Algorithm 2.
First, consider the adjusted samples . Note that by definition the cumulants of have the form
which is a rational function of the moments of . Since , the coordinates of are linear combinations of the coordinates of with coefficients given by the entries of the matrix . Therefore, the cumulants of can be expressed as rational functions of the moments of and the entries of . Thus, for all , the cumulant of of order is a continuous function of finitely many moments of and .
Then, there exists such that whenever the empirical moments of are within a ball of its population moments and the estimated is within of the true one, all of the estimated cumulants of are within a ball of its population cumulants .
Next, we know from ref.  that there exist such that if the sample moments of are within a ball of the population moments BANG will output the correct graph (where is the bidirected subdivision of ) if it uses as a proxy for its independence tests. And, if the sample moments are within a ball of the population moments, the estimated directed effects matrix will be within a ball from the true direct effects matrix. This is because the estimated matrix is a rational function of the sample moments .
Thus, choosing we see that if the sample moments of are within of its population moments, and if we use as a proxy for its independence tests in BANG and as a proxy in the tests for vanishing cumulants, the BANG procedure and the test for cumulants being zero or not will produce the correct results, and, therefore, Algorithm 1 will yield the correct graph .□
4 Numerical results
In this section, we examine how well Algorithm 1 performs numerically.
We carried out a number of simulations using four different types of error distributions: the uniform distribution on ; the Student’s -distribution with 10 degrees of freedom; the Gamma distribution with and ; and the chi-squared distribution with 2 degrees of freedom. We shift these distributions to have zero mean. The -distribution is used to test the performance of the algorithm in cases when the distribution resembles the Gaussian distribution.
We generate random bow-free acyclic mixed graphs as follows. First, we uniformly select a prescribed number of directed edges from the set , and we choose the direct effect coefficients for each edge uniformly from . Afterward, some of the vertices of this DAG which have no parents are regarded as unobserved variables, and bow structures caused by marginalizing these vertices are removed from the graph. The resulting graph is a bow-free acyclic mixed graph with potential multidirected edges.
To evaluate the performance of our method, we calculate the proportion of time that the algorithm recovers the true graph. We generate graphs with seven vertices, and after some of them are made hidden, the final graphs usually contain five or six observed variables. We test three settings with different levels of sparsity. Sparse graphs have 5 directed edges before marginalization, medium graphs have 8, and dense graphs have 12. We do not restrict the number of multidirected edges, however, almost all dense graphs have multidirected edges, and most of the medium graphs have at least one multidirected edge.
In the first step of our algorithm, we perform the BANG procedure, and we set all the nominal levels of the hypothesis test to , as suggested by the authors of the BANG algorithm . The tolerance value used in our cumulant tests we use is 0.05.
For each of the three settings: sparse, medium, and dense, we tested 100 graphs by taking different numbers of samples: 10,000, 25,000, and 50,000.
In Figure 9, we show the proportion of graphs that were recovered precisely by the BANG and MBANG algorithms. For the BANG algorithm, we recognize each multidirected edge as the corresponding set of bidirected edges.
In Figure 10, we show the percent of correctly identified bidirected and multidirected edges among all the true bidirected and multidirected edges for each setting conditioning on the case all bidirected edges have been correctly identified and a bidirected clique has been found for each true multidirected edge. The axis represents the sample size and the axis represents the percentage.
In practice, we normalize the data matrix by dividing each row by its standard deviation before initiating Algorithm 2 in order to control the variability of its cumulants.
Among all 3,600 graphs we tested in this data set, the BANG algorithm recovered 1,883 correctly, and the MBANG algorithm recovered 1,774 correctly. Hence, given that the BANG result was correct, the MBANG algorithm identified 94.2% of the graphs correctly. Among all dense graphs, this proportion was reduced to 88.0%.
The drastic drop in empirical performance as the number of vertices increases could be partially due to multiple testing and setting a different cutoff for the cumulants or a smaller level for the independence tests may perform better.
4.2 A real data set
In a paper by Grace et al. , an LSEM was used to examine the relationships between land productivity and the richness of plant diversity, the full model of which is shown in Figure 11a. Wang and Drton  choose a subset of the variables and consider the model in Figure 11b as the ground truth model. Figure 12 shows the graphical model they discover using the BANG procedure with nominal test level 0.01.
Since this is a real data set, it is possible that some of the hidden variables affect more than one bidirected edge, or there exist other hidden variables that can affect the hidden variables detected in the BANG procedure. After applying our MBANG algorithm with 0.05 tolerance, we found that all bidirected edges can be grouped in three 3-directed edges: (“PlotSoilSuit,” “PlotProd,” “SiteBiomass”), (“PlotSoilSuit,” “SiteBiomass,” “SiteProd”), and (“PlotSoilSuit,” “SiteBiomass,” “SiteProd”), see Figure 13.
5 Further discussion
In this article, we proposed a high-order cumulant-based algorithm for discovering hidden variable structure in non-Gaussian LSEMs. Note that our Algorithm 2 can be used on top of any procedure that recovers a mixed graph with directed and bidirected edges only and an estimate of the direct effects matrix . We chose the BANG algorithm as this first step since it applies to a large class of graphs: bow-free acyclic mixed graphs.
Second, the performance of our algorithm is closely related to the error distribution, and the density and size of the graph. For example, when the errors are unif ( ) and the number of vertices is 7, as the edge number increases from 8 to 20, the correct rate drops from 78 to 1%. Also, if the density is medium, as the number of vertices increases from 5 to 7, the correct rate drops from 78% to about 40%. Compared to the uniform distribution, the exponential distribution has a much milder drop in correct rate. For example, when the sample size is 50,000, the correct rate for exp(1) drops from 70 to 61% as the graph density increases from sparse to very dense (almost complete). Note, however, that this drop is also present in the output of the BANG algorithm itself.
Finally, while we proved that Algorithm 1 finds the true graph given enough samples, it would be interesting to know the exact amount of samples needed in both BANG and Algorithm 2, as well as the nominal levels for the independence tests in BANG and in our cumulant tests.
The authors thank Mathias Drton and Samuel Wang for helpful discussions regarding their algorithm . The authors also thank Jean-Baptiste Seby for a helpful discussion at an early stage of the project.
Funding information: ER was supported by an NSERC Discovery Grant (DGECR-2020-00338). YL was supported by a WLIURA in the summer of 2020.
Conflict of interest: Authors state no conflict of interest.
 Maathuis M, Drton M, Lauritzen S, Wainwright M, editors. Handbook of graphical models. Chapman and Hall/CRC handbooks of modern statistical methods. London, UK: CRC Press; 2019. p. MR3889064. Search in Google Scholar
 Lauritzen S. Graphical models. Oxford: Clarendon Press; 1996. Search in Google Scholar
 Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-Gaussian acyclic model for causal discovery. J Machine Learn Res. 2006;7:2003–30. Search in Google Scholar
 Shimizu S, Inazumi T, Sogawa Y, Hyvärinen A, Kawahara Y, Washio T, et al. DirectLiNGAM: a direct method for learning a linear non-Gaussian structural equation model. J Machine Learn Res. 2011;12:1225–48. Search in Google Scholar
 Hyvärinen A, Smith SM. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. J Machine Learn Res. 2013;14:111–52. Search in Google Scholar
 Wang YS, Drton M. High-dimensional causal discovery under non-gaussianity. Biometrika. 2019;107(1):41–59. Search in Google Scholar
 Hoyer PO, Shimizu S, Kerminen AJ, Palviainen M. Estimation of causal effects using linear non-Gaussian causal models with hidden variables. Internat. 2008;49(2):362–78. Search in Google Scholar
 Tashiro T, Shimizu S, Hyvärinen A, Washio T. ParceLiNGAM: a causal ordering method robust against latent confounders. Neural Comput. 2014;26(1):57–83. Search in Google Scholar
 Wang YS, Drton M. Causal discovery with unobserved confounding and non-Gaussian data. Preprint: arXiv:2007.11131. Search in Google Scholar
 Pearl J. Causality: Models, reasoning, and inference. Second edn. Cambridge, UK: Cambridge University Press; 2009. MR 2548166. Search in Google Scholar
 Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. Adaptive computation and machine learning. Second edn. Cambridge, MA: MIT Press; 2000. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson, A Bradford Book. MR 1815675. Search in Google Scholar
 Comon P, Jutten C. Handbook of blind source separation: Independent component analysis and applications. Cambridge, MA, USA: Academic Press, Inc; 2010. Search in Google Scholar
 Robeva E, Seby J-B. Multi-trek separation in linear structural equation models. SIAM J Appl Algebra Geomet. 2021;5(2):278–303. Search in Google Scholar
 Bron C, Kerbosch J. Algorithm 457: finding all cliques of an undirected graph. Commun ACM. 1973;16(9):575–7. Search in Google Scholar
 Grace JB, Anderson TM, Seabloom EW, Borer ET, Adler PB, Harpole WS, et al. Integrative modelling reveals mechanisms linking productivity and plant species richness. Nature. 2016;529(7586):390–3. Search in Google Scholar
© 2021 Yiheng Liu et al., published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.