# Learning linear non-Gaussian graphical models with multidirected edges

Yiheng Liu, Elina Robeva and Huanqing Wang

# Abstract

In this article, we propose a new method to learn the underlying acyclic mixed graph of a linear non-Gaussian structural equation model with given observational data. We build on an algorithm proposed by Wang and Drton, and we show that one can augment the hidden variable structure of the recovered model by learning multidirected edges rather than only directed and bidirected ones. Multidirected edges appear when more than two of the observed variables have a hidden common cause. We detect the presence of such hidden causes by looking at higher order cumulants and exploiting the multi-trek rule. Our method recovers the correct structure when the underlying graph is a bow-free acyclic mixed graph with potential multidirected edges.

MSC 2010: 11B83; 11C08; 33B10

## 1 Introduction

Building on the theory of causal discovery from observational data, we propose a method to learn linear non-Gaussian structural equation models with hidden variables. Importantly, we allow each hidden variable to be a parent of multiple of the observed variables, and we represent this graphically via multidirected edges. Therefore, given observational data, we seek to find an acyclic mixed graph whose vertices correspond to the observed variables, and which has directed and multidirected edges.

Consider an acyclic mixed graph G = ( V , D , ) , where V = { 1 , , p } is the set of vertices, D is the set of directed edges, and is the set of multidirected edges, i.e., consists of tuples of vertices such that the vertices in each tuple have a hidden common cause. The graph in Figure 1c has only directed and bidirected edges, while the one in Figure 1b also has a 3-directed edge. Both of these mixed graphs represent the directed acyclic graph (DAG) in Figure 1a with the variables 1 and 5 unobserved. Representing hidden variables via such edges is quite commonly used for causal discovery [1].

### Figure 1

The graph in (a) is directed acyclic. After marginalization of vertices 1 and 5, the resulting mixed graph with directed and bidirected edges only is that in (c). If we allow multidirected edges, we can capture the hidden variable structure better via the graph in Figure (b), which also has a 3-directed edge. (a) Example DAG. (b) Mixed graph. (c) Mixed graph.

We restrict our attention to the class of linear structural equation models (LSEMs). A mixed graph G gives rise to an LSEM, which is the set of all joint distributions of a random vector X = ( X 1 , , X p ) such that the variable X i associated with vertex i is a linear function of a noise term ε i and the variables X j , as j varies over the set of parents of i , denoted pa ( i ) (i.e., the set of all vertices j such that j i E ). Thus,

X i = j pa ( i ) b i j X j + ε i , i V .

When the variables X 1 , , X p are Gaussian, we are only able to recover the mixed graph up to Markov equivalence from observational data [1,2]. When the variables are non-Gaussian, however, it is possible to recover the full graph from observational data. This line of work originated with the paper [3] by Shimizu et al. in which a linear non-Gaussian structural equation model corresponding to a DAG, i.e., a graph without confounders, can be identified from observational data using independent component analysis (ICA). Instead of ICA, the subsequent DirectLiNGAM [4] and Pairwise LiNGAM [5] methods use an iterative procedure to estimate a causal ordering; Wang and Drton [6] gave a modified method that is also consistent in high-dimensional settings in which the number of variables p exceeds the sample size n .

Hoyer et al. [7] considered the setting where the data are generated by a linear non-Gaussian acyclic model (LiNGAM), but some variables are unobserved. Their method, like ours, recovers an acyclic mixed graph with multidirected edges. However, it uses overcomplete ICA and requires the number of latent variables in the model to be known in advance. Furthermore, current implementations of overcomplete ICA algorithms often suffer from local optima and cannot guarantee convergence to a global one. To avoid using overcomplete ICA while still identifying unobserved confounders, Tashiro et al. [8] proposed a procedure, called ParcelLiNGAM, which tests subsets of observed variables. Wang and Drton [9], however, showed that this procedure works whenever the graph is ancestral and proposed a new procedure, called BANG, which uses patterns in higher-order moment data and can identify bow-free acyclic mixed graphs, a set of mixed graphs much larger than the set of ancestral graphs. The BANG algorithm, however, recovers a mixed graph which only contains directed and bidirected edges.

Our method builds on the BANG procedure. We can identify a bow-free acyclic mixed graph, and, in addition, join bidirected edges together into a multidirected edge whenever more than two of the observed variables have a hidden common cause. Note, however, that while the BANG algorithm works for all types of confounding, we require that the unobserved confounding is linear. We do believe our results could be extended to work in the case of non-linear confounding, but we have left it to future work.

The rest of the article is organized as follows. In Section 2, we give the necessary background on mixed graphs, LSEMs, and multi-treks. In Section 3, we present our algorithm (MBANG) and we prove that it recovers the correct graph as long as the empirical moments are close enough the population moments. In Section 4, we present numerical results, including simulations of different graphs and error distributions, and an application of our algorithm to a real dataset. We conclude with a short discussion in Section 5.

## 2 Background

In this section, we introduce the key concepts that we use throughout the article, including mixed graphs with multidirected edges, LSEMs, multi-treks, and cumulants.

### 2.1 Mixed graphs with multidirected edges

The notion of a mixed graph is widely used in graphical modeling, where bidirected edges depict unobserved confounding. In this article, a mixed graph is also allowed to contain multidirected edges to depict slightly more complicated unobserved confounding.

### Definition 1

1. (a)

We call G = ( V , D , ) a mixed graph, where V = { 1 , , p } is the set of vertices, D V × V is the set of directed edges, and is the set of multidirected edges (all k -directed edges for k 2 ) (see part (b)).

2. (b)

For k 2 , a k -directed (or multidirected) edge between distinct nodes i 1 , , i k V , denoted by ( i 1 , , i k ) , is the union of k -directed edges with the same hidden source and sinks i 1 , , i k , respectively. Multidirected edges are unordered (Figure 2).

### Figure 2

Examples of multidirected edges. (a) A 2-directed (or bidirected) edge between i 1 , i 2 . (b) A 4-directed edge between i 1 , i 2 , i 3 , i 4 .

Our method is able to recover bow-free acyclic mixed graphs. A mixed graph is bow-free if it does not contain a bow, and a bow consists of two vertices i , j V such that there is both a directed and a multidirected edge between i and j . In other words, i j D and there exists h such that i , j h . A mixed graph is acyclic if it does not contain any directed cycles, where a directed cycle is a sequence of directed edges of the form i 1 i 2 , i 2 i 3 , , i i 1 (Figure 3).

### Figure 3

Examples of a bow and a cycle. (a) A bow between i and j . (b) A cycle.

Acyclic mixed graphs with potential multidirected edges are all one can hope to recover from observational data. Suppose for a moment that we have data coming from a DAG where a subset of the variables is unobserved, e.g., consider the DAG in Figure 4, where we only observe variables 1, 4, and 5.

### Figure 4

An example of two observationally and causally equivalent models from ref. [7]. (a) Original model. (b) Canonical model.

Hoyer et al. [7] showed that any DAG model in which some of the variables are unobserved is observationally and causally equivalent to a unique canonical model, where a canonical model is a non-Gaussian LSEM corresponding to an acyclic mixed graph G = ( V , D , ) such that none of the latent variables have any parents, and each latent variable has at least two children. This means that the distribution of the observed variables in the original model is identical to that in the canonical model, and causal relationships of observed variables in both models are identical. Therefore, we can focus our attention on the set of canonical models, which can be represented precisely by acyclic mixed graphs with multidirected edges. The graph in Figure 4b is the canonical model corresponding to the one in Figure 4a.

Our algorithm recovers the canonical model equivalent to the original model, except in situations in which there is more than one unobserved cause of several of the observed variables. More precisely, if i 1 , , i k are observed and is an unobserved cause whose children are precisely i 1 , , i k , then the presence of no other unobserved cause whose children are a subset of { i 1 , , i k } will be found.

### 2.2 LSEMs

Let G = ( V , D , ) be an acyclic mixed graph as defined in the previous section. It induces a statistical model, called an LSEM, for the joint distribution of a collection of random variables ( X i , i V ) , indexed by the graph’s vertices. The model hypothesizes that each variable is a linear function of its parent variables and a noise term ε i :

(1) X i = b 0 i + j pa ( i ) b j i X j + ε i , i V ,

where pa ( i ) = { j V : j i D } is the set of parents of vertex i . The variables ε i , i V are assumed to have mean 0, and the coefficients b 0 i and b j i are unknown real parameters. Since we can center the variables X i , we assume that the coefficients b 0 i are all equal to 0. In addition, the multidirected edge structure of G defines dependencies between the noise terms ε i , that is, if i and j are not connected by a multidirected edge, then ε i and ε j are independent variables. In particular, if G is a DAG, i.e., it does not have any multidirected edges, then all noise terms ε i are mutually independent. Typically termed a system of structural equations, the system (1) specifies cause–effect relations whose straightforward interpretability explains the widespread use of the models [10,11].

We can rewrite the system (1) as

X = B X + ε ,

where X = ( X 1 , , X p ) T , B = ( b i j ) is the coefficient matrix satisfying b i j = 0 whenever i j D , and ε is the noise vector. Note that since G is assumed to be acyclic, we can permute the rows and columns of the coefficient matrix B and apply the same permutation to the vector X so that the matrix B becomes lower triangular. Therefore, the matrix I B is invertible, and the system (1) can further be rewritten as

X = ( I B ) 1 ε .

### 2.3 Cumulants and the multi-trek rule

We recall the notion of a cumulant tensor for a random vector [12].

### Definition 2

Let Z = ( Z 1 , , Z p ) be a random vector of length p . The k th cumulant tensor of Z is defined to be a p × × p ( k times) table, C ( k ) , whose entry at position ( i 1 , , i k ) is

C i 1 , , i k ( k ) = ( A 1 , , A L ) ( 1 ) L 1 ( L 1 ) ! E [ j A i Z j ] E [ j A L Z j ] ,

where the sum is taken over all partitions ( A 1 , , A L ) of the set { i 1 , , i k } .

Note that if each of the variables Z i has zero mean, then we can restrict the summing over partitions for which each A i has size at least two. For example:

C i 1 , i 2 , i 3 , i 4 ( 4 ) = E ( Z i 1 Z i 2 Z i 3 Z i 4 ) E ( Z i 1 Z i 2 ) E ( Z i 3 Z i 4 ) E ( Z i 1 Z i 3 ) E ( Z i 2 Z i 4 ) + E ( Z i 1 Z i 4 ) E ( Z i 2 Z i 3 ) .

We now recall the notion of a multi-trek from ref. [13].

### Definition 3

A k -trek in a mixed graph G = ( V , D , ) between k nodes i 1 , i 2 , , i k is an ordered collection of k directed paths ( P 1 , , P k ) , where P j has sink i j and either P 1 , , P k have the same source of vertex, or there exists a multidirected edge h such that the sources of P 1 , , P k all lie in h (Figure 5).

### Figure 5

Examples of multi-treks. (a) A 4-trek between i 1 , i 2 , i 3 , i 4 . (b) A 4-trek between i 1 , i 2 , i 3 , i 4 .

The following consequence of the multi-trek rule [13] connects cumulants of LSEMs to multi-treks.

### Theorem 1

[13] Let G = ( V , D , ) be an acyclic mixed graph, and let i 1 , , i k V . Then,

C i 1 , , i k ( k ) = 0

for the kth cumulant tensor C ( k ) of any random vector whose distribution lies in the LSEM corresponding to G if and only if there is no k -trek between i 1 , , i k in G .

Note that the first part of Theorem 1 assumes that the cumulant vanishes for all distributions in the model. Because the model is the image of an algebraic map, the authors of ref. [13] show that the theorem statement also holds for generic error distributions. In other words, whenever the error distribution is generic non-Gaussian, C i 1 , , i k ( k ) = 0 if and only if the vertices i 1 , , i k do not have a common ancestor (either latent or observed).

## 3 Main result

In this section, we present our algorithm and we show that it recovers the correct graph given enough samples. In the chart below, we illustrate how the algorithm works when applied to observational data coming from the graph in Figure 1a.

Given a p × n data matrix Y whose columns are i.i.d. samples from an LSEM on an unknown bow-free acyclic mixed graph G = ( V , D , ) with an unknown direct effect matrix B , we aim to recover the graph G and the matrix B . The first step is to apply the BANG procedure [9] and obtain the coefficient matrix B together with a bow-free acyclic mixed graph G 1 = ( V , D , ) , which contains directed and bidirected edges only. The bidirected edges are obtained from by replacing each multidirected edge h = ( i 1 , , i k ) by k 2 bidirected edges, one for each pair i s , i t { i 1 , , i k } . We call such a set the bidirected subdivision of . We then, “remove” the directed edges by removing the direct effects given by B . This is done by replacing the original data matrix Y with X = Y B Y . This new matrix X can be thought of as observations from an LSEM corresponding to the acyclic mixed graph G = ( V , , ) , which is the same as G with the directed edges removed. However, we only know the bidirected subdivision of . Finally, using the higher order cumulants of X and a clique-finding algorithm on the graph G 1 = ( V , , ) , we identify all multidirected edges in . The algorithm is summarized below.

Algorithm 1: MBANG procedure
1: Input: Y R p × n arising from an unknown LSEM on G = ( V , D , ) with unknown direct effects matrix B .
2: Apply the BANG procedure to estimate the coefficient matrix B and the mixed graph G 1 = ( V , D , ) .
3: Let X = Y B Y be the new observation matrix, corresponding to the graph G = ( V , , ) which has no directed edges.
4: Initiate R = Q = , P = { 1 , 2 , , p } .
5: Apply Algorithm 2 to X , G 1 = ( V , , ) , R , P and Q to recover the set of multidirected edges.
6: Output: the graph G = ( V , D , ) and the matrix B .

Algorithm 2 finds those multidirected edges that contain the most vertices and are consistent with the cumulant structure of the data matrix X . It is based on the Bron–Kerbosh algorithm [14] for finding all cliques in an undirected graph, applied to the bidirected edges in G 1 = ( V , , ) .

### 3.1 The BANG procedure

In this section, we briefly describe the BANG procedure [9]. It is an algorithm which takes as input a p × n data matrix Y and returns a bow-free acyclic mixed graph G , which contains directed and bidirected edges only, consistent with Y as well as a direct effects matrix B .

Suppose the observed data are drawn from an LSEM whose corresponding graph is the one in Figure 6a. Figure 6b shows how the BANG algorithm works.

### Figure 6

An example of the BANG algorithm. (a) Graph example. (b) BANG flowchart.

It first identifies sibling and ancestor relations, and then it distinguishes parent and non-parent ancestors. Recall that two vertices are siblings if there is a multidirected edge between them, a vertex i is a parent of a vertex j if there is a directed edge i j , and a vertex i is an ancestor of a vertex j if there is a directed path i i 0 i k j .

However, the BANG algorithm returns bow-free acyclic mixed graphs in which the latent variable structure is recorded using bidirected edges only, i.e., it recovers the bidirected subdivision of the true set of multidirected edges . This means it cannot determine whether more than two vertices have a common cause. We resolve this problem using cumulant information.

### 3.2 Finding the multidirected edge structure

Consider the two models in Figure 7. By Theorem 1, we know that C 2 , 3 , 4 ( 3 ) = 0 in Case 1, whereas C 2 , 3 , 4 ( 3 ) 0 in Case 2. In other words, the exact model can be recovered by testing whether C 2 , 3 , 4 ( 3 ) = 0 .

### Figure 7

The same bidirected edge structure depicts two different hidden variable models. (a) Case 1. (b) Case 2.

More generally, as depicted in Figure 8, we first apply the BANG algorithm to the data matrix Y R p × n arising from an LSEM on G = ( V , D , ) , to obtain an estimate of the direct effects matrix B and the bow-free acyclic mixed graph G 1 = ( V , D , ) with directed and bidirected edges only, where there is a bidirected edge between i and j in G 1 if and only if there is a multidirected edge in G containing both i and j , i.e., is the bidirected subdivision of . Then, we form a new data matrix X = Y B Y and we consider the graph G = ( V , , ) which contains only multidirected edges. In other words, the data matrix X can be thought of as a matrix of samples coming from an LSEM on the graph G . However, we only know the graph G 1 = ( V , , ) . The bidirected edges recovered by the BANG algorithm can now be “merged” together to obtain the true multidirected edge structure . Since G 1 contains only bidirected edges, we look for all cliques (of bidirected edges) { i 1 , , i k } in G such that C i 1 , , i k ( k ) 0 , where C ( k ) is the k th cumulant of X . We do this by adapting the Bron–Kerbosch algorithm [14], which is used for finding all maximal cliques in an undirected graph.

### Figure 8

MBANG flow chart.

Algorithm 2 is a direct modification of the Bron–Kerbosch algorithm [14]. It is a recursive algorithm that maintains three disjoint sets of vertices R , P , Q V and aims to output all maximal cliques in the graph which contain all vertices in R , do not contain any of the vertices in Q , and could contain some of the vertices in P . The only addition to the Bron–Kerbosch algorithm that Algorithm 2 does is the test of whether or not specific cumulants are non-zero in Line 7. In the cumulant test, if several vertices have a common ancestor, their higher order cumulant is non-zero by Theorem 1. In our algorithm, we consider the bidirected cliques of the graph G 1 and we use the cumulant test to see if the vertices in such a clique have a common ancestor. If that is the case, then they should instead be connected by one multidirected edge rather than a clique of bidirected edges.

Algorithm 2: Determine multidirected edges
1: Input: X R p × n , G 1 = ( V , , ) , R , P and Q . Denote the elements of R to be i 1 , , i k .
2: if P and Q are both empty then
3: Report R as a multidirected edge.
4: end if
5: for all vertex v P do
6: if N ( v ) , where N ( v ) is the set of vertices adjacent to v in then
7: if C i 1 , , i k , v ( k + 1 ) 0 or j { i 1 , , i k } s.t. C i 1 , , i k , v , j ( k + 2 ) 0 then
8: call Algorithm 2 ( X , R { v } , P N ( v ) , Q N ( v ) )
9: end if
10: end if
11: P = P \ { v }
12: Q = Q { v }
13: end for

In practice, since Theorem 1 holds for generic distributions, sometimes random variables consistent with a model that has a multidirected edge between i 1 , , i k may still have a zero cumulant. For example, if i 1 , i 2 , and i 3 are all caused by a hidden parent variable whose distribution is symmetric around 0, we could obtain C i 1 , i 2 , i 3 ( 3 ) = 0 . To solve this problem, we relax the criterion C i 1 , , i k ( k ) 0 . In our implementation, we also check if there exists j { i 1 , , i k } such that C i 1 , , i k , j ( k + 1 ) 0 . Note that there is a k -trek between i 1 , , i k if and only if there is a ( k + 1 ) -trek between i 1 , , i k , j for j = i 1 , , i k . If there is a multidirected edge h such that i 1 , , i k h but due to non-genericity of the model C i 1 , , i k k = 0 , we find that often times C i 1 , , i k , j ( k + 1 ) 0 , which allows us to reach the correct conclusion. This is illustrated in the following example.

### Example 1

Let Z 0 = ε 0 be a hidden variable, and let

Z 1 = ε 0 + ε 1 , Z 2 = ε 0 + ε 2 , Z 3 = c 0 + ε 3 ,

where ε j are i.i.d. symmetric with mean 0 for 0 j 3 . Then we have

E [ Z 1 Z 2 Z 3 ] = 0 , E [ Z 1 Z 2 ] E [ Z 3 ] = 0 , E [ Z 1 ] E [ Z 2 ] E [ Z 3 ] = 0 ,

which implies C 1 , 2 , 3 ( 3 ) = 0 . At the same time,

E [ Z 1 Z 2 Z 3 Z 1 ] = E [ ε 0 4 ] + E [ ε 0 2 ] E [ ε 1 2 ] E [ Z 1 Z 1 ] E [ Z 2 Z 3 ] = ( E [ ε 0 2 ] + E [ ε 1 2 ] ) E [ ε 0 2 ] E [ Z 1 Z 2 ] E [ Z 1 Z 3 ] = E [ ε 0 2 ] 2 C 1 , 2 , 3 , 1 ( 4 ) = E [ ε 0 4 ] 3 E [ ε 0 2 ] 2 .

This shows that C 1 , 2 , 3 , 1 ( 4 ) is likely to be non-zero.

### 3.3 Theoretical results

In this section, we show that Algorithm 1 recovers the correct bow-free acyclic mixed graph given enough samples. We begin with a population moment result.

### Theorem 2

Suppose Y is generated from an LSEM corresponding to a bow-free acyclic mixed graph G = ( V , D , ) . Then for a generic choice of the coefficient matrix B and generic error moments, when given the population moments of Y , Algorithm 1 produces the correct graph G .

### Proof

Note that the only assumptions that the BANG procedure makes are that the graph is bow-free acyclic and the error terms are non-Gaussian, i.e., the error terms are generic.

Algorithm 1 first applies the BANG procedure to produce a bow-free acyclic mixed graph G 1 = ( V , D , ) containing bi-directed edges and directed edges only as well as the coefficient matrix B . In ref. [9], the authors show that the BANG algorithm recovers the correct bow-free acyclic mixed graph (with directed and bidirected edges only) and the correct coefficient matrix B . Note that this means that the graph G 1 will have a bidirected edge b between two nodes i and j which are part of a multidirected edge h in G . Consider the data matrix X = Y B Y which corresponds to a graph G = ( V , , ) obtained by removing all directed edges in G , and the graph G 1 = ( V , , ) obtained from G 1 by removing all directed edges. Our task is to merge together some of the bidirected edges in G 1 in order to obtain the graph G .

There is a multidirected edge between i 1 , , i k in G (or, equivalently, in G ), if and only if any two of i 1 , , i k are joined by a bidirected edge in G 1 , i.e., they will form a clique, and

(2) C i 1 , , i k ( k ) 0 ,

where C ( k ) is the k th cumulant of X (assuming the distribution of X is generic). Therefore, finding all multidirected edges is equivalent to finding all maximal cliques in G 1 for which (2) holds. This is precisely what Algorithm 2 does – it applies the Bron–Kerbosch algorithm for finding all cliques in an undirected graph (which can equally well be applied to the graph G 1 since it only has bidirected edges) and for each such clique it checks whether (2) holds.

Therefore, Algorithm 1 produces the correct graph G .□

### Theorem 3

Suppose Y ( 1 ) , Y ( 2 ) , , Y ( n ) are generated by an LSEM which corresponds to a bow-free acyclic mixed graph G = ( V , D , ) . Then, for generic choices of the effect coefficient matrix B , and generic error moments, there exist δ 1 , δ 2 , δ 3 > 0 such that if the sample moments are within a δ 1 ball of the population moments of Y , then Algorithm 1 will produce the correct graph G when comparing the absolute value of the sample statistics to δ 2 as a proxy for the independence tests in the BANG procedure and when comparing the absolute value of the cumulants to δ 3 as a proxy in the tests for the vanishing of cumulants in Algorithm 2.

### Proof

First, consider the adjusted samples X ( i ) = Y ( i ) B Y ( i ) . Note that by definition the cumulants of X have the form

C i 1 , , i k ( k ) cum ( X i 1 , , X i k ) = ( A 1 , , A L ) ( 1 ) L 1 ( L 1 ) ! E [ j A 1 X j ] E [ j A L X j ] ,

which is a rational function of the moments of X . Since X = Y B Y , the coordinates of X are linear combinations of the coordinates of Y with coefficients given by the entries of the matrix ( I B ) . Therefore, the cumulants of X can be expressed as rational functions of the moments of Y and the entries of B . Thus, for all k 1 , the cumulant of X of order k is a continuous function of finitely many moments of Y and B .

Let

δ 3 = 1 2 min PC i 1 , , i k ( k ) 0 C i 1 , , i k ( k ) , 1 k p .

Then, there exists δ > 0 such that whenever the empirical moments of Y are within a δ ball of its population moments and the estimated B is within δ of the true one, all of the estimated cumulants of X are within a δ 3 ball of its population cumulants C ( k ) .

Next, we know from ref. [9] that there exist δ , δ 2 > 0 such that if the sample moments of Y are within a δ ball of the population moments BANG will output the correct graph G 1 = ( V , D , ) (where is the bidirected subdivision of ) if it uses δ 2 as a proxy for its independence tests. And, if the sample moments are within a δ ball of the population moments, the estimated directed effects matrix B will be within a δ ball from the true direct effects matrix. This is because the estimated matrix B is a rational function of the sample moments [9].

Thus, choosing δ 1 = min ( δ , δ ) we see that if the sample moments of Y are within δ 1 of its population moments, and if we use δ 2 as a proxy for its independence tests in BANG and δ 3 as a proxy in the tests for vanishing cumulants, the BANG procedure and the test for cumulants being zero or not will produce the correct results, and, therefore, Algorithm 1 will yield the correct graph G = ( V , D , ) .□

## 4 Numerical results

In this section, we examine how well Algorithm 1 performs numerically.

### 4.1 Simulations

We carried out a number of simulations using four different types of error distributions: the uniform distribution on [ 10 , 10 ] ; the Student’s t -distribution with 10 degrees of freedom; the Gamma distribution with shape = 2 and rate = 4 ; and the chi-squared distribution with 2 degrees of freedom. We shift these distributions to have zero mean. The t -distribution is used to test the performance of the algorithm in cases when the distribution resembles the Gaussian distribution.

We generate random bow-free acyclic mixed graphs as follows. First, we uniformly select a prescribed number of directed edges from the set { ( i , j ) i < j } , and we choose the direct effect coefficients for each edge uniformly from ( 1 , 0.6 ) ( 0.6 , 1 ) . Afterward, some of the vertices of this DAG which have no parents are regarded as unobserved variables, and bow structures caused by marginalizing these vertices are removed from the graph. The resulting graph is a bow-free acyclic mixed graph with potential multidirected edges.

To evaluate the performance of our method, we calculate the proportion of time that the algorithm recovers the true graph. We generate graphs with seven vertices, and after some of them are made hidden, the final graphs usually contain five or six observed variables. We test three settings with different levels of sparsity. Sparse graphs have 5 directed edges before marginalization, medium graphs have 8, and dense graphs have 12. We do not restrict the number of multidirected edges, however, almost all dense graphs have multidirected edges, and most of the medium graphs have at least one multidirected edge.

In the first step of our algorithm, we perform the BANG procedure, and we set all the nominal levels of the hypothesis test to α = 0.01 , as suggested by the authors of the BANG algorithm [9]. The tolerance value used in our cumulant tests we use is 0.05.

For each of the three settings: sparse, medium, and dense, we tested 100 graphs by taking different numbers of samples: 10,000, 25,000, and 50,000.

In Figure 9, we show the proportion of graphs that were recovered precisely by the BANG and MBANG algorithms. For the BANG algorithm, we recognize each multidirected edge as the corresponding set of bidirected edges.

### Figure 9

Random graph results: correct graphs.

In Figure 10, we show the percent of correctly identified bidirected and multidirected edges among all the true bidirected and multidirected edges for each setting conditioning on the case all bidirected edges have been correctly identified and a bidirected clique has been found for each true multidirected edge. The x axis represents the sample size and the y axis represents the percentage.

### Figure 10

Random graph results: correct multidirected edges.

In practice, we normalize the data matrix X by dividing each row by its standard deviation before initiating Algorithm 2 in order to control the variability of its cumulants.

Among all 3,600 graphs we tested in this data set, the BANG algorithm recovered 1,883 correctly, and the MBANG algorithm recovered 1,774 correctly. Hence, given that the BANG result was correct, the MBANG algorithm identified 94.2% of the graphs correctly. Among all dense graphs, this proportion was reduced to 88.0%.

The drastic drop in empirical performance as the number of vertices increases could be partially due to multiple testing and setting a different cutoff for the cumulants or a smaller level for the independence tests may perform better.

### 4.2 A real data set

In a paper by Grace et al. [15], an LSEM was used to examine the relationships between land productivity and the richness of plant diversity, the full model of which is shown in Figure 11a. Wang and Drton [9] choose a subset of the variables and consider the model in Figure 11b as the ground truth model. Figure 12 shows the graphical model they discover using the BANG procedure with nominal test level 0.01.

### Figure 11

True model. (a) Full model from ref. [15]. (b) A subset of the model from ref. [15] used in ref. [9].

### Figure 12

Model discovered by BANG [9].

Since this is a real data set, it is possible that some of the hidden variables affect more than one bidirected edge, or there exist other hidden variables that can affect the hidden variables detected in the BANG procedure. After applying our MBANG algorithm with 0.05 tolerance, we found that all bidirected edges can be grouped in three 3-directed edges: (“PlotSoilSuit,” “PlotProd,” “SiteBiomass”), (“PlotSoilSuit,” “SiteBiomass,” “SiteProd”), and (“PlotSoilSuit,” “SiteBiomass,” “SiteProd”), see Figure 13.

### Figure 13

Model discovered by MBANG.

## 5 Further discussion

In this article, we proposed a high-order cumulant-based algorithm for discovering hidden variable structure in non-Gaussian LSEMs. Note that our Algorithm 2 can be used on top of any procedure that recovers a mixed graph G = ( V , D , ) with directed and bidirected edges only and an estimate of the direct effects matrix B . We chose the BANG algorithm as this first step since it applies to a large class of graphs: bow-free acyclic mixed graphs.

Second, the performance of our algorithm is closely related to the error distribution, and the density and size of the graph. For example, when the errors are unif ( 5 , 5 ) and the number of vertices is 7, as the edge number increases from 8 to 20, the correct rate drops from 78 to 1%. Also, if the density is medium, as the number of vertices increases from 5 to 7, the correct rate drops from 78% to about 40%. Compared to the uniform distribution, the exponential distribution has a much milder drop in correct rate. For example, when the sample size is 50,000, the correct rate for exp(1) drops from 70 to 61% as the graph density increases from sparse to very dense (almost complete). Note, however, that this drop is also present in the output of the BANG algorithm itself.

Finally, while we proved that Algorithm 1 finds the true graph given enough samples, it would be interesting to know the exact amount of samples needed in both BANG and Algorithm 2, as well as the nominal levels for the independence tests in BANG and in our cumulant tests.

# Acknowledgements

The authors thank Mathias Drton and Samuel Wang for helpful discussions regarding their algorithm [9]. The authors also thank Jean-Baptiste Seby for a helpful discussion at an early stage of the project.

1. Funding information: ER was supported by an NSERC Discovery Grant (DGECR-2020-00338). YL was supported by a WLIURA in the summer of 2020.

2. Conflict of interest: Authors state no conflict of interest.

### References

[1] Maathuis M, Drton M, Lauritzen S, Wainwright M, editors. Handbook of graphical models. Chapman and Hall/CRC handbooks of modern statistical methods. London, UK: CRC Press; 2019. p. MR3889064. Search in Google Scholar

[2] Lauritzen S. Graphical models. Oxford: Clarendon Press; 1996. Search in Google Scholar

[3] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-Gaussian acyclic model for causal discovery. J Machine Learn Res. 2006;7:2003–30. Search in Google Scholar

[4] Shimizu S, Inazumi T, Sogawa Y, Hyvärinen A, Kawahara Y, Washio T, et al. DirectLiNGAM: a direct method for learning a linear non-Gaussian structural equation model. J Machine Learn Res. 2011;12:1225–48. Search in Google Scholar

[5] Hyvärinen A, Smith SM. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. J Machine Learn Res. 2013;14:111–52. Search in Google Scholar

[6] Wang YS, Drton M. High-dimensional causal discovery under non-gaussianity. Biometrika. 2019;107(1):41–59. Search in Google Scholar

[7] Hoyer PO, Shimizu S, Kerminen AJ, Palviainen M. Estimation of causal effects using linear non-Gaussian causal models with hidden variables. Internat. 2008;49(2):362–78. Search in Google Scholar

[8] Tashiro T, Shimizu S, Hyvärinen A, Washio T. ParceLiNGAM: a causal ordering method robust against latent confounders. Neural Comput. 2014;26(1):57–83. Search in Google Scholar

[9] Wang YS, Drton M. Causal discovery with unobserved confounding and non-Gaussian data. Preprint: arXiv:2007.11131. Search in Google Scholar

[10] Pearl J. Causality: Models, reasoning, and inference. Second edn. Cambridge, UK: Cambridge University Press; 2009. MR 2548166. Search in Google Scholar

[11] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. Adaptive computation and machine learning. Second edn. Cambridge, MA: MIT Press; 2000. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson, A Bradford Book. MR 1815675. Search in Google Scholar

[12] Comon P, Jutten C. Handbook of blind source separation: Independent component analysis and applications. Cambridge, MA, USA: Academic Press, Inc; 2010. Search in Google Scholar

[13] Robeva E, Seby J-B. Multi-trek separation in linear structural equation models. SIAM J Appl Algebra Geomet. 2021;5(2):278–303. Search in Google Scholar

[14] Bron C, Kerbosch J. Algorithm 457: finding all cliques of an undirected graph. Commun ACM. 1973;16(9):575–7. Search in Google Scholar

[15] Grace JB, Anderson TM, Seabloom EW, Borer ET, Adler PB, Harpole WS, et al. Integrative modelling reveals mechanisms linking productivity and plant species richness. Nature. 2016;529(7586):390–3. Search in Google Scholar