Dependence and dependence structures: estimation and visualization using the unifying concept of distance multivariance

Distance multivariance is a multivariate dependence measure, which can detect dependencies between an arbitrary number of random vectors each of which can have a distinct dimension. Here we discuss several new aspects, present a concise overview and use it as the basis for several new results and concepts: In particular, we show that distance multivariance unifies (and extends) distance covariance and the Hilbert-Schmidt independence criterion HSIC, moreover also the classical linear dependence measures: covariance, Pearson's correlation and the RV coefficient appear as limiting cases. Based on distance multivariance several new measures are defined: a multicorrelation which satisfies a natural set of multivariate dependence measure axioms and $m$-multivariance which is a dependence measure yielding tests for pairwise independence and independence of higher order. These tests are computationally feasible and under very mild moment conditions they are consistent against all alternatives. Moreover, a general visualization scheme for higher order dependencies is proposed, including consistent estimators (based on distance multivariance) for the dependence structure. Many illustrative examples are provided. All functions for the use of distance multivariance in applications are published in the R-package 'multivariance'.


Introduction
The detection of dependence is a common statistical task, which is crucial in many applications. There have been many methods employed and proposed, see e.g. [24], [43] and [29] for recent surveys. Usually these focus on the (functional) dependence of pairs of variables. Thus when the dependence of many variables is studied the resulting networks (correlation networks, graphical models) only show the pairwise dependence. As long as pairwise dependence is present, this might be sufficient (and also for the detection of such dependencies total multivariance and m-multivariance provide efficient tests). But recall that pairwise independence does not imply the independence of all variables if more than two variables are considered. Thus, in particular if all variables are pairwise independent many methods of classical statistical inference would have discarded the data. Although there might be some higher order dependence present. This can only be detected directly with a multivariate dependence measure. The classical examples featuring 3-dependence are a dice in the shape of a tetrahedron with specially colored sides (see Example 9.1) and certain events in multiple coin throws (Examples 9.2). In Example 9.3 a generalization to higher orders is presented.
To avoid misconceptions when talking about independence one should note that the term "mutual independence" is ambiguous, some authors use it as a synonym for pairwise independence, others for independence. For the latter also the terms "total independence" or "joint independence" are used. We use the terms: pairwise independence, its extension m-independence (see Section 2) and independence.
In [7,8] the basics of distance multivariance and total distance multivariance were developed, which can be used to measure multivariate dependence. Incidentally, a variant of total distance multivariance based on the Euclidean distance was simultaneously developed in [10]. Moreover, distance multivariance names and extends a concept introduced in [4]. Here we recall and extend the main definitions and properties (Sections 2 and 4). In particular, the moment conditions required in [8] for the independence tests are considerably relaxed (Theorem 2.5, Tests 4.1 and 4.3), invariance properties are explicitly discussed (Propositions 2.3 and 2.4) and resampling tests are introduced (Section 4). Moreover, on this basis the following new results and concepts are developed: • A general scheme for the visualization of higher order dependence which can be used with any multivariate dependence measure (Section 6). For the setting of multivariance we provide explicit consistent estimators for the (higher order) dependence structure. In particular the method for the clustered dependence structure is based on the fact that multivariance is a truely multivariate dependence measure: On the one hand it can measure the dependence of multiple (more than 2) random variables. On the other hand each random variable can be multivariate and each can have a distinct dimension.  • Global tests for pairwise (and higher order) independence: Pairwise independence is a fundamental requisite e.g. for the law of large numbers (in its basic form; a result which is used in almost every statistical estimation). Recently in [47] a test for pairwise independence of identically distributed random variables was proposed. We derive in Section 5 a test for pairwise (and higher order) independence which is applicable to any mixture of marginal distributions and dimensions.
• Unification of dependence measures (Section 3): -In [38] it was shown that for independence testing methods based on reproducing kernels and methods based on distances are equivalent. They considered the bivariate setting. We present in Section 3.2 a different, explicit and very elementary relation. Moreover, this transfers also to the setting of multiple random vectors.
-It was clear that the RV-coefficent structurally corresponds to distance covariance, see e.g. [24]. We show in Section 3.1 that the RV-coefficent and, in particular, covariance (the most classical dependence measure of all!) are covered as limiting cases by the framework of multivariance.
-Hilbert Space methods require characteristic kernels. Multivariance requires the full support of the underlying measures. We show that these assumptions are equivalent (Proposition 2.2).
• Multivariate correlations: A formalization of desired properties of dependence measures goes back at least to Reni's Axioms [34]. We discuss in Section 3.6 a multivariate extension of the Axioms and provide several multicorrelations, e.g. (14), (15) and (56).
Recently also several other dependence measures for multiple random vectors were proposed, e.g. [47] propose tests for pairwise independence, banded independence and independence based on distance covariance or based on the approach of [32]. The latter presented tests for independence of multiple random vectors using kernels. In [23] also distance covariance (a measure for the dependence of two random vectors; introduced in [42]) was generalized to tests of independence of multiple random vectors. All these approaches are related to distance multivariance, see Section 3 for a detailed discussion. Empirical comparisons can be found in Examples 7.2 and 7.3. It is remarkable that, although the above measures are able to detect higher order dependencies, all real data examples which were provided so far feature only pairwise dependence. Certainly the predominant statistical methods cause a publication bias for such datasets. Nevertheless, we want to point out that many available datasets feature higher order dependence. Based on a data driven study we collected over 350 datasets featuring statistically significant higher order dependencies [5]. One of these datasets is discussed in further details in Section 7.2, and all of these datasets are distributed as part of various R-packages without the context of higher order dependence. This indicates that higher order dependence can be detected frequently, but what remains open (with the potential to be the basis for various new research projects) are intrinsic explanations of higher order dependence within each field of research of the underlying data.
Besides the real data examples the presentation of this paper is complemented by a comprehensive collection of further examples (in Section 7 and in the supplementary 1 Section 9): illustrating higher order dependencies (Section 9.1), discussing various properties of distance multivariance (Section 9.2), comparisons to other dependence measures (Section 7.1). Technical details and further results are collected in Section 8.
The R code for the evaluation of distance multivariance and the corresponding tests is provided in the R-package multivariance [6]. Finally, based on (some of) the results of this paper we have the following recommendations for questions common in independence testing: 1. Are at least some variables dependent? Detection of any kind of dependence: (a) The global independence test based on total multivariance can be used to detect any kind of dependence, alternatively 2-multivariance can be used to test for pairwise (in)dependence. The latter and m-multivariance can also be used to reduce the statistical curse of dimension which total multivariance might suffer. For all settings fast distribution free (conservative) tests exist and these are applicable for large samples and a large number of random vectors. The computation of the test statistic takes in its current implementation for 100 variables with 1000 samples each (or for 1000 variables with 300 samples each) less than 2 seconds on a dated i7-6500U CPU Laptop. Slower, but approximately sharp, are the corresponding resampling tests. Faster approximately sharp tests are discussed in [3].
(b) As a complementary approach to the global tests one could perform multiple tests as suggested in [4]. This requires 2 n −n−1 individual tests, where n denotes the number of random vectors. Hence it is only applicable for small n. [4] also provides a multiple testing approach to m-dependence.

2.
Which variables depend on each other? Dependence structure: Especially if some dependence was detected the algorithm of Section 6 can be used to analyze which variables depend on each other, yielding either a full or clustered dependence structure. The method is based on multiple tests, but variables are clustered (or related tuples are excluded from further tests) as soon as a positive detection occurred. This can considerably reduce the computation time in comparison to 1.(b).

Distance multivariance
In the following distance multivariance is introduced. Some parts are essential for the (less technical) comparison to other dependence measures in Section 3, other parts are required for the introduction of mmultivariance (Section 5). Furthermore, several new results are included which make distance multivariance more accessible and applicable. Tests using distance multivariance will be discussed in Section 4. Let X i be R di valued random variables with characteristic functions f Xi (t i ) = E(e itiXi ) for i = 1, . . . , n. Then distance multivariance is defined by and total distance multivariance is M ρ (X 1 , . . . , X n ) := 1≤i1<...<im≤n 2≤m≤n where ρ = ⊗ n i=1 ρ i and each ρ i is a symmetric measure with full topological support 2 on R di such that 1 ∧ |t i | 2 ρ i (dt i ) < ∞ and t = (t 1 , . . . , t n ) with t i ∈ R di . To simplify notation and definitions we will just use the term 'multivariance' instead of 'distance multivariance', and we will drop the subscript ρ if the measure is the full measure ρ.
Random variables X 1 , . . . , X n are called m-independent, if X i1 , . . . , X im are independent for any distinct i j ∈ {1, . . . , n} for j = 1, . . . , m. This concept is essential for the statement (and proof) of the following theorem. It is also the basis for the estimators for m-independence which will be developed in Section 5.
For statistical applications the following representations, which require the moment condition (5), are very useful. Let (X 1 , . . . , X n ) be an independent copy of (X 1 , . . . , X n ) and ψ i (y i ) : Note that in [8] a technical looking moment condition was required for the above representations, we show in Section 8.2 that the following more comprehensible condition is equivalent to it (for non constant random variables) finite joint ψ-moments: for all S ⊂ {1, . . . , n} : A direct consequence of (3) is the factorization of M and M for independent subsets, i.e., if (X i ) i∈I and (X i ) i∈I c are independent for some I ⊂ {1, . . . , n} then Furthermore, the expectations in (3) yield strongly consistent (see [8,Theorem 4.3] and Corollary 2.7) and numerically feasible estimators. Hereto denote samples of (X 1 , . . . , X n ) by and sample total multivariance is where (A i ) jk denotes the element in the j-th row and k-th column of the doubly centered distance matrix A i defined by The matrices A i are positive definite [8,Remark 4.2.b], since the considered distances ψ i (. − .) are given by Functions defined via (11) appear in various areas: e.g. they are called variogram (e.g. [30]), continuous negative definite function (e.g. [2]) or characteristic exponent of a Lévy process with Lévy measure ρ i (e.g. [36]), and they are closely related to the symbol of generators of Markov processes ( [22], [9]). The choice of ρ i and ψ i is discussed in more detail in Remark 2.8, the standard choices are the Euclidean distance ψ i (t i ) = |t i | and for α i ∈ (0, 2) stable distances ψ i (t i ) = |t i | αi and bounded functions of the form ψ i (t i ) = 1−exp(−δ i |t i | αi ) with δ i > 0. But also other functions like Minkowski distances are possible, various examples are given in [7, Table 1]. We call a function ψ characterizing if for any random vector X the function z → E(ψ(X −z)) characterizes the distribution of X uniquely, or equivalently, if for finite measures µ the function µ → ψ(x − .)µ(dx) is injective. The following Proposition provides a characterization of the required support property of ρ in terms of ψ, it actually solves an open problem of [7]. Moreover it also links the setting of multivariance to other dependence measures, see Section 3.
Proposition 2.2. Let ψ i be given by (11) for a symmetric measure ρ i such that 1 ∧ |t i | 2 ρ i (t i ) < ∞. Then ψ i is characterizing if and only if ρ i has full topological support.
Proof. The statement is a consequence of Theorem 8.1 (see Section 8). Hereto note that the distributions of two random vectors coincide if and only if their characteristic functions coincide on a dense subset, i.e., µ almost surely for a measure µ with full topological support.
There are important scaled versions of the estimators in (8) and (9) (8) and (9) is replaced by (12) In the case of normalized sample total multivariance the sum in (9) is additionally scaled by the number of summands in the definition of total multivariance (2), i.e., By this scaling the test statistics for multivariance and total multivariance have expectation 1 (in the case of independent variables).
• sample multicorrelation: (8) is replaced by • unnormalized sample multicorrelation: (8) is replaced by Note that M cor is newly introduced in this paper, see in particular Table 1 for a comparison. For even n multicorrelation R and M cor coincide, but for odd n they differ. Only R is always bounded by 1, hence M cor is called unnormalized. But only for M cor the value 1 has an explicit interpretation. The population versions of the above sample measures are given by scaling Ψ i in (3) with the terms on the right of (12), (14) and (15), e.g. normalized multivariance and normalized total multivariance are where implicitly the finiteness of the corresponding moments is assumed, i.e., finite first ψ-moments: for all i = 1, . . . , n : For the scaling of the multicorrelations one has to assume finite ψ-moments of order n: for all i = 1, . . . , n : E(ψ n i (X i )) < ∞.
Note that the scaling factors given in (14) and (15) depend on n thus the total multicorrelations do not have such a simple representation as M (or its sample version (13)) in fact the following holds (analogously also for M cor): Therefore the total multicorrelations seem more of a theoretic interest, but the corresponding m-multicorrelations (which will be defined in Remark 5.5.3) have efficient estimators. Moreover, also the lower bound in (20) has an efficient sample version analogously to (13). For a comparison of these multicorrelations see Section 3.6. The introduced dependence measures and their sample versions have in particular the following properties. 2. permutation invariant, i.e., M (X 1 , . . . , X n ) = M ⊗ n i=1 ρ k i (X k1 , . . . , X kn ) for all permutations k 1 , . . . , k n of 1, . . . , n. Moreover, the sample versions are in addition invariant with respect to permutations of the samples, i.e., the equality N M (x (1) , . . . , x (N ) ) = N M (x (l1) , . . . , x (l N ) ) holds for all permutations l 1 , . . . , l N of 1, . . . , N. (This should not be confused with the permutations for a resampling test, where components of the samples are permuted separately, see (42).) Note that the latter and (3) imply that for dichotomous 0-1 coded data a swap of the coding does not change the value of the multivariance.
5. rotation invariant for isotropic ψ i , i.e., if ψ i (x i ) = g i (|x i |) for some g i and all i = 1, . . . , n, then M (X 1 , . . . , X n ) = M (R 1 X 1 , . . . , R n X n ) for all rotations R i on R di . Note that in this case, since also (3) and (4) hold, M is invariant with respect to Euclidean isometries.
Proof. For multivariance M the property (a) follows by direct calculation using (3) and (4), (b) is obvious by (3), (c) holds since ψ i is symmetric and for (d) note that the translations cancel in (4). Moreover, since a rotation preserves Euclidean distances also (e) holds. Total multivariance M is just a sum of multivariances, hence it inherits these properties. For sample multivariance N M the same arguments apply using (8) and (10). For the sample permutation invariance in (b) note that permutations of samples correspond to permutations of rows and columns of the centered distance matrices. Analogously also the scaling factors given in (12), (14) and (15) have these properties. Therefore they hold also for all scaled and sample versions.
Moreover, for special functions ψ i the scaled dependence measures feature one further invariance.
. Then the scaled measures M, M, R, R, M cor, M cor and the corresponding sample versions are scale invariant, that is, Proof.
The key for statistical tests based on multivariance is the following convergence result. The presented result relaxes the required moments considerably in comparison to [8,Thm. 4.5,4.10,Cor. 4.16,4.18], moreover also a new parameter β is introduced which will be useful in Section 6. Theorem 2.5 (Asymptotics of sample multivariance). Let X i , i = 1, . . . , n be non-constant random variables and let X (k) , k = 1, . . . , N be independent copies of X = (X 1 , . . . , X n ). Let either of the following conditions hold all ψ i are bounded (22) or for all i = 1, . . . , n : Then for any β > 0 where Q and Q are Gaussian quadratic forms with EQ = 1 = EQ.
Proof. Here we explain the main new ideas, the details are provided in Section 8.3. For the convergence in (25) and (27) exist at least two methods of proof: As in [8] the convergence of empirical characteristic functions can be used. For this step a slightly relaxed (but technical) version of the log moment condition (see Remark 2.6) is necessary and sufficient, cf. [8,Remark 4.6.b]. An alternative approach (Theorem 8.3 in Section 8) uses the theory of degenerate V-statistics, this requires moments of second order with respect to ψ i , but no further condition. Thus, in particular, for bounded ψ i the latter removes the log moment condition.
For β = 1 the divergence in (24) and (26) was proved in [8] under the condition (5), which ensures that the representations (3) of the limits of sample (total) multivariance are well defined and finite. Using the characteristic function representation (1) (which is always well defined, but possibly infinite) the divergence can be proved without (5), see Section 8 Lemma 8.5 ff.. Moreover, the arguments used therein work for any β > 0.
Note that in Theorem 2.5 the parameter β was only considered in the dependent cases. In the case of independent random variables one obtains the following result.
For almost sure convergence one has to look at the proof(s) of (25) and (27). Therein a key step is an application of the central limit theorem, which requires (in the given setting) exactly the factor N for convergence (in distribution) to a standard normally distributed random variable. Using therein, for N replaced by N β with β < 1, Marcinkiewicz's law of large numbers, e.g. [25,Theorem 3.23], (or the law of the iterated logarithm) yields the limit 0 almost surely.
The choice of ψ i is intertwined with the invariance properties (Propostions 2.3 and 2.4) and the moment conditions (22) and (23). For the population measures also condition (5) and for the scaled measures also (17) and (18) have to be considered. In particular, it is possible to choose ψ i (or to transform the random variables) such that these conditions are satisfied regardless of the underlying distributions.
, classically with α i = 1 (other choices might provide higher power in tests; a general α i selection procedure is to our knowledge not yet available).
Nevertheless there are many other options for ψ i , see [7, Table 1] for various examples, and there are at least a few reasons why one might choose a ψ i which is not (a power of ) the Euclidean distance: i) For unbounded ψ i condition (23) is required in Theorem 2.5. If the existence of these moments is unknown for the underlying distribution the convergence results might not hold. Here the use of a slower growing or bounded ψ i is a safer approach, see Example 9.13.
ii) The empirical size/power of the tests (details are given in Section 4) can depend on the functions ψ i used, see Example 9.11. Especially if some information on the dependence scale is known the parameter δ i > 0 in ψ i (x i ) = 1 − e −δi|xi| α i can be adapted accordingly, see [8,Example 5.2] for an example using multivariance. Adaptive procedures for δ i can be found in [19].
iii) A non-linear dependence of multivariance on sample distances might be desired, e.g. there might be application based reasons to use the Minkowski distance [20,Section 2.4.4].
An alternative approach to ensure the moment conditions is the following. Moreover, if d i = s i and f i are bijective then also the converse implication holds. Thus one way to ensure all moment conditions in Theorem 2.5 -and preserve the (in)dependence -is to transform the random variables by bounded (bounded D i ) bijective functions f i . But beware that with this approach the multivariance is neither translation invariant nor homogeneous, cf. Example 9.13.

Comparison of multivariance to other dependence measures
In this section we compare multivariance to other dependence measures for random vectors X i ∈ R di . We only consider dependence measures which are closely related, in the sense that they are also based on characteristic functions or appear as special cases. In the papers introducing and discussing these measures comparisons with further dependence measures can be found.

Classical covariance, Pearson's correlation and the RV coefficient are limiting cases of multivariance
Let n = 2 and ψ i (x i ) = |x i | 2 . Note that |.| 2 is not characterizing in the sense of Proposition 2.2. It actually does not correspond to a Lévy measure, cf. [7, Table 1]. Thus the characteristic function representation (left hand side of (30)) does not hold and a value 0 of the right hand side does not characterize independence. Nevertheless, |.| 2 is a continuous negative definite function and it is the limit for α i ↑ 2 of |.| αi which are valid for multivariance. Moreover, the right hand side of (30) is also for |.| 2 well defined, and it corresponds to classical linear dependence measures: Hereto denote by X i,k the components of the vectors X i , i.e., X i = (X i,1 , . . . , X i,di ) where X i,k ∈ R. By direct (but extensive calculation) the expectation representation in (30) Especially for d 1 = d 2 = 1 the (absolute value of) classical covariance is recovered. Note that for n = 2 and ψ i (.) = |.| 2 the scaling constants (12), (14) and (15) become 2 Var(X i ), thus normalized multivariance coincides in this setting with both multicorrelations and with the absolute value of classical correlation. For arbitrary d 1 and d 2 the multicorrelations (squared) also coincide with the extension of correlation to random vectors developed in [15]. The corresponding sample versions N M, N R and N M cor coincide for d 1 = d 2 = 1 with (the absolute value of) Pearson's correlation coefficient and N R 2 and N M cor 2 coincide for arbitrary d 1 and d 2 with the RV coefficient of [35] (see also [24]). Note that also for n > 2 the right hand side of (30) with ψ i (x i ) = |x i | 2 is a well defined expression, which can be understood as an extension of covariance, Pearson's correlation and the RV coefficient to more than two random vectors.

Multivariance unifies distance covariance and HSIC
In the case of two random variables (that is, n = 2) multivariance coincides with generalized distance covariance [7] and the following (simplified) representations hold (using [7,Eq. (30)], direct calculations, [24,Eq. (3.2)] and the notation ψ = 1 − ψ) The last line is included to emphasize that further interesting representations exist -this one actually provides a characterization of independence using (the classical linear dependence measure) covariance. Other equivalent representations are Brownian distance covariance [41, ] (for ψ(.) = |.|) and its generalization Gaussian distance covariance [7,Section 7]. Note that (34) is for ψ i (x i ) = |x i | αi distance covariance [42, ] and (33) is for ψ i (x i ) = 1 − e −δiψi(x) (wherẽ ψ i can be any continuous real-valued negative definite function, e.g. |.| αi , and δ i > 0) the Hilbert Schmidt Independence Criterion (HSIC, [18]) with kernel k i (x, y) = e −δiψi(x−y) . 3 For the latter just note that for any continuous positive definite function φ the function φ(0) − φ is continuous negative definite (cf. [22, Corollary 3.6.10]), i.e., it fits into the framework of multivariance. The equivalence of kernel based approaches and distance based approaches was noted in [38], see also [39] for a recent discussion. But note that the approach in [38] to the correspondence of kernels and distance functions only works for the case n = 2, whereas the above correspondence also extends to the multivariate setting.
In other words, in the case n = 2 multivariance with bounded measures ρ i coincides with HSIC and special cases of unbounded ρ i yield distance covariance. Therefore, in general, multivariance is an extension of these measures to more than two variables. But note that there is also at least one alternative extension as we will discuss in the next section.
As discussed in Remark 2.8, the cases with bounded measures have the advantage that most moment conditions are trivially satisfied and that in the case of HSIC the parameters δ i provide a somehow natural bandwidth selection parameter. In contrast, using unbounded measures ρ i corresponding to |.| αi provide (scaled) measures with superior invariance properties (Propositions 2.3 and 2.4). Note, that also in this case the parameters α i offer some variability.
As a side remark, note that by the above it is straight forward that multivariance withψ i is the derivative (in the bandwidth parameter at δ i = 0) of multivariance corresponding to 1 − e −δiψi(x) , this relation of distance covariance and HSIC was noted in [4]. Incidentally, it is also the key for relating Lévy processes to their generators, e.g. see the introduction of [9].
Finally note that also the other measures discussed in the next section reduce for the case n = 2 to the above setting, thus they are included (or closely related as [23], which considers a joint measure ρ without product structure).

Independence of more than two random vectors
As a consequence of Theorem 2.1 the multivariances of all subfamilies of the variables X 1 , . . . , X n characterize jointly their independence. In fact, this was suggested in [4] as an approach to independence via multiple testing, i.e., via computing the p-value for each of these 2 n − n − 1 multivariances separately. The approach is complementary to the global test using total multivariance.
In [4] multivariance is considered in disguise: expanding the integrand of (30) and using the linearity of the expectation yields E ). This representation of the product is also called Möbius transformation of the characteristic functions. Without the characteristic function representation (with ψ i based on kernels k i ) the multivariance of 3 random variables appeared before under the name "(complete) Lancaster interaction" in [37].
Other popular multivariate dependence measures based on characteristic functions are of the following form, which is here stated using our setting (with the notation ψ = 1 − ψ and ρ i (R di ) = 1; to reformulate it for positive definite kernels use the correspondence provided in Section 3.2): Such dependence measures go back at least to [26, (1.3)]. It is important to note that the equality in (36) does not hold in general for unbounded measures ρ i , e.g. for n = 3, X 1 , X 2 dependent (satisfying (5)) and X 3 constant the term (36a) is infinite but (36b) is finite. Nevertheless, dependence measures of type (36) for ρ = ⊗ n i=1 ρ i with bounded and unbounded ρ i were recently discussed in [16] (in the unbounded case [16, Lemma 1a] only provides a rather complicated sample version, which actually corresponds to (36b), a proof and φ is a continuous positive definite function. For the non translation invariant case see Section 3.5. Moreover, note that we assume here φ(0) = 1 to avoid distracting constants in the presentation. HSIC and dHSIC additionally require that the kernel is characterizing, which is by Proposition 2.2 equivalent to the full support property of ρ.
can be found in Section 8.4), for finite ρ i representation (36) corresponds to the also recently introduced measure dHSIC of [32] and for an unbounded (joint measure) ρ associated to ψ(.) = |.| it was considered in [23] (in this case (36b) has a slightly different form).
The above illustrates that also for measures derived via (36) various approaches can be unified using the framework of continuous negative definite functions and Lévy measures.
To compare (36) with multivariance, note that in [8,Section 3.5] it was shown that for any given multivariance there exist special kernels (beyond the restrictions of the above papers) which turn (36b) into multivariance. With the usual kernels the following holds: if the given random variables are (n − 1)-independent [7, Corollary 3.3]. Thus the left hand sides of (30) and (36) coincide in the case of (n − 1)-independence. Without (n − 1)-independence multivariance does not characterize independence, but total multivariance M (X 1 , . . . , X n ), given by does characterize independence. The approach (36) and total multivariance (37) require similar moment conditions 4 (the variant in [23] requires a joint first moment) and the computational complexity of the sample versions is similar (the variant in [23] has a higher complexity, but they also provide an approximate estimator with the same complexity). Total multivariance needs one product of doubly centered distance matrices whereas (36b) needs three products of different distance matrices (which actually coincide with those used for the double centering). Nevertheless, both approaches differ: In the Section 8.5 we calculate explicitly the difference of the population measures for the case n = 3, indicating that it is by no means theoretically obvious which approach might be more advantageous. Here certainly further investigations are required. A practical difference is the fact that the current implementation of dHSIC [33, ] requires N > 2n, for multivariance there is no such an explicit restriction.
Generally, papers on dependence measures differ not only in their measures, but also in their methods of testing. For the approach (36) various methods have been proposed, of which the resampling method seems most popular. For multivariance we introduce the resampling method in Section 4. But there are also further (and faster) methods available for multivariance: Distribution free tests are used in [8] (see Theorem 4.4) and in [3] tests based on moments of the finite sample or limit distribution and/or using eigenvalues of the associated Gaussian process are developed, see also [19].

Pairwise independence
In Section 5 we introduce m-multivariance. In particular, 2-multivariance provides a global test for pairwise independence without any condition (when using bounded ψ i ) or under the mild moment condition (23), see Test 5.4. A related approach to pairwise independence using distance covariance was developed in [47], but in contrast it required assumptions which are necessary for applications of a (generalized) central limit theorem. The methods are compared in Example 7.3.

Generalizations
The setting of HSIC and also extensions of distance covariance are applicable to more general spaces than R d . In this settings the representation via characteristic functions and the characterization of independence (might) fail. Nevertheless, the representations given in (3) can canonically be extend to negative definite kernels n(x i , x i ) replacing ψ(x i − x i ). Thus it seems a natural guess that the key properties required for testing can be recovered in this setting, but to our knowledge this has not been studied yet.
For distance covariance exist also further modifications, like the affinely invariant distance correlation in [14]. Also this extension seems possible for multivariance. It is only defined for random vectors with non singular covariance matrices and in this setting it would be a candidate to satisfy the set of axioms given in the next section [31, Example 3].

Axiomatic classification of dependence measures
Rényi [34] proposed in 1959 a set of axioms which a dependence measure should satisfy. These have been challenged over the years, most recently e.g. in [31]. They propose "four simple axioms" which a dependence measure d should satisfy, and distance correlation is called the "simplest and most appealing" measure which satisfies these axioms. All axioms were proposed for pairwise comparisons of random variables or vectors. We present here a multivariate extension to n non-constant random vectors (constants are removed to avoid technical difficulties, cf. [31]): . . , X n are related by similarity transforms (see (38) for details).
n ) converges in distribution to (X 1 , . . . , X n ) (under a uniform moment condition, which ensures the finiteness of the measures).
Note that [31] uses a further common axiom -normalization: d(...) ∈ [0, 1] -which was only indirectly assumed and (A3) was stronger: it contained "if and only if" with a seemingly more restrictive relation which actually forced explicitly the dimensions of the random vectors to be identical. Note that in the related (original) axiom [34, Axiom E] also only the "if" part was required and a footnote explicitly advised against strengthening it.
In the setting of multivariance we say that random variables X i and X k are related by similarity trans- A prerequisite for the continuity (A4) is the finiteness of the measure d, cf. [31]. Thus all considerations for (normalized) multivariance are under the moment condition (5) and for the multicorrelation we have to assume (18). Based on Propositions 2.3 and 2.4 the invariance with respect to similarity transforms holds for ψ(x) = |x| α , and it seems (cf. [7,Section 5]) that for other unbounded and all bounded ψ the invariance fails. Therefore we only consider ψ(x) = |x| α . Table 1 indicates which axioms are satisfied by the measures, all properties follow by direct calculations (the continuity uses the dominated convergence theorem; for the normalization a generalized Hölder inequality is used, see also [8,Proposition 4.13]). For multicorrelation the properties vary as the number of variables is even or odd, and R yields always a measure with values in [0, 1] whereas M cor yields always the reference value 1 for variables related by similarity transforms. Note that for a multivariate normal distribution the value of total distance multivariance is (for the special case ψ(x) = |x|) linked to its correlation by [10, Proposition 2].
By Table 1 the four axioms are simultaneously satisfied by M cor. But recall that R and M cor lack efficient sample versions. In the sample setting also N · N M 2 and N · N M 2 provide statistical interpretable values (indirectly: via the corresponding p-value; yielding also a rough direct interpretation: they are positive and their expectation is 1 for independent random variables. Thus values much larger than one hint at dependence). Moreover, normalized multivariance requires only the moment condition (5) whereas multicorrelation requires the more restrictive condition (18). Finally, note that in the case n = 2 the multicorrelations Table 1: Dependence measure axioms which are satisfied by (variants) of (total) multivariance for coincide. Thus, in particular, M cor 2 (defined in Section 5) provides a measure with efficient sample estimator. For this measure a value of 0 only characterizes pairwise independence, but the value 1 occurs if and only if the random variables are related by similarity transforms.
A first discussion of the behavior of (total) multivariance when one enlarges the family of random variables can be found in [8, Proposition 3.7, Remark 3.8], which translates directly to multicorrelation.

Testing independence using multivariance
In this section we extend the discussion of [8, Section 4.5]. We use the notation of Section 2, in particular n ) are samples of independent copies of (X 1 , . . . , X n ). Based on Theorem 2.5, and recalling the fact that constant random variables are always independent, the following structure of a test for independence is obvious.
Test 4.1 (Test for n-independence, given (n − 1)-independence). Let ψ i be bounded or (23) be satisfied. Then a test for independence is given by The value R will be discussed below.
Remark 4.2. Note that also without the assumption of (n − 1)-independence (39) provides a test for independence for which the type I error can be controlled by the choice of R, since the distribution of the test statistic under the hypothesis of independence is known, see (25). But in this case it is unknown if the test statistic diverges if the hypothesis does not hold. Thus one can not control the Type II error and it will not be consistent against all alternatives (regardless of the satisfied moment conditions). A trivial example hereto would be the case where one random variable is constant, and thus the test statistic is always 0. But note that with the assumption of (n − 1)-independence this problem does not appear, since the (n − 1)-independence implies (given that at least one random variable is constant) that the random variables are independent.
Analogous to Test 4.1, using total multivariance instead of multivariance, one gets the test for independence.

Test 4.3 (Test for (n-)independence). Let ψ i be bounded or (23) be satisfied. Then a test for independence is given by
To get a test with significance level α ∈ (0, 1) the natural choice for R in (39) and (40) is the (1 − α)-quantile of the (limiting) distributions of the test statistics under H 0 , i.e., assuming that the X i are independent. To find this distribution explicitly or at least to have good estimates is non trivial, see [3] for an extensive discussion. As a starting point, one can follow [42,Theorem 6] where a general estimate for quadratic forms of Gaussian random variables given in [40] is used to construct a test for independence based on distance covariance. In our setting this directly yields the following result.
are (conservative) tests with significance level α. Here F χ 2 1 is the distribution function of the Chi-squared distribution with one degree of freedom.
In the case of univariate Bernoulli random variables the significance level α is achieved (in the limit) by Test 4.1 with R given in (41), see [3,Remark 4.27]. But for other cases it might be very conservative, e.g. Example 9.11 ( Figure 22). Recall that total multivariance is the sum of 2 n −n−1 distance multivariances (this is the number of summands in (2)). Thus one distance multivariance with a large value might be averaged out by many small summands, see Example 9.14. Hereto m-multivariance (which will be introduced in the next section) provides an intermediate remedy. It is also the sum of multivariances, but it has less summands. Thus the 'averaging out' (also known as 'statistical curse of dimension') will be still present but less dramatic.
Note that R in Theorem 4.4 is provided by a general estimate for quadratic forms. It yields in general conservative tests, since it does not consider the specific underlying (marginal) distributions. Less conservative tests can be constructed if the distributions are known or by estimating these distributions. The latter can be done by a resampling approach or by a spectral approach, similarly to the case of distance covariance (see [38,Section 7.3.]). Methods related to the spectral approach are developed in [3].
In the following the resampling approach for M is introduced. The procedure is certainly standard to experts, never the less it seems important to recall it (to avoid ambiguity): Suppose we are given i.i.d. samples 6 x (1) , . . . , x (N ) with unknown dependence, i.e., for each i the dependence of the components x (i) 1 , . . . , x (i) n is unknown. Now, resampling each component separately yields (almost) independent components. Thus Test 4.1 (respectively Test 4.3 with M) becomes a resampling test (resampling without replacement / permutation test) with L ∈ N replications using the rejection level R given by where each p k (N ) to be any sample of 1, . . . , N , this would also be a resampling test (resampling with replacement / bootstrap test), but note that the permutation test can be implemented more efficiently.
Similarly, Test 4.1 (respectively Test 4.3 with M) becomes a Monte Carlo test with L ∈ N replications using where x (i,l) k , k = 1, . . . , n, i = 1, . . . , N, l = 1, . . . , L are independent samples and for each fixed k the x (i,l) Remark 4.5. In [32, Section 3.2] two related resampling tests are introduced for dHSIC. But note that they use slightly different terminology, i.e., therein the 'permutation test' considers samples as in (42) but instead of random permutations all permutations are considered. For the 'bootstrap test' they use all resamplings of the sample distribution of each variable. This yields L = (N !) n and L = N N d , respectively. Which is infeasible even for relatively small N , thus in [32, Section 4.2] they also use randomly selected samples instead of all samples, and they call the resulting estimators 'Monte-Carlo approximations' of the estimators.

m-multivariance
Pairwise independence is the prime requirement for various fundamental tools in stochastics, e.g. the classical law of large numbers. Especially when working with many variables (n large) a multiple testing approach might not be feasible. Thus a global test for pairwise independence has many applications, see also the motivation in [47]. Here we construct such a test, together with further generalizations which allow the successive testing of 2-independence, 3-independence, etc.
Define for m ∈ {2, . . . , n} the m-multivariance M m by Instantly Theorem 2.1 yields the following characterization.
As in the case of multivariance, using (8), a strongly consistent estimator for M m is the sample mmultivariance Analogous to the case of normalized (total) multivariance the normalized sample m-multivariance N M m is given by where A i are the normalized matrices defined in (12). For (sample) m-multivariance the invariance properties (Propositions 2.3 and 2.4) hold analogously. To ensure that the expectation representation of m-multivariance (analogous to (3)) is finite the following condition (which is weaker than (5)) is required: finite joint ψ-moments for families of size m: for all S ⊂ {1, . . . , n} with |S| ≤ m: Note that the sum 1≤i1<...<im≤n has n m summands, which might be a lot to compute. These sums can be simplified using the multinomial theorem, In particular, for m = 2, 3 the following expressions of sample m-multivariance are easier to evaluate (analogous representations hold for the normalized sample m-multivariance): Thus at least for small m these estimators are easy to compute and -analogous to the case of (total) multivariance -these can be used to test m-independence by the next results.
. . , n be non-constant random variables and let X (k) , k = 1, . . . , N be independent copies of (X 1 , . . . , X n ). If either the ψ i are bounded or (23) holds, then for m ≤ n where Q is a Gaussian quadratic form with EQ = 1.
Proof. Let the assumptions of Theorem 5.2 be satisfied. Then (50) holds, since in this case (25) implies the convergence of each of the n m summands of (46) to a Gaussian quadratic form with expectation 1. Thus, due to the normalizing factor in (46), the limiting distribution has expectation 1. Further note that given (23) all these quadratic forms can be expressed as a stochastic integral with respect to the same process, cf. The divergence (51) follows by (24). The latter implies under the given assumptions that at least one summand of (46) diverges.
Analogous to the case of (total) multivariance the above theorem immediately yields a test for mindependence which is (under the given moment conditions) consistent against all alternatives.
with R as discussed in Section 4. (Note that one has to replace M by M m in (42) and (43) with R as discussed in Section 4. (Note that one has to replace M by M 2 in (42) and (43) to get R for the resampling test and the Monte Carlo test, respectively.) Examples of the use of m-multivariance are given in the Section 7, e.g. Example 7.3, and in the supplement. To roundup this section we discuss some related estimators.
1. Analogously to total multivariance one can define total m-multivariance for X = (X 1 , . . . , X n ) by as an estimator for 3-independence. In fact in the case of 2-independence this provides (assuming (47) and using [8, Corollary 4.7]) a weakly consistent estimator for M 3 . Hereto just note that the sums of all mixed terms of the form ((A i ) kl ) 2 (A j ) kl with i = j are estimators for multivariances like M (X i , X i , X j ), and the factorization for independent subsets and it coincides with the (analogously defined) R 2 since for n = 2 the scaling factors in (14) and (15) coincide. Moreover these factors have for each summand in (56) the same exponent, thus (in contrast to R and M cor) one gets efficient sample representations by replacing the A i in (48) by those in (14) (or equivalently (15)). This correlation satisfies all the dependence measure axioms of Section 3.6 when one replaces (A1) by the characterization of pairwise independence, see Table 1.

Dependence structure visualization and detection
In this section a visualization of higher order dependencies of random variables X 1 , . . . , X n using an undirected graph is introduced. The population version and estimation procedures are discussed. In Section 7 and in the supplement various examples are presented. The implementation of the visualization in R relies in particular on the package igraph [12].
The dependence structure graph consists of three elements (cf. Figure 1): • Circled nodes denote random variables.
• Non-circled nodes ('dependency nodes') are primarily used to denote the dependence of the connected nodes and a label might denote the strength of the dependence or a related quantity (in our sample setting it is the value of the test statistic N · N M 2 or the order of dependence) of the connected nodes. Secondarily they might be used to represent the 'random variable' which consists of all components of a connected cluster, e.g. in the right graph in Figure 1 the node with label '97.1' represents the cluster of X 1 , X 2 , X 11 .
A visualization of the full dependence structure is constructed by adding the corresponding 'dependency nodes' and edges for any m-tuple of X i , . . . , X n which is m-dependent but (m − 1)-independent. In general this graph can be very overloaded, see Example 9.9.
The direct approach to the full dependence structure based on samples is to test successively all (m − 1)independent m-tuples for m-independence for m = 2, . . . , n, adjust the p-values appropriately for multiple testing and add the significant dependency nodes and edges. For such a test procedure a direct visualization of the tests p-values was introduced in [17]: the dependogram. Note that the full dependence structure visualizes the (lowest order) significant findings in a dependogram, see Example 7.1 for more details. In practice a visualization of the full dependence structure is only feasible for small n, since for n random variables there are 2 n − n − 1 = n k=2 n k tuples to consider. To overcome (or at least to reduce) the drawbacks of the full dependence structure one can use a clustered dependence structure. Hereto each set of connected vertices in an undirected graph will be called cluster. Then the clustered dependence structure graph is constructed by the following algorithm: 0. Include the circled nodes for X 1 , . . . , X n .
1. Let k be the number of clusters currently in the graph and Y i , i = 1, . . . , k be random variables which have as components the connected X j of cluster i, e.g. Y 1 = (X 1 , X 2 , X 11 ) if X 1 , X 2 , X 11 are connected via some edges. (In the very first run this amounts to: k := n and Y i := X i .) Moreover, set m = 2.
2. If m > k the graph construction is finished, otherwise: For all m-dependent subsets of Y 1 , . . . , Y k add the corresponding dependency nodes and edges (connected to some non-circled node representing the cluster, if the cluster consists of more than one random variable) to the graph. If new nodes were introduced, go to step 2 otherwise repeat this step with m increased by 1.
Since dependence and independence are not transitive, some information might be lost in the clustered dependence structure. Nevertheless, note that clustering preserves dependence, e.g. if at least one of the random variables X i , i ∈ I is dependent with one of X k , k ∈ K then also (X i , i ∈ I ∪ J) is dependent with (X k , k ∈ K ∪ L).
The visualization algorithm for a clustered dependence structure based on samples is analogous to the above, just in step 2 the m-independence has to be tested. Here one can (we do so) choose to skip sets of variables which have been tested before, i.e., sets which remained unchanged after the last cluster detection. But in any case the p-values have to be adjusted appropriately for multiple testing.
The appropriate adjustment of p-values due to multiple testing is the basis for many debates. For the full dependence structure and for the clustered dependence structure the situation is complicated by the fact that the total number of tests is unknown at the beginning, and the result of the tests in one step influence (by indicating that some tuples are lower order dependent or by clustering) the data for the tests thereafter. Thus adjusting p-values after clustering would usually require new tests. An approach which avoids this uses Holm's method separately for each set of multiple tests in step 2 of the algorithm, but one has to keep in mind that by this the global type I error bound increases with each set of tests.
In general one might also distinguish between visualizations of the results of tests using a given significance level (in this case there is a bound for a type I error based on the significance level and it depends also on the correction for multiple testing used) or visualizations using consistent estimators (if these exist they might also be based on tests, but then the significance level or rejection level is adapted based on the sample size, which might make it harder to get explicit error estimates). In the case of tests with a fixed significance level, the range of significance levels which yield the same test results might give an additional indication of the reliability of the detection. Remark 6.1 (Comment on detection errors). The probability of a type I error can be estimate and/or bounded by the choice of the rejection level or significance level. But a type II error bound or estimate might not be available. In this case one has to keep in mind that, due to the successive estimation/testing procedure a type II error (i.e., a not detected dependence) for some tuple can yield a detected higher order dependence for a superset of the tuple. Thus in this case the higher order dependence still indicates that the components of the tuple are not independent (but they might not be lower order independent).
All of the above applies to the use of any multivariate dependence measure or test in the dependence structure detection algorithms. Now we turn explicitly to the case of multivariance.

Dependence structure detection using distance multivariance
For consistent estimates using multivariance the following observation is essential (cf. Theorem 2.5 and Corollary 2.7): Corollary 6.2 (Consistent dependence estimation). Let X i , i = 1, . . . , n be (n−1)-independent non-constant random variables and let X (k) , k = 1, . . . , N be independent copies of (X 1 , . . . , X n ). If either the ψ i are bounded or (23) holds, then for any β ∈ (0, 1) Hence using R = R(N ) := N 1−β · C for any fixed constants β ∈ (0, 1), C > 0 in the independence Tests 4.1, 4.3, 5.3, 5.4 provides strongly consistent tests, in the sense that (under the assumptions of the tests) the test result converges almost surely to the correct statement as N → ∞. Clearly the convergence speed of the estimator depends on the choice of β and C (see below for a rough error estimate).
Therefore there are several options for the dependence structure detection using Test 4.1: the very fast but conservative rejection level given in Theorem 4.4, the (extremely) slow but approximately sharp rejection level provided by the resampling approach (42) (or by the Monte Carlo approach (43)) and the value R := N 1−β · C for consistent tests.
Of the above options the conservative estimate and the resampling approach provide (almost) directly also the corresponding p-values, but only the conservative and the consistent approach are feasible. For the resampling approach the sample size would have to be adapted (increased!) if the p-values are adjustedyielding in general an extremely slow algorithm. For the consistent estimator the corresponding p-value can only be estimated (using one of the other methods) and the convergence rates have not been analyzed in detail yet, thus the actual type I error for a given finite sample is not directly available. Nevertheless note that fast and approximately sharp methods to estimate the p-values of multivariance are developed in the preprint [3], see also [19]. Moreover, an approximation of an upper bound for the type I error of the consistent estimator is given by the following: If one performs k independent sharp tests with significance levels γ i then the probability of a type I error is 1 − k i=1 (1 − γ i ). In the setting of multivariance the tests are in the limit (under H 0 ) independent and γ i ≤ F χ 2 1 (N 1−β · C), thus posterior to testing the number of tests performed is known, say k, and the bound becomes 1 − (1 − F χ 2 1 (N 1−β · C)) k for the consistent estimator. Concerning β and C note that for the estimator discussed in Corollary 6.2 with β close to 1 the convergence to 0 (in the case of independence) becomes slower, for β close to 0 the divergence to ∞ (in the case of dependence) becomes slower, here β = 1/2 seems a balanced choice. For the value of C an optimal recommendation is still open -we will use C = 2 in our examples where it seems a reasonable choice. Naturally, the constant C could also be based on a rejection level for a fixed sample size, e.g. choose C such that 5 · C is the rejection level for a sample of size 25 for a significance level of 0.05. Then at least for the this sample size the probability of a type I error is known, but this would still require some p-value estimation.
Finally, note that the above are basic algorithms. There are certainly several variants and extensions possible, e.g. a further speedup might be obtained by using total multivariance and m-multivariance for initial tests of independence (but beware of the problem of multiple vs. single tests). Furthermore, if pairwise dependence is detected (and clustered) this can be further analyzed in the framework of graphical models. Hereto also note that [32, Section 5.2] (see also [10, Section 6]) provides a method for the detection of causal relations of variables using multivariate dependence measures, this can be used to refine an undirected graph visualizing the dependence structure into a directed graph. Moreover, clearly the visual layout allows many variants, e.g. one might also use different line types and thicknesses to indicate the dependence strength or order, also the denoted values could for example be replaced by p-values (since for given marginals there is a one-to-one correspondence between the value of multivariance and its (exact) p-value).

Examples
In the supplementary 7 Section 9 a comprehensive collection of illustrating examples is provided. These discuss in detail several (toy-)examples of higher order dependence and their visualization, including the full and clustered dependence structure detection. Moreover also the detection power, empirical size and various other properties of multivariance are studied.
Here in the main text we will only discuss two types of examples: Comparisons with other multivariate dependence measures and two basic real data examples.
The presented tables will contain additionally a test called 'Comb' which combines the tests of m-and total multivariance by Holm's method. This provides a reference for readers with an interest in a joint test procedure, rather than comparing individual tests in their realm. For a full explanation of the setting, terms and parameter values of the studies we refer to the introduction of the supplementary Section 9.

Empirical comparison of multivariance with other dependence measures
As discussed in Section 3.3 there are several dependence measures closely related to distance multivariance and its variants. For these empirical power comparisons are provided in Examples 7.2 and 7.3. But we begin with an example of a different visualization of higher order dependence which was proposed alongside a copula based dependence measure.
Example 7.1 (Dependogram vs. visualization). In [17] copula based higher order dependence tests were proposed together with a dependogram, which provides a graphical representation of the test results of multiple testing. Our proposed visualization is closely related, it provides a visualization of the (lowest order) significant dependencies.
Example 7.2 (Comparison with the methods of [23]). Here we compare our estimators to those presented in [23]. To avoid confusion, note that in their main representation results is a sign typo: the second sum in Lemmata 1 and 2 should have a minus sign. In general, one should note that the estimators for multivariance have complexity O(N 2 ) whereas the exact estimators of [23] (e.g. Q N , S N ) have higher complexity. To reduce the complexity they introduce approximate estimators (e.g. Q N , J N ) which have the same complexity as ours.
Note that these approximate estimators are not permutation invariant with respect to the order of the samples. In fact their positive finding (significant p-values) in the real data example [23, 6.2 Financial data] is an artifact due to this shortcoming. Their estimators yield for the same date with permuted samples p-values about 0.3 and above. Therefore we strongly advise against the use of their approximate estimators in the given form. This problem can be reduced by permuting the samples prior to the use of their estimators. Nevertheless, we decided to use their estimators for a comparison, since these are the most recent estimators related to the approach discussed in Section 3.3 corresponding to (36). Moreover [23] also provides several variants and comparative tables including other measures. The following tables are computed with their parameter settings, e.g. α = 0.1. We only include their best exact and approximate estimators (for each particular example), for further comparisons see the full tables in [23].
One might argue that the comparison with 2-multivariance is unjust, since it is a measure of pairwise dependence, whereas the other measures detect any kind of dependence. Hereto note that also the combination of the measures in 'Comb' has a higher detection rate than the approximate estimators.   [47]). In [47] several measures of dependence were introduced. The main contribution is a measure dCov for pairwise dependence, which is closely related to N M 2 . The examples in [47] use the parameters N ∈ {60, 100} and n ∈ {50, 100, 200, 400, 800} and α = 0.05, which we also use here to provide values which can be compared to other dependence measures given in their tables. Let X i , i = 1, . . . , n be random variables with values in R 1 such that (X 1 , . . . , X n ) ∼ N (0, Σ). We consider [47,Example 2], hereto let Σ ∈ R n×n , Σ ij = 1 for i = j and otherwise (for i = j) set: a) auto-regressive structure: Σ ij = (0.25) |i−j| , b) band structure: Σ ij = 0.25 for 0 < |i − j| < 3 and 0 otherwise, c) block structure: Σ = I n/5 ⊗ A where I k ∈ R k×k is the identity matrix and A ∈ R k×k with A ij = 1 for i = j and 0.25 otherwise. In all cases the performance of 2-multivariance is very similar to their estimator, see Figure 6. Note that due to computation time restrictions we used for the table in Figure 6 the resampling distribution of one sample to compute all resampling p-values (instead of resampling each sample separately).

Real data examples
As stated in the introduction, looking at other papers considering multivariate dependence measures (e.g. those discussed in Section 3.3) one notices that although these are capable of detecting dependencies of higher order the real data examples feature pairwise dependence. From our point of view this seems first of all to be due to the fact that the concept of higher order dependencies is not popular (or even unknown) in applied statistics. Therefore, on the one hand there is a very strong publication bias for datasets with pairwise dependencies, on the other hand even if datasets statistically feature higher order dependencies an explanation by field experts is yet missing. Nevertheless, we refer to [5] for a collection of more than 350 datasets which feature higher order dependence. In the following we present two examples for which 2-multivariance and total multivariance detect some dependence. In terms of dependence structure detection they are more delicate: The first example illustrates the difference between the clustered and full dependence structure and it indicates an application of higher order dependencies to model selection. The second example discusses detected higher order dependencies which are actually based on pairwise dependence due to the small sample size, a conservative detection method and the chosen significance level (see also Remark 6.1).
Example 7.4 (Quine's student survey data). We consider a classical data set of a student survey [1] (see also [45, R-package: MASS, dataset: quine]), which contains 146 samples of the variables: age (actually the class level), gender, cultural background, type of learner and the number of days school was missed. The dataset was extensively used in [1] to discuss model selection in a multi-factor analysis of variance to model the number missed school days.
The conservative tests using 2-multivariance and total multivariance detect no dependencies (p-values: 0.0767, 0.1565), the corresponding resampling tests reject independence with actual p-values of 0.00.
The dependence structure detection yields the structures shown in Figure 8. Here the full structure provides a refinement of the clustered structure. For the detection we used resampling tests with 10000 resamples and significance level α = 0.01. Based on the actually performed multiple tests the approximate probability of a type I error is 0.0297 for the clustered structure and 0.0199 for the full structure. By the large number of resamples used this example might just seem to be an (impractical) proof of concept, but note that the same results can also be obtained with the faster methods developed in [3].
For the variables: age, gender and missed-days 3-dependence (with lower order independence) was detected ( Figure 8). To judge if this is really a sensible finding in terms of the field of study is beyond our expertise. Nevertheless the found dependencies naturally suggest candidates for a minimal model for the number of missed days: Based on the detected full dependence structure the missed days depend only on the cultural background and on the interaction term of age and gender.   [44]. To consider these (and smaller subsets) as independent samples we only keep the personal best of each athlete, leaving 2709 samples, and order these by the achieved total points in increasing order (the field is denser for lower points, constituting more to the required i.i.d. setting for the samples tested). It is well known that the 10 disciplines are dependent, e.g. [11,46]. We are interested how many samples (using the real measurements of the results in each discipline) are required to detect a dependence using the presented methods: For 2multivariance M 2 the dependence is, for a significance level of α = 0.05, first detected for N = 5 and finally for all N > 11. For total-multivariance M it is detected also for all N > 11.
Here we used the dependence structure detection based on conservative tests, thus it is interesting to see which structures are detected for various sample sizes, see Figure 9. The detection of the higher order dependence indicates early on that these variables are dependent, but due to the conservative tests, the actual lower order dependence is missed. With increasing sample size only the dominant pairwise dependencies are detected. Also note that due to the repeated testing and the given significance level the probability of a type I error is large. Using the consistent estimator only some pairwise dependencies are detected for N = 2706 and no dependencies are detected for N ∈ {50, 100, 200}.
Notably there are some natural variants: 1. Instead of the results one could consider the achieved points in each discipline, which are obtained by non linear transformations of the results. This yields almost the same inference. 2. Starting with the elite athletes instead of our order causes a change in the detection: In this case 2-multivariance detects a dependence for all N > 25 but total-multivariance requires much more samples: N > 177. Thus here a curse of dimension is at work (compare with Example 9.14), which might indicate that for top athletes some disciplines are less dependent than for other athletes. We leave further analysis and interpretation to field experts. The setting also naturally yields to clustering methods (for dependent random variables) based on distance multivariance, a topic which is beyond the current paper.

Further results and proofs
Here we collect several results which are essential for (parts of) the previous sections, but which were postponed to this section due to their technicality.

A theorem characterizing the support of Lévy measures
Note that in [7, after Defintion 2.3] it was stated that it is unknown how to characterize the (full) support of Lévy measures in terms of the corresponding continuous negative definite function. The following result provides a characterization (via Proposition 2.2), it is related to [48,Corollary 2].
where ρ is a symmetric measure integrating 1 ∧ |.| 2 , and X, Y be R d -valued random vectors with characteristic functions f X , f Y , and assume E(ψ(X)) < ∞ and E(ψ(Y )) < ∞. Then Proof. Additionally to the stated assumptions let Z, X , Y be independent random variables which are also independent of X, Y and satisfy E(ψ(Z)) < ∞ and X by Tonelli and using the (generalized) triangle inequality for continuous negative definite functions (66). Thus the following implications hold: and the last line is equivalent to the start. This completes the proof.

Moment condition
In [8] the following condition was used: where X 0 , X 0 , X 1 , X 1 are independent and have the same marginal distributions as X (for the dimensions d i ), X 1 , X 1 have also the same joint distribution, but the marginal distributions of X 0 , X 0 are independent (for further details see [8, Def. 2.3.a]). We show that for non constant random vectors X i the joint ψ-moment condition (5) and (65) are equivalent. If a random vector is constant condition (65) becomes trivial since the corresponding factor is equal to 0.
By this inequality (5) implies (65). For the converse implication we begin with the following observation.

Proof of the asymptotics of sample distance multivariance (Theorem 2.5)
Here we are in the setting of Section 2. The asymptotics (25) and (27) of the test statistic were proved in [8,Thm. 4.5,4.10,Cor. 4.16,4.18] and [7,Cor. 4.8] under the condition (23). The following theorem provides a proof using an alternative condition. Combining the results yields the convergence statements (25) and (27) of Theorem 2.5. Theorem 8.3. Let X i , i = 1, . . . , n be non-constant random variables such that and let X (k) , k = 1, . . . , N be independent copies of X = (X 1 , . . . , X n ). Then where Q and Q are Gaussian quadratic forms with EQ = 1 = EQ.
where (67) was used for the bounds. Therefore N · N −2 N j,k=1 Φ(X (j) , X (k) ) converges in distribution to a Gaussian quadratic form by [28,Thm. 4.3.2,p. 141]. Note that in the limit in [28] appear E(Φ(X, X)) and a sum ∞ i=1 λ i , which cancel in our setting -this equality also implies that the limit for normalized multivariance has expectation 1, cf. [3, Lemma 2.3 and Remark 4.9.1]. Finally, (68) follows by Slutsky's theorem since To avoid a false impression, note that (73) seems natural since the strong law of large numbers implies that the expectations in (71) are approximated by the corresponding sums in (70). But the additional factor N in (73) makes the proof technical, which we only sketch here: For (73) it is, using the Markov inequality, sufficient to show that the second moment of the left hand side converges to 0. This moment and its limit can be calculated explicitly based on and similar to [3,Theorem 4.15], where the second moment of N Φ S (j, k) is analyzed in-depth. Considering analogously N Φ(j, k) := S⊂{1,...,n} |S|>1 N Φ S (j, k) instead of N Φ yields the result for total multivariance.
Remark 8.4. Based on the methods developed in the preprint [3, e.g. Section 7.7] the second order moment in (72) seems to be already bounded under the weaker assumption E(ψ i (X i )) < ∞ for all i = 1, . . . , n, see also the special case ψ i (x i ) = |x i | discussed in [10]. To make this rigorous one would have to rewrite (or at least discuss) the steps in [28] in much more detail, which is beyond the bounds of this paper. Moreover, this clearly also requires a discussion if (and why) the counterexample, which shows that the log moment condition in (23) (see also Remark 2.6) is necessary for the convergence of the empirical characteristic functions, is somehow compensated by the L 2 (ρ) norm.
Lemma 8.5. Let X (k) , k = 1, . . . , N be independent copies of X = (X 1 , . . . , X n ). Then, without any moment assumptions, we have Proof. In this proof we omit (X (1) , . . . , X (N ) ) and (X 1 , . . . , X n ) in the notation. Note that for ρ ε instead of ρ the joint ψ-moment condition (5)  Thus N · N M diverges for N → ∞ by Lemma 8.5. This also applies in the case of dependence to at least one summand of total multivariance (and the others are non negative) therefore also the divergence (26) follows.

The population representation of [16, Lemma 1a]
Note that the γ, β terms appearing in [16, Lemma 1a] correspond in our notation to γ j,l = ψ l (x Turning their sample sums into expectations and observing the independence implied by the indices yields the population version of their T n : S⊂{1,...,n} |S|>0 S ⊂{1,...,n} |S |>0 now note that (−1) |S|+|S | = (−1) |S\S |+|S \S| can be distributed as factor −1 to each factor in the products corresponding to S\S and S \S. Then the formula (1 − E(ψ i (X i ))) + 1 (82) (1 − E(ψ i (X i ))) + 1. (83) Finally, after using the linearity of the expectation, the last two products in the first row cancel with the second product in (82), and the third product in (82) cancels with the last two products in (83); also the trailing "+1" cancel. For the remaining products the linearity of the expectation and the definition of φ in (76) yield 8.5 The difference of dHSIC and total multivariance for n = 3 Expanding the product in (3) yields by careful accounting the representation: where π(1, 2, 3) is the set of all permutations of the vector (1,2,3). Define and note H 0 = H 1 = 0 and H 2 (X 1 , . . . , X n ) = M 2 (X 1 , . . . , X n ) where M 2 is 2-multivariance defined in (48). Using

Supplement
The supplementary Section 9 can be found at pages 35 ff. of the current manuscript. The clustered dependence structure is detected using the conservative tests with significance level α (with Holm's correction for multiple tests) or using the consistent estimators of Corollary 6.2 with β = 1 2 and C = 2.
In the power tables the normalized multivariances M, M, M 2 and M 3 are studied using the distributionfree and the resampling tests. To avoid overloading we included only a selection of the test results in the tables. Hereto note that using M 3 as a test, should (to provide a test which is consistent against all alternatives) be preceded by a test for pairwise independence, e.g. using M 2 . Here we performed these tests independently. But we also include in the tables a test 'Comb' which combines the tests based on M 2 , M 3 and for n > 3 also M to a global test for the same significance level (rejecting independence if at least one p-value, adjusted by Holm's method, is significant).
For most examples the clustered dependence structure is illustrated using the test based scheme of Section 6 (with conservative p-value). Explicit values in the graphs are based on a successful detection with N = 100 samples, if not stated otherwise.

Detection and visualization of higher order dependencies
The generation of samples with higher order dependencies is explained by a detailed description of the two classical examples (Examples 9.1 and 9.2) and a basic example for dependence of arbitrary order (Example 9.3). These provide reference examples to detect and build more involved dependence structures, which also illustrate different aspects of higher order dependence: higher order dependencies with continuous marginal distributions (Example 9.4), disjoint clusters (Example 9.5), a mixture of pairwise and higher order dependence (Example 9.6), iterated dependencies (Example 9.7) and joint dependence of all variables (such that all are connected by dependencies of higher order) without any pairwise dependence (Example 9.8). The full dependence structures for the examples are collected in Example 9.9. Example 9.1 (Colored tetrahedron). Consider a dice shaped as a tetrahedron with sides colored red, green, blue and stripes of all three colors on the fourth side. The events that a particular color is on the bottom side -when throwing this dice -are pairwise independent events. But they are not independent. Both properties follow by direct calculation: Thus this provides an example of three variables which are 2-independent, but dependent. In Figure 10 the empirical powers of the tests are denoted. Maybe it seems surprising that the empirical power of the test based dependence structure detection is not 1 albeit the others have power 1. Hereto recall that the distribution-free test is sharp for Bernoulli random variables, thus (due to the correction for multiple tests) it is expected that in 5% of the cases already a (false) detection of pairwise dependence occurs. Furthermore, note that the distribution-free test for the normalized total multivariance has for N = 10 an empirical power of 0 due to averaging (see also Example 9.14).  Example 9.2 (Two coins -three events). Throw two fair coins and consider the three events: the first shows head, the second shows tail, both show the same. Then again a direct calculation shows pairwise independence, but dependence. The probability that all three events occur simultaneously is 0. Alternatively the same (but with a joint probability of 1/4 as in Example 9.1) holds for the events: the first shows head, the second shows head, both show the same. Figure 9.2 shows the dependence structure and empirical power for the case with joint probability 0. The results are indistinguishable from Example 9.1.   Figure 11: Three events of two coins (Ex. 9.2): dependence structure and empirical power.
A simple generalization yields the next important example, featuring higher order dependence in its 'purest' form. Example 9.3 (n coins -(n + 1) events). Throw n fair coins and consider the n + 1 events: The first shows head, the second shows head, ..., the n-th shows head, there is an odd number of heads. Then by direct calculation these are n-independent, but dependent (the joint probability of the events is 0 for even n and it is (1/2) n for odd n). To get an intuition, note that given n of these events one can directly calculate the (n + 1)th event. But given less, provides not enough information to determine any further event -any option is equally likely. Figure 12 shows the dependence structure and the empirical power of the dependence measures. The total multivariance suffers a loss of power compared to the previous examples due to the averaging (only one of the 2 n − n − 1 summands diverges, see also Example 9.14). Moreover one starts to see that the distribution-free method is conservative for total multivariance (recall that also with univariate Bernoulli marginals it is only sharp for multivariance, not for total multivariance). The low detection rate of the test based dependence structure detection is again due to the sharp rejection level for Bernoulli random variables and the p-value adjustment due to multiple testing of all k-tuples for each k ∈ {2, . . . , n + 1}. The previous examples only used dichotomous data. Obviously the same dependence structures can also appear (and be detected) for other marginal distributions. A basic example is the following.
Example 9.4 (Perturbed coins). Let (Y 1 , Y 2 , Y 3 ) be the random variables corresponding to the events of n = 2 coins in Example 9.3 and Z 1 , Z 2 , Z 3 be i.i.d. standard normal random variables. Now set X i := Y i + rZ i for i = 1, 2, 3 and some fixed r ∈ R. For these the same dependence structure as in Example 9.2 ( Figure 11) is detected. Figure 13 shows the dependence structure and the empirical power for r ∈ {0.25, 0.5, 0.75, 1}. Note that the rate of the detection of the test based dependence structure algorithm improves in comparison to the previous examples (for N large) whereas the consistent estimator requires larger samples. The former is due to the fact that only in the case of univariate Bernoulli distributed random variables the distribution-free method is sharp for multivariance. In all other cases it becomes conservative and therefore the rate of falsely detected pairwise dependencies is reduced. Increasing the value of r reduces the empirical power. This is expected, since the dependence structure becomes blurred by the variability of the Z i 's.  Example 9.5 (Several disjoint dependence clusters). We look at samples of (X 1 , . . . , X 26 ) where (X 1 , X 2 , X 3 ) are as in Example 9.3 with 2 coins, (X 7 , . . . , X 11 ) are as in Example 9.3 with 4 coins, (X 4 , X 5 , X 6 ) and (X 12 , X 13 , X 14 ) and (X 15 , X 16 , X 17 ) are as in Example 9.1, (X 18 , . . . , X 21 ) and (X 22 , . . . , X 25 ) are as in Example 9.3 with 3 coins and X 26 ∼ N (0, 1). Furthermore, each of these tuples is independent of the others. Note that we added X 26 to make the detection much harder, since now the factorization for independent subsets (6) implies M(X 1 , . . . , X 26 ) = 0. Figure 14 shows that the detection algorithm and the 3-multivariance (with resampling) perform well, whereas the total multivariance suffers from averaging (see also Example 9.14) and the distribution-free dependence tests are too conservative. Example 9.6 (Star dependence structure). Consider samples of (X 1 , X 2 , X 3 , X 1 , X 2 , X 3 , X 1 , X 2 , X 3 ) where X 1 , X 2 , X 3 are as in Example 9.3 with 2 coins. Then the structure in Figure 15 is detected. Here the graph was slightly cleaned up: vertices representing only pairwise multivariance were reduced to edges with labels. The variables are Bernoulli distributed and thus (as e.g. in Example 9.3) the detection rate of 95% reflects the 5% falsely detected pairwise dependencies.   Figure 15: The star dependence structure of Ex. 9.6.
Example 9.7 (Iterated dependence structure). Consider samples of random variables (X 1 , . . . , X 13 ) where X 1 , . . . , X 10 are independent but X 1 , X 2 , X 11 are dependent (but all subtuples are independent), the same holds for X 1 , . . . , X 5 , X 12 and X 1 , . . . , X 9 , X 13 . Such examples can be constructed by letting X 11 = f (X 1 , X 2 ) for some (special) f , and analogously for the others. If such a structure is detected the graph looks like Figure  16.
For the dependence we used f (x 1 , . . . , x k ) = k i=1 x i mod 2, and X i , i = 1, . . . , 10 were i.i.d. Bernoulli random variables. The dependence structure is reasonably detected given 100 samples by the test based algorithm, the consistent estimator requires a much large sample size. Both total multivariance and 3multivariance also detect the dependence, see Figure 16. i=k−3 X i ) mod 2 for k ∈ {4, 7, 10, 13} and X 15 := (X 13 + X 14 + X 1 ) mod 2.
Since here only quadruple dependence is present, only total multivariance is used to detect it. The dependence structure detection works surprisingly well, also with small sample sizes, see Figure 17. Example 9.9 (The full dependence structures). For the Examples 9.1 to 9.5 the clustered dependence structure and the full dependence structure coincide. For Examples 9.6, 9.7 and 9.8 the full dependence structures are given in Figure 18, 19 and 20, respectively. The full graph is not as easy to comprehend as the clustered graphs, to improve it we used the order of dependence as labels of the dependency nodes. Moreover, besides the full graph also individual graphs depicting only the dependence of a certain order (for tuples for which no lower order dependence was detected) are presented.  Figure 19: The full dependence structure of Ex. 9.7 (iterated dependence structure).

Empirical studies of properties of distance multivariance
Note that in the papers introducing distance multivariance [7,8] only two very elementary examples are contained. Thus simultaneously to illustrating the new measures and methods provided in the current paper we also provide the first detailed empirical study of distance multivariance. The following aspects of multivariance, total multivariance and m-multivariance are discussed: the empirical size of the tests (Example 9.10), the dependence of the distribution of the test statistic on marginal distributions, sample size, dimension and the choice of ψ (Example 9.11), the computational complexity (Example 9.12), the moment conditions (Example 9.13) and the statistical curse of dimensions (Example 9.14). The section closes with a generalization of total multivariance (Example 9.15).
Legend for empirical size plots   In Figure 21 the empirical sizes are depicted. The resampling methods have (as expected for a sharp test) an empirical size close to 0.05. For Bernoulli marginals also the distribution-free method for multivariance is close to 0.05. In the other cases (and for m-and total multivariance) the tests are conservative.
Next we analyze the effect of various parameters on the distribution of the test statistic.
Example 9.11 (Influence of sample size, marginal distributions and ψ i ). The distribution of the test statistic N · N M 2 under the hypothesis of independence depends on the marginal distributions of the random variables and also on the number of variables n as Figure 22 illustrates (see also Figure 24). The empirical distributions are based on 3000 samples each. Moreover the distribution also clearly depends on the choice of the reference measure ρ or equivalently (see (11) and Remark 2.8) on the distances ψ i . For Figure 23 we used ψ i (x i ) = |x i | α with α ∈ {0.5, 1, 1.5}, and the plots show that in general for α = 1.5 the upper tail of the distribution of the test statistic comes closer to the distribution-free limit which is the χ 2 1 -distribution. Note that the χ 2 1 -distribution is matched in the case of Bernoulli distributed random variables, in this case the choice of ψ i has no effect on the empirical distribution of the test statistic, since ψ i (0) = 0 and ψ i (1) = 1 for all α.   For independent normally distributed random variables the dependence of the test statistic on the number of variables n is depicted in Figure 24, and the dependence on the sample size N is illustrated in Figure  25. Roughly, the distribution spreads with the number of variables and shrinks to a limiting distribution (as stated in Theorem 2.5) with increasing sample size.   Example 9.13 (Infinite moments -cf. Remark 2.8). Similar to Example 9.4 let (Y 1 , Y 2 , Y 3 ) be the random variables corresponding to the events of n = 2 coins in Example 9.3 and Z 1 , Z 2 , Z 3 be independent Cauchy distributed random variables. Now set X i := Y i + rZ 3 i for i = 1, 2, 3 and some fixed r ∈ R (here we only use r = 0.001). Note that E(|X i | 1 3 ) = ∞, thus clearly the moment condition (23) does not hold for the standard ψ i (.) = |.|. Now we compare three methods: a) we don't care (thus we use the standard method); b) we use ψ i (·) = ln(1 + |·| 2 2 ) which increases slowly enough such that the moments exist; c) we consider the bounded random variables arctan(X i ) instead of X i (cf. Remark 2.8.2). The results are shown in Figure 27. It turns out that method a) is not reliable, method b) works reasonably. In our setup method c) works best, but recall that this method destroys the translation and scale invariance of the test statistic, thus already if we shift our data it might not work anymore. Example 9.14 ((total and m-)multivariance -statistical curse of dimensions). Let X 1 , . . . , X n be independent random variables and set Y 1 := X 2 . Then (due to the independence of the X i ) But the corresponding difference of the estimators might be negative, as a direct calculation shows. Empirically we study this setting with X i i.i.d. Bernoulli random variables. The empirical power of the independence test with resampling for N M 2 and N M is shown in Figure 28 for increasing n and various sample sizes. As expected the decrease of power is rapid for total multivariance and at least not as bad for 2-multivariance. The resampling method was used, since the distribution-free test is not sharp in this setting. It is for univariate Bernoulli marginals only sharp for multivariance but not for total or m-multivariance (m < n). We close this section with an extension of total multivariance which introduces a further parameter to tune the power of the tests. Recall that we assumed, in order to avoid distracting constants, in Section 3 that the kernels of HSIC (and the related measures) satisfy k i (x i , x i ) = 1. Without this assumption additional constants appear naturally. As a special case one is led to the following dependence measure, which was incidentally suggested before [27] and it is for ψ i (x i ) = |x i | a special case of the joint distance covariance developed in [10].
Example 9.15 (total distance multivariance with parameter λ). Let λ > 0 and define λ-total multivariance M ρ 2 (λ; X 1 , . . . , X n ) := Thus one puts the weight λ n−k on the multivariance of each k-tuple for k = 2, . . . , n. Therefore with λ < 1 the n-tuple gets the biggest weight, with λ > 1 the 2-tuples (i.e., pairwise dependence) get the biggest weight. This might be used to improve the detection rate of total multivariance as Figure 29 and Figure 30 show. If the random variables are (n − 1)-independent then clearly the detection improves when λ gets closer to 0, Figure 29. If some lower order dependence is present then some optimal λ seems to exist, Figure 30, but a priori its value seems unclear.