A machine learning-based approach for estimating and testing associations with multivariate outcomes

We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize ensemble machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups' association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.


Introduction
Neurocognitive impairment may affect 250 million children under five years globally. [1] The Healthy Birth, Growth, and Development knowledge integration initiative was established, in part, to inform global public health programs toward optimizing neurocognitive development. [2] Neurocognitive development is often studied through studies that enroll pregnant women and follow their children through early childhood and adolescence. Covariate information about the child's parents, environment, and somatic growth is recorded at regular intervals and in early adolescence, children complete tests that measure diverse domains of neurocognitive development such as motor, mathematics, and language skills. Researchers are often interested in assessing the correlation between covariate information and neurocognitive development. Such an assessment may be useful for developing effective prediction algorithms that identify children at high risk for neurocognitive deficits. [3] A common approach for describing the strength of association between covariates and a multivariate outcome is canonical correlation analysis. [4,5] Canonical correlation maximizes the correlation between a linear combination of the multivariate outcome and a linear combination of the covariates. Several test statistics have been proposed for significance testing of canonical correlation including Wilks' Λ [6], the Hotelling-Lawley trace [4,5], the Pillai-Bartlett trace [7,8], and Roy's largest root [9]. One potential drawback to canonical correlation analysis is that it may fail to identify associations when nonlinear relationships exist between covariates and outcomes. This has led to recent interest in nonlinear extension of canonical correlation analysis [10,11,12].
Another common approach in settings with multivariate outcomes is latent variable analysis. Many definitions of latent variables appear in the literature, and we refer interested readers to more thorough discussions in [13,14,15,16,17]. A commonly used definition is that a latent variable is a hypothetical construct that cannot be measured by the researcher and that describes some underlying characteristic of the participant. Others have strongly rejected the use of latent variables, instead opting for empirical explanations of observed phenomenon [18]. For our purposes it suffices to say that latent variable analysis tends entail an unsupervised grouping of the observed outcomes into a set of lower-dimensional latent features. One technique commonly employed to this end is principal components analysis (PCA). Researchers often use PCA to reduce the observed multivariate outcome to a low-dimensional (e.g., univariate) outcome and may examine the factor loadings to ascribe a-posteriori meaning to the reduced outcomes. Researchers further may use these outcomes to test for associations of covariates. However, such tests may fail to identify associations between predictors and outcomes due to the unsupervised nature of the outcomes' construction.
In this work, we propose an alternative method for measuring association between a multivariate outcome and a set of covariates. Our approach is related to canonical correlation analysis, but rather than maximizing the correlation between linear functions of covariates and outcomes, we maximize the predictive accuracy of a flexible machine learning algorithm and a convex combination of the outcome. The method identifies the univariate outcome that is "most easily predicted" by the covariates and a summary measure of how well that "easiest-to-predict" outcome may be predicted. This approach adapts to complex relationships between covariates and outcomes and identifies associations in situations where traditional canonical correlation does not. However, in contrast to recent nonparametric canonical correlation proposals, our method provides a measure of association on a familiar scale, and also provides asymptotically justified inference, including confidence intervals and hypothesis tests.
In certain situations, our method may further provide a novel means of constructing a clinically interpretable latent variable. For example, in studies of neurocognitive development with measured genetic information, our method could be used to identify a univariate measure of neurocognitive development that reflects the components of neurocognitive development most strongly associated with genetics. This univariate outcome may be used to represent the latent trait of "heritable intelligence," and could be used to study associations and gene/environment interactions. Nevertheless, we view our method primarily as a latent variable approach only in the sense of Harman (1960) [19], who described latent variables as a convenient means of summarizing a number of variables in many fewer factors. Our approach fits in this definition by providing a single summary measure of the strength of association between covariates and multivariate outcomes.

Defining the target parameter
Suppose we observe n independent copies of O := (X 1 , . . . , X D , Y 1 , . . . , Y J ), where X := (X 1 , . . . , X D ) is a D-dimensional vector of covariates and Y := (Y 1 , . . . , Y J ) is a J-dimensional vector of continuously-valued outcomes. Without loss of generality, we assume that each outcome has mean zero and standard deviation one and otherwise make no assumptions about the true distribution P 0 of O. We define Y ω := J j=1 ω j Y j as a convex combination of the outcomes, where ω j ≥ 0 for all j and j ω j = 1. We are interested in finding ω 0 , the weights that make Y ω 0 "easiest-to-predict" based on X. We formalize this notion presently. Subsequently, a summary measure of how well Y ω 0 can be predicted using X will serve as a summary measure of the strength of the association between X and Y . Rejecting a null hypothesis of zero association allows us to conclude that there is an association between X and at least one of the components of Y .
To formalize, suppose we are given weights ω and a function ψ 0,ω that takes x as input and outputs ψ 0,ω (x), a real-valued prediction of Y ω . In the sequel, we discuss how one might construct such a function using the observed data. A measure of how well ψ 0,ω predicts Y ω is mean squarederror, ξ 0,ω (ψ 0,ω ) : , which measures the average squared distance between the prediction made by ψ 0,ω and the outcome Y ω . Here we use the notation E 0 {f (X, Y )} to denote the average value of f (X, Y ) under the true distribution P 0 . Mean squared-error can be scaled to obtain a measure of accuracy that does not depend on the scale of the outcome. Specifically, let µ 0,ω := E 0 (Y ω ) denote the marginal mean of Y j and define as the proportional reduction in mean squared-error when using ψ 0,ω as opposed to µ 0,ω to predict Y ω . We refer to this as a nonparametric R 2 , as this measure may be interpreted similarly as the R 2 measure in the familiar context of linear models: values near to one indicate that nearly all of the variance in Y ω is explained by ψ 0,ω . However, we note that the squared notation here is a misnomer in that ρ 2 0,ω (ψ 0,ω ) can be negative, which indicates that the marginal mean of Y ω provides a better fit than ψ 0,ω . Nevertheless, we maintain the colloquial squared-notation.
We define ω 0 as the set of weights that is optimal with respect to nonparametric R 2 , ω 0 := argmax ω ρ 2 0,ω (ψ 0,ω ). We assume that such a maximal value exists and is unique. The value ρ 2 0,ω 0 (ψ 0,ω 0 ) describes how well the optimally combined outcome may be predicted. This measure provides a summary of the association between X and Y that is closely related to canonical correlation [5]. Indeed, if ψ ω is based on a linear combination of X, then ρ 2 0,ω 0 (ψ ω 0 ) is nearly identical to the squared correlation of the first canonical variates. However, canonical correlation reflects only the strength of linear associations between X and Y , while ψ 0,ω 0 may include nonlinear functions of X. Thus, ρ 2 0,ω 0 (ψ 0,ω 0 ) may more accurately summarize the association between X and Y in situations where relationships between covariates and outcomes are not well described by linear models. Our goals are thus: (i) find a suitable prediction function for a convex combination of the multivariate outcome; (ii) approximate the optimal weights ω 0 ; (iii) summarize the strength of the association between the prediction function and the optimally weighted outcome.
Restricting ψ 0,ω to linear functions of X, as in canonical correlation analysis, enables simultaneous optimization over choice of prediction function and outcome weights. Furthermore, if X and Y are of relatively low dimension, then canonical correlation may provide a reasonable measure of association between X and Y . However, when considering more complex functions of X, simultaneous optimization appears to be a much more difficult task. Furthermore, when more aggressive data fitting approaches are employed, overfitting becomes concern and may prevent simple correlative statistics from being used to summarize associations. To overcome these difficulties, we propose to use an approach centered around cross-validation. For goal (i), we develop a prediction function for each component of Y using a cross-validation-based ensemble machine learning technique, which ensures that the relationship between X and Y is captured as accurately as possible. For goal (ii), we propose to maximize a cross-validated performance measure in order to determine outcome weights. For goal (iii), we propose to crossvalidate the entire procedure to obtain a summary of the association between X and Y . We propose closed-form variance estimators for this associative measure that can be used to generate confidence intervals and hypothesis tests.

Super learning
First, we note that given ω, the maximizer of ρ 2 0,ω (ψ ω ) over all functions mapping an observation x to a real value is ψ 0,ω := E 0 (Y ω | X). Due to linearity of expectation, This implies that for any ω, a prediction function for Y ω may be constructed by building a prediction function for each Y j and weighting those predic-tion function by ω. An alternative approach is to use multivariate regression methods simultaneously predict the whole vector Y . [20] However, we find the univariate approach appealing in that it allows a rich collection of univariate learning methods to be utilized. In particular, we propose to use regression stacking [21,22], also known as super learning [23], to construct a prediction function for each outcome. Super learning entails constructing a library of M j candidate learners for each of the J outcomes. The library of estimators can include parametric model-based estimators as well as machine learning algorithms. Parametric model-based methods include linear regression of Y j on X, while machine learning-based methods include random forests [24], gradient boosted machines [25], and deep neural networks [26]. If X is highdimensional, learners can incorporate dimension-reduction strategies. For example, we may include a main-terms linear regression of Y j on all variables in X, while also including a linear regression using only a subset of X chosen based on their univariate association with Y j . The library of learners can also include different choices of tuning parameters for machine learning algorithms, an important consideration for methods that require proper selection of many tuning parameters (e.g., deep learning).
Because there is no way to know a-priori which of these myriad estimators will be best, a cross-validated empirical criterion is used to adaptively combine, or ensemble, the various learners. This approach is appealing in the present application in that the best learner for each outcome might be different. By allowing the data to determine the best ensemble of learners for each outcome, we expect to obtain a better fit over all outcomes and thereby obtain a more accurate summary of the association between X and Y . Indeed, the super learner is known to be optimal in the sense that, under mild assumptions, its goodness-of-fit is essentially equivalent to the (unknown) best-fitting candidate estimator. [27,28] A full treatment of super learning [23] and R package [29] are available. Here, we provide a brief overview of the procedure.
Super learning is often based on K * -fold cross validation, and we illustrate the method for K * = 2 in Figure 1. The super learner for a given outcome Y j may be implemented in the following steps: Step 1: Partition the data.
Starting with a data set consisting of observations {O i : i ∈ F * n } for some F * n ⊆ {1, . . . , n}, we randomly partition the data into K * splits of approximately equal size. We define the k-th training sample as the observed data with observations in the k-th split removed. We use T * n,k ⊂ F * n to denote the indices of observations in the k-th Ψj,β n,j (F * n ) 6 6 Figure 1: A flowchart illustrating a two-fold super learner procedure, which maps a set of observed data indices, F * n into Ψ j,βn,j (F * n ) an ensemble prediction function for outcome Y j . Notation: T n,k , the k-th training sample; V n,k , the k-th validation sample; Ψ j,m (T n,k ), the m-th learner for the j-th outcome fit in the k-th training sample; ξ n,k (Ψ j,m ), the k-th cross-validated mean squared-error; ξ n (Ψ j,m ), cross-validated mean squared-error; β n,j , a J-length vector of super learner weights; Ψ j,m (F * n ), the m-th learner for the j-th outcome fit using training sample and V * n,k to denote the indices of observations in the k-th validation sample for k = 1, . . . , K.
Step 2: Fit learners in training data. We now use the observations in the training sample to fit each of the M j candidate learners. We use Ψ j,m to denote the m-th candidate learner for the j-th outcome. The learner Ψ j,m takes as input a set of indices and returns a prediction function for Y j . We use Ψ j,m (T n,k ) to denote the m-th candidate learner for the j-th outcome fit in the k-th training sample. For example, Ψ j,m might correspond with fitting an ordinary least squares linear regression of Y j on X, in which case Ψ j,m (T n,k ) corresponds to the fitted linear predictor based on the regression fit in the k-th training sample and Ψ j,m (T n,k )(x) corresponds to the linear predictor of the fitted regression evaluated at the observation x.
Step 3: Evaluate learners in validation data. Next, we use the data in the k-th validation sample to evaluate each of the learners Ψ j,m (T n,k ), m = 1, . . . , M j via where we use |V n,k | to denote the number of observations in the k-th validation sample. We obtain an overall estimate of the performance of the m-th learner via Step 4: Find ensemble weights.
The cross-validation selectorm is defined as the single learner with the smallest ξ n and one may use Ψ j,m (F * n ) as the learner for the j-th outcome, i.e., we refit the cross-validation-selected learner using all the data and use this as prediction function. However, it is often possible to improve on the fit of the cross-validation selector by considering ensemble learners of the form Ψ j,β = m β j,m Ψ j,m , where the weights β j,m are non-negative for all m and sum to 1. The prediction function Ψ j,β j takes as input X, computes the prediction of Y j using each of the M j learners, and returns a convex combination of these predictions. We can compute β n,j := argmin β j ξ n (Ψ j,β j ), the choice of weights that minimizes cross-validated mean squared-error, using standard optimization software.
Step 5: Refit learners using full data. We next refit the learners that received non-zero weight in β n,j using all observations in F * n . We denote these fits Ψ j,m (F * n ) for m = 1, . . . , M j .
Step 6: Combine learners into super learner. The super learner is the convex combination of the M j learners using weights β n,j and is denoted by Ψ j (F * n ) := Ψ j,β n,j (F n ). We conclude this subsection by noting that, while the super learner is an appealing choice in many applications, the remainder of our proposal does not rely on the super learner being used to develop a learner for Y ω . In the remainder, we often refer to {Ψ j : j = 1, . . . , J} as being a set of super learners; however, these could just as well be any learning algorithm.

Choosing weights for composite outcome
The super learner is an aggressive learning algorithm and as such we must be concerned about overfitting when selecting weights for the composite outcome. If the super learner overfits the j-th outcome, then Y j may be erroneously upweighted in Y ω . To avoid overfitting, we propose to use an additional layer of K • -fold cross-validation when selecting outcome weights. Figure 2 illustrates our proposal for K • = 2.
Step 1: Partition the data.
Starting with a data set consisting of . . , n}, we randomly partition Figure 2: A flowchart illustrating two-fold outcome weighting procedure, which maps a set of observed data indices, F • n into Ω n (F • n ), a vector of convex weights for the outcome Y . Notation: T • n,k , the k-th training sample; V • n,k , the k-th validation sample; Ψ j (T • n,k ), the super learner for the j-th outcome fit in the k-th training sample; Ψ ω (T n,k ), the composite super learner based on weights ω; ρ 2 n,ω (Ψ ω ), the cross-validated nonparametric R 2 for the composite super learner.
the data into K • splits of approximately equal size. As above, we use T • n,k ⊂ F • n to denote the indices of observations in the k-th training sample and V • n,k to denote the indices of observations in the k-th validation sample for k = 1, . . . , K.
Step 2: Fit super learner for each outcome in training data. For each split k, we use the training sample to fit each outcome, resulting in learner fits Ψ j (T • n,k ) for j = 1, . . . , J. We note that the super learner itself requires cross-validation so that this step will include nested cross-validation. That is, we repeat the super learning procedure outlined in the previous section with F * n = T • n,k for k = 1, . . . , K • .
Step 3: Compute cross-validated R 2 . For a given ω, we can compute the super learner prediction function for Y ω , Ψ ω (T • n,k ) := J j=1 ω j Ψ j (T • n,k ). For any ω, we can use data in the k-th validation sample to compute an estimate of mean squared-error for Ψ ω (T • n,k ), The cross-validated mean squared-error is the sum over the validation folds, ξ n,ω (Ψ ω ) : as the empirical average weighted outcome among observations in F • n and the empirical mean squared-error for predicting the composite outcome Y ω using the composite sample meansΨ ω . For any ω, we can combine these elements to compute the cross-validated nonparametric R 2 measure as Step 4: Maximize cross-validated R 2 . The maximizer ω n := argmax ω ρ 2 n,ω (Ψ ω ) of nonparametric R 2 may be computed using standard optimization software. In our simulation and data analysis, we used an interior algorithm with augmented Lagrange multipliers [30,31].
We conclude this section by noting that the theoretical parameter estimated by ω n is We index this parameter by n to denote that it is data-adaptive in the sense that it is unknown without first using the data to construct the super learners [32].

Estimating associations
We may report the maximized value ρ 2 n,ωn (Ψ ωn ) based on the full data F n := {1, . . . , n} as a relevant summary of the association between X and Y . However, this estimate may be overly optimistic due to overfitting in the weight selection, particularly in situations when Y is high-dimensional. Nevertheless, if one finds that ρ 2 n,ωn (Ψ ωn ) is less than or equal to zero, then it likely reasonable to conclude that there is no meaningful association between X and Y . Otherwise, we propose to utilize an outer layer of K-fold crossvalidation to accurately assess the association. This proposal is illustrated in Figure 3 for K = 2. Ψ Ω(Tn,1) (Tn,1) Ψ Ω(Tn,2) (Tn,2)  Figure 3: A flowchart illustrating computation of two-fold cross-validated measure of association between X and Y . This procedure maps the full data set into ρ 2 n,Ω (Ψ Ω ), a real-number that measures the strength of association. Notation: T n,k , the k-th training sample; V n,k , the k-th validation sample; Ψ j (T n,k ), the super learner for the j-th outcome fit in the k-th training sample; Ω(T n,k ), outcome weights computed in the k-th training sample; Ψ Ω(T n,k ) (T n,k ), the super learner fits from the k-th training sample combined using weights also computed in the k-th training sample; ρ 2 n,Ω (Ψ Ω ), the cross-validated nonparametric R 2 for the composite super learner.
Step 1. Partition the data. Starting with the full data set {O i : i ∈ F n }, randomly partition the data into K splits of approximately equal size. As above, we use T n,k ⊂ F n to denote the indices of observations in the k-th training sample and V n,k to denote the indices of observations in the k-th validation sample for k = 1, . . . , K.
Step 2. Fit super learner for each outcome in training data. For each split k, we use the training sample to fit each outcome, resulting in learner fits Ψ j (T n,k ) for j = 1, . . . , J. That is, we repeat the super learning procedure outlined in the previous subsection with F * n = T n,k for k = 1, . . . , K.
Step 3. Fit outcome weights in training data. In the previous section, we wrote the procedure that finds outcome weights as a mapping Ω from a subset of observations to a vector of numbers. We now apply this procedure to the training data. That is, we repeat the outcome weighting procedure from the previous subsection with F • n = T n,k for k = 1, . . . , K.
Step 4. Combine super learners based on outcome weights. Given weights Ω(T n,k ), we compute the super learner prediction function for Y Ω (T n,k ) as Ψ Ω(T n,k ) (T n,k ) := J j=1 Ω j (T n,k )Ψ j (T n,k ).
Step 5. Compute cross-validated nonparametric R 2 . We use each validation sample to compute the mean squared-error for the combined super learner, which provides an evaluation of the performance of the composite super learner for predicting the composite outcome based on the weights learned in the training sample. We define ξ n,Ω (Ψ Ω ) := 1 K K k=1 ξ n,Ω,k (Ψ Ω ) as the average of this measure across the K splits. We evaluate the fit of the composite empirical meanΨ for the composite outcome via ξ n,Ω,k (Ψ Ω ) := 1 |V n,k | i∈V n,k {Y Ω(T n,k ),i −Ψ Ω(T n,k ) (T n,k )} 2 .
Similarly as above, we define ξ n,Ω (Ψ Ω ) := K k=1 ξ n,Ω,k (Ψ Ω ) as the average across cross-validation folds. Finally, we compute the estimated cross-validated nonparametric R 2 measure of association as We conclude by noting that ρ 2 n,Ω (Ψ Ω ) estimates the data-adaptive parameter , which describes the true cross-validated performance of the estimated composite prediction function for predicting the composite outcome.

Inference
An asymptotically justified confidence interval can be constructed for ρ 2 n,Ω using influence function-based methodology [32]. We propose to construct a Wald-style 100(1−α)% confidence interval about the estimate log{ξ n,Ω (Ψ Ω )/ξ n,Ω (Ψ Ω )} and back-transform this interval to obtain an interval on the original scale. Thus, our interval is of the form where z 1−α/2 is the 1 − α/2 quantile of the standard normal distribution and σ 2 n is a consistent estimate of the asymptotic variance of log{ξ n,Ω (Ψ Ω )/ξ n,Ω (Ψ Ω )}. Similarly, a level α one-sided test of the null hypothesis of no association between X and Y can be performed by rejecting the null hypothesis whenever log{ξ n,Ω (Ψ Ω )/ξ n,Ω (Ψ Ω )} σ n /n 1/2 < z α .

Variable importance
Often we are not only be interested in an overall summary of the association between X and Y , but also in the relative importance of each component of X. Ensemble machine learning may be criticized as a "black box" approach, in which it is difficult to understand the relative contributions of different variables. [33] To provide better understanding of the "black box," we propose to study differences in the estimated association between X and Y when considering different subsets of variables. Our proposal is similar to variable importance measures proposed for specific machine learning algorithms, such as random forests, [24] because they measure the change in predictive performance with and without each predictor variable considered. Although existing approaches may have poorly behaved statistical inference, [34] the present approach yields straightforward, asymptotically justified inference.
To assess the importance of a subset S ⊂ {1, . . . , D} of variablesX := {X s : s ∈ S}, we propose to repeat our procedure, but restrict only to variables not in this subset. We obtain an estimate ρ 2 n,Ω (ΨΩ) of the crossvalidated performance measure for the composite prediction algorithm based on this subset of variables. The joint importance of the variables inX is defined by a contrast between ρ 2 0n,Ω (Ψ Ω ) and ρ 2 0n,Ω (ΨΩ) such as Point estimates for (3) are constructed by plugging in the estimates, and asymptotically justified confidence intervals and hypothesis tests about these estimates are constructed using influence functions as in the previous section.

Simulations
We evaluated the finite-sample performance of the proposed estimators in two simulations. In the first, we studied the operating characteristics of our estimator. This simulation confirms that the asymptotic theory developed in previous sections leads to reasonable finite-sample performance for the estimators. In the second simulation, we compared our proposed method to existing methods for assessing associations between X and Y . This simulation shows that our method correctly identifies associations in situations where existing methods do not.
The optimal R-squared for each outcome is about 0.60. The true optimal weighting scheme for the outcomes is ω 0 = (1/3, 1/3, 1/3) and the true optimal R-squared for the optimally composite outcome is approximately 0.81. We considered sample sizes of n = 100, 500, 1000, 5000. For computational simplicity, we considered a small super learner library consisting of a main terms linear regression (function SL.glm in SuperLearner package), an intercept-only regression (SL.mean), and a forward-stepwise selection linear regression model (SL.step.forward). We used ten folds for each layer of cross-validation, K = K * = K • = 10.
The estimates of predictive performance were biased downwards in smaller sample sizes (Figure 4). However, even with n = 100 the average estimated performance was only about 5% too small. The confidence interval coverage was less than nominal for small sample sizes, but had nominal coverage in larger samples. We excluded X 2 and then excluded X 7 to estimate the additive difference in R-squared values (equation (3)) for these 2 variables. Examining the datagenerating mechanism, we note that X 2 was important for predicting each individual outcome and should be important for predicting the composite outcome. The optimal R-squared for predicting each individual outcome without X 2 was 0.52, and the additive importance of X 2 was 0.60 -0.52 = 0.08. For the composite outcome, the optimal weightings were unchanged (1/3, 1/3, 1/3), but the optimal R-squared without X 2 decreased to 0.68 and the additive importance of X 2 for the composite outcome was 0.81 -0.68 = 0.13. In contrast, X 7 was important only for predicting Y 1 and had no effect on predicting Y 2 or Y 3 . Therefore, the optimal composite outcome was expected to upweight Y 2 and Y 3 when excluding X 7 , and the importance of X 7 for predicting the optimal composite outcome may be minimal. The optimal weights without X 7 were (0.24, 0.38, 0.38), whereas the optimal Rsquared without X 7 was 0.79, leading to an additive importance of 0.81 -0.79 = 0.02.
The importance measures for X 2 exhibited substantial bias (25% truth) when n = 100, but both measures were unbiased with sample sizes > 1, 000 ( Figure 5). The nominal 95% confidence interval coverage was less than nominal in small samples, but the coverage was > 90% for sample sizes > 500.

Simulation 2: power of associative hypothesis tests
In this simulation, X was simulated as above. The outcome Y was simulated as a 10-dimensional multivariate normal variate with mean µ = (µ 1 (x), . . . , µ 10 (x)) T and covariance matrix Σ. Letting Σ i,j denote the (i, j) entry in Σ, we set Σ i,i = 5 for i = 1, . . . , 10. We also set Σ 1,2 = Σ 1,3 = Σ 4,7 = −2 and Σ 3,9 = Σ 4,9 = Σ 5,9 = 2, while also appropriately setting the corresponding lower triangle elements to these values. All other off-diagonal elements were set to zero. We studied three settings. The first setting had no association between X and Y , and we set µ j (x) = 0 for j = 1, . . . , 10. The second setting had a linear association between X and one component of Y .
Here, we set µ 6 (x) = −2 + 0.75x 1 and µ j (x) = 0 for j = 6. The third setting had a nonlinear association between X and one component of Y . Here, we set µ 6 (x) = −2 + 0.75(x − 2) 2 and µ j (x) = 0 for j = 6. We studied the power of level 0.05 tests of the hypothesis of no association between X and Y . Power was estimated as 100 times the proportion of the 1,000 total simulations where the null hypothesis was rejected. We tested this hypothesis with the proposed one-sided hypothesis test with all layers of cross validation set to 5-fold. We used a super learner library that included an intercept-only regression, a main terms regression, and a generalized additive model. [35] We also tested this hypothesis using four common tests of canonical correlation: Wilks' Λ, the Hotelling-Lawley trace, the Pillai-Bartlett trace, and Roy's largest root. Rather than comparing these statistics to their respective approximate asymptotic distributions, we used permutation tests to compute p-values. [36] We also studied the power of a test based on a principal components approach. For this test, we constructed a composite outcome based on the first principal component of Y and subsequently fit a main-terms linear regression on X. The null hypothesis of no association was rejected whenever the p-value associated with the global F-test for this linear regression was less than 0.05.
Under the null hypothesis, each canonical correlation-based and the PCAbased test had approximately nominal type I error ( Figure 6, left panel). However, our proposed test was found to be conservative, falsely rejecting the null hypothesis less than 1% of the time. In the second scenario, the canonical correlation-based tests had greater power to reject the null hypothesis than our proposed test at the two smallest sample sizes (middle panel). The PCA-based test had poor power, as the first principal component fails to adequately reflect the component of Y for which a relationship with X exists.

Group Variables Health care
Health care access, use of preventive health care Household Child:adult ratio, child dependency ratio, crowding index, urban score Socioeconomic status Total income, socioeconomic status Water and sanitation Sanitation, access to clean water Parental Mother age, father age, mother height, mother education (y), father education (y), marital status, mother age first child, parity Growth Weight-for-age Z-score, height-for-age z-score [37] (0, 6, 12, 18, 24 mo) Other Mother smoked during pregnancy, child's sex, gestational age at birth In the third scenario, only our proposed test had power to correctly reject the null hypothesis (right panel).

Data Analysis
The ongoing Cebu Longitudinal Health and Nutrition Study (CLHNS) enrolled Filipino women who gave birth in 1983-1984, and the children born to these women have been followed prospectively. [38,39] The long-term followup with these children has enabled researchers to quantify the long-term effects of prenatal and early childhood nutrition and health on outcomes (adolescent and adult health, economics, and development). [40,41] We are interested in assessing the strength of correlation between prenatal and early childhood data and later schooling achievement. We are also interested in understanding the extent to which somatic growth (height and body weight) early in life associates with later neurocognitive outcomes. [42,43] In 1994, achievement tests were administered to 2166 children in 3 subjects: mathematics, and English and Cebuano languages.
We applied the present methods to assess association of the 3 test scores with variables collected from birth to age two years (Table 1). For variables that had missing values, we created an indicator of missingness, and missing values of the original variable were set = 0. The library of candidate super learner algorithms included a random forest, gradient boosted machines, and elastic net regression algorithms. The tuning parameters for each algorithm were chosen via nested 5-fold cross-validation. We used 10fold cross-validation for each layer of cross-validation.
We computed variable importance measures by repeating the procedure, eliminating groups of variables and estimating the additive change in performance. We found that the child's sex and parental information were responsible for the largest proportion of the association with achievement test score performance ( Figure 7). However, the somatic growth variables, as a group, modestly increased the association, with an estimated change in association of 0.01 (95% CI: 0.00, 0.02; P = .02).

Discussion
Our proposed method provides a new means of summarizing associations with multivariate outcomes and our simulations demonstrate that it provides powerful inference for tests of association when complex and nonlinear relationships are present. We found that existing tests provide greater power in small samples against alternatives with linear relationships between covariates and outcomes. This is unsurprising as our approach relies on several layers of cross-validation, which may stretch small samples too thin. We also found that our test was conservative under the null hypothesis. This can be explained by the fact that the true value of cross-validated R-squared was often less than zero, while our test was based on testing a value of zero for this parameter. We suggest that a permutation test could be used to construct a hypothesis test with better operating characteristics; however, this may often prove to be computationally intractable in practice. In such cases, we may instead opt for the more conservative, but less computationally intensive test.
A possible criticism of our approach is the data-adaptive nature of the parameter ρ 2 0n,Ω (Ψ Ω ). One could argue that this parameter is only a useful measure of association insofar as the relevant prediction functions approximate ψ 0,ω 0 , the unknown true conditional mean of the optimally combined outcome. This strengthens the argument for using super learning to a prediction function that is as accurate as possible. A more direct measure of association is (1) evaluated at ω 0 . Future work will be devoted to strategies for direct estimation of this quantity. On the other hand, as mentioned above, in some settings the goal is to construct a composite outcome and a prediction function for that outcome. In these cases, our associative measure is directly relevant, as it provides an approximation to how well we might expect to predict the composite outcome of future patients.
Though our proposal is computationally intensive, the vast majority of the computational burden lies in fitting the candidate learners, which can be implemented in a parallelized framework. The computational burden of the procedure may be further reduced by clever choices of number of cross-validation folds. When K = K * = K • , the procedure requires fitting K 3 J j=1 M j total candidate learners. However, if K • = K − 1 and K * = K − 2, it is possible to re-use candidate learner fits at different points in the procedure, thereby drastically decreasing the computational burden of the procedure. In fact, we can show that this approach requires only (K 3 + 5K)/6 × J j=1 M j candidate learner fits. Thus, for K = 10, the latter approach is expected to execute in 17.5% the time it would take if K = K * = K • = 10. An R package implementing our methods is available (link to be added upon acceptance).