Accessible Published by De Gruyter September 28, 2013

Efficient Analysis of Q-Level Nested Hierarchical General Linear Models Given Ignorable Missing Data

Yongyun Shin and Stephen W. Raudenbush

Abstract

This article extends single-level missing data methods to efficient estimation of a Q-level nested hierarchical general linear model given ignorable missing data with a general missing pattern at any of the Q levels. The key idea is to reexpress a desired hierarchical model as the joint distribution of all variables including the outcome that are subject to missingness, conditional on all of the covariates that are completely observed and to estimate the joint model under normal theory. The unconstrained joint model, however, identifies extraneous parameters that are not of interest in subsequent analysis of the hierarchical model and that rapidly multiply as the number of levels, the number of variables subject to missingness, and the number of random coefficients grow. Therefore, the joint model may be extremely high dimensional and difficult to estimate well unless constraints are imposed to avoid the proliferation of extraneous covariance components at each level. Furthermore, the over-identified hierarchical model may produce considerably biased inferences. The challenge is to represent the constraints within the framework of the Q-level model in a way that is uniform without regard to Q; in a way that facilitates efficient computation for any number of Q levels; and also in a way that produces unbiased and efficient analysis of the hierarchical model. Our approach yields Q-step recursive estimation and imputation procedures whose qth-step computation involves only level-q data given higher-level computation components. We illustrate the approach with a study of the growth in body mass index analyzing a national sample of elementary school children.

1 Introduction

A seminal contribution to statistical methodology is the development of efficient methods for handling missing data within the framework of a general linear model (GLM [17]). These methods provide efficient estimation of the GLM given incomplete data. In particular, model-based multiple imputation now provides state-of-the-art methods for handling missing data [5]. These approaches are founded on a comparatively mild assumption in many applications that missing data are ignorable [2, 4].

This article extends the methodology to an arbitrary Q-level hierarchical GLM where lower-level units are nested within higher-level units [8, 9]. Many multi-level observational studies and controlled experiments produce missing data. In cluster-randomized experiments, the dominant design involves the random assignment of whole schools, hospitals or communities, rather than students, patients or adults to treatments [10]. Multi-level analysis is pervasive in health, education and social science studies [1114]. Surveys involve multi-stage sampling designs [15]. A ubiquitous problem is that explanatory as well as outcome variables may be subject to missingness at any of the levels.

In longitudinal studies, hierarchical data subject to missingness may be estimated by maximum likelihood (ML) in a structural equation model (SEM) where latent means include missing data [1620]. SEM software such as Mplus [21], Amos [22] and EQS [23] performs ML estimation of such models. When these models are formulated by multi-group analysis, the number of groups is the number of missing patterns [16, 19, 20].

Recent advances have extended the single-level methods to multi-level ignorable missing data. Liu et al. [24] considered Bayes inference to longitudinal designs having a fixed within-subject design with repeated measures at level 1 nested within persons at level 2 where the data are missing at both levels. Schafer and Yucel [25] developed Bayes and ML inference for a broader class of two-level designs in which the level-1 design could vary across level-2 units with level-1 data subject to missingness. Goldstein and Browne [26, 27] took a Bayesian approach to a two-level factor analysis where missing outcomes were imputed by a Gibbs sampling step. Shin and Raudenbush [14, 28] extended these methods to a two-level model where the outcome and covariates may have missing values at both levels. Shin and Raudenbush [29] and Shin [30] illustrated an efficient ML method to estimate a three-level model with incomplete data. To estimate a three-level hierarchical linear model with incomplete data, Yucel [31] modified a single-level imputation method [6, 32] and a two-level imputation method [25] to carry out the Gibbs Sampler to sequentially impute cluster-level missing values, intermediate-level missing values given the multiply imputed cluster-level data and then, the lowest-level missing values given the multiply imputed data at higher levels. These advances guide us with continuous outcomes. Goldstein et al. [33] and Goldstein and Kounali [34] used a Markov Chain Monte Carlo method to impute a mixture of continuous and discrete outcomes subject to missingness in a two-level model.

Shin and Raudenbush [28] illustrated two ways to handle two-level missing data: direct ML estimation () and a two-stage procedure of multiple imputation followed by the second stage analysis of the multiply imputed data (). This article generalizes the two methods to an arbitrary number of Q levels and an arbitrary number of outcomes defined at any level. A key emphasis in this article is the difference in logic and assumptions between the two methods. Using , one first writes down a desired hierarchical model, then reparameterizes the model in terms of the joint distribution of outcome and covariates subject to missingness given the completely observed covariates. Great care must be taken, so that the transformation is one-to-one in order to insure unbiased estimation [14, 28]. This task generally requires the imposition of constraints on regression surfaces at each level to avoid the proliferation of covariance components at higher levels of the joint distribution. One challenge for this article is to formulate these constraints within the framework of the Q-level model. We show that the unconstrained joint model identifies contextual effects [9] and interaction effects that are typically extraneous for the desired model. In contrast, generally implies that the imputation model should be unconstrained, allowing the data analyst to impose the desired constraints at the second stage when using conventional software to analyze the imputed data. A challenge with is that the need to avoid constraints at stage 1 may lead to the formulation of extremely high-dimensional imputation models that may be difficult to estimate well. The two methods have characteristic advantages and disadvantages. imposes more assumptions than does , because imposes distributional assumptions on all variables subject to missingness while for , such assumptions do not affect the observed data. Given the pluses and minuses, this article considers a hybrid approach that combines the two methods.

Our aim is to formulate a Q-level model that unifies single- and multi-level models into a single expression, facilitating extension of existing missing data methods to an arbitrary number of levels of a linear model with efficient estimation and computation. The model has two representations: a hierarchical linear model for a response variable conditional on covariates and a marginal model that represents the joint distribution of all variables – the response and covariates – that are subject to missingness conditional on the completely observed covariates. It is essential to clarify the relationship between these two models; in particular, the conditional model is always equivalent to the joint model after imposing certain constraints on the more general joint model. To clarify the needed constraints in a general Q-level setting, we find it revealing to reparameterize both forms of the model such that all variables subject to missingness are decomposed orthogonally by level. In this model, random components are correlated within but uncorrelated between levels. The required constraints then fall out naturally for any number of levels.

The next section defines the joint model and decomposes all variables subject to missingness into orthogonal random components. Section 3 considers the problem of estimation and multiple imputation. The orthogonal decomposition is helpful. The key insight is that if we stack the joint model by level, we can write down estimation formulas at level q using level-q data only given higher-level computation components that are uniform for all q even if we ignore all other-level data. This recursive representation yields efficient computations using conventional ML methods such as the EM algorithm [1]. Thus, orthogonal decomposition by level transforms a seemingly intractable computational problem into a sequence of familiar, solvable problems as described in Section 3. Section 4 introduces the conditional hierarchical linear model. We show that the joint model represents more parameters than desired in the hierarchical model and describe how to constrain the joint model for identification of the desired hierarchical model. Incorporating the constraints is essential to avoid bias for . For , the constraints do not reduce bias but may nonetheless be practically necessary for computation involving high-dimensional models. Analyzing data from a large, nationally representative longitudinal sample of children, we illustrate these methods to study predictors of the growth of body mass index (BMI) between ages 5 and 15 in Section 5. This is a three-level problem, with up to seven repeated measures on children who are sampled within their elementary schools. Section 6 concludes with a discussion of limitations and next steps in the Q-level research agenda.

2 Joint model

All the models described in this article are subsets of a multi-level p-variate model

(1)
(1)

where every element of Y is subject to missingness, may be a linear function of completely observed covariates, Z is a matrix of completely observed covariates having random effects b and . For simplicity of exposition, we shall assume in this section. We partition such that is a vector of variables at level q in a hierarchy of Q levels for . In the case of where occasions are nested within children, for example, elements of such as BMI and daily television viewing hours (TV) vary across occasions at the lower level 1, and elements of such as years of highest parent education and birth weight vary among children at the higher level 2.

The variance covariance matrix V may be structured by the fact that varies at level q or higher. In the case of , for example, the BMI in varies within as well as between children, while the birth weight in varies between children but not within a child. Thus, we partition , a diagonal matrix having diagonal submatrices , and , and decompose orthogonally by level as

(2)
(2)

for , and where is a matrix of known covariates having level-r unit-specific random effects for subscript and superscript r denoting the level of variation. The orthogonal random effects are independent between levels so that , but correlated within levels by so that for . Then, , a matrix with submatrices for . We may now express eq. [2] as having for where , and decompose for by the level of variation as

(3)
(3)

To reveal the orthogonal decomposition explicitly, we show all random components of the joint model (1) in Table 1.

Table 1

All random components of Y in eq. [1]

123Q

Row label indicates a vector of variables that are decomposed into the random components listed in the row. The random components enable us to write down estimation formulas at level q that are uniform for all q and that use level-q data only given higher-level computation components. This representation facilitates construction of efficient computation formulas as will be explained in Section 3. The column label shows the level at which the random components in the column vary. Table 1 lists random effects that are correlated within, but uncorrelated across columns or levels. Column q in Table 1 lists level-q unit-specific random effects for

(4)
(4)

Thus, all random effects b of the joint model (1) may also be expressed as , displayed vertically in Table 1. Their parameters are . This expression is useful for deriving estimators as will be described in the next Section.

3 Estimation of the joint model

The orthogonal decomposition by level shown in Table 1 enables us to write down the joint model (1) as a familiar mixed linear model (2) for any level q. Next, we shall exploit the fact that if we stack these level-specific models such that there is an equation at level q and a second equation stacked at all levels higher than q, we can write down familiar estimation formulas that use level-q data only given higher-level computation components even when we completely ignore all data at other levels. Moreover, the estimation formulas remain uniform for all q to produce efficient computation formulas as will be explained below. Therefore, the orthogonal decomposition of the joint model (1) enables us to obtain general, recursive and familiar estimation formulas for the Q-level problem.

3.1 Structure of joint model for all data at or above level q

Based on the model (2) at level q, we stack , and to express the joint model (1) at level q or above as

(5)
(5)

for , , , and where and for the transpose of . Note that we express , and uniformly for all q to contain , and , respectively. For , for example, for , , , , ; for , , , and where , , , and ; and for , , , and where , , , and for , , and .

Eq. [5] for expresses the joint model (1) where may include covariates having random effects at all levels. Often, the model (5) itself is of interest [29, 30]. For a positive integer n, let denote an n by n identity matrix. In this article, we focus on estimation of the model (5) where for all q as many applications do. In what follows, we use the Kronecker product that multiplies matrix B to each scalar element of matrix A [35]. In particular, .

3.2 Estimation

In deriving estimators, it is essential to aggregate the stacked-up joint model (5) at the highest level-Q cluster m. We refer to subscript m as a unit at the highest level Q for ; as the number of units at level q nested within the cluster m; and , hereafter. Let and aggregate all and in cluster m for . The level-q unit-specific random effects in column q of Table 1 are aggregated to form the aggregated eq. [4] for cluster m

Table 1 may also be aggregated to reveal all random components for cluster m. The aggregated table is of the same form as Table 1 with row label replacing , the same column label and the vector instead of in column q. The random components in row of the table are to form the aggregated model (2)

(6)
(6)

for a conformable matrix of known covariates , and where . Then, for , and has for .

Next, we stack , and to express the aggregated model (5) uniformly for all q as

(7)
(7)

for , , , and where and . By expressing having for and so that , and for , we structure in a familiar form [28] that is uniform for all q and has recursive .

Shin and Raudenbush [28] illustrated how to efficiently estimate the two-level model (7) for by ML via the EM algorithm where and are vectors of arbitrary length. The key insight of the model (7) is to express the familiar two-level form uniformly at each level q and estimate it by the method of Shin and Raudenbush [28] given computation components at level , starting from the highest level until we estimate the desired model for with arbitrary Q levels at . The initial step is to estimate the single-level model (7) for , a special case of Shin and Raudenbush’s two-level model. We formalize the recursive estimation within each iteration of the EM algorithm after defining notation for estimation below. A major advantage of this approach is that the step-q estimation uses only level-q data given higher-level computation components for efficient computation as will be shown in the following section.

To express the relationship between complete and observed data, let be a matrix of dummy indicators for observed values in , so that and for [28]. At level Q, and for all m. Let and for to express the observed model (7)

(8)
(8)

Estimation of is carried out via the EM algorithm. The complete data (CD) for cluster m may be viewed as given . Let for the entire sample. If we denote in column q of Table 1 for unit i nested within cluster m, the CD log-likelihood of may be expressed as for the density of level-q unit-specific . The CD ML estimators are . The E components are from the conventional estimation of which requires inversion of that may be extremely high dimensional. The orthogonal decomposition by level of the joint model (1) enables expression of that is uniform for all q and based on level-q data only given computation components from higher levels. As will be explained below, this recursive expression yields successive Q-step estimation formulas from down to for efficient computation of the E step without directly inverting .

To estimate fixed effects, let for in the model (1) where is a matrix of completely observed covariates having fixed effects . We replace with in the level-q model (2) to express the stacked-up model (5) as for , and . Let aggregate such that the corresponding model (7) has for and . Then, and , so that the observed model (8) is . The desired parameters are for . For simplicity of notations, let and denote the inverses of and , respectively. Given current , the Fisher scoring equivalent to the Newton–Raphson update is

(9)
(9)

The following section describes efficient recursive computation of based on and .

Eq. [7] expresses a single-level GLM when is an identity matrix and has for all m. The clustering of sample data discussed above gives rise to a Q-level GLM. Next, we show that the aggregated joint model (7) enables us to write down efficient Q-step recursive estimation formulas the qth-step computation of which involves level-q data only and thus is not unduly burdened with respect to the number of Q levels, p variables and random effects.

3.3 Efficient computation

The conventional E step based on requires inversion of which may be extremely high dimensional and, thus, take long to compute within each iteration of the EM algorithm. On the other hand, the E step based on eq. [8] produces Q-step estimation formulas the qth step of which is to estimate where

(10)
(10)

for and . The key advantages of the E step via eq. [10] are that and stay uniform for all q; that computation of and uses level-q data only, given and ; that the expressions (10) enable efficient computation of and via computation of recursive components and ; and that the E step does not require direct inversion of . Estimation of and starts at the highest level with initial components

(11)
(11)

for computes and using level-q data only, given and at step q, and finally evaluates eq. [10] given and at within each iteration of the EM algorithm.

To formulate the recursive computation, let

(12)
(12)

where , , and for . Note that computation of and requires level-q and only, given , and . The following result shows that and depend on level-q data only, given higher-level components and . See the Appendix for proofs.

Proposition 3.1anddepend on level-q data, andonly, givenandfor all.

Proposition 3.1 enables a Q-step recursive computation of and the qth step of which involves level-q data only without directly inverting .

Theorem 3.2andcan be computed by a Q-step recursive procedure from level Q down to level 1 givenwhere step q involves level-q data only.

The E step for , for example, computes and in eq. [11] initially for the inverse of ; and in eqs [32] and [33] given and at ; and in eqs [32] and [33] given and to finally yield and in eq. [10] at .

Fisher scoring on may also be based on recursive computation of , ,

(13)
(13)

The following result shows that , and depend on level-q data only, given , , , , and .

Proposition 3.3, anddepend on level-q data, andonly, given, , , , andfor all.

Propositions 3.1 and 3.3 enable to be computed recursively.

Theorem 3.4andcan be computed by a Q-step recursive procedure from level Q down to level 1 givenwhere step q involves level-q data only.

The Fisher scoring on for , for example, computes , , , and initially; , , , and in eqs [3236] from level-2 data, given , , , and at ; and in eqs [34] and [35] from level-1 data given , , , and to finally yield in eq. [9] at .

The recursive estimation efficiently handles missing data one level at a time. Consequently, the computation will be efficient given a number of variables subject to missingness at higher levels. In that case, the observed joint model (8) yields recursive computation that is not excessively burdened with respect to Q, p and the number of random effects. On the other hand, given missing data at level 1 only, this approach amounts to the conventional EM algorithm. The inverse of the Fisher information matrix yields .

Multiple imputation is based on given the ML . Let and be variances of and where and for diagonal and off-diagonal in . Then, and estimated by ML imply the joint distribution of a vector, , of distinct and from . Let the distributions of and be and estimated by ML, respectively. To propagate uncertainty in estimation of for proper imputation [2], we randomly generate from and from for all q, and then impute missing data given the for each imputation [28].

Thus far, we have focused on estimating the joint model (1) for variables subject to missingness given completely observed covariates. However, our goal in this article is to estimate a general Q-level hierarchial GLM for a univariate response conditional on covariates where the covariates as well as the response may have ignorable missing data at any of the levels. To efficiently estimate the GLM, we have to reparameterize it in the form of the joint model (1). The next section will introduce the desired GLM, clarify its relationship with the joint model and describe methods to efficiently estimate the GLM via the joint model.

4 Hierarchical general linear model

The aim of this article is to estimate a Q-level hierarchial GLM that is a special case of the joint model (1) in which a univariate response is defined at the lowest level of aggregation. We show that the joint model over-identifies the desired model, in general, and describe how to constrain the joint model to be a one-to-one transformation of the GLM. Without the one-to-one correspondence, the estimated GLM via may be substantially biased [28]. For simplicity of explication, we first consider the desired GLM where all covariates having fixed effects are subject to missingness. This consideration is without loss of generality, because completely observed covariates having fixed effects do not affect the constraints on the joint model. After explaining the needed constraints for , we consider a more realistic GLM having both covariates subject to missingness and covariates completely observed.

We write the general Q-level hierarchial GLM

(14)
(14)

for having fixed effects , having random effects , and where A and are vectors of level-1 and level-r covariates having fixed effects and , respectively, is a vector of covariates having level-r unit-specific random effects independent across levels and . Both R and C are subject to missingness, while D is known. We assume and that carries an intercept as many applications do, although it is not required to have one.

The aim of this article is to efficiently estimate the Q-level hierarchical GLM (14) given incomplete data. To do so, we must reparameterize eq. [14] in the form of the joint distribution (1) of all variables subject to missingness – including the response and covariates at any level given D. We define the first element of as the response R and the remaining elements of as covariates A to partition and decompose Y into the response and covariates where we decompose and orthogonally by level. Then, eq. [14] is a special case of the model (1) for where implies partitioning

(15)
(15)

and C implies for . Then, for

(16)
(16)

We refer to as the form of the joint model (1) for efficient estimation of the GLM (14) in this article. Then, the GLM (14) implies

When , the reparameterization is one-to-one between eqs [1] and [14], and no difficulties arise in computation and interpretation as illustrated in Section 4.1. However, when , we find that the reparameterization required to equate the conditional model (14) to the corresponding joint model (1) can be quite challenging. Without imposing constraints, the joint model will over-identify the conditional model (14). We can readily comprehend this problem in the case of two-level models shown in Sections 4.2 and 4.3. We then generalize our approach in the subsequent section. We see that the problem of over-identification can become severe, as covariates, levels and random coefficients are added to the model.

4.1 Single-level model

Eq. [14] for , , and expresses the conventional ordinary least squares (OLS) regression model as the conditional distribution of R given covariates A. Efficient estimation of the model from ignorable missing data [2, 4] is straightforward when we estimate the corresponding joint model (1)

(17)
(17)

for , and . The OLS model (14) implies and to yield the one-to-one transformations and . That is, the parameters in the OLS model (14) and the variance and covariance components in are one-to-one functions of the parameters in the joint model (17). Equivalently, the parameters in the OLS model are one-to-one functions of without redundantly counting the number of parameters in .

The general forms of the joint model (1) implied by eq. [16] and the conditional model (14) remain intact when we consider the hierarchical linear model with arbitrary Q levels. A central concern of interest to this article is that, when we move beyond the single-level case for , the desired model (14) is not a one-to-one transformation of the joint model. To see how this works, we consider two-level data where level-1 units (e.g. students) are nested within level-2 units (e.g. schools) before we consider arbitrary Q levels. We shall consider the cases of the two-level model with a random intercept and the two-level model with random coefficients.

4.2 Random-intercept model

A comparatively simple two-level hierarchical linear model with a random intercept is of form

(18)
(18)

a special case of model (14) with , , , and . The corresponding joint model (1) is

(19)
(19)

where , and . We can see now that the desired model (18) constrains the joint model (19) by and such that

(20)
(20)
(21)
(21)

To see how many constraints the desired model (18) has placed on the joint model (19), the constrained model (18) identifies parameters, while the unconstrained joint model (19) identifies parameters in . Therefore, the constrained model (18) has fewer parameters than does the unconstrained joint model (19). The key constraints occur in the variances and covariances (20) and (21) where the association between R and A is constrained to be the same at both levels. An alternative form of the unconstrained model (19) replaces with in eq. [20] and with in eq. [21]. This would allow the association between R and A to be different at the two levels, inducing what is known in the social science and public health applications as a contextual effects model [14]. The constraints (20) and (21) impose , that is, no contextual effects.

4.3 Random-coefficients model

As the number of random coefficients increases, the number of potentially extraneous parameters generated will increase non-linearly if no constraints are imposed. To show how aggravated the over-identification can become, consider a random-coefficients model that adds level-1 covariates having random coefficients to the model (18)

(22)
(22)

another special case of model (14) for , , , and where , , , and . The corresponding joint model (1) is

(23)
(23)

for , , and . Note that , and as in the model (19). The desired model (22) implies constraining the joint model (23) by and such that

(24)
(24)
(25)
(25)

To see how many constraints the desired model (22) has placed on the joint model (23), the constrained model (22) identifies parameters, while the unconstrained joint model (23) has components in . Therefore, the model (22) has fewer parameters than does the unconstrained joint model (23). Again, the key constraints occur in the variances and covariances (24) and (25) where not only is the association between R and A constrained to be the same at both levels, but the covariance components and that yield extraneous interaction effects between and C are also set to zero. That is, the desired model (22) has no contextual effects of A and no interaction effects between and C. Next, we extend the model (14) to an arbitrary number of Q levels.

4.4 Q-level model

We now focus on how to efficiently estimate the hierarchical GLM (14) for the arbitrary number of Q levels. Unlike the single-level case, however, the joint distribution (1) over-identifies the desired model (14). This over-identification poses a major computational challenge, as it represents the components of that are extraneous for subsequent analysis and that rapidly multiply as Q, p and increase. The consequence is that estimation of the over-identified hierarchical model (14) may produce substantially biased inferences in the case of or computational problems in the case of .

To show the over-identification explicitly, we reexpress all random effects of the joint model (1) in Table 1 according to the decomposition as listed in Table 2. Column q lists level-q unit-specific random effects that may now be partitioned as

Table 2

All random components of Y in eq. [1]

123Q
R
A

The random effects and their variances and covariances are useful for explaining the over-identification problem and the constraints. Notice that the level-q unit-specific generates covariance components in . For with three columns, for example, columns 1–3 generate , and components in , and at levels 1–3, respectively, for , and . Overall, the random effects of the joint model (1) produce covariance components between R and C, while the desired model (14) implies elements in . Consequently, the potential for severe over-identification exists if no constraints are imposed on eq. [1].

A key task, then, is to formulate a general approach to imposing constraints, one that applies to any value of Q and any number of covariates. The following theorem shows a conditional model the joint model (1) identifies so that constraining the joint model amounts to constraining the conditional model. Let and be the inverse of .

Theorem 4.1Joint model (1) representscovariance components between R and C and is a one-to-one transformation of

(26)
(26)
forand.

For each covariate in , the conditional model (26) expresses the association between the covariate and R to be distinct at each level , while the desired model (14) represents a single effect of the covariate on R. Consequently, the joint model (1) produces parameters extraneous for subsequent analysis. The extraneous parameters, representing the contextual effects of C and the interaction effects between D and , rapidly multiply, as Q, p and grow. Let and , so that . The following corollary to Theorem 4.1 establishes one-to-one correspondence between a general contextual effects model and a constrained joint model (1) where each level-q covariate has a distinct effect at every level without involving interaction effects.

Corollary 4.2Joint model (1) under constraints

(27)
(27)
identifies a general contextual effects model givenand D
(28)
(28)
for.

Eq. [28] is nested within Model (26) for . We define the contextual effects of covariates in at level , as [14]. Corollary 4.2 may involve expressing constraints of different forms. For example, if it is desirable for each covariate in A to have a single effect on R in the model (28), then the corollary would constrain for all q in addition to the constraints (27). For another example, if the contextual effects of A are desired at level 2 but no other levels in the model (28), then the additional constraints would be and for . Shin and Raudenbush [28] imposed and a single effect of each covariate on the joint model (1) to identify the desired model (14) for . The following corollary to Theorem 4.1 establishes the one-to-one correspondence between the model (14) and a constrained joint model (1).

Corollary 4.3Joint model (1) under constraints

(29)
(29)
identifies hierarchical model (14).

Model (14) under the constraints (29) is equivalent to model (28) for that implies , and for all and all q.

Given Q-level incomplete data, eq. [1] under constraints (29) identifies hierarchical model (14), while the joint model under partial constraints such as constraints (27) may identify desired contextual effects of C or interaction effects between D and [14]. All these applications may be carried out via , or a hybrid method of imputation following estimation of the constrained joint model. The choice will depend on computational feasibility and the goal of the application. Given an analyst’s model (14), constrains the joint model (1) to just identify the analyst’s model, while is more generally applicable by enabling the data analyst to explore, in addition, contextual effects of C and interaction effects involving D. Consequently, is tailored to estimation of the analyst’s model, whereas estimates an over-identified joint model so that it enables the data analyst to explore multiple hierarchical models for correct specification of the analyst’s model. When is desired, but produces an unconstrained joint model (1) that is extremely high dimensional and thus difficult to estimate well, the hybrid method enables estimation of fewer parameters and thus reduces computational burden in estimation by imposing partial constraints such as eq. [27] on the joint model.

Now, we consider a more general model (14)

(30)
(30)

for known covariates having fixed effects and every other component defined identically as the counterpart of the model (14) where level-q covariates have fixed effects . The corresponding joint model (1) has and everything else the same as previously defined.

The next section illustrates an application to three-level large-scale survey data subject to missingness at all levels. We illustrate , which is more generally applicable than and the hybrid method. Estimation and multiple imputation are carried out by C programs written by the first author. The imputation program uses a random number generating library of C routines, RANDLIB 1.3 by Barry W. Brown, James Lovato, Kathy Russell and John Venier. Analysis of imputed data and complete-case analysis are carried out by HLM 7 [36]. The convergence criterion is the difference in the observed log-likelihoods between two consecutive iterations less than . The statistical significance is discussed at a significance level . The user-friendly two-level program that implements is expected to be released to the public in software package HLM 7 in the year 2014. The user-friendly three-level program is under development at the time of writing this manuscript.

5 Illustrative examples

In this section, we aim to identify the determining factors of BMI during childhood that may span three levels, occasions nested within a child attending a school, via analysis of the Early Childhood Longitudinal Study – Kindergarten Cohort of 1998 (ECLS-K [15]). Specifically, we consider ethnic and social disparities in the growth of BMI and ask how environmental exposures such as television watching and school quality are associated with growth in BMI.

The ECLS-K is a nationally representative sample of 21,260 kindergartners in the United States who attended 1,018 schools in 1998. The study followed the children in fall-kindergarten (K) of 1998, spring-K of 1999, fall-first grade (G1) of 1999, spring-G1 of 2000, spring-third grade (G3) of 2002, spring-fifth grade (G5) of 2004 and spring-eighth grade (G8) of 2007. Due to cost constraints, a random subsample (41–54%) of students transferring schools were followed from K to G5. Furthermore, the fall-G1 data collection was limited to 27% of base-year students in a 30% subsample of the schools. Therefore, the ECLS-K contains many item- and unit-nonresponses. For example, only 5,044 first graders had their BMI measured in fall of 1999. Consequently, researchers have analyzed the ECLS-K without the third wave in longitudinal studies of obesity [11, 13, 37]. With the G8 data available since 2009, the longitudinal analysis demands challenges as less than 7% of the children attended the same school from K to G8.

A longitudinal analysis of the ECLS-K should involve all seven waves of data to yield efficient analysis. Furthermore, missing data may be present at multiple levels. The approach in this article enables all waves and available data to be analyzed for efficient and unbiased inferences. The “all available data” include children with item- as well as unit-nonresponse, because a child with time-varying characteristics missing but individual or school characteristics observed strengthens inferences at higher levels [29, 30]. Mobile students transferring schools are nested within their original schools in fall-K and analyzed.

Following the previous studies of the ECLS-K [11, 12, 37, 38], we analyze the raw BMI as a ratio of body weight in kg to height in meters squared. Table 3 summarizes the data for analysis of 21,210 children who attended 1,018 schools in 1998 after dropping 50 children with most characteristics missing including gender and race. Also dropped are 6 eighth-grade BMIs ranging 98–207 that are influential on the fitted regression and 13 extraneous heights and weights such as a 20-pound weight and a height reduced by more than 10 inches.

Table 3

Data for analysis

LevelVariableDescriptionMean (SD)Missing (%)
TimeBMIBody mass index
Fall-K16.27 (2.20)2,219 (10)
Spr.-K16.40 (2.30)1,450 (7)
Fall-G116.62 (2.60)16,170 (76)
Spr.-G116.90 (2.86)5,805 (27)
Spr.-G318.66 (3.88)7,476 (35)
Spr.-G520.57 (4.75)10,241 (48)
Spr.-G822.80 (5.29)12,450 (59)
TVDaily television viewing in hours
Fall-K2.01 (1.08)2,772 (13)
Spr.-K2.01 (1.08)2,772 (13)
Fall-G12.44 (1.65)16,374 (77)
Spr.-G11.88 (1.25)6,018 (28)
Spr.-G31.91 (1.22)8,123 (38)
Spr.-G52.05 (1.22)10,468 (49)
Spr.-G83.09 (1.87)12,227 (58)
ChildHOMESAFETYSafety around home1.66 (0.55)2,316 (11)
AGEAge in months68.41 (4.35)2,132 (10)
BIRTHWEIGHTBirth weight in lb6.91 (1.35)1,472 (7)
SESSocioeconomic status0.00 (0.80)1,088 (5)
FEMALE1 if female0.49 (0.50)0 (0)
BLACK1 if African-American0.15 (0.36)0 (0)
HISPANIC1 if Hispanic0.18 (0.38)0 (0)
ASIAN1 if Asian0.06 (0.25)0 (0)
PACIFIC1 if pacific islander0.01 (0.10)0 (0)
ALASKAN1 if American Indian/Alaskan0.02 (0.13)0 (0)
OTHER1 if multiracial or others0.03 (0.16)0 (0)
SchoolGRAFFITIGraffiti around school0.55 (0.72)262 (26)
PRIVATE1 if private school0.26 (0.44)0 (0)

After dropping the influential and extraneous observations, the standard deviation of G8 BMIs reduces from 6.29 to 5.29. With seven occasions nested within most children, there are a total of 148,451 occasions at level 1 nested within 21,210 children at level 2 attending 1,018 schools at level 3. BMI and the daily number of hours spent watching television (TV) are time varying [11, 13, 37]. The BMI ranges 7.1–57.5. To produce TV, maximum daily television viewing hours exceeding 7 per weekday and 10 per weekend day were set to 7 and 10 hours, respectively. TV was measured in spring-K for the first time and then once in every other data collection. It is unreasonable to think that the TV values between fall-K and spring-K for each child are different enough to treat all the values missing in fall-K. The analysis uses the TV measured in spring-K for the first two data collections. In addition, the analysis considers six dummy time indicators from spring-K to G8 to control for the natural growth in BMI at level 1; base-year home neighborhood safety (HOMESAFETY), base-year age in months (AGE), birth weight in pounds (BIRTHWEIGHT), base-year socioeconomic status (SES), a female indicator (FEMALE) and six race ethnicity indicators at child level or level 2; and base-year school neighborhood safety (GRAFFITI, the amount of graffiti around school) and a private school indicator (PRIVATE) at school level or level 3.

An unsafe neighborhood is associated with elevated BMI among adults [28]. HOMESAFETY has three scales: not safe or low (0); somewhat safe or medium (1); and very safe or high (2), while GRAFFITI, the lower the safer, has four scales: none (0); a little (1); some (2); and a lot (3). Preliminary analysis shows that higher-order than linear association between the safety factors and BMI is unlikely. BMI and TV miss 38 and 44% of their values overall and 76 and 77% in fall-G1, respectively. HOMESAFETY, AGE, BIRTHWEIGHT and SES miss 5–11%, while GRAFFITI is missing for 26% of the schools. Complete-case analysis entails removing the 262 schools with missing GRAFFITI and dropping 5,381 students attending the schools and their data from analysis. The resulting inference is inefficient and may be considerably biased as will be illustrated below. The 2,166 missing birth weights in fall-K were recovered from later data collections. The 21,210 students are 49% female and 55% white. Out of the 1,018 schools, 26% are private. The mean BMI grows with acceleration until G5.

5.1 Random-intercept model

The analysis aims to efficiently identify the environmental factors of childhood BMI such as television watching and school quality after controlling for natural growth as well as ethnic and social disparities in BMI. At level 1, over time nested within children, we model change in BMI as a function of incompletely observed TV and time. At level 2, between children nested within schools, we include incompletely observed measure of the safety around the home, age, birth weight and socioeconomic status, and completely observed female and race ethnicity indicators. At level 3, between schools, we include an incompletely observed measure of the safety of the school by GRAFFITI and a completely observed indicator for private school. In terms of our general model (30), we, therefore, have , , , , , , , , , and where through are dummy indicators for spring-K through spring-G8. Rather than subjecting the mean growth in BMI to a polynomial curve [11, 37], controls for it as the difference in mean BMIs between each time point and fall-K.

Table 4 displays the output. Age and birth weight have been centered around the respective sample means. The complete-case analysis of the desired model (30) under column heading “CC” involves 12,446 children attending 706 schools who have both BMI and TV observed at one or more occasions. The children have a total of 55,082 occasions. The next column under “” presents the that uses five imputations based on the corresponding unconstrained joint model (1) for , and of respective dimensions 2-by-2, 6-by-6 and 7-by-7. The number of covariance components between R and C is , while the desired hierarchical model has effects of C on R. Consequently, six covariance components of the unconstrained joint model are extraneous for subsequent analysis. An asterisk “*” marks statistical significance at a significance level . The CC standard errors are 8–83% higher than those under the . Under the CC, females have 0.09 units higher than do males, and pacific islanders and students of multiple or other races have 1.16 and 0.27 units higher than do white counterparts, respectively, in BMI on average, while these effects are insignificant under the , controlling for other covariates. The effect estimates for females and pacific islanders under the CC are 25 and 5 times higher than their counterparts under the . The effect estimates for TV, SES, BLACK, ALASKAN and OTHER under the CC are also 10–67% higher than their counterparts. The CC variances are comparatively underestimated.

Table 4

Random-intercept model (30) by complete-case analysis (CC), by the and by the based on the unconstrained joint model (1), and random-coefficients model (30) by the . Statistical significance marked by “*” at a significance level

Estimate (Std. Err.)
CovariateCCunconstrainedrandom slope
Intercept15.88 (0.11)*15.94 (0.08)*11.59 (0.37)*15.93 (0.09)*
TV0.10 (0.01)*0.08 (0.01)*0.18 (0.01)*0.09 (0.01)*
T20.13 (0.02)*0.13 (0.02)*0.13 (0.02)*0.13 (0.02)*
T30.34 (0.04)*0.32 (0.03)*0.28 (0.03)*0.33 (0.03)*
T40.63 (0.03)*0.63 (0.02)*0.64 (0.02)*0.63 (0.02)*
T52.36 (0.03)*2.36 (0.02)*2.37 (0.02)*2.36 (0.03)*
T64.21 (0.03)*4.24 (0.02)*4.23 (0.02)*4.23 (0.02)*
T76.38 (0.03)*6.40 (0.02)*6.29 (0.03)*6.39 (0.03)*
HOMESAFETY–0.01 (0.05)0.00 (0.04)0.01 (0.04)–0.01 (0.04)
AGE0.03 (0.01)*0.03 (0.00)*0.03 (0.00)*0.03 (0.00)*
BIRTHWEIGHT0.31 (0.02)*0.32 (0.01)*0.32 (0.02)*0.32 (0.01)*
SES–0.30 (0.04)*–0.27 (0.03)*–0.28 (0.03)*–0.27 (0.03)*
FEMALE0.09 (0.05)*0.00 (0.04)0.02 (0.04)0.01 (0.04)
BLACK0.52 (0.09)*0.47 (0.06)*0.35 (0.06)*0.48 (0.07)*
HISPANIC0.61 (0.08)*0.62 (0.06)*0.55 (0.06)*0.62 (0.06)*
ASIAN0.06 (0.13)–0.07 (0.09)–0.09 (0.09)–0.10 (0.09)
PACIFIC1.16 (0.37)*0.25 (0.20)0.16 (0.20)0.31 (0.20)
ALASKAN0.66 (0.20)*0.53 (0.17)*0.47 (0.16)*0.53 (0.17)*
OTHER0.27 (0.16)*0.16 (0.12)0.15 (0.13)0.16 (0.13)
GRAFFITI0.04 (0.05)0.00 (0.04)0.00 (0.04)0.00 (0.04)
PRIVATE–0.03 (0.07)–0.07 (0.06)–0.04 (0.06)–0.07 (0.06)
0.15 (0.03)0.17 (0.02)0.20 (0.03)
6.95 (0.10)7.01 (0.08)6.98 (0.08)6.98 (0.07)
3.08 (0.02)3.07 (0.02)3.05 (0.02)3.07 (0.02)

The estimated natural growths of BMI under CC and are close to each other. From the , a white student having mean age, birth weight and socioeconomic status who does not watch television has 15.94 BMI units on average in fall-K. The 6-month BMI growth is 0.13 units in spring-K and then accelerates to 0.19 units (0.32–0.13) in fall-G1, to 0.31 units (0.63–0.32) in spring-G1, to 0.43 units[(2.36–0.63)/4] until spring-G3 and to 0.47 units [(4.24–2.36)/4] until spring-G5, and then decelerates to 0.36 units [(6.40–4.24)/6] until spring-G8. A polynomial curve may not reveal the details in growth. Controlling for the natural growth, and demographic individual and organizational school characteristics, a 1-hour increment in daily television viewing elevates child BMI by 0.08 units on average, 64% of the 6-month BMI growth in kindergarten. Each month in base-year age and an additional pound in birth weight contribute to 0.03 and 0.32 unit increases in BMI, respectively, while one unit increase in SES lowers the child BMI by 0.27 units on average, ceteris paribus. Black, Hispanic and American Indian or Alaskan students have 0.47, 0.62 and 0.53 units higher, respectively, than do white counterparts in BMI on average, controlling for other covariates.

The next column under “ unconstrained” illustrates how biased the resulting inferences may be relative to those of the desired model (30) under the , if we are to directly transform the corresponding unconstrained joint model (1) to the desired model. The transformed parameters are

(31)
(31)
and . We compare the estimates to the counterparts under the . Most strikingly, the key environmental effect of television watching (TV) under investigation increases by 126% to 0.18. The gaps in mean BMIs of black, Hispanic and American Indian or Alaskan students relative to white students are noticeably underestimated and so are the intercept and the effects of T3 and T7. The level-1 and -2 error variances are understated, while the level-3 error variance is over-represented. This example is comparatively benign with only one covariate, TV, at level 1 and four covariates at level 2 subject to missingness. With more covariates subject to missingness at nested levels, this method has the potential to produce severely biased inferences. To correctly apply the for the desired hierarchical model (30), the transformation should follow estimation of the joint model under constraints (29): and according to Corollary 4.3. To see how the constraints work, replace the right-hand side in eq. [31] with the constraints.

5.2 Random-coefficients model with missing data

The analysis above reveals that black, Hispanic and American Indian or Alaskan students have elevated BMIs relative to white counterparts on average controlling for other covariates in the model. The minority students may attend lower-quality schools than those that white counterparts attend which, by hypothesis, contributes to the disparity in BMI. School quality may be indicated by school characteristics such as school safety, school-mean socioeconomic status, contents of school meals, physical education time and school sector [12, 13, 37]. If this hypothesis is true, then the minority students may have a randomly varying effect on BMI across schools of different qualities. Among the minority students, Hispanic students stand out in BMI. Overall, Hispanic students are half as likely to attend private schools as white students. They also attend schools having about three times as much graffiti around as those that white students attend on average.

The random-intercept model above is extended to a random-coefficients model where the Hispanic indicator has a random effect on BMI across schools. The desired model (30) has , and all other components identical to those of the random-intercept model. The corresponding unconstrained joint model (1) has an 8-by-8 covariance matrix and all other parameters identical to those of the joint model corresponding to the random-intercept model. The complete-case analysis produced virtually identical estimates as those under the CC in Table 4. This is not surprising in that the slope of HISPANIC does not vary significantly across schools (variance estimate = 0.23, standard error = 0.17, p-value = 0.24). The based on imputations yields the estimates under “ random slope” in the table. Except for the level-3 covariance matrix, the estimates are close to the counterparts under the . The slope for the Hispanic indicator seems to vary at most modestly across schools (slope , standard error ). To test the null hypothesis that which implies , let and be the parameters of the random-coefficients (full) and -intercept (reduced) models, and and be the ML estimates, respectively, given the tth imputation for . The log-likelihoods and evaluated at the ML and , respectively, given the tth imputation yield . The test statistic recommended by Li et al. [39] is where and for [6]. The p-value is where is a random variable from the F distribution with 2 numerator and denominator degrees of freedom for . This test yields an approximate range of p-values between twice and one half the computed value [6, 40]. The computed p-value 0.59 gives enough precision to conclude the random-intercept (null) model. Therefore, we do not find evidence that the slope for the Hispanic indicator varies randomly across schools.

The unconstrained joint model identifies 18 covariance components between R and C, while the desired random-coefficients model has six effects of C on R. Consequently, 12 covariance components between R and C are extraneous for subsequent analysis. Constraints to identify the desired model via are , , and by Corollary 4.3.

6 Discussion

This article presented methods for efficient and unbiased analysis of a Q-level hierarchical GLM given incomplete data with a general missing pattern at any of the Q levels. Our general approach uniformly expresses the Q-level model for that greatly facilitates extension of existing single-level and two-level efficient missing data methods to general Q-level data; reexpresses the desired model as a joint distribution of the variables, including the outcome, that are subject to missingness conditional on all of the covariates that are completely observed and efficiently estimates the joint distribution. This approach confronts two major challenges. As the number of Q levels, the number, p, of variables subject to missingness and the number, , of random coefficients increase in the hierarchical model, the joint distribution may become extremely high dimensional and difficult to estimate well. Moreover, the joint model, in general, over-identifies the desired hierarchical model. The problem of over-identification can grow severe as levels, covariates, and random coefficients are added to the hierarchical model. The consequence is that the over-identified hierarchical model may produce considerably biased inferences as was illustrated in this article. To overcome the computational challenges, we derived, within each iteration of the EM algorithm, recursive Q-step computation formulas for efficient estimation of the joint distribution where computation at each step involves single-level data only given higher-level computation components. The consequence is efficient computation that is not excessively burdened with regard to Q, p and . Furthermore, we showed how to impose constraints on the joint distribution within the framework of the Q-level hierarchical model in a way that is uniform without regard to Q and in a way that produces unbiased and efficient analysis of the hierarchical model.

This article considered three methods for efficient handling of missing data: , and the hybrid method. Given a Q-level hierarchical model with incomplete data, constrains the joint model to be a one-to-one transformation of the desired hierarchical model for unbiased analysis, efficiently estimates the constrained joint model by ML, and transforms it to the hierarchical model. Consequently, is tailored to estimation of the desired hierarchical model. On the other hand, generates multiple imputation given the unconstrained joint model estimated by ML, allowing the user to impose the desired constraints when using conventional software to analyze the imputed data. Therefore, is more generally applicable by enabling the data analyst to explore, in addition, contextual effects and interaction effects involving covariates having random coefficients. Theorem 4.1 provides the scope of hierarchical models that can be explored under the joint model. When is desired, but produces an unconstrained joint model that is extremely high dimensional and thus difficult to estimate well, the hybrid method of imputation following estimation of partially constrained joint model enables estimation of fewer parameters and thus reduces computational burden in estimation. The corollaries to Theorem 4.1 provide the scope of hierarchical models that may be explored by the data analyst under the partially constrained joint model.

We have illustrated the by a longitudinal analysis of ECLS-K for and compared it to the complete-case analysis that was shown to be relatively inefficient and subject to biased inferences. We have also compared the to the based on the unconstrained joint model that revealed the potential to produce substantially biased inferences. The proliferation of extraneous parameters was comparatively moderate with , and . We have imposed six constraints to generate the results via the in Table 4. For a model having larger Q, p or , it may be desirable to use the hybrid method that reduces the computational burden of and, at the same time, broadens the scope of by allowing an analyst to explore multiple hierarchical models for the correct specification of the desired model. We may also take advantage of extra variables not of direct interest in the desired hierarchical model, but highly correlated with variables subject to missingness to more precisely impute missing data at multiple levels [28].

The ECLS-K has a great majority of students transferring schools. Thus, it may be more appropriate to consider a cross-classified model for the analysis that relaxes the strict hierarchy of nesting a student within a single school. Estimation of a cross-classified model is challenging, because the growths in the outcome of students while attending the same schools become depentent to produce a complicated network of dependence among children and schools [9]. All schools for each child and all children for each school may have to be analyzed at once to fully account for the dependence. With many covariates subject to missingness at multiple levels, the joint model of variables subject to missingness may be too highly dimensioned to estimate well. Therefore, a valuable future research topic is development of a method for efficient estimation of a cross-classified model given incomplete data.

One restriction of the general Q-level hierarchical model is that the covariates having random effects should be completely observed. If the covariates are subject to missingness, they should appear on the left-hand side of the corresponding joint model for efficient handling of the missing data as well as on the right-hand side of the model for estimation of the random coefficients. Such a joint model is not multivariate normal, and the factorization under joint normality that leads to the desired conditional hierarchical linear model does not apply. Consequently, the ML approach is challenging. Relaxing this assumption is beyond the scope of this article.

It took 21 seconds to complete each iteration in estimating each joint model in Table 4 on a 2.8 GHz laptop computer that has 8 GB memory. The estimated random-intercept and -coefficients models took more than 5 hours to converge at 867th and 894th iterations, respectively. No attempt to accelerate the convergence has been made. The convergence criterion is the difference in the observed log-likelihoods between two consecutive iterations less than . In some of our two-level test runs, we compared computation times for estimation of a joint model between our program with a convergence criterion of the difference in log-likelihoods between two consecutive iterations less than [28], and an alternative program with a convergence criterion of the percentage difference in log-likelihoods between two consecutive iterations less than and the Aitken acceleration [41], the alternative program converged not only to practically identical estimates and standard errors but up to 90% faster than did our program in terms of the number of iterations. Considerable saving in computation time is anticipated with the likewise acceleration in the three-level applications. It took us about 2 minutes to generate a single imputation for the results in Table 4. So far, we have developed a three-level program implementing the only and thus cannot compare the computation times between and .

In Section 5.2, we found no evidence that the slope for the Hispanic student indicator varies randomly across schools, based on the test statistic recommended by [39]. This test provides an approximate range of p-values between twice and one half the computed p-value. More accurate p-values may be obtained at the expense of extra computational effort [2, 6, 39, 42]. Because the corollaries to Theorem 4.1 establish one-to-one correspondence between hierarchical model (30) and joint model (1), the enables an alternative likelihood ratio test between the two analyst’s models directly based on their constrained joint models.

Our illustrative examples in Section 5 are based on a large sample. The performance of our estimators in terms of bias and efficiency involving a small sample is yet to be assessed. Therefore, simulation studies on the small-sample performance of our methods will be a useful future research area.

The analysis in this paper involved discrete covariates, the safety factors, subject to missingness at levels 2 and 3. Although it is improper for the normal linear joint model to describe the marginal distribution for the discrete factors, the implied conditional distribution is the desired hierarchical model. An advantage is that it allows the discrete covariates subject to missingness to be analyzed by the efficient missing data method [6, 28]. In addition, the impact of the joint distribution assumptions on the desired conditional model by the is comparatively weak because the distributional assumptions do not affect the observed data. A valuable future extension of this approach is to a hierarchical generalized linear model given incomplete data.

Appendix

Proof of Proposition 3.1. The model (8) implies

(32)
(32)
(33)
(33)

for a symmetric matrix (33) showing the upper triangular components only where , , and in eq. [12], and

depend on level-q and only, given , and . ■

Proof of Theorem 3.2. The initial step is to compute and in eq. [11]. Suppose that and may be computed involving level-q data only given , and . Then, and may be computed based on level- data only given and by Proposition 3.1. ■

Proof of Proposition 3.3. Eq. [8] implies

(34)
(34)
(35)
(35)
(36)
(36)

where , , and in eq. [12], in eq. [33],

depend on level-q, and only, given , , , , and . ■

Proof of Theorem 3.4. The initial step is to compute , , , and for . Suppose that and can be computed from level-q data only, given , , , , and . Then, and can be computed from level- data only, given , , , , and by Proposition 3.3. ■

Proof of Theorem 4.1. Table 2 reveals covariance components in between R and C and implies and for all q to identify model (26). Conversely, eq. [26] implies , and which, along with the marginal C, is one-to-one with the joint model (1). ■

Fisher information

We express to compute where places for the ith level-q unit within cluster m and zeroes elsewhere for and for . Let be a vector of distinct elements in , so that and . The Fisher information is

(37)
(37)

for

Likelihood

The observed log-likelihood measures convergence to ML for . Recursive components are

(38)
(38)
(39)
(39)

Acknowledgments

This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305D090022 to NORC and R305D130033 to VCU. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. The first author was also supported by Grant 1U01HL101064 from the National Institutes of Health.

References

1. DempsterA, LairdN, RubinD. Maximum likelihood from incomplete data via the em algorithm. JRSS Series B1977;76:138. Search in Google Scholar

2. LittleR, RubinD. Statistical analysis with missing data. New York: Wiley, 2002. Search in Google Scholar

3. MengX. Multiple-imputation inferences with uncongenial sources of input. Stat Sci1994;9:53858. Search in Google Scholar

4. RubinD. Inference and missing data. Biometrika1976;63:58192. Search in Google Scholar

5. RubinD. Multiple imputation for nonresponse in surveys. New York: J. Wiley & Sons, 1987. Search in Google Scholar

6. SchaferJ. Analysis of incomplete multivariate data. London: Chapman & Hall, 1997. Search in Google Scholar

7. SchaferJ. Multiple imputation in multivariate problems when imputation and analysis models differ. Stat Neerl2003;57:1935. Search in Google Scholar

8. GoldsteinH. Multilevel statistical models. London: Edward Arnold, 2003. Search in Google Scholar

9. RaudenbushSW, BrykAS. Hierarchical linear models. Newbury Park, CA: Sage, 2002. Search in Google Scholar

10. BingenheimerJ, RaudenbushS. Statistical and substantive inferences in public health: issues in the application of multilevel models. Annu Rev Public Health2004;25:5377. Search in Google Scholar

11. DannerF. A national longitudinal study of the association between hours of TV viewing and the trajectory of BMI growth among us children. J Pediatr Psychol2008;33:11001107. Search in Google Scholar

12. DatarA, SturmR. Physical education in elementary school and body mass index: evidence from the early childhood longitudinal study. Am J Public Health2004;94:150106. Search in Google Scholar

13. GableS, ChangY, KrullJ. Television watching and frequency of family meals are predictive of overweight onset and persistence in a national sample of school-aged children. J Am Diet Assoc2007;107:5361. Search in Google Scholar

14. ShinY, RaudenbushSW. A latent cluster mean approach to the contextual effects model with missing data. JEBS2010;35:2653. Search in Google Scholar

15. TourangeauK, NordC, T, SorongonA, NajarianM. Early childhood longitudinal study, kindergarten class of 1998–99 (ECLS-K), combined user’s manual for the ECLS-K eighth-grade and K-8 full sample data files and electronic codebooks. Washington, DC: NCES, IES, DOE, 2009 (nces 2009–004) edition. Search in Google Scholar

16. AllisonPD. Estimation of linear models with incomplete data. Sociol Methodol1987;17:71103. Search in Google Scholar

17. ArbuckleJL. Full information estimation in the presence of incomplete data. In: MarcoulidesGA, SchumackerRE, editors. Advanced structural equation modeling: issues and techniques. Mahwah, NJ: Lawrence Erlbaum, 1996. Search in Google Scholar

18. EndersC, PeughJ. Using an em covariance matrix to estimate structural equation models with missing data: choosing an adjusted sample size to improve the accuracy of inferences. Struct Equation Model2004;11:119. Search in Google Scholar

19. MuthénB. Latent variable modeling of growth with missing data and multilevel data. In: CuadrasCM,RaoCR, editors. Multivariate analysis: future directions 2. Amsterdam: North Holland, 1993:199210. Search in Google Scholar

20. MuthénB, KaplanD, HollisM. On structural equation modeling with data that are not missing completely at random. Psychometrika1987;52:43162. Search in Google Scholar

21. MuthénL, MuthénB.Mplus user’s guide6th ed. Los Angeles, CA: Muthén and Muthén, 2010. Search in Google Scholar

22. ArbuckleJL. Amos 5.0 update to the Amos 5.0 user’s guide [Computer software and manual]. Chicago: Smallwaters, 2003. Search in Google Scholar

23. BentlerPM. EQS 6 structural equations program manual. Encino, CA: Multivariate Software, 2007. Search in Google Scholar

24. LiuM, TaylorJ, BelinT. Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics2000;56:115763. Search in Google Scholar

25. SchaferJ, YucelR. Computational strategies for multivariate linear mixed-effects models with missing values. JCGS2002;11:43757. Search in Google Scholar

26. GoldsteinH, BrowneW. Multilevel factor analysis modelling using Markov chain Monte Carlo estimation. In: MarcoulidesG,MoustakiI, editors. Latent variable and latent structure models. London: Lawrence Erlbaum, 2002. Search in Google Scholar

27. GoldsteinH, BrowneW. Multilevel factor analysis models for continuous and discrete data. In: OlivaresA,McArdleJJ, editors. Contemporary psychometrics. A Festschrift to Roderick P. McDonald. Mahwah, NJ: Lawrence Erlbaum, 2005. Search in Google Scholar

28. ShinY, RaudenbushSW. Just-identified versus over-identified two-level hierarchical linear models with missing data. Biometrics2007;63:12628. Search in Google Scholar

29. ShinY, RaudenbushSW. The causal effect of class size on academic achievement: multivariate instrumental variable estimators with data missing at random. JEBS2011;36:15485. Search in Google Scholar

30. ShinY. Do black children benefit more from small classes? multivariate instrumental variable estimators with ignorable missing data. JEBS2012;37:54374. Search in Google Scholar

31. YucelR. Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response. Philos Trans R Soc A2008;366:2389403. Search in Google Scholar

32. SchaferJ. NORM: multiple imputation of incomplete multivariate data under a normal model [Computer software]. University Park, PA: Pennsylvania State University, Department of Statistics, 1999. Search in Google Scholar

33. GoldsteinH, CarpenterJ, KenwardM, LevinK. Multilevel models with multivariate mixed response types. Stat Model2009;9:17397. Search in Google Scholar

34. GoldsteinH, KounaliD. Multilevel multivariate modelling of childhood growth, numbers of growth measurements and adult characteristics. JRSS, Series A2009;172:599613. Search in Google Scholar

35. MagnusJR, NeudeckerH. Matrix differential calculus with applications in statistics and econometrics. New York: Wiley, 1988. Search in Google Scholar

36. RaudenbushSW, BrykAS, CheongY, CongdonR, du ToitM. HLM 7: hierarchical linear and nonlinear modeling. Lincolnwood, IL: Scientific Software International, 2011. Search in Google Scholar

37. BhargavaA, JolliffeD, HowardL. Socio-economic, behavioural and environmental factors predicted body weights and household food insecurity scores in the early childhood longitudinal study-kindergarten. Br J Nutr2008;100:43844. Search in Google Scholar

38. SturmR, DatarA. Body mass index in elementary school children, metropolitan area food prices and food outlet density. Public Health2005;119:105968. Search in Google Scholar

39. LiKH, MengX, RaghunathanTE, RubinDB. Significance levels from repeated p-values with multiply-imputed data. Stat Sin1991;1:6592. Search in Google Scholar

40. LiKH, RaghunathanTE, RubinDB. Large-sample significance levels from multiply imputed data using moment-based statistics and an f reference distribution. JASA1991;86:106573. Search in Google Scholar

41. AitkenA. On bernoulli’s numerical solution of algebraic equations. Proc R Soc Edinb1926;46:289305. Search in Google Scholar

42. MengX, RubinD. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika1992;79:10311. Search in Google Scholar

Published Online: 2013-09-28

©2013 by Walter de Gruyter Berlin / Boston