Selecting estimators is an essential step in modeling, and Akaike information criterion (AIC)  has been widely used for this purpose. AIC allows selecting maximum likelihood estimators (MLE) based on parametric models that are not too badly specified. More general criteria have been developed, in particular the Takeuchi information criterion (TIC)  and the general information criterion (GIC) . A related criterion in the field of neural networks is the network information criterion (NIC) . Two other well-known criteria are the Bayesian information criterion (BIC) and the deviance information criterion (DIC); both use Bayesian arguments and are not directly related to the present paper. A good reference book for information criteria is by Konishi and Kitagawa .
Likelihood cross-validation (LCV) has also been widely used for comparing parametric models. Stone  heuristically established that LCV was asymptotically identical to AIC. LCV, however, is more flexible in that it can be applied to other estimators than MLEs, for instance, to penalized likelihood estimators: see Golub et al.  and Wahba .
Cross-validation can also be applied to other assessment risks than Kullback–Leibler risk. The leave-one-out cross-validation is the most natural and one of the most efficient [9, 10], but it is also the most computationally demanding so that approximation formulas have been derived. Approximate cross-validation formulas have been developed for penalized splines [11, 12] or penalized likelihood [13, 14]. Commenges et al.  derived an approximate cross-validation criterion in the context of prognosis.
In the present paper we consider the following general framework: estimators of the true density function are defined as minimizing an estimating function; the estimating function itself can be viewed as an estimator of a risk, that we call an “estimating risk.” Typically there is a model, that is a family of densities for the variable Y, , , and the estimator is chosen as minimizing the estimating risk. The estimators of the true density are then assessed using an “assessment risk,” which allows choosing between different available estimators. The most conventional case is when the estimating risk is which is estimated by the log likelihood, and the assessment risk of the obtained estimator is , which can be estimated by cross-validation or in the parametric case by the normalized AIC: AIC . These information risks are very appealing but there are cases where other risks are relevant. As an example, the MLE could be assessed by the continuous rank probability score (CRPS) : this is detailed in Section 4.4. Another example is the estimation of the distribution of ordinal data through an approximation using models for continuous data. Models for ordinal variables that can take a large number of values are rather cumbersome; it is convenient to treat these data as continuous, using an estimating risk adapted to continuous data. However, if we wish to compare the obtained estimator to that obtained by a model for ordinal data, the assessment risk must still take into account that the data are really ordinal. Such assessment risk can be estimated by cross-validation; cross-validation has good properties but is very computationally demanding. The main aim of this paper is to find an approximation for leave-one-out cross-validation, valid whatever the estimating and assessment risks satisfying regularity conditions that will be detailed. This will be applied to the ordinal data example.
Section 2 presents the framework, the cross-validation criterion and its approximation. It is universal in the sense that it can be applied to any estimating and assessment risks satisfying regularity conditions. We denote the approximate criterion by UACVR (U for Universal, A for approximate, CV for cross-validation and R for regularity). In Section 2 the asymptotic distributions of UACVR and of a difference of two UACVR values are given. Section 4 shows how UACVR specializes to particular cases: TIC appears as a special case when cross-entropy is used for defining both estimating and assessment risks, and AIC follows if the models are close to being well specified; other important cases where estimating and assessment risks defined in a less symmetric way are given. Section 5 presents a simulation study. Section 6 presents an illustration of the use of UACVR for comparing estimators derived from threshold models and estimators obtained by continuous approximations in the case of ordered categorical data with repeated measurements; these data are psychometric scores from a large study on cognitive aging. Section 7 concludes.
2 The universal cross-validation criterion and its approximation
2.1 The estimating risk and its estimation by an estimating function
Suppose that a sample of independently identically distributed (i.i.d.) variables is available. Based on , an estimator (where is short for ) of the probability density function of the true distribution can be chosen in a model, that is a family of distributions , . The main rules for designing estimators of can be thought of as minimizing an estimating risk. The estimating risk is defined as the expectation under the true distribution of a loss function : . We would like to choose where . For making consistent estimation possible, it is natural to require that whenever the model is well specified, the risk is minimized by the true distribution. Precisely, saying that the model is well-specified amounts to say that there is a value , such that . Then we require that ; moreover we will require that this minimum is unique. This is related to the concept of strictly proper scores . In the scoring rule literature, the problem is formulated in terms of reward rather than loss; it is possible to establish a correspondence between the two theories by considering that minus a loss is a reward, and of course while one tries to minimize the expected loss, one tries to maximize the expected reward.
We cannot compute the estimating risk but a natural estimator of the estimating risk is the estimating function . The estimator defined as minimizing is called an M-estimator. By the law of large numbers, converges in probability toward . Under some conditions given in Van der Vaart  (see, e.g. Theorem 5.7), converges in probability toward . A simple set of sufficient conditions is that is compact, is continuous and has a unique minimizer, is continuous for every y.
Example 1: If we take as loss function , the estimating risk is ; the estimating function is and is the least-square estimator.
Example 2: If we take as loss function , the estimating risk is which is the cross-entropy of with respect to ; the estimating function is and is the MLE.
2.2 The assessment risk and its estimation by cross-validation
When several estimators are available, we wish to assess their performance by estimating an assessment risk. Estimators with small assessment risks will be preferred. For constructing the risk of an estimator we may use a loss function . The assessment risk is the expectation under of , where both Y and are random: (1)The problem is to estimate the assessment risk (without knowing the true density ). A natural, albeit naive, estimator is (2)However is not completely satisfying because it does not take into account that depends on the observations; as a result underestimates (the well-known overoptimism bias).
If another sample i.i.d. with respect to were available, a natural estimator of the assessment risk would be . We call the “oracle estimator.” This is an unbiased estimator of the assessment risk but cannot be computed based on . Its variance is , which tends toward , where is the variance of .
A pseudo-oracle estimator of the assessment risk is often used by practitioners who split their original sample in a training and a validation sample. However, this practice leads to a loss of efficiency since only half of the data is used for computing the estimator and half of the data also for estimating its assessment risk. Cross-validation estimators of the assessment risk make a more efficient use of the information. In particular the leave-one-out cross-validation criterion is where and . does nearly as well as if another sample were available, in terms of both bias and variance. Indeed it can immediately be seen that . We shall see in Section 3 that the asymptotic variance of the approximate cross-validation criterion is precisely , the same as that of the oracle estimator.
For comparing two estimators, the difference of assessment risks is relevant. This can be estimated by the difference of cross-validation estimates of the assessment risks.
2.3 The universal approximate cross-validation criterion
The leave-one-out cross-validation criterion may be computationally demanding since it is necessary to run the maximization algorithm n times for finding the . For this reason an approximate formula is very useful. In this section we propose a universal approximation of the cross-validation (UACVR) criterion for regular loss functions and .
Definition 1 (Universal approximation of the cross-validation) (3)where and , with and The leading term in eq. (3) is the naive estimator of defined in eq. (2) while the second term is a correction accounting for parameter estimation. This correction term involves , the Hessian of the estimating function, and and which are the gradients of the assessment and estimating functions (up to the multiplicative constant for the latter).
Under regularity assumptions on and , we have that the leave-one-out cross-validation criterion differs from UACVR by an asymptotically negligible term in , which makes UACVR a good approximation for n relatively large, when leave-one-out cross-validation becomes computationally too demanding. The regularity conditions are detailed in the Appendix and are essentially: A1: has a unique maximizer; A2: thrice differentiability of ; A3: twice differentiability of .
Theorem 1 Under assumptions A1, A2, A3, we have (4)UACVR applies only to regular parametric problems. Thus it does not apply to non- or semi-parametric estimators and more generally to singular problems as treated by Watanabe . Also, some assessment functions do not satisfy the regularity assumptions: for instance, a non-parametric estimator of the area under the ROC curve can be used for assessing the discriminating ability of an estimator, and this is not continuous in the parameter . Nevertheless, UACVR may be useful in various important contexts as detailed in Section 4, including penalized likelihood estimators approximated on a spline basis, which is a way to avoid strong parametric assumptions.
3 Asymptotic distribution and tracking interval
3.1 Asymptotic distribution of UACVR
Commenges et al.  using results of Vuong  studied the asymptotic distribution of a difference of normalized AIC’s as an estimator of a difference of Kullback–Leibler risks: the normalized AIC is defined as . Here similar arguments are applied to study the asymptotic distribution of UACVR and a difference of two UACVR values. By the continuous mapping theorem, the asymptotic distribution of is the same as that of . Since the latter quantity is a mean, it immediately follows by the central limit theorem that (5)where and stands for the variance under the true distribution. We can also write: (6)and can be estimated by the empirical variance of , .
3.2 Asymptotic distribution of a difference between UACVR values of two estimators
If two estimators and are available, we would like to know which is the best according to the chosen assessment risk. Thus, we have to estimate the difference of their assessment risks: . The obvious estimator is: We focus on the case where . We obtain in that case using the same arguments as above: (7)where , and this can be estimated by the empirical variance of
Based on the same type of results, Commenges et al.  proposed to construct a “tracking interval” for a difference of normalized AIC values. The tracking interval is a kind of confidence interval for the difference of risks. Because the variability of estimators of difference of risks is rather large in general, it is useful to have an interval estimate rather than just a point estimate. However, in the conventional theory of point and interval estimation, the target parameter is fixed; here, it changes with n. Thus, we have a moving target: hence the name of tracking interval. Some simulations in Commenges et al.  showed that the variance of the difference of AIC was correctly estimated and the corresponding tracking interval had good coverage properties. The same idea can be applied in the more general case treated here. The tracking interval is given by , where and , where is the quantile of the standard normal variable.
Note that is in general much lower than . This has been shown by Commenges et al.  for the expected cross-entropy assessment risk and comes from the fact that and are often positively correlated.
4 Particular cases of UACVR
In this section we give seven frameworks in which UACVR applies (a non-exhaustive list).
4.1 MLEs and information assessment risk: TIC and AIC
Suppose we take: . Then, the estimating function is minus the log-likelihood. It estimates the estimating risk, here the cross-entropy  of with respect to the true density : , where is the entropy of and the Kullback–Leibler divergence of relative to . The assessment risk is here the expected cross-entropy: (8)where is the expected Kullback–Leibler risk. It differs from the conventional Kullback–Leibler risk defined for a fixed density because it is applied here to an estimator: it was mentioned by Hall  under the name of “expected Kullback–Leibler loss.” So, although the loss functions for estimating and assessment are the same, there is a dissymmetry in that the estimating risk is a cross-entropy while, because is random, the assessment risk is an expected cross-entropy.
In that case the leading term of eq. (3) is minus the maximized (normalized) log-likelihood. We have also that is the individual score and so that UACVR is identical to a normalized version of TIC . If the model is well specified K tends in probability toward . The Hessian also tends toward so that the correction term tends toward p, the number of parameters. Thus, if the model is not too badly specified, TIC is approximately equal to AIC. We have , and this estimates the expected cross-entropy of the estimator, . In practice, Burnham and Anderson  do not recommend the use of TIC if n is small because of the variability of the correction term. On the other hand, Konishi and Kitagawa  show (see their Table 3.3) that the correction terms can be rather different when the models are misspecified.
4.2 M-estimators and information assessment risk: GIC
Konishi and Kitagawa  have generalized TIC and AIC to cases where was an M-estimator. The criterion they proposed, obtained by correcting the bias of the log-likelihood, is the GIC. GIC is also a special case of UACVR, obtained when the assessment risk is the expected cross-entropy. They apply GIC in particular to penalized likelihood estimators. Thus UACVR, as GIC, can be applied to maximum a posteriori, maximum penalized likelihood and hierarchical likelihood estimators.
4.3 Restricted AIC
Liquet and Commenges  have proposed a modification of AIC and LCV when estimators are based on the full information while they are assessed on a smaller (more targeted) information. More specifically, the estimator is based on the sample but the assessment risk is based on a random variable Z which is a coarsened version of Y. For instance Z is a dichotomization of Y: . For this case, the restricted AIC (RAIC) was derived by both direct approximation of the risk and by approximation of the LCV. RAIC is a particular case of for the case: and .
4.4 Estimators assessment by CRPS
Gneiting and Raftery  studied scoring rules and particularly the CRPS. Its inverse that can be used as a loss function is defined as where is the cumulative distribution function (c.d.f.) of a distribution in the model. The risk is a Cramer–von Mises-type distance: . In some cases, it may be interesting to assess MLE’s using this assessment risk rather than the logarithmic loss which may be too sensitive to low values of the density. UACVR can be used for estimating this risk. In that case, the leading term of is ; for the correcting term, is the Hessian of the log-likelihood (since is the MLE) and K must be computed with ; is the individual score (gradient of the individual log-likelihood) divided by . Thus the computation of , for each i, involves the computation of p simple integrals, which can be done numerically.
4.5 Estimators assessment by Brier score
Brier score  can be used to assess estimators of the distribution of categorical variables, say Y, taking values . Consider a model for this distribution: we write . Brier score is defined as , where is the Kronecker symbol ( if , zero otherwise). Assume that we estimate by maximum likelihood and use the Brier score for assessment. In this case, the leading term of is ; for the correcting term, is the Hessian of the log-likelihood (since is the MLE) and K must be computed with ; is the individual score (gradient of the individual log-likelihood) divided by .
4.6 Conditional AIC
A referee suggested that UACVR might be useful for selecting random effect models based on conditional assessment functions, that is when the target is the density conditional on random effects. Conditional Akaike criterion was proposed by Vaida and Blanchard ; Greven and Kneib  proposed a correction taking into account uncertainty on the covariance matrix of the random effects; Braun et al.  proposed a predictive cross-validation criterion. UACVR could directly apply to this case by considering that the assessment loss is , where b is the random effect and its estimator. Since is a function of and Y, the assessment loss can indeed be written . For computing UACVR, the main task would be here to compute the gradient , not forgetting the dependence of on . This could be easily done by numerical differentiation.
4.7 Estimators based on continuous approximation of categorical data
Assume Y is an ordered categorical variable taking values . Here for simplicity we consider that Y is univariate. Several models are available for this type of variables. Cumulative probit models, further called “threshold link models,” assume that if a latent variable takes values in the interval for , with and : (9) itself can be modeled as a noisy linear form of explanatory variables , where has a normal distribution of mean zero and variance , and where are explanatory variables. The parameters are . For identifiability one must add some constraints, for instance and null intercept in the linear model for . An estimator of the distribution can be obtained by maximum likelihood leading to define . The assessment risk can be . Note that since Y is discrete, the densities are defined with respect to a counting measure that is, defines the probability that .
One may also make a continuous approximation which leads to simpler computations and may be more parsimonious, especially if Y is multivariate as in the illustration of Section 6. For example we can consider the model . Maximizing the likelihood of this model for observations of leads to a probability measure specified by the density . This is however a density relative to Lebesgue measure. This probability measure gives zero probabilities to for all l, and this yields infinite value for ECE (meaning strong rejection of this estimator). However from a natural estimator of can be constructed by gathering at l the mass around l: , for , and , . UACVR can be computed for this estimator for estimating its ECE. The leading term of can be interpreted as the log-likelihood obtained by this estimator with respect to the counting measure. For the correcting term we need the Hessian of the log-likelihood of and we have to compute . For instance if for we have Since the denominator is the probability under that , can be interpreted as the conditional expectation (under ) of the individual score. Thus if does not vary much on , is close to . Using the same arguments as in Section 4.1 we obtain that UACVR is close to correcting by the number of parameters as in AIC; such a criterion that we call AIC was proposed by Proust-Lima et al. , and this is likely to be a good approximation if the number of modalities of Y is large.
5 Simulation: choice of estimators for ordered categorical data
We conducted a simulation study to illustrate the use of UACVR for comparing estimators derived from threshold link models and estimators obtained by a linear continuous approximation in the case of ordered categorical data (see Section 4.7). The aim was to assess the performance of UACVR as an estimator of ECE defined in eq. (8), and to compare it to the normalized naive AIC criterion (noted AIC) and the normalized AIC criterion computed on the counting measure (noted AIC). Performances of these criteria were studied in the case where the number of modalities () of the response variable Y is small (Section 5.2.1) and when it is large (Section 5.2.2).
5.1.1 True distributions
For all the simulations, the data came from a cumulative probit model where the relationship between and is as in eq. (9) and the linear form of is specified by (10)where and the two explanatory variables and were generated from independent standard normal distributions. In order not to disadvantage the linear continuous approximation compared to the threshold link model, the parameters were chosen as the solution of the following equations:
5.1.2 The different models
For each generated sample, we fitted the cumulative probit model as previously defined, and a linear model assuming a linear continuous approximation of the response variable Y, , with being independent zero mean normal variables with variance . Both models were fitted by maximum likelihood using a Fortran program which was checked to be correct by comparing the results with those obtained by the R package lcmm .
Samples of subjects were generated. For all simulations, N = 10,000 samples were generated. The true assessment risk, ECE, which is available only in a simulation study, was computed by a Monte Carlo approach: for each sample we computed ; we generated a large number M = 100,000 observations independent of ; we estimated ECE by the global mean .
5.2 Results of the simulation
5.2.1 Small number of modalities
We consider here the case where the number of modalities of Y is relatively small (). In this simulation, we fixed , . In Table 1, we present, for different sample sizes n, the results for the different empirical criteria AIC, AIC and UACVR which can be compared with ECE. For any sample size, the cumulative probit model provided a better ECE than the linear model (positive difference). It appeared that UACVR had a very small bias for all the sample sizes (of order ). The two other criteria AIC and AIC were also in favor of a threshold model. However, as expected, the naive normalized AIC did not correctly estimate ECE due to the wrong probability measure (Lebesgue measure instead of a counting measure). We note that the criterion AIC estimated ECE relatively well, with a small bias around and . All the criteria were in agreement with ECE for the choice of the model.
5.2.2 Large number of modalities
We consider here the case where the number of modalities of Y is relatively large (). In this simulation, we fixed , . The results of this simulation are presented in Table 2. For any sample size, the linear model provided a better ECE than the threshold model (negative difference). It appeared that UACVR had a small bias for all the sample sizes (of order and ). The AICd criterion gave similar results as the UACVR criterion while the AIC criterion failed to find the best estimator (positive difference).
5.2.3 Coverage of tracking intervals
Finally we looked at the coverage of the tracking intervals and the percentage of cases where 0 was inside of the tracking interval. The results are given in Table 3. The coverage rates appear to be too large. We checked that the distributions of UACVR were approximately normal. We found however that the estimated standard deviations were too large by a factor varying from 1.2 to 1.8 for small and large number of modalities respectively, but we were unable to find the reason of this overestimation. Nevertheless, the estimate gives the order of magnitude of the variability of UACVR.
For small number of modalities, 0 was always outside of the tracking interval, leading to an unequivocal choice. For large number of modalities, the percentage increased with n. This may seem paradoxical but illustrates well the difference between a tracking interval and a confidence interval. What happens is that the misspecification risk of the linear model is rather large for small number of modalities and is very small for large number of modalities. Thus the global risk is driven by the statistical risk. The latter decreases with n, so that the difference of risks, which is the target, decreases with n, becoming very small for ; in this case the two models are nearly equivalent and there is no point to choose one rather than the other according to the chosen risk.
6 Illustration on the choice of estimators for psychometric tests
In epidemiological studies, cognition is measured by psychometric tests which usually consist in the sum of items measuring one or several cognitive domains. A common example is the Mini-Mental State Examination (MMSE) score , computed as the sum of 30 binary items evaluating memory, calculation, orientation in space and time, language, and word recognition; for this reason it is called a “sumscore” and ranges from 0 to 30. Although in essence psychometric tests are ordered categorical data, they are most often analyzed as continuous data. Indeed, they usually have a large number of different levels and, especially in longitudinal studies, models for categorical data are numerically complex. Recently, Proust-Lima et al.  defined a latent process mixed model to analyze repeated measures of discrete outcomes involving either a threshold link model or an approximation of it using continuous parameterized increasing functions. Comparison of models assuming either categorical data (using the threshold model) or continuous data (using continuous functions) was done with an AICd, computed with respect to the counting measure. In this illustration, we use UACVR to compare such latent process mixed models assuming either continuous or ordered categorical data when applied on the repeated measures of the MMSE and its calculation subscore in a large sample from a French prospective cohort study.
6.1 Latent process mixed models
In brief, the latent process mixed model assumes that a latent process underlies the repeated measures of the observed variable for subject i () and occasion j (). The latent process is defined as a standard linear mixed model: for where and are distinct vectors of time-dependent covariates associated, respectively, with the vector of fixed effects and the vector of random effects (). We further assume that , the first component of that usually represents the random intercept, is for identifiability; except for the variance of , D is an unstructured variance matrix.
A measurement model links the latent process with the observed repeated measures through intermediary variables which are noisy versions of the latent process at time : , where the ’s are i.i.d. normal variables with zero expectation. For ordered categorical data, a standard threshold link model as defined in eq. (9) (Section 4.7) for the univariate case is well adapted, leading to a cumulative probit mixed model. For continuous data, the link has been modeled as where is a monotonic increasing transformation. Three families of such transformations are considered: (i) where is the beta c.d.f. with parameters ; (ii) where is a basis of quadratic I-splines with m nodes; (iii) which gives the standard linear mixed model.
Latent process mixed models are estimated within the maximum likelihood framework using the lcmm function of lcmm R package . When assuming continuous data, the likelihood can be computed analytically using the Jacobian of H . In contrast, when assuming ordered categorical data, an integration over the random effects has to be done numerically .
UACVR is computed from the log-likelihood obtained for the MLEs with respect to the counting measure: (11)where , , and either () are the estimated thresholds when a threshold model is considered, or () when monotonic increasing families of transformations are used. We also need to compute similarly as in Section 4.7. The integral is approximated by Gaussian quadrature.
6.2 Application: categorical psychometric tests
Data come from the French prospective cohort study PAQUID initiated in 1988 to study normal and pathological aging . Subjects included in the cohort were 65 and older at initial visit and were followed up to 10 times with a visit at 1, 3, 5, 8, 10, 13, 15, 17 and 20 years after the initial visit. At each visit, a battery of psychometric tests including the MMSE was completed. In the present analysis, all the subjects free of dementia at the 1-year visit and who had at least one MMSE measure during the whole follow-up were included: this resulted in a sample size of 2,914 subjects. Data from baseline were removed to avoid modeling the first-passing effect. The observed distributions of the MMSE sumscore and of its calculation subscore are displayed in Figure 1.
The trajectory of the latent process was modeled as an individual quadratic function of age with correlated random effects for intercept, slope and quadratic slope (), and an adjustment for binary covariates educational level (EL = 1 if the subject graduated from primary school) and gender (SEX = 1 if the subject is a man) plus their interactions with age and quadratic age (so that ). For MMSE sumscore, in addition to the threshold link, the linear, beta c.d.f. and I-splines (with five equidistant nodes) continuous link functions were considered. For calculation subscore, in addition to the threshold link, only the linear link was considered.
Table 4 gives the assessment criteria for estimators based on the different models, and Table 5 provides the differences in UACVR or AICd and their 95% tracking interval. For the MMSE sumscore, the mixed model assuming the standard linear transformation yielded a clearly worse UACVR than other models accounting for nonlinear relationships with the underlying latent process. The model involving a beta c.d.f. gave a similar risk as the one involving the less parsimonious I-splines transformation ( and 0 in the 95% tracking interval). Finally, the mixed model considering a threshold link model, which is numerically demanding (because of a three-dimensional integral in the likelihood), gave the best assessment risk but remained relatively close to the simpler ones assuming a beta c.d.f. () or a I-splines transformation (). For the interpretation of these values Commenges et al.  suggested to qualify values of order , and as “large,” “moderate” and “small,” respectively; moreover for multivariate observations, it was suggested to divide by the total number of observations rather by the number of independent observations. With this correction (which amounts to divide the current values by a factor of ) the differences between the linear model and the other models can be qualified as “large,” and the differences between the threshold model and both beta c.d.f. and I-splines are between “moderate” and “small.” Of course, this gives only an idea of the difference of risks between estimators; a more intuitive and reliable interpretation scale is still to be found. Figure 2 displays the estimated link functions in (A) and the predicted mean trajectories of the latent process according to educational level in (B) from the models involving either a linear, a beta c.d.f., I-splines or a threshold link function. The estimated link functions as well as the predicted trajectories of the latent process are very close when assuming either beta c.d.f., I-splines or a threshold link function but they greatly differ when assuming a linear link.
For the calculation subscore also, the standard linear mixed model again gave a clearly higher risk than the mixed model assuming a threshold link model (, 95% tracking interval: ).
We have proposed a universal approximate formula for leave-one-out cross-validation under regularity conditions: it is universal in the sense that it applies to any couple of estimating and assessment risks which can be correctly estimated from the observations. UACVR is often a very good approximation of leave-one-out cross-validation which itself does nearly as well as an “oracle estimator” of the assessment risk which would be computable if we assessed the estimator on an independent replica of the sample. Another asset is that UACVR does not need the assumption that the models are well specified, and non-nested models can be compared. The result is in principle restricted to parametric models but extends to smooth semi- or non-parametric ones through spline representation of penalized likelihood estimators. The approximate formula not only allows fast computation, because the model is fitted only once, but also allows deriving the asymptotic distribution.
Estimating this distribution is important since the variability of UACVR, as that of any criterion used for estimator choice, may be large. Hopefully, as noted in Section 3, the variability of a difference of UACVR values between two estimators is smaller, but still remains non-negligible. A simple formula allows to estimate these variances and to construct so-called tracking intervals; our simulation study however shows that the coverage of these tracking intervals is too large, due to an overestimation of the variances. This is an open question to find why this happened here while in other contexts [15, 19] the coverage rates were correct, and possibly to find a correction to this overestimation; nevertheless, the estimates get the correct order of magnitude and the tracking intervals may be useful.
In this paper, UACVR has been applied to the issue of choice between estimators of the distribution of longitudinal categorical data based on cumulative probit mixed models or on mixed models based on a continuous approximation. It has been shown that the naive AIC can be misleading while a procedure called AIC (which had not been validated yet) yields results very close to UACVR, even if the latter is slightly better. Both quantities can be computed in the lcmm R package.
Appendix: Proof of Theorem 1
Under Assumptions A1–A3 below, we have formula (4).
In the proof, we apply the concept to vectors and matrices. Saying that a matrix H is means that all its elements are . The proof is partly heuristic in that we need at the end an assumption for obtaining that a mean of n remainder terms is itself an or at least an term.
A1 is the unique minimizer of and the M-estimator is consistent for .
A2 is thrice differentiable for every y and the third derivative is dominated by a fixed function in a neighborhood of .
A3 is twice differentiable for every y and the second derivative is dominated by a fixed function in a neighborhood of .
By definition of we have the relation (12)Taking derivatives of the terms of this equation and taking the values at we find and we obtain that
Hence we have (13)Note that this implies that because and both and are . But this in turn implies that is in fact an (as a quadratic form of terms). Now we show that can be replaced by in eq. (13). By twice derivating eq. (12) we obtain where ; since the last term is an , we can write . Equation (13) can be written or replacing by , . Using the fact that we obtain (14)Developing now the assessment loss function for around yields (using Assumption A3): Replacing in this equation by its approximation in eq. (14) we obtain . Taking the mean of the left-hand terms of these equations yields . Taking the mean of the terms on the right-hand side gives us a development with an error term which is the mean of n error terms in . Because the number of error terms to consider increases with n, it is not true in general that such a mean preserves the order of the error terms. This is true assuming some boundedness conditions of the expectations of these terms. At this stage the proof is heuristic: we assume conditions such that the mean of these terms is also an , or at least . When this holds, we obtain the announced result given in formula (4).
Akaike H. Information theory and an extension of the maximum likelihood principle. In BN Petrov and F Csâki, editors. Proc. of the 2nd Int. symp. on information theory, Budapest: Akademiai Kiâdo, 1973: 267–81. Google Scholar
Takeuchi K. Distributions of information statistics and criteria for adequacy of models . Math Sci 1976;153:12–18. Google Scholar
Murata N, Yoshizawa S, Amari S-I. Network information criterion-determining the number of hidden units for an artificial neural network model . Neural Networks IEEE Trans 1994;5:865–72. CrossrefGoogle Scholar
Konishi S, Kitagawa G. Information criteria and statistical modeling. New York: Springer Series in Statistics, 2008. Google Scholar
Stone M. Cross-validatory choice and assessment of statistical predictions (with discussion) . J R Stat Soc B 1974;39:111–47. Google Scholar
Van Der Laan M, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation . Stat Appl Genet Mol Biol 2004;3:1036. Google Scholar
Xiang D, Wahba G. A generalized approximate cross validation for smoothing splines with non-Gaussian data . Stat Sin 1996;6:675–92. Google Scholar
Commenges D, Joly P, Gegout-Petit A, Liquet B. Choice between semi-parametric estimators of Markov and non-Markov multi-state models from generally coarsened observations . Scand J Stat 2007;34:33–52. CrossrefGoogle Scholar
Commenges D, Liquet B, Proust-Lima C. Choice of prognostic estimators in joint models by estimating differences of expected conditional Kullback-Leibler risks . Biometrics 2012;68:380–7. Web of ScienceCrossrefGoogle Scholar
Van der Vaart A. Asymptotic statistics. Cambridge: Cambridge University Press, 2000. Google Scholar
Watanabe S. Algebraic geometry and statistical learning theory. Vol. 25. Cambridge: Cambridge University Press, 2009. Google Scholar
Commenges D, Sayyareh A, Letenneur L, Guedj J, Bar-Hen A. Estimating a difference of Kullback-Leibler risks using a normalized difference of AIC . Ann Appl Stat 2008;2:1123–42. Web of ScienceCrossrefGoogle Scholar
Cover T, Thomas J. Elements of information theory. New York: John Wiley and Sons, 1991:542. Google Scholar
Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. New York: Springer-Verlag, 2002. Google Scholar
Braun J, Held L, Ledergerber B. Predictive cross-validation for the choice of linear mixed-effects models with application to data from the Swiss HIV cohort study . Biometrics 2012;68:53–61. Web of ScienceCrossrefGoogle Scholar
Proust-Lima C, Philipps V, Diakite A, Liquet B. LCMM: Estimation of extended mixed models using latent classes and latent processes. R package version 1.6.6, 2014. Google Scholar
Proust C, Jacqmin-Gadda H, Taylor JM, Ganiayre J, Commenges D. A nonlinear model with latent process for cognitive evolution using multivariate longitudinal data . Biometrics 2006;62:1014–24. CrossrefGoogle Scholar
Letenneur L, Commenges D, Dartigues JF, Barberger-Gateau P. Incidence of dementia and Alzheimer’s disease in elderly community residents of South-Western France . Int J Epidemiol 1994;23:1256–61. CrossrefGoogle Scholar