Show Summary Details
More options …

# The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

2 Issues per year

IMPACT FACTOR 2016: 0.500
5-year IMPACT FACTOR: 0.862

CiteScore 2016: 0.42

SCImago Journal Rank (SJR) 2016: 0.488
Source Normalized Impact per Paper (SNIP) 2016: 0.467

Mathematical Citation Quotient (MCQ) 2016: 0.09

Online
ISSN
1557-4679
See all formats and pricing
More options …
Volume 11, Issue 1 (May 2015)

# A Universal Approximate Cross-Validation Criterion for Regular Risk Functions

Daniel Commenges
• Corresponding author
• INSERM, ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Bordeaux F-33000, France
• ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Universite de Bordeaux, Bordeaux F-33000, France
• Email
• Other articles by this author:
/ Cécile Proust-Lima
• INSERM, ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Bordeaux F-33000, France
• ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Universite de Bordeaux, Bordeaux F-33000, France
• Email
• Other articles by this author:
/ Cécilia Samieri
• INSERM, ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Bordeaux F-33000, France
• ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Universite de Bordeaux, Bordeaux F-33000, France
• Email
• Other articles by this author:
/ Benoit Liquet
• INSERM, ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Bordeaux F-33000, France
• ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique, Universite de Bordeaux, Bordeaux F-33000, France
• School of Mathematics and Physics, The University of Queensland, St Lucia, Brisbane, Queensland 4066, Australia
• Email
• Other articles by this author:
Published Online: 2015-04-03 | DOI: https://doi.org/10.1515/ijb-2015-0004

## Abstract

Selection of estimators is an essential task in modeling. A general framework is that the estimators of a distribution are obtained by minimizing a function (the estimating function) and assessed using another function (the assessment function). A classical case is that both functions estimate an information risk (specifically cross-entropy); this corresponds to using maximum likelihood estimators and assessing them by Akaike information criterion (AIC). In more general cases, the assessment risk can be estimated by leave-one-out cross-validation. Since leave-one-out cross-validation is computationally very demanding, we propose in this paper a universal approximate cross-validation criterion under regularity conditions (UACVR). This criterion can be adapted to different types of estimators, including penalized likelihood and maximum a posteriori estimators, and also to different assessment risk functions, including information risk functions and continuous rank probability score (CRPS). UACVR reduces to Takeuchi information criterion (TIC) when cross-entropy is the risk for both estimation and assessment. We provide the asymptotic distributions of UACVR and of a difference of UACVR values for two estimators. We validate UACVR using simulations and provide an illustration on real data both in the psychometric context where estimators of the distributions of ordered categorical data derived from threshold models and models based on continuous approximations are compared.

## 1 Introduction

Selecting estimators is an essential step in modeling, and Akaike information criterion (AIC) [1] has been widely used for this purpose. AIC allows selecting maximum likelihood estimators (MLE) based on parametric models that are not too badly specified. More general criteria have been developed, in particular the Takeuchi information criterion (TIC) [2] and the general information criterion (GIC) [3]. A related criterion in the field of neural networks is the network information criterion (NIC) [4]. Two other well-known criteria are the Bayesian information criterion (BIC) and the deviance information criterion (DIC); both use Bayesian arguments and are not directly related to the present paper. A good reference book for information criteria is by Konishi and Kitagawa [5].

Likelihood cross-validation (LCV) has also been widely used for comparing parametric models. Stone [6] heuristically established that LCV was asymptotically identical to AIC. LCV, however, is more flexible in that it can be applied to other estimators than MLEs, for instance, to penalized likelihood estimators: see Golub et al. [7] and Wahba [8].

Cross-validation can also be applied to other assessment risks than Kullback–Leibler risk. The leave-one-out cross-validation is the most natural and one of the most efficient [9, 10], but it is also the most computationally demanding so that approximation formulas have been derived. Approximate cross-validation formulas have been developed for penalized splines [11, 12] or penalized likelihood [13, 14]. Commenges et al. [15] derived an approximate cross-validation criterion in the context of prognosis.

In the present paper we consider the following general framework: estimators of the true density function are defined as minimizing an estimating function; the estimating function itself can be viewed as an estimator of a risk, that we call an “estimating risk.” Typically there is a model, that is a family of densities for the variable Y, $\left({g}^{\mathrm{\theta }}{\right)}_{\mathrm{\theta }\in \mathrm{\Theta }}$, $\mathrm{\Theta }\subset {\mathrm{\Re }}^{p}$, and the estimator is chosen as minimizing the estimating risk. The estimators of the true density are then assessed using an “assessment risk,” which allows choosing between different available estimators. The most conventional case is when the estimating risk is $E\left[-log{g}^{\mathrm{\theta }}\left(Y\right)\right]$ which is estimated by the log likelihood, and the assessment risk of the obtained estimator ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ is $E\left[-log{g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(Y\right)\right]$, which can be estimated by cross-validation or in the parametric case by the normalized AIC: AIC $/2n$. These information risks are very appealing but there are cases where other risks are relevant. As an example, the MLE could be assessed by the continuous rank probability score (CRPS) [16]: this is detailed in Section 4.4. Another example is the estimation of the distribution of ordinal data through an approximation using models for continuous data. Models for ordinal variables that can take a large number of values are rather cumbersome; it is convenient to treat these data as continuous, using an estimating risk adapted to continuous data. However, if we wish to compare the obtained estimator to that obtained by a model for ordinal data, the assessment risk must still take into account that the data are really ordinal. Such assessment risk can be estimated by cross-validation; cross-validation has good properties but is very computationally demanding. The main aim of this paper is to find an approximation for leave-one-out cross-validation, valid whatever the estimating and assessment risks satisfying regularity conditions that will be detailed. This will be applied to the ordinal data example.

Section 2 presents the framework, the cross-validation criterion and its approximation. It is universal in the sense that it can be applied to any estimating and assessment risks satisfying regularity conditions. We denote the approximate criterion by UACVR (U for Universal, A for approximate, CV for cross-validation and R for regularity). In Section 2 the asymptotic distributions of UACVR and of a difference of two UACVR values are given. Section 4 shows how UACVR specializes to particular cases: TIC appears as a special case when cross-entropy is used for defining both estimating and assessment risks, and AIC follows if the models are close to being well specified; other important cases where estimating and assessment risks defined in a less symmetric way are given. Section 5 presents a simulation study. Section 6 presents an illustration of the use of UACVR for comparing estimators derived from threshold models and estimators obtained by continuous approximations in the case of ordered categorical data with repeated measurements; these data are psychometric scores from a large study on cognitive aging. Section 7 concludes.

## 2.1 The estimating risk and its estimation by an estimating function

Suppose that a sample of independently identically distributed (i.i.d.) variables ${\mathcal{O}}_{n}=\left({Y}_{i},i=1,\dots ,n\right)$ is available. Based on ${\mathcal{O}}_{n}$, an estimator ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ (where $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is short for ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n}$) of the probability density function ${f}^{\ast }$ of the true distribution can be chosen in a model, that is a family of distributions $\left({g}^{\mathrm{\theta }}{\right)}_{\mathrm{\theta }\in \mathrm{\Theta }}$, $\mathrm{\Theta }\subset {\mathrm{\Re }}^{p}$. The main rules for designing estimators of $\mathrm{\theta }$ can be thought of as minimizing an estimating risk. The estimating risk $\mathrm{\Phi }\left(\mathrm{\theta }\right)$ is defined as the expectation under the true distribution of a loss function $\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)$: $\mathrm{\Phi }\left(\mathrm{\theta }\right)={\mathrm{E}}_{\ast }\left\{\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)\right\}$. We would like to choose ${g}^{{\mathrm{\theta }}_{0}}$ where ${\mathrm{\theta }}_{0}=\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}{\mathrm{n}}_{\mathrm{\theta }}\mathrm{\Phi }\left(\mathrm{\theta }\right)$. For making consistent estimation possible, it is natural to require that whenever the model is well specified, the risk is minimized by the true distribution. Precisely, saying that the model is well-specified amounts to say that there is a value ${\mathrm{\theta }}_{\ast }$, such that ${g}^{{\mathrm{\theta }}_{\ast }}={f}^{\ast }$. Then we require that ${\theta }_{*}={\text{argmin}}_{\theta }\Phi \left(\theta \right)$; moreover we will require that this minimum is unique. This is related to the concept of strictly proper scores [16]. In the scoring rule literature, the problem is formulated in terms of reward rather than loss; it is possible to establish a correspondence between the two theories by considering that minus a loss is a reward, and of course while one tries to minimize the expected loss, one tries to maximize the expected reward.

We cannot compute the estimating risk but a natural estimator of the estimating risk is the estimating function ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)={n}^{-1}{\sum }_{i=1}^{n}\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)$. The estimator $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ defined as minimizing ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)$ is called an M-estimator. By the law of large numbers, ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}$ converges in probability toward $\mathrm{\Phi }\left(\mathrm{\theta }\right)={\mathrm{E}}_{\ast }\left\{\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)\right\}$. Under some conditions given in Van der Vaart [17] (see, e.g. Theorem 5.7), $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ converges in probability toward ${\mathrm{\theta }}_{0}$. A simple set of sufficient conditions is that $\mathrm{\Theta }$ is compact, $\mathrm{\Phi }\left(\mathrm{\theta }\right)$ is continuous and has a unique minimizer, $\mathrm{\varphi }\left(\mathrm{\theta },y\right)$ is continuous for every y.

Example 1: If we take as loss function $\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)=\left[{Y}_{i}-{E}_{{g}^{\mathrm{\theta }}}\left({Y}_{i}\right){\right]}^{2}$, the estimating risk is $\mathrm{\Phi }\left(\mathrm{\theta }\right)={E}_{\ast }\left[{Y}_{i}-{E}_{{g}^{\mathrm{\theta }}}\left({Y}_{i}\right){\right]}^{2}$; the estimating function is ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)={n}^{-1}{\sum }_{i=1}^{n}\left[{Y}_{i}-{E}_{{g}^{\mathrm{\theta }}}\left({Y}_{i}\right){\right]}^{2}$ and $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is the least-square estimator.

Example 2: If we take as loss function $\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)=-log{g}^{\mathrm{\theta }}\left({Y}_{i}\right)$, the estimating risk is $\mathrm{\Phi }\left(\mathrm{\theta }\right)={E}_{\ast }\left[-log{g}^{\mathrm{\theta }}\left({Y}_{i}\right)\right]$ which is the cross-entropy of ${g}^{\mathrm{\theta }}$ with respect to ${f}^{\ast }$; the estimating function is ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)=-{n}^{-1}{\sum }_{i=1}^{n}log{g}^{\mathrm{\theta }}\left({Y}_{i}\right)$ and $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is the MLE.

## 2.2 The assessment risk and its estimation by cross-validation

When several estimators are available, we wish to assess their performance by estimating an assessment risk. Estimators with small assessment risks will be preferred. For constructing the risk of an estimator ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ we may use a loss function $\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},Y\right)$. The assessment risk is the expectation under ${f}^{\ast }$ of $\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},Y\right)$, where both Y and ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ are random: $\mathrm{\Psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={E}_{\ast }\left\{\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},Y\right)\right\}.$(1)The problem is to estimate the assessment risk (without knowing the true density ${f}^{\ast }$). A natural, albeit naive, estimator is ${\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={n}^{-1}\sum _{i=1}^{n}\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right).$(2)However ${\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ is not completely satisfying because it does not take into account that ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ depends on the observations; as a result ${\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ underestimates $\mathrm{\Psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ (the well-known overoptimism bias).

If another sample $\mathcal{O}{n}_{\prime }^{}=\left({Y}_{i}^{\prime },i=1,\dots ,n\right)$ i.i.d. with respect to ${\mathcal{O}}_{n}$ were available, a natural estimator of the assessment risk would be ${\mathrm{\Psi }}_{\mathcal{O}{n}_{\prime }^{}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={n}^{-1}{\sum }_{i=1}^{n}\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}^{\prime }\right)$. We call ${\mathrm{\Psi }}_{\mathcal{O}{n}_{\prime }^{}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ the “oracle estimator.” This is an unbiased estimator of the assessment risk but cannot be computed based on ${\mathcal{O}}_{n}$. Its variance is $\mathrm{v}\mathrm{a}{\mathrm{r}}_{\ast }{\mathrm{\Psi }}_{\mathcal{O}{n}_{\prime }^{}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={n}^{-1}\mathrm{v}\mathrm{a}{\mathrm{r}}_{\ast }\left\{\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}^{\prime }\right)|\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right\}$, which tends toward ${n}^{-1}{\mathrm{\kappa }}_{\ast }^{2}$, where ${\mathrm{\kappa }}_{\ast }^{2}$ is the variance of $\mathrm{\psi }\left({g}^{{\mathrm{\theta }}_{0}},{Y}_{i}^{\prime }\right)$.

A pseudo-oracle estimator of the assessment risk is often used by practitioners who split their original sample in a training and a validation sample. However, this practice leads to a loss of efficiency since only half of the data is used for computing the estimator ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and half of the data also for estimating its assessment risk. Cross-validation estimators of the assessment risk make a more efficient use of the information. In particular the leave-one-out cross-validation criterion is $\mathrm{C}\mathrm{V}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={n}^{-1}\sum _{i=1}^{n}\mathrm{\psi }\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}},{Y}_{i}\right),$where ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}=\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}\mathrm{n}\text{\hspace{0.17em}}{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}$ and ${\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}=\frac{1}{n-1}{\sum }_{j\ne i}^{n}\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{j}\right)$. $\mathrm{C}\mathrm{V}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ does nearly as well as if another sample $\mathcal{O}{n}_{\prime }^{}$ were available, in terms of both bias and variance. Indeed it can immediately be seen that $E\left\{\mathrm{C}\mathrm{V}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)\right\}=\mathrm{\Psi }\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n-1}}\right)$. We shall see in Section 3 that the asymptotic variance of the approximate cross-validation criterion $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ is precisely ${n}^{-1}{\mathrm{\kappa }}_{\ast }^{2}$, the same as that of the oracle estimator.

For comparing two estimators, the difference of assessment risks is relevant. This can be estimated by the difference of cross-validation estimates of the assessment risks.

## 2.3 The universal approximate cross-validation criterion

The leave-one-out cross-validation criterion may be computationally demanding since it is necessary to run the maximization algorithm n times for finding the ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i},i=1,\dots ,n$. For this reason an approximate formula is very useful. In this section we propose a universal approximation of the cross-validation (UACVR) criterion for regular loss functions $\mathrm{\varphi }$ and $\mathrm{\psi }$.

Definition 1 (Universal approximation of the cross-validation) $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)+\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{c}\mathrm{e}\left({H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}^{-1}K\right),$(3)where ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}=\frac{{\mathrm{\partial }}^{2}{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and $K={n}^{-1}{\sum }_{i=1}^{n}{\stackrel{ˆ}{v}}_{i}{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}^{T}$, with ${\stackrel{ˆ}{v}}_{i}=\frac{\mathrm{\partial }\mathrm{\psi }\left({g}^{\mathrm{\theta }},{Y}_{i}\right)}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$and ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}=\frac{1}{n-1}\frac{\mathrm{\partial }\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}.$The leading term in eq. (3) is the naive estimator of $\mathrm{\Psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ defined in eq. (2) while the second term is a correction accounting for parameter estimation. This correction term involves ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}$, the Hessian of the estimating function, and ${\stackrel{ˆ}{v}}_{i}$ and ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$ which are the gradients of the assessment and estimating functions (up to the multiplicative constant $1/\left(n-1\right)$ for the latter).

Under regularity assumptions on $\mathrm{\varphi }\left(.,.\right)$ and $\mathrm{\psi }\left(.,.\right)$, we have that the leave-one-out cross-validation criterion differs from UACVR by an asymptotically negligible term in ${o}_{p}\left({n}^{-1}\right)$, which makes UACVR a good approximation for n relatively large, when leave-one-out cross-validation becomes computationally too demanding. The regularity conditions are detailed in the Appendix and are essentially: A1: $\mathrm{\Phi }\left(\mathrm{\theta }\right)$ has a unique maximizer; A2: thrice differentiability of $\mathrm{\varphi }\left(\mathrm{\theta },y\right)$; A3: twice differentiability of $\mathrm{\psi }\left(\mathrm{\theta },y\right)$.

Theorem 1 Under assumptions A1, A2, A3, we have $\mathrm{C}\mathrm{V}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)+\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{c}\mathrm{e}\left({H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}^{-1}K\right)+{o}_{p}\left({n}^{-1}\right),$(4)UACVR applies only to regular parametric problems. Thus it does not apply to non- or semi-parametric estimators and more generally to singular problems as treated by Watanabe [18]. Also, some assessment functions do not satisfy the regularity assumptions: for instance, a non-parametric estimator of the area under the ROC curve can be used for assessing the discriminating ability of an estimator, and this is not continuous in the parameter $\mathrm{\theta }$. Nevertheless, UACVR may be useful in various important contexts as detailed in Section 4, including penalized likelihood estimators approximated on a spline basis, which is a way to avoid strong parametric assumptions.

## 3.1 Asymptotic distribution of UACVR

Commenges et al. [19] using results of Vuong [20] studied the asymptotic distribution of a difference of normalized AIC’s as an estimator of a difference of Kullback–Leibler risks: the normalized AIC is defined as $\frac{1}{2n}\mathrm{A}\mathrm{I}\mathrm{C}$. Here similar arguments are applied to study the asymptotic distribution of UACVR and a difference of two UACVR values. By the continuous mapping theorem, the asymptotic distribution of $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$ is the same as that of $\mathrm{\Psi }\left({g}^{{\mathrm{\theta }}_{0}}\right)$. Since the latter quantity is a mean, it immediately follows by the central limit theorem that ${n}^{1/2}\left\{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)-\mathrm{\Psi }\left({g}^{{\mathrm{\theta }}_{0}}\right)\right\}\to {D}^{}\phantom{\rule{1em}{0ex}}N\left(0,{\mathrm{\kappa }}_{\ast }^{2}\right),$(5)where ${\mathrm{\kappa }}_{\ast }^{2}=\mathrm{v}\mathrm{a}{\mathrm{r}}_{\ast }\mathrm{\psi }\left({g}^{{\mathrm{\theta }}_{0}},Y\right)$ and $\mathrm{v}\mathrm{a}{\mathrm{r}}_{\ast }$ stands for the variance under the true distribution. We can also write: ${n}^{1/2}\left\{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)-\mathrm{\Psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)\right\}\to {D}^{}\phantom{\rule{1em}{0ex}}N\left(0,{\mathrm{\kappa }}_{\ast }^{2}\right),$(6)and ${\mathrm{\kappa }}_{\ast }^{2}$ can be estimated by the empirical variance of $\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right)$, $i=1,\dots ,n$.

## 3.2 Asymptotic distribution of a difference between UACVR values of two estimators

If two estimators ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and ${h}^{\stackrel{ˆ}{\mathrm{\gamma }}}$ are available, we would like to know which is the best according to the chosen assessment risk. Thus, we have to estimate the difference of their assessment risks: ${\mathrm{\Delta }}^{\mathrm{\psi }}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\right)=\mathrm{\Psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)-\mathrm{\Psi }\left({h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\right)$. The obvious estimator is: ${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\right)=\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)-\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\right).$ We focus on the case where ${g}^{{\mathrm{\theta }}_{0}}\ne {h}^{{\mathrm{\gamma }}_{0}}$. We obtain in that case using the same arguments as above: ${n}^{1/2}\left\{{D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n}},{h}^{{\stackrel{ˆ}{\mathrm{\gamma }}}_{n}}\right)-\mathrm{\Delta }\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n}},{h}^{{\stackrel{ˆ}{\mathrm{\gamma }}}_{n}}\right)\right\}\to {D}^{}\phantom{\rule{1em}{0ex}}N\left(0,{\mathrm{\omega }}_{\ast }^{2}\right),$(7)where ${\mathrm{\omega }}_{\ast }^{2}=\mathrm{v}\mathrm{a}{\mathrm{r}}_{\ast }\left\{\mathrm{\psi }\left({g}^{{\mathrm{\theta }}_{0}},Y\right)-\mathrm{\psi }\left({h}^{{\mathrm{\gamma }}_{0}},Y\right)\right\}$, and this can be estimated by the empirical variance of $\left\{\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right)-\mathrm{\psi }\left({h}^{\stackrel{ˆ}{\mathrm{\gamma }}},{Y}_{i}\right)\right\}.$

Based on the same type of results, Commenges et al. [19] proposed to construct a “tracking interval” for a difference of normalized AIC values. The tracking interval is a kind of confidence interval for the difference of risks. Because the variability of estimators of difference of risks is rather large in general, it is useful to have an interval estimate rather than just a point estimate. However, in the conventional theory of point and interval estimation, the target parameter is fixed; here, it changes with n. Thus, we have a moving target: hence the name of tracking interval. Some simulations in Commenges et al. [19] showed that the variance of the difference of AIC was correctly estimated and the corresponding tracking interval had good coverage properties. The same idea can be applied in the more general case treated here. The tracking interval is given by $\left({A}_{n},{B}_{n}\right)$, where ${A}_{n}={D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n}},{h}^{{\stackrel{ˆ}{\mathrm{\gamma }}}_{n}}\right)-{z}_{\mathrm{\alpha }/2}{n}^{-1/2}{\stackrel{ˆ}{\mathrm{\omega }}}_{n}$ and ${B}_{n}={D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{n}},{h}^{{\stackrel{ˆ}{\mathrm{\gamma }}}_{n}}\right)+{z}_{\mathrm{\alpha }/2}{n}^{-1/2}{\stackrel{ˆ}{\mathrm{\omega }}}_{n}$, where ${z}_{u}$ is the ${u}^{th}$ quantile of the standard normal variable.

Note that ${\mathrm{\omega }}_{\ast }$ is in general much lower than ${\mathrm{\kappa }}_{\ast }$. This has been shown by Commenges et al. [13] for the expected cross-entropy assessment risk and comes from the fact that $\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right)$ and $\mathrm{\psi }\left({h}^{\stackrel{ˆ}{\mathrm{\gamma }}},{Y}_{i}\right)$ are often positively correlated.

## 4 Particular cases of UACVR

In this section we give seven frameworks in which UACVR applies (a non-exhaustive list).

## 4.1 MLEs and information assessment risk: TIC and AIC

Suppose we take: $\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)=\mathrm{\psi }\left({g}^{\mathrm{\theta }},{Y}_{i}\right)=-log{g}^{\mathrm{\theta }}\left({Y}_{i}\right)$. Then, the estimating function is minus the log-likelihood. It estimates the estimating risk, here the cross-entropy [21] of ${g}^{\mathrm{\theta }}$ with respect to the true density ${f}^{\ast }$: ${\mathrm{E}}_{\ast }\left\{-log{g}^{\mathrm{\theta }}\left(Y\right)\right\}=H\left({f}^{\ast }\right)+\mathrm{K}\mathrm{L}\left({g}^{\mathrm{\theta }};{f}^{\ast }\right)$, where $H\left({f}^{\ast }\right)=-{E}_{\ast }\left\{log{f}^{\ast }\left(Y\right)\right\}$ is the entropy of ${f}^{\ast }$ and $\mathrm{K}\mathrm{L}\left({g}^{\mathrm{\theta }};{f}^{\ast }\right)={E}_{\ast }\left\{log\frac{{f}^{\ast }\left(Y\right)}{{g}^{\mathrm{\theta }}\left(Y\right)}\right\}$ the Kullback–Leibler divergence of ${g}^{\mathrm{\theta }}$ relative to ${f}^{\ast }$. The assessment risk is here the expected cross-entropy: $\mathrm{E}\mathrm{C}\mathrm{E}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)={E}_{\ast }\left[{E}_{\ast }\left\{-log{g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(Y\right)|{\mathcal{O}}_{n}\right\}\right]=H\left({f}^{\ast }\right)+\mathrm{E}\mathrm{K}\mathrm{L}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}};{f}^{\ast }\right),$(8)where $\mathrm{E}\mathrm{K}\mathrm{L}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}};{f}^{\ast }\right)={E}_{\ast }\left\{log\frac{{f}^{\ast }\left(Y\right)}{{g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(Y\right)}\right\}$ is the expected Kullback–Leibler risk. It differs from the conventional Kullback–Leibler risk defined for a fixed density because it is applied here to an estimator: it was mentioned by Hall [22] under the name of “expected Kullback–Leibler loss.” So, although the loss functions for estimating and assessment are the same, there is a dissymmetry in that the estimating risk is a cross-entropy while, because ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ is random, the assessment risk is an expected cross-entropy.

In that case the leading term of eq. (3) is minus the maximized (normalized) log-likelihood. We have also that ${\stackrel{ˆ}{v}}_{i}$ is the individual score and ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}=\frac{1}{n-1}{\stackrel{ˆ}{v}}_{i}$ so that UACVR is identical to a normalized version of TIC [5]. If the model is well specified K tends in probability toward $I\left({\mathrm{\theta }}_{0}\right)$. The Hessian ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}$ also tends toward $I\left({\mathrm{\theta }}_{0}\right)$ so that the correction term tends toward p, the number of parameters. Thus, if the model is not too badly specified, TIC is approximately equal to AIC. We have $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}=\frac{1}{2n}\mathrm{T}\mathrm{I}\mathrm{C}\approx \frac{1}{2n}\mathrm{A}\mathrm{I}\mathrm{C}$, and this estimates the expected cross-entropy of the estimator, $\mathrm{E}\mathrm{C}\mathrm{E}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$. In practice, Burnham and Anderson [23] do not recommend the use of TIC if n is small because of the variability of the correction term. On the other hand, Konishi and Kitagawa [5] show (see their Table 3.3) that the correction terms can be rather different when the models are misspecified.

## 4.2 M-estimators and information assessment risk: GIC

Konishi and Kitagawa [3] have generalized TIC and AIC to cases where ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ was an M-estimator. The criterion they proposed, obtained by correcting the bias of the log-likelihood, is the GIC. GIC is also a special case of UACVR, obtained when the assessment risk is the expected cross-entropy. They apply GIC in particular to penalized likelihood estimators. Thus UACVR, as GIC, can be applied to maximum a posteriori, maximum penalized likelihood and hierarchical likelihood estimators.

## 4.3 Restricted AIC

Liquet and Commenges [24] have proposed a modification of AIC and LCV when estimators are based on the full information while they are assessed on a smaller (more targeted) information. More specifically, the estimator is based on the sample ${\mathcal{O}}_{n}=\left({Y}_{i},i=1,\dots ,n\right)$ but the assessment risk is based on a random variable Z which is a coarsened version of Y. For instance Z is a dichotomization of Y: $Z={1}_{Y>l}$. For this case, the restricted AIC (RAIC) was derived by both direct approximation of the risk and by approximation of the LCV. RAIC is a particular case of $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}$ for the case: $\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)=-log{g}^{\mathrm{\theta }}\left({Y}_{i}\right)$ and $\mathrm{\psi }\left({g}^{\mathrm{\theta }}\left({Y}_{i}\right)\right)=-log{g}^{\mathrm{\theta }}\left({Z}_{i}\right)$.

## 4.4 Estimators assessment by CRPS

Gneiting and Raftery [16] studied scoring rules and particularly the CRPS. Its inverse that can be used as a loss function is defined as $\mathrm{C}\mathrm{R}\mathrm{P}{\mathrm{S}}^{\ast }\left(G\left(.,\mathrm{\theta }\right),Y\right)={\int }_{-\mathrm{\infty }}^{+\mathrm{\infty }}\left\{G\left(u,\mathrm{\theta }\right)-{1}_{u\ge Y}{\right\}}^{2}du,$where $G\left(.,\mathrm{\theta }\right)$ is the cumulative distribution function (c.d.f.) of a distribution in the model. The risk is a Cramer–von Mises-type distance: $d\left(G,{G}^{\ast }\right)=\int \left\{G\left(u\right)-{G}^{\ast }\left(u\right){\right\}}^{2}du$. In some cases, it may be interesting to assess MLE’s using this assessment risk rather than the logarithmic loss which may be too sensitive to low values of the density. UACVR can be used for estimating this risk. In that case, the leading term of $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}$ is ${n}^{-1}{\sum }_{i=1}^{n}\mathrm{C}\mathrm{R}\mathrm{P}{\mathrm{S}}^{\ast }\left(G\left(.,\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right),{Y}_{i}\right)$; for the correcting term, ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}$ is the Hessian of the log-likelihood (since $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is the MLE) and K must be computed with ${\stackrel{ˆ}{v}}_{i}=\frac{\mathrm{\partial }\mathrm{\psi }}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}=2{\int }_{-\mathrm{\infty }}^{+\mathrm{\infty }}\left\{G\left(u,\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)-{1}_{u\ge {Y}_{i}}\right\}\frac{\mathrm{\partial }G\left(u,\mathrm{\theta }\right)}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}du$; ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$ is the individual score (gradient of the individual log-likelihood) divided by $n-1$. Thus the computation of ${\stackrel{ˆ}{v}}_{i}$, for each i, involves the computation of p simple integrals, which can be done numerically.

## 4.5 Estimators assessment by Brier score

Brier score [25] can be used to assess estimators of the distribution of categorical variables, say Y, taking values $1,\dots ,m$. Consider a model for this distribution: we write ${g}^{\mathrm{\theta }}\left(j\right)=P\left(Y=j\right)$. Brier score is defined as ${\sum }_{j=1}^{m}\left({\mathrm{\delta }}_{Y,j}-{g}^{\mathrm{\theta }}\left(j\right){\right)}^{2}$, where $\mathrm{\delta }$ is the Kronecker symbol (${\mathrm{\delta }}_{Y,j}=1$ if $Y=j$, zero otherwise). Assume that we estimate $\mathrm{\theta }$ by maximum likelihood and use the Brier score for assessment. In this case, the leading term of $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}$ is ${n}^{-1}{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}\left({\mathrm{\delta }}_{{Y}_{i},j}-{g}^{\mathrm{\theta }}\left(j\right){\right)}^{2}$; for the correcting term, ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}$ is the Hessian of the log-likelihood (since $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is the MLE) and K must be computed with ${\stackrel{ˆ}{v}}_{i}=\frac{\mathrm{\partial }\mathrm{\psi }}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}=-2\frac{\mathrm{\partial }{g}^{\mathrm{\theta }}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left({Y}_{i}\right)+2{\sum }_{j=1}^{m}{g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(j\right)\frac{\mathrm{\partial }{g}^{\mathrm{\theta }}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(j\right)$; ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$ is the individual score (gradient of the individual log-likelihood) divided by $n-1$.

## 4.6 Conditional AIC

A referee suggested that UACVR might be useful for selecting random effect models based on conditional assessment functions, that is when the target is the density conditional on random effects. Conditional Akaike criterion was proposed by Vaida and Blanchard [26]; Greven and Kneib [27] proposed a correction taking into account uncertainty on the covariance matrix of the random effects; Braun et al. [28] proposed a predictive cross-validation criterion. UACVR could directly apply to this case by considering that the assessment loss is $-log{g}^{\mathrm{\theta }}\left(Y|\stackrel{ˆ}{b}\right)$, where b is the random effect and $\stackrel{ˆ}{b}$ its estimator. Since $\stackrel{ˆ}{b}$ is a function of $\mathrm{\theta }$ and Y, the assessment loss can indeed be written $\mathrm{\psi }\left(\mathrm{\theta },y\right)$. For computing UACVR, the main task would be here to compute the gradient $\frac{\mathrm{\partial }\mathrm{\psi }\left(\mathrm{\theta },{Y}_{i}\right)}{\mathrm{\partial }\mathrm{\theta }}$, not forgetting the dependence of $\stackrel{ˆ}{b}$ on $\mathrm{\theta }$. This could be easily done by numerical differentiation.

## 4.7 Estimators based on continuous approximation of categorical data

Assume Y is an ordered categorical variable taking values $l=0,1,\dots ,L$. Here for simplicity we consider that Y is univariate. Several models are available for this type of variables. Cumulative probit models, further called “threshold link models,” assume that ${Y}_{i}=l$ if a latent variable ${\mathrm{\Lambda }}_{i}$ takes values in the interval $\left({c}_{l},{c}_{l+1}\right)$ for $l=0,\dots ,L$, with ${c}_{0}=-\mathrm{\infty }$ and ${c}_{L+1}=+\mathrm{\infty }$: ${Y}_{i}=\sum _{l=0}^{L}{1}_{\left\{{\mathrm{\Lambda }}_{i}\in \left({c}_{l},{c}_{l+1}\right)\right\}}l.$(9)${\mathrm{\Lambda }}_{i}$ itself can be modeled as a noisy linear form of explanatory variables ${\mathrm{\Lambda }}_{i}=\mathrm{\beta }{x}_{i}+{\mathrm{\epsilon }}_{i}$, where ${\mathrm{\epsilon }}_{i}$ has a normal distribution of mean zero and variance ${\mathrm{\sigma }}^{2}$, and where ${x}_{i}$ are explanatory variables. The parameters are $\mathrm{\theta }=\left({c}_{1},\dots ,{c}_{L},\mathrm{\beta },\mathrm{\sigma }\right)$. For identifiability one must add some constraints, for instance $\mathrm{\sigma }=1$ and null intercept in the linear model for ${\mathrm{\Lambda }}_{i}$. An estimator of the distribution can be obtained by maximum likelihood leading to define ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$. The assessment risk can be $\mathrm{E}\mathrm{C}\mathrm{E}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$. Note that since Y is discrete, the densities are defined with respect to a counting measure that is, ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\left(l\right)$ defines the probability that $Y=l$.

One may also make a continuous approximation which leads to simpler computations and may be more parsimonious, especially if Y is multivariate as in the illustration of Section 6. For example we can consider the model ${Y}_{i}=\mathrm{\beta }{x}_{i}+{\mathrm{\epsilon }}_{i}$. Maximizing the likelihood of this model for observations of ${Y}_{i}$ leads to a probability measure specified by the density ${h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}$. This is however a density relative to Lebesgue measure. This probability measure gives zero probabilities to $\left\{{Y}_{i}=l\right\}$ for all l, and this yields infinite value for ECE (meaning strong rejection of this estimator). However from ${h}_{c}$ a natural estimator of ${f}^{\ast }$ can be constructed by gathering at l the mass around l: ${h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(l\right)={\int }_{l-1/2}^{l+1/2}{h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(u\right)du$, for $l=1,\dots ,L-1$, and ${h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(0\right)={\int }_{-\mathrm{\infty }}^{1/2}{h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(u\right)du$, ${h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(L\right)={\int }_{L-1/2}^{+\mathrm{\infty }}{h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(u\right)du$. UACVR can be computed for this estimator for estimating its ECE. The leading term of $\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}\left({h}^{\stackrel{ˆ}{\mathrm{\gamma }}}\right)$ can be interpreted as the log-likelihood obtained by this estimator with respect to the counting measure. For the correcting term we need the Hessian of the log-likelihood of ${h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}$ and we have to compute ${\stackrel{ˆ}{v}}_{i}=\frac{\mathrm{\partial }\mathrm{\psi }\left({h}^{\mathrm{\gamma }},{Y}_{i}\right)}{\mathrm{\partial }\mathrm{\gamma }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$. For instance if ${Y}_{i}=l$ for $l=1\dots ,L-1$ we have ${\stackrel{ˆ}{v}}_{i}=-\frac{{\int }_{l-1/2}^{l+1/2}\frac{\mathrm{\partial }{h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}}{\mathrm{\partial }\mathrm{\gamma }}\left(u\right)du}{{\int }_{l-1/2}^{l+1/2}{h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}\left(u\right)du}.$Since the denominator is the probability under ${h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}$ that $Y\in \left(l-1/2,l+1/2\right)$, ${\stackrel{ˆ}{v}}_{i}$ can be interpreted as the conditional expectation (under ${h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}$) of the individual score. Thus if ${h}_{c}^{\stackrel{ˆ}{\mathrm{\gamma }}}$ does not vary much on $\left(l-1/2,l+1/2\right)$, ${\stackrel{ˆ}{v}}_{i}$ is close to $-\left(n-1\right){\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$. Using the same arguments as in Section 4.1 we obtain that UACVR is close to correcting by the number of parameters as in AIC; such a criterion that we call AIC${d}_{}$ was proposed by Proust-Lima et al. [29], and this is likely to be a good approximation if the number of modalities of Y is large.

## 5.1 Design

We conducted a simulation study to illustrate the use of UACVR for comparing estimators derived from threshold link models and estimators obtained by a linear continuous approximation in the case of ordered categorical data (see Section 4.7). The aim was to assess the performance of UACVR as an estimator of ECE defined in eq. (8), and to compare it to the normalized naive AIC criterion (noted AIC) and the normalized AIC criterion computed on the counting measure (noted AIC${d}_{}$). Performances of these criteria were studied in the case where the number of modalities ($L+1$) of the response variable Y is small (Section 5.2.1) and when it is large (Section 5.2.2).

## 5.1.1 True distributions

For all the simulations, the data came from a cumulative probit model where the relationship between ${Y}_{i}$ and ${\mathrm{\Lambda }}_{i}$ is as in eq. (9) and the linear form of ${\mathrm{\Lambda }}_{i}$ is specified by ${\mathrm{\Lambda }}_{i}={\mathrm{\beta }}_{1}{X}_{i}^{1}+{\mathrm{\beta }}_{2}{X}_{i}^{2}+{\mathrm{\epsilon }}_{i};\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}i=1,\dots ,n,$(10)where ${\mathrm{\epsilon }}_{i}$ and the two explanatory variables ${X}_{i}^{1}$ and ${X}_{i}^{2}$ were generated from independent standard normal distributions. In order not to disadvantage the linear continuous approximation compared to the threshold link model, the parameters ${c}_{1},\dots ,{c}_{L}$ were chosen as the solution of the following equations: $\left\{\begin{array}{c}P\left({\mathrm{\Lambda }}_{i}<{c}_{1}\right)=P\left({\mathrm{\Lambda }}_{i}>{c}_{L}\right)\\ P\left({\mathrm{\Lambda }}_{i}<{c}_{1}\right)=P\left({c}_{1}<{\mathrm{\Lambda }}_{i}<{c}_{2}\right),\\ {c}_{i+1}={c}_{i}+m\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{w}\mathrm{i}\mathrm{t}\mathrm{h}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}m=\left({c}_{L}-{c}_{1}\right)/\left(L-1\right)\end{array}$

## 5.1.2 The different models

For each generated sample, we fitted the cumulative probit model as previously defined, and a linear model assuming a linear continuous approximation of the response variable Y, ${Y}_{i}={\mathrm{\gamma }}_{0}+{\mathrm{\gamma }}_{1}{X}_{i}^{1}+{\mathrm{\gamma }}_{2}{X}_{i}^{2}+{\mathrm{\epsilon }}_{i}$, with ${\mathrm{\epsilon }}_{i}$ being independent zero mean normal variables with variance ${\mathrm{\tau }}^{2}$. Both models were fitted by maximum likelihood using a Fortran program which was checked to be correct by comparing the results with those obtained by the R package lcmm [30].

Samples of $300,\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}500,\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}3,000$ subjects were generated. For all simulations, N = 10,000 samples were generated. The true assessment risk, ECE, which is available only in a simulation study, was computed by a Monte Carlo approach: for each sample ${\mathcal{O}}_{n}^{j}$ we computed ${g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\left(j\right)}$; we generated a large number M = 100,000 observations ${Y}_{k}$ independent of ${\mathcal{O}}_{n}^{j},j=1,\dots ,N$; we estimated ECE by the global mean $\frac{1}{NM}{\sum }_{j=1}^{N}{\sum }_{k=1}^{M}-log{g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\left(j\right)}\left({Y}_{k}\right)$.

## 5.2.1 Small number of modalities

We consider here the case where the number of modalities of Y is relatively small ($L+1=5$). In this simulation, we fixed ${\mathrm{\beta }}_{1}=-1.05$, ${\mathrm{\beta }}_{2}=-1.85$. In Table 1, we present, for different sample sizes n, the results for the different empirical criteria AIC, AIC${d}_{}$ and UACVR which can be compared with ECE. For any sample size, the cumulative probit model provided a better ECE than the linear model (positive difference). It appeared that UACVR had a very small bias for all the sample sizes (of order ${10}^{-3}$). The two other criteria AIC and AIC${d}_{}$ were also in favor of a threshold model. However, as expected, the naive normalized AIC did not correctly estimate ECE due to the wrong probability measure (Lebesgue measure instead of a counting measure). We note that the criterion AIC${d}_{}$ estimated ECE relatively well, with a small bias around ${10}^{-2}$ and ${10}^{-3}$. All the criteria were in agreement with ECE for the choice of the model.

Table 1:

Performance of the criteria for a small number of modalities (L + 1 = 5) and different sample sizes.

Table 2:

Performance of the criteria for a large number of modalities (L + 1 = 20) and different sample sizes.

Table 3:

Performance of the 95% tracking interval in both situations ($L+1=5$ and $L+1=20$) and for the different sample sizes ($n=300,\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}500$ and 3,000).

## 5.2.2 Large number of modalities

We consider here the case where the number of modalities of Y is relatively large ($L+1=20$). In this simulation, we fixed ${\mathrm{\beta }}_{1}=-0.15$, ${\mathrm{\beta }}_{2}=-0.85$. The results of this simulation are presented in Table 2. For any sample size, the linear model provided a better ECE than the threshold model (negative difference). It appeared that UACVR had a small bias for all the sample sizes (of order ${10}^{-3}$ and ${10}^{-4}$). The AICd criterion gave similar results as the UACVR criterion while the AIC criterion failed to find the best estimator (positive difference).

## 5.2.3 Coverage of tracking intervals

Finally we looked at the coverage of the tracking intervals and the percentage of cases where 0 was inside of the tracking interval. The results are given in Table 3. The coverage rates appear to be too large. We checked that the distributions of UACVR were approximately normal. We found however that the estimated standard deviations were too large by a factor varying from 1.2 to 1.8 for small and large number of modalities respectively, but we were unable to find the reason of this overestimation. Nevertheless, the estimate gives the order of magnitude of the variability of UACVR.

For small number of modalities, 0 was always outside of the tracking interval, leading to an unequivocal choice. For large number of modalities, the percentage increased with n. This may seem paradoxical but illustrates well the difference between a tracking interval and a confidence interval. What happens is that the misspecification risk of the linear model is rather large for small number of modalities and is very small for large number of modalities. Thus the global risk is driven by the statistical risk. The latter decreases with n, so that the difference of risks, which is the target, decreases with n, becoming very small for $n=3,000$; in this case the two models are nearly equivalent and there is no point to choose one rather than the other according to the chosen risk.

## 6 Illustration on the choice of estimators for psychometric tests

In epidemiological studies, cognition is measured by psychometric tests which usually consist in the sum of items measuring one or several cognitive domains. A common example is the Mini-Mental State Examination (MMSE) score [31], computed as the sum of 30 binary items evaluating memory, calculation, orientation in space and time, language, and word recognition; for this reason it is called a “sumscore” and ranges from 0 to 30. Although in essence psychometric tests are ordered categorical data, they are most often analyzed as continuous data. Indeed, they usually have a large number of different levels and, especially in longitudinal studies, models for categorical data are numerically complex. Recently, Proust-Lima et al. [29] defined a latent process mixed model to analyze repeated measures of discrete outcomes involving either a threshold link model or an approximation of it using continuous parameterized increasing functions. Comparison of models assuming either categorical data (using the threshold model) or continuous data (using continuous functions) was done with an AICd, computed with respect to the counting measure. In this illustration, we use UACVR to compare such latent process mixed models assuming either continuous or ordered categorical data when applied on the repeated measures of the MMSE and its calculation subscore in a large sample from a French prospective cohort study.

## 6.1 Latent process mixed models

In brief, the latent process mixed model assumes that a latent process $\left({\mathrm{\Lambda }}_{i}^{\ast }\left(t\right){\right)}_{t\ge 0}$ underlies the repeated measures of the observed variable ${Y}_{ij}$ for subject i ($i=1,...,n$) and occasion j ($j=1,...,{n}_{i}$). The latent process ${\mathrm{\Lambda }}_{i}^{\ast }\left(t\right)$ is defined as a standard linear mixed model: ${\mathrm{\Lambda }}_{i}^{\ast }\left(t\right)={X}_{i}\left(t{\right)}^{T}\mathrm{\beta }+{Z}_{i}\left(t{\right)}^{T}{b}_{i}$ for $t\ge 0$ where ${X}_{i}\left(t\right)$ and ${Z}_{i}\left(t\right)$ are distinct vectors of time-dependent covariates associated, respectively, with the vector of fixed effects $\mathrm{\beta }$ and the vector of random effects ${b}_{i}$ (${b}_{i}\sim \mathcal{N}\left(\mathrm{\mu },D\right)$). We further assume that ${b}_{i0}$, the first component of ${b}_{i}$ that usually represents the random intercept, is $\mathcal{N}\left(0,1\right)$ for identifiability; except for the variance of ${b}_{i0}$, D is an unstructured variance matrix.

A measurement model links the latent process with the observed repeated measures through intermediary variables which are noisy versions of the latent process at time ${t}_{ij}$: ${\mathrm{\Lambda }}_{ij}={\mathrm{\Lambda }}_{i}^{\ast }\left({t}_{ij}\right)+{\mathrm{\epsilon }}_{ij}$, where the ${\mathrm{\epsilon }}_{ij}$’s are i.i.d. normal variables with zero expectation. For ordered categorical data, a standard threshold link model as defined in eq. (9) (Section 4.7) for the univariate case is well adapted, leading to a cumulative probit mixed model. For continuous data, the link has been modeled as $H\left({Y}_{ij};\mathrm{\eta }\right)={\mathrm{\Lambda }}_{ij}$ where $H\left(.;\mathrm{\eta }\right)$ is a monotonic increasing transformation. Three families of such transformations are considered: (i) $H\left(y;\mathrm{\eta }\right)=\frac{h\left(y;{\mathrm{\eta }}_{1},{\mathrm{\eta }}_{2}\right)-{\mathrm{\eta }}_{3}}{{\mathrm{\eta }}_{4}}$ where $h\left(.;{\mathrm{\eta }}_{1},{\mathrm{\eta }}_{2}\right)$ is the beta c.d.f. with parameters $\left({\mathrm{\eta }}_{1},{\mathrm{\eta }}_{2}\right)$; (ii) $H\left(y;\mathrm{\eta }\right)={\mathrm{\eta }}_{1}+{\sum }_{l=2}^{m+2}{\mathrm{\eta }}_{l}{B}_{l}^{I}\left(y\right)$ where $\left({B}_{l}^{I}{\right)}_{l=2,m+2}$ is a basis of quadratic I-splines with m nodes; (iii) $H\left(y;\mathrm{\eta }\right)=\frac{y-{\mathrm{\eta }}_{1}}{{\mathrm{\eta }}_{2}}$ which gives the standard linear mixed model.

Latent process mixed models are estimated within the maximum likelihood framework using the lcmm function of lcmm R package [30]. When assuming continuous data, the likelihood can be computed analytically using the Jacobian of H [32]. In contrast, when assuming ordered categorical data, an integration over the random effects has to be done numerically [29].

UACVR is computed from the log-likelihood ${\mathrm{\Psi }}_{{\mathcal{O}}_{n}}$ obtained for the MLEs $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ with respect to the counting measure: $\begin{array}{rl}{\mathrm{\Psi }}_{{\mathcal{O}}_{n}}\left(\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)\phantom{\rule{thickmathspace}{0ex}}& =-{n}^{-1}\sum _{i=1}^{n}{\int }_{-\mathrm{\infty }}^{+\mathrm{\infty }}\prod _{j=1}^{{n}_{i}}P\left({Y}_{ij}|{b}_{i}\right){f}_{b}\left({b}_{i}\right)d{b}_{i}\\ =-{n}^{-1}\sum _{i=1}^{n}{\int }_{-\mathrm{\infty }}^{+\mathrm{\infty }}\prod _{j=1}^{{n}_{i}}\prod _{l=0}^{L}{\left(P\left({Y}_{ij}=l|{b}_{i}\right)\right)}^{{1}_{\left\{{Y}_{ij}=l\right\}}}{f}_{b}\left({b}_{i}\right)d{b}_{i}\\ =-{n}^{-1}\sum _{i=1}^{n}{\int }_{-\mathrm{\infty }}^{+\mathrm{\infty }}\prod _{j=1}^{{n}_{i}}\prod _{l=0}^{L}{\left(P\left({c}_{l}\le {\mathrm{\Lambda }}_{ij}<{c}_{l+1}|{b}_{i}\right)\right)}^{{1}_{\left\{{Y}_{ij}=l\right\}}}{f}_{b}\left({b}_{i}\right)d{b}_{i},\end{array}$(11)where ${c}_{0}=-\mathrm{\infty }$, ${c}_{L+1}=+\mathrm{\infty }$, and either ${c}_{l}$ ($l=1,...,L$) are the estimated thresholds when a threshold model is considered, or ${c}_{l}=H\left(l-\frac{1}{2},\stackrel{ˆ}{\mathrm{\eta }}\right)$ ($l=1,...,L$) when monotonic increasing families of transformations are used. We also need to compute ${\stackrel{ˆ}{v}}_{i}$ similarly as in Section 4.7. The integral is approximated by Gaussian quadrature.

## 6.2 Application: categorical psychometric tests

Data come from the French prospective cohort study PAQUID initiated in 1988 to study normal and pathological aging [33]. Subjects included in the cohort were 65 and older at initial visit and were followed up to 10 times with a visit at 1, 3, 5, 8, 10, 13, 15, 17 and 20 years after the initial visit. At each visit, a battery of psychometric tests including the MMSE was completed. In the present analysis, all the subjects free of dementia at the 1-year visit and who had at least one MMSE measure during the whole follow-up were included: this resulted in a sample size of 2,914 subjects. Data from baseline were removed to avoid modeling the first-passing effect. The observed distributions of the MMSE sumscore and of its calculation subscore are displayed in Figure 1.

Figure 1:

Distributions of MMSE sumscore and MMSE calculation subscore in the PAQUID sample (n = 2,914). Data were pooled from all available visits for a total of 10,846 observations.

The trajectory of the latent process was modeled as an individual quadratic function of age with correlated random effects for intercept, slope and quadratic slope (${Z}_{i}\left(t{\right)}^{\mathrm{T}}=\left(1,\mathrm{a}\mathrm{g}{\mathrm{e}}_{i}\left(t\right),{\mathrm{a}\mathrm{g}\mathrm{e}}_{i}^{2}\left(t\right)\right)$), and an adjustment for binary covariates educational level (EL = 1 if the subject graduated from primary school) and gender (SEX = 1 if the subject is a man) plus their interactions with age and quadratic age (so that ${X}_{i}\left(t{\right)}^{\mathrm{T}}={Z}_{i}\left(t{\right)}^{\mathrm{T}}\otimes \left(1,\mathrm{E}{\mathrm{L}}_{i},\mathrm{S}\mathrm{E}{\mathrm{X}}_{i}\right)$). For MMSE sumscore, in addition to the threshold link, the linear, beta c.d.f. and I-splines (with five equidistant nodes) continuous link functions were considered. For calculation subscore, in addition to the threshold link, only the linear link was considered.

## 6.3 Results

Table 4 gives the assessment criteria for estimators based on the different models, and Table 5 provides the differences in UACVR or AICd and their 95% tracking interval. For the MMSE sumscore, the mixed model assuming the standard linear transformation yielded a clearly worse UACVR than other models accounting for nonlinear relationships with the underlying latent process. The model involving a beta c.d.f. gave a similar risk as the one involving the less parsimonious I-splines transformation (${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}=-0.0070$ and 0 in the 95% tracking interval). Finally, the mixed model considering a threshold link model, which is numerically demanding (because of a three-dimensional integral in the likelihood), gave the best assessment risk but remained relatively close to the simpler ones assuming a beta c.d.f. (${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}=0.0200$) or a I-splines transformation (${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}=0.0270$). For the interpretation of these values Commenges et al. [19] suggested to qualify values of order ${10}^{-1}$, ${10}^{-2}$ and ${10}^{-3}$ as “large,” “moderate” and “small,” respectively; moreover for multivariate observations, it was suggested to divide by the total number of observations rather by the number of independent observations. With this correction (which amounts to divide the current values by a factor of $3.7=10,846/2,914$) the differences between the linear model and the other models can be qualified as “large,” and the differences between the threshold model and both beta c.d.f. and I-splines are between “moderate” and “small.” Of course, this gives only an idea of the difference of risks between estimators; a more intuitive and reliable interpretation scale is still to be found. Figure 2 displays the estimated link functions in (A) and the predicted mean trajectories of the latent process according to educational level in (B) from the models involving either a linear, a beta c.d.f., I-splines or a threshold link function. The estimated link functions as well as the predicted trajectories of the latent process are very close when assuming either beta c.d.f., I-splines or a threshold link function but they greatly differ when assuming a linear link.

Figure 2:

(A) Estimated inverse link functions between MMSE sumscores and the underlying latent process and (B) predicted trajectories of the latent process of a woman according to educational level (with EL+ and EL– for, respectively, validated or non-validated primary school diploma) in latent process mixed models assuming either linear, beta c.d.f., I-splines or threshold link functions (PAQUID sample, n = 2,914); the trajectories for the latter three transformations are indistinguishable.

For the calculation subscore also, the standard linear mixed model again gave a clearly higher risk than the mixed model assuming a threshold link model (${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}\left(\mathrm{l}\mathrm{i}\mathrm{n}\mathrm{e}\mathrm{a}\mathrm{r},\mathrm{t}\mathrm{h}\mathrm{r}\mathrm{e}\mathrm{s}\mathrm{h}\mathrm{o}\mathrm{l}\mathrm{d}\mathrm{s}\right)=0.452$, 95% tracking interval: $\left[0.413,0.492\right]$).

Table 4:

Number of parameters (p), naive normalized AIC (AIC), AICd, and UACVR for latent process mixed models involving different transformations H and applied on either the MMSE sumscore or its calculation subscore.

Table 5:

Difference of AICd (${D}_{\mathrm{A}\mathrm{I}{\mathrm{C}}_{d}}$), difference of two UACVR values(${D}_{\mathrm{U}\mathrm{A}\mathrm{C}\mathrm{V}\mathrm{R}}$) and its 95% tracking interval between latent process mixed models involving different transformations ${H}_{1}$ and ${H}_{2}$, and applied on either the MMSE sumscore or its calculation subscore.

## 7 Conclusion

We have proposed a universal approximate formula for leave-one-out cross-validation under regularity conditions: it is universal in the sense that it applies to any couple of estimating and assessment risks which can be correctly estimated from the observations. UACVR is often a very good approximation of leave-one-out cross-validation which itself does nearly as well as an “oracle estimator” of the assessment risk which would be computable if we assessed the estimator on an independent replica of the sample. Another asset is that UACVR does not need the assumption that the models are well specified, and non-nested models can be compared. The result is in principle restricted to parametric models but extends to smooth semi- or non-parametric ones through spline representation of penalized likelihood estimators. The approximate formula not only allows fast computation, because the model is fitted only once, but also allows deriving the asymptotic distribution.

Estimating this distribution is important since the variability of UACVR, as that of any criterion used for estimator choice, may be large. Hopefully, as noted in Section 3, the variability of a difference of UACVR values between two estimators is smaller, but still remains non-negligible. A simple formula allows to estimate these variances and to construct so-called tracking intervals; our simulation study however shows that the coverage of these tracking intervals is too large, due to an overestimation of the variances. This is an open question to find why this happened here while in other contexts [15, 19] the coverage rates were correct, and possibly to find a correction to this overestimation; nevertheless, the estimates get the correct order of magnitude and the tracking intervals may be useful.

In this paper, UACVR has been applied to the issue of choice between estimators of the distribution of longitudinal categorical data based on cumulative probit mixed models or on mixed models based on a continuous approximation. It has been shown that the naive AIC can be misleading while a procedure called AIC${d}_{}$ (which had not been validated yet) yields results very close to UACVR, even if the latter is slightly better. Both quantities can be computed in the lcmm R package.

## Appendix: Proof of Theorem 1

Under Assumptions A1–A3 below, we have formula (4).

In the proof, we apply the ${o}_{p}$ concept to vectors and matrices. Saying that a matrix H is ${O}_{p}\left(1\right)$ means that all its elements are ${O}_{p}\left(1\right)$. The proof is partly heuristic in that we need at the end an assumption for obtaining that a mean of n ${O}_{p}\left({n}^{-2}\right)$ remainder terms is itself an ${O}_{p}\left({n}^{-2}\right)$ or at least an ${o}_{p}\left({n}^{-1}\right)$ term.

We assume:

• A1 ${\mathrm{\theta }}_{0}$ is the unique minimizer of $\mathrm{\Phi }\left(\mathrm{\theta }\right)$ and the M-estimator $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ is consistent for ${\mathrm{\theta }}_{0}$.

• A2 $\mathrm{\varphi }\left(\mathrm{\theta },y\right)$ is thrice differentiable for every y and the third derivative is dominated by a fixed function in a neighborhood of ${\mathrm{\theta }}_{0}$.

• A3 $\mathrm{\psi }\left(\mathrm{\theta },y\right)$ is twice differentiable for every y and the second derivative is dominated by a fixed function in a neighborhood of ${\mathrm{\theta }}_{0}$.

The proof is as follows. Assumption A2 is the essential assumption in the so-called classical conditions [17] for obtaining that $\sqrt{n}\left(\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}-{\mathrm{\theta }}_{0}\right)$ has an asymptotic normal distribution. It implies that ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}={O}_{p}\left({n}^{-1/2}\right)$. A Taylor expansion of $\frac{\mathrm{\partial }{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }\mathrm{\theta }}{|}_{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}}$ around $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ yields $0=\frac{\mathrm{\partial }{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}+{H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}\left({\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)+{R}_{n}^{1},$where ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}=\frac{{\mathrm{\partial }}^{2}{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and ${R}_{n}^{1}$ is a quadratic form of ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ involving third derivatives of ${\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}$ taken in $\stackrel{˜}{\mathrm{\theta }}$ so that $||{\stackrel{˜}{\mathrm{\theta }}}_{n}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}||\le ||{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}||$. Thus $||{\stackrel{˜}{\mathrm{\theta }}}_{n}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}||$ is also an ${O}_{p}\left({n}^{-1/2}\right)$. Under Assumption A2 and using Lemma 2.12 of Van der Vaart [17], ${R}_{n}^{1}$ is an ${O}_{p}\left({n}^{-1}\right)$. Assumptions A1 and A2 imply that $I\left(\mathrm{\theta }\right)=\frac{{\mathrm{\partial }}^{2}\mathrm{\Phi }}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\mathrm{\theta }}$ exists and is invertible in a neighborhood of ${\mathrm{\theta }}_{0}$. By the strong law of large numbers, ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}=\frac{{\mathrm{\partial }}^{2}{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}=\frac{{\mathrm{\partial }}^{2}{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ converge toward $I\left({\mathrm{\theta }}_{0}\right)$ and thus are invertible for sufficiently large n. It also follows that both these matrices and their inverses are ${O}_{p}\left(1\right)$. Thus, from the above development we obtain ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}=-{H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}^{-1}\frac{\mathrm{\partial }{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}+{R}_{n},$where ${R}_{n}=-{H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}^{-1}{R}_{n}^{1}$ is an ${O}_{p}\left({n}^{-1}\right)$.

By definition of ${\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)$ we have the relation $n{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}\left(\mathrm{\theta }\right)=\left(n-1\right){\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}\left(\mathrm{\theta }\right)+\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right).$(12)Taking derivatives of the terms of this equation and taking the values at $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ we find $0=\left(n-1\right)\frac{\mathrm{\partial }{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}+\frac{\mathrm{\partial }\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ and we obtain that $\frac{\mathrm{\partial }{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}{\mathrm{\partial }\mathrm{\theta }}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}=-{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$

Hence we have ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}={H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}^{-1}{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}+{R}_{n},$(13)Note that this implies that ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}={O}_{p}\left({n}^{-1}\right)$ because ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}={O}_{p}\left(1\right)$ and both ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}$ and ${R}_{n}$ are ${O}_{p}\left({n}^{-1}\right)$. But this in turn implies that ${R}_{n}$ is in fact an ${O}_{p}\left({n}^{-2}\right)$ (as a quadratic form of ${O}_{p}\left({n}^{-1}\right)$ terms). Now we show that ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}$ can be replaced by ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}=\frac{{\mathrm{\partial }}^{2}{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$ in eq. (13). By twice derivating eq. (12) we obtain ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}=\frac{n-1}{n}{H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}+\frac{1}{n}{H}_{{\mathrm{\varphi }}_{i}}$ where ${H}_{{\mathrm{\varphi }}_{i}}=\frac{{\mathrm{\partial }}^{2}\mathrm{\varphi }\left(\mathrm{\theta },{Y}_{i}\right)}{\mathrm{\partial }{\mathrm{\theta }}^{2}}{|}_{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}$; since the last term is an ${O}_{p}\left({n}^{-1}\right)$, we can write ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}={H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}+{O}_{p}\left({n}^{-1}\right)$. Equation (13) can be written ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}\left({\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)={\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}+{O}_{p}\left({n}^{-2}\right)$ or replacing ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n|i}}}$ by ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}+{O}_{p}\left({n}^{-1}\right)$, ${H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}\left({\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)={\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}+{O}_{p}\left({n}^{-1}\right)\left({\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}\right)+{O}_{p}\left({n}^{-2}\right)$. Using the fact that ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}={O}_{p}\left({n}^{-1}\right)$ we obtain ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}={H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}^{-1}{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}+{O}_{p}\left({n}^{-2}\right).$(14)Developing now the assessment loss function for ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}$ around $\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ yields (using Assumption A3): $\mathrm{\psi }\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}},{Y}_{i}\right)=\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right)+\left({\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}{\right)}^{T}{\stackrel{ˆ}{v}}_{i}+{O}_{p}\left({n}^{-2}\right).$Replacing in this equation ${\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}-\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}$ by its approximation in eq. (14) we obtain $\mathrm{\psi }\left({g}^{{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}_{-i}},{Y}_{i}\right)=\mathrm{\psi }\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}},{Y}_{i}\right)+{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}d}}_{i}^{T}{H}_{{\mathrm{\Phi }}_{{\mathcal{O}}_{n}}}^{-1}{\stackrel{ˆ}{v}}_{i}+{O}_{p}\left({n}^{-2}\right)$. Taking the mean of the left-hand terms of these equations yields $\mathrm{C}\mathrm{V}\left({g}^{\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{\theta }}}\right)$. Taking the mean of the terms on the right-hand side gives us a development with an error term which is the mean of n error terms in ${O}_{p}\left({n}^{-2}\right)$. Because the number of error terms to consider increases with n, it is not true in general that such a mean preserves the order of the error terms. This is true assuming some boundedness conditions of the expectations of these terms. At this stage the proof is heuristic: we assume conditions such that the mean of these ${O}_{p}\left({n}^{-2}\right)$ terms is also an ${O}_{p}\left({n}^{-2}\right)$, or at least ${o}_{p}\left({n}^{-1}\right)$. When this holds, we obtain the announced result given in formula (4).

## References

• 1.

Akaike H. Information theory and an extension of the maximum likelihood principle. In BN Petrov and F Csâki, editors. Proc. of the 2nd Int. symp. on information theory, Budapest: Akademiai Kiâdo, 1973: 267–81. Google Scholar

• 2.

Takeuchi K. Distributions of information statistics and criteria for adequacy of models . Math Sci 1976;153:12–18. Google Scholar

• 3.

Konishi S, Kitagawa G. Generalised information criteria in model selection . Biometrika 1996;83:875–90.

• 4.

Murata N, Yoshizawa S, Amari S-I. Network information criterion-determining the number of hidden units for an artificial neural network model . Neural Networks IEEE Trans 1994;5:865–72.

• 5.

Konishi S, Kitagawa G. Information criteria and statistical modeling. New York: Springer Series in Statistics, 2008. Google Scholar

• 6.

Stone M. Cross-validatory choice and assessment of statistical predictions (with discussion) . J R Stat Soc B 1974;39:111–47. Google Scholar

• 7.

Golub G, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter . Technometrics 1979;21:215–23.

• 8.

Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem . Ann Stat 1985;13:1378–402.

• 9.

Van Der Laan M, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation . Stat Appl Genet Mol Biol 2004;3:1036. Google Scholar

• 10.

Xu G, Huang JZ. Asymptotic optimality and efficient computation of the leave-subject-out cross-validation . Ann Stat 2012;40:3003–30.

• 11.

Gu C, Xiang D. Cross-validating non-Gaussian data . J Comput Graphical Stat 2001;10:581–91.

• 12.

Xiang D, Wahba G. A generalized approximate cross validation for smoothing splines with non-Gaussian data . Stat Sin 1996;6:675–92. Google Scholar

• 13.

Commenges D, Joly P, Gegout-Petit A, Liquet B. Choice between semi-parametric estimators of Markov and non-Markov multi-state models from generally coarsened observations . Scand J Stat 2007;34:33–52.

• 14.

O’Sullivan F. A statistical perspective on ill-posed inverse problems . Stat Sci 1986;1:502–18.

• 15.

Commenges D, Liquet B, Proust-Lima C. Choice of prognostic estimators in joint models by estimating differences of expected conditional Kullback-Leibler risks . Biometrics 2012;68:380–7.

• 16.

Gneiting T, Raftery A. Strictly proper scoring rules, prediction, and estimation . J Am Stat Assoc 2007;102:359–78.

• 17.

Van der Vaart A. Asymptotic statistics. Cambridge: Cambridge University Press, 2000. Google Scholar

• 18.

Watanabe S. Algebraic geometry and statistical learning theory. Vol. 25. Cambridge: Cambridge University Press, 2009. Google Scholar

• 19.

Commenges D, Sayyareh A, Letenneur L, Guedj J, Bar-Hen A. Estimating a difference of Kullback-Leibler risks using a normalized difference of AIC . Ann Appl Stat 2008;2:1123–42.

• 20.

Vuong Q. Likelihood ratio tests for model selection and non-nested hypotheses . Econometrica 1989;57:307–33.

• 21.

Cover T, Thomas J. Elements of information theory. New York: John Wiley and Sons, 1991:542. Google Scholar

• 22.

Hall P. On Kullback-Leibler loss and density estimation . Ann Stat 1987;15:1491–519.

• 23.

Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. New York: Springer-Verlag, 2002. Google Scholar

• 24.

Liquet B, Commenges D. Choice of estimators based on different observations: modified AIC and LCV criteria . Scand J Stat 2011;38:268–87.

• 25.

Brier GW. Verification of forecasts expressed in terms of probability . Mon Weather Rev 1950;78:1–3.

• 26.

Vaida F, Blanchard S. Conditional Akaike information for mixed-effects models . Biometrika 2005;92:351–70.

• 27.

Greven S, Kneib T. On the behaviour of marginal and conditional AIC in linear mixed models . Biometrika 2010;97:773–89.

• 28.

Braun J, Held L, Ledergerber B. Predictive cross-validation for the choice of linear mixed-effects models with application to data from the Swiss HIV cohort study . Biometrics 2012;68:53–61.

• 29.

Proust-Lima C, Amieva H, Jacqmin-Gadda H. Analysis of multivariate mixed longitudinal data: a flexible latent process approach Br J Math Stat Psychol 2012;66:470–87.

• 30.

Proust-Lima C, Philipps V, Diakite A, Liquet B. LCMM: Estimation of extended mixed models using latent classes and latent processes. R package version 1.6.6, 2014. Google Scholar

• 31.

Folstein MF, Folstein SE, McHugh PR. “Mini-mental state”. A practical method for grading the cognitive state of patients for the clinician . J Psychiatr Res 1975;12:189–98.

• 32.

Proust C, Jacqmin-Gadda H, Taylor JM, Ganiayre J, Commenges D. A nonlinear model with latent process for cognitive evolution using multivariate longitudinal data . Biometrics 2006;62:1014–24.

• 33.

Letenneur L, Commenges D, Dartigues JF, Barberger-Gateau P. Incidence of dementia and Alzheimer’s disease in elderly community residents of South-Western France . Int J Epidemiol 1994;23:1256–61.

## Footnotes

Published Online: 2015-04-03

Published in Print: 2015-05-01

Citation Information: The International Journal of Biostatistics, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X,

Export Citation

© 2015 Walter de Gruyter GmbH, Berlin/Boston.