Show Summary Details
More options …

# The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

IMPACT FACTOR 2018: 1.309

CiteScore 2018: 1.11

SCImago Journal Rank (SJR) 2018: 1.325
Source Normalized Impact per Paper (SNIP) 2018: 0.715

Mathematical Citation Quotient (MCQ) 2018: 0.03

Online
ISSN
1557-4679
See all formats and pricing
More options …
Volume 13, Issue 2

# A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix

Zongliang Hu
/ Kai Dong
/ Wenlin Dai
/ Tiejun Tong
Published Online: 2017-09-21 | DOI: https://doi.org/10.1515/ijb-2017-0013

## Abstract

The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.

## 1 Introduction

High-dimensional data are becoming more common in scientific research including gene expression study, financial engineering and signal processing. One significant feature of such data is that the dimension $p$ is larger than the sample size $n$, the so-called “large $p$ small $n$” data. For example, gene microarray often measures thousands of gene expression values simultaneously for each individual. However, due to the cost or the limited availability of patients, the number of samples in microarray experiments is usually much smaller than the number of genes. It is common to see microarray data with less than 10 samples [1, 2, 3, 45]. As seen in the literature, there are many statistical and computational challenges in analyzing the “large $p$ small $n$” data.

Let ${X}_{i}=\left({x}_{i1},\dots ,{x}_{ip}{\right)}^{T}$, $i=1,\dots ,n$, be independent and identically distributed (i.i.d.) random vectors from the multivariate normal distribution ${N}_{p}\left(\mu ,\mathrm{\Sigma }\right)$, where $\mu$ is a $p$-dimensional mean vector and $\mathrm{\Sigma }$ is a covariance matrix of size $p×p$. When $p$ is larger than $n$, the sample covariance matrix ${S}_{n}$ is a singular matrix. To overcome the singularity problem, various methods for estimating $\mathrm{\Sigma }$ have been proposed in the recent literature, e.g., the ridge-type estimators in [6] and [7], the sparse estimators in [8, 910] and [11]. Recently, [12] and [13] considered sparse covariance matrix estimation for time series data based on certain dependence measures, which relaxes the independence assumption among samples. For more references, see also [14, 15] and [16].

Apart from the covariance matrix estimation, there are situations where one needs an estimate of the determinant (or the log-determinant) of the covariance matrix for high-dimensional data. To illustrate it, we write the log-likelihood function of the data as $\begin{array}{r}log\left(L\right)=-\frac{np}{2}log\left(2\pi \right)-\frac{n}{2}log|\mathrm{\Sigma }|-\frac{1}{2}\sum _{i=1}^{n}\left({X}_{i}-\mu {\right)}^{T}{\mathrm{\Sigma }}^{-1}\left({X}_{i}-\mu \right),\end{array}$

where $|\mathrm{\Sigma }|$ denotes the determinant of the covariance matrix $\mathrm{\Sigma }$. In classic multivariate analysis, the determinant $|\mathrm{\Sigma }|$, referred to as the generalized variance (GV), was introduced by [17] and [18] as a scalar measure of overall multidimensional scatter. It has many applications such as outlier detection, hypothesis testing, and classification. To cater for this demand, we present several examples as follows.

• Quadratic discriminant analysis (QDA) is an important method of classification. Assuming that the data in class $k$ follows ${N}_{p}\left({\mu }_{k},{\mathrm{\Sigma }}_{k}\right)$, the quadratic discriminant scores are given by

${d}_{k}\left(Y\right)=\left(Y-{\mu }_{k}{\right)}^{T}{\mathrm{\Sigma }}_{k}^{-1}\left(Y-{\mu }_{k}\right)+log|{\mathrm{\Sigma }}_{k}|-2log{\pi }_{k},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}k=1,\dots ,K,$ where $Y$ is the new sample, $K$ is the total number of classes, and ${\pi }_{k}$ is the prior probability of observing a sample from class $k$. The classification rule is to assign $Y$ to class $k$ that minimizes ${d}_{k}\left(Y\right)$ among all classes. To implement QDA, it is obvious that we need an estimate of $|{\mathrm{\Sigma }}_{k}|$ or $log|{\mathrm{\Sigma }}_{k}|$.

• To estimate the high-dimensional precision matrix $\mathrm{\Omega }={\mathrm{\Sigma }}^{-1}$, [19] and [20] proposed to solve the following optimization problem:

$\stackrel{ˆ}{\mathrm{\Omega }}=arg\underset{\mathrm{\Omega }>0}{min}\left\{\mathrm{t}\mathrm{r}\left({S}_{n}\mathrm{\Omega }\right)-log|\mathrm{\Omega }|+\lambda \parallel \mathrm{\Omega }{\parallel }_{1}\right\},$ where $\text{tr}\left(\cdot \right)$ is the trace, $\parallel \cdot {\parallel }_{1}$ is the ${\mathrm{\ell }}_{1}$ norm, and $\lambda$ is a tuning parameter. The purpose of the term, $log|\mathrm{\Omega }|=-log|\mathrm{\Sigma }|$, is to ensure that the optimization problem has a unique global positive definite minimizer [10]. Other proposals in this direction include [21], [22], [23], [24] and among others.

• In probability theory and information theory, the differential entropy is commonly used by extending the concept of entropy to the continuous probability distribution [25, 26]. For a random vector from ${N}_{p}\left(\mu ,\mathrm{\Sigma }\right)$, the differential entropy is

$\begin{array}{r}h\left(\mathrm{\Sigma }\right)=\frac{p}{2}+\frac{plog\left(2\pi \right)}{2}+\frac{log|\mathrm{\Sigma }|}{2}.\end{array}$

• The minimum covariance determinant (MCD) method developed by [27] and [28] is a robust estimator of multivariate scatter. MCD aims to find a subset with $h$ samples (observations) having the smallest determinant of the covariance matrix. Specifically, let ${S}=\left\{I\subset \left\{1,\dots ,n\right\}:\mathrm{c}\mathrm{a}\mathrm{r}\mathrm{d}\left(I\right)=h\right\}$ be the collections of all subsets with $h$ samples, where $\mathrm{c}\mathrm{a}\mathrm{r}\mathrm{d}\left(I\right)$ is the cardinality of $I$. For any $I\in {S}$, let ${S}_{I}$ be the corresponding sample covariance. The subset with the minimum determinant is defined as

${I}_{m}=arg\underset{I\in {S}}{min}\left\{|{S}_{I}|\right\}.$ When $p$ is larger than $n$, MCD is ill-defined as ${S}_{I}$ is singular. To generalize the MCD method to high-dimensional data, we need an estimate for the determinant of the high-dimensional covariance matrix. For instance, [29] replaced $|{S}_{I}|$ with $|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{I}\right)|$, and [30] modified $|{S}_{I}|$ by shrinking the subset-based sample covariance matrix toward a target matrix.

• Multivariate analysis of variance (MANOVA) is a procedure for testing the equality of mean vectors across multiple groups. Wilks’ $\mathrm{\Lambda }$ statistic for the hypothesis test [31] is given as

$\begin{array}{r}\mathrm{\Lambda }=\frac{|E|}{|H+E|},\end{array}$ where $E$ is the within-group sum of squares and cross-product matrix, and $H$ is the between-group sum of squares and cross-product matrix. However, $E$ is singular under the “large $p$ small $n$” setting. To apply MANOVA for high-dimensional data, [32] proposed replacing $E$ with a shrinkage estimator, in which the shrinkage intensity is computed based on the method by [33]. Ullah and Jones [34] compared the powers of three types of regularized Wilks’ $\mathrm{\Lambda }$ statistics, in which $E$ was replaced by the lasso, ridge and shrinkage estimator, respectively.

From the above examples, it is evident that an estimator of GV, or $log|\mathrm{\Sigma }|$, plays an important role in high-dimensional data analysis. For ease of notation, we let $\theta =log|\mathrm{\Sigma }|$

throughout the paper. In contrast to the covariance matrix estimation, the investigation of estimating $\theta$ is relatively overlooked in the literature. In practice, one often estimates the covariance matrix first and then uses it to compute the log-determinant. Chiu et al. [35] considered a regression model and allowed the covariance matrix of response vector ${X}_{i}=\left({x}_{i1},\dots ,{x}_{ip}{\right)}^{T}$ to vary with explanatory variables. In specific, they proposed modeling each element of $log\mathrm{\Sigma }$ as a linear function of the explanatory variables. One property of the transformation is that the log determinant $log|\mathrm{\Sigma }|$ is equal to $\mathrm{t}\mathrm{r}\left(log\mathrm{\Sigma }\right)$, a summation of log eigenvalues of $\mathrm{\Sigma }$. Recently, [36] investigated the estimation of $\theta$ under various settings. Under some “moderate” setting with $p\le n$, they proposed to estimate $\theta$ by the determinant of the sample covariance matrix, i.e., $log|{S}_{n}|$. A central limit theorem was also established for $log|{S}_{n}|$ in the setting where $p$ can grow with $n$. For the “large $p$ small $n$” data, however, they showed that it is impossible to estimate $\theta$ consistently, unless some structural assumption such as sparsity on the parameter can be imposed.

In this paper, we conduct a comprehensive simulation study that evaluates the performance of the existing methods for estimating $\theta$. We follow a two-step procedure: we first estimate $\mathrm{\Sigma }$ with the existing methods, and then estimate $\theta$ by the plug-in estimator, $\stackrel{ˆ}{\theta }=log\left(|\stackrel{ˆ}{\mathrm{\Sigma }}|\right)$. In Section 2, we consider a total of eight methods for estimating $\theta$. A brief review on each of the methods is also given. In Section 3, we conduct simulation studies to evaluate and compare their performance under various settings. In particular, we will consider different types of correlation structures including a non-positive definite covariance matrix that is often ignored in the existing literature. We then explore and summarize some useful findings, and provide some practical guidelines for scientists in Section 4. Finally, we conclude the paper in Section 5 with some discussion. Technical details are provided in the Appendix.

## 2 Methods for estimating $\theta$

In this section, we review eight representative methods for estimating the covariance matrix, and then estimate the log-determinant $\theta$ using the eight estimates of $\mathrm{\Sigma }$, respectively. We also propose a new method for estimating $\theta$ under the assumption of a diagonal covariance matrix. For ease of presentation, we divide the eight methods into four categories: diagonal estimation, shrinkage estimation, sparse estimation, and factor model estimation.

## 2.1 Diagonal estimation

Method 1: Diagonal Estimator (DE)

Under the “large $p$ small $n$” setting, one naive approach is to estimate $\mathrm{\Sigma }$ by the diagonal sample covariance matrix, i.e., $\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$. This estimator was first considered in [37] to propose a diagonal linear discriminant analysis. It was further considered in [38] where the authors demonstrated that a diagonal covariance matrix estimation can be sometimes reasonable when $p$ is much larger than $n$. Let $\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)=\text{diag}\left({\sigma }_{1}^{2},\dots ,{\sigma }_{p}^{2}\right)$ where ${\sigma }_{j}^{2}$ are the covariate-specific variances for $j=1,\dots ,p$, and $\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)=\text{diag}\left({s}_{1}^{2},\dots ,{s}_{p}^{2}\right)$ where ${s}_{j}^{2}$ are the sample variances of ${\sigma }_{j}^{2}$, respectively. By letting $\stackrel{ˆ}{\mathrm{\Sigma }}=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$, we define the first estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(1\right)}=log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)|=\sum _{j=1}^{p}log{s}_{j}^{2}.\end{array}$(1)

We refer to ${\stackrel{ˆ}{\theta }}_{\left(1\right)}$ as the diagonal estimator (DE). To be specific, DE is proposed to estimate $log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|$ rather than $log|\mathrm{\Sigma }|$.

Method 2: Improved Diagonal Estimator (IDE)

It is noteworthy that DE may not perform well as an estimate of $log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|$ when the sample size is small, mainly due to the unreliable estimates of the sample variances. Various approaches have been proposed to improving the variance estimation in the literature. See, for example, [39, 40, 4142], and [43].

To improve DE, we consider the optimal shrinkage estimator in [42], $\begin{array}{r}{\stackrel{ˆ}{\sigma }}_{j}^{2}=\left\{{h}_{p}\left(1\right){s}_{pool}^{2}{\right\}}^{\alpha }\left\{{h}_{1}\left(1\right){s}_{j}^{2}{\right\}}^{1-\alpha },\end{array}$

where ${s}_{pool}^{2}=\prod _{i=1}^{p}\left({s}_{j}^{2}{\right)}^{1/p}$, ${h}_{p}\left(1\right)=\left(\nu /2\right){\left\{\mathrm{\Gamma }\left(\nu /2\right)/\mathrm{\Gamma }\left(\nu /2+1/p\right)\right\}}^{p}$ with $\nu =n-1$, $\mathrm{\Gamma }\left(\cdot \right)$ is the Gamma function, and $\alpha \in \left[0,1\right]$ is the shrinkage parameter. Replacing ${s}_{j}^{2}$ in DE by ${\stackrel{ˆ}{\sigma }}_{j}^{2}$, we have $\begin{array}{r}\stackrel{ˆ}{\theta }=\sum _{j=1}^{p}log{\stackrel{ˆ}{\sigma }}_{j}^{2}={\stackrel{ˆ}{\theta }}_{\left(1\right)}+C,\end{array}$(2)

where $C=log\left\{{h}_{p}^{\alpha p}\left(1\right){h}_{1}^{\left(1-\alpha \right)p}\left(1\right)\right\}$ is a constant.

The estimation structure in eq. (2) shows that the DE estimator, ${\stackrel{ˆ}{\theta }}_{\left(1\right)}$, can be further improved. Specifically, if we have ${C}_{0}$ such that $E\left({\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}\right)=log|\text{diag}\left(\mathrm{\Sigma }\right)|$, then ${C}_{0}$ is defined as the optimal $C$ value so that the estimator ${\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}$ minimizes the mean squared error in the family of estimators $\left\{{\stackrel{ˆ}{\theta }}_{\left(1\right)}+C:\text{\hspace{0.17em}}C\in \left(-\mathrm{\infty },\mathrm{\infty }\right)\right\}$.

#### Theorem 1

Let ${s}_{j}^{2}={\sigma }_{j}^{2}{\chi }_{\nu ,j}^{2}/\nu$, where ${\chi }_{\nu ,j}^{2}$ are i.i.d random variables with a common chi-squared distribution with $\nu$ degrees of freedom, and ${C}_{0}=-p\left\{log\left(2/\nu \right)+\psi \left(\nu /2\right)\right\}$, where $\psi \left(\cdot \right)={\mathrm{\Gamma }}^{\prime }\left(\cdot \right)/\mathrm{\Gamma }\left(\cdot \right)$ is the digamma function. Then for any fixed $\nu >0$, we have

1. [(1)] ${\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}$ is an unbiased estimator of $log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|$.

2. [(2)] Assume also that ${\sigma }_{j}^{2}$ are i.i.d random variables from a common distribution $F$ and $E\left(log{\sigma }_{1}^{2}\right)<\mathrm{\infty }$. Then

$\begin{array}{r}\frac{1}{p}\left({\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}-log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|\right)\stackrel{a.s.}{⟶}0\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{a}\mathrm{s}\text{\hspace{0.17em}}\text{\hspace{0.17em}}p\to \mathrm{\infty },\end{array}$ where $\stackrel{a.s.}{⟶}$ denotes almost sure convergence.

The proof of Theorem 1 is given in the Appendix. By eq. (2) and Theorem 1, we define the second estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(2\right)}=\sum _{j=1}^{p}log{s}_{j}^{2}-p\left\{log\left(2/\nu \right)+\psi \left(\nu /2\right)\right\}.\end{array}$

We refer to ${\stackrel{ˆ}{\theta }}_{\left(2\right)}$ as the improved diagonal estimator (IDE).

## 2.2 Shrinkage estimation

Recall that the sample covariance matrix ${S}_{n}$ is singular when the dimension is larger than the sample size. To overcome the singularity problem, other than the diagonal methods in Section 2.1, one may also estimate the covariance matrix by the following convex combination: $\begin{array}{r}{S}^{\ast }=\delta T+\left(1-\delta \right){S}_{n},\end{array}$

where $T$ is the target matrix, and $\delta \in \left[0,1\right]$ is the shrinkage parameter. Both the target matrix and the shrinkage parameter play an important role in the shrinkage estimation. For instance, if we let $T=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$ and $\delta =1$, then ${S}^{\ast }$ reduces to the DE estimator.

The appropriate choice of the target matrix has been extensively studied in the literature. See, for example, [6, 33, 44, 45], and [7] and the references therein. Note that $T$ is often chosen to be positive definite and well-conditioned, and consequently, the final estimate ${S}^{\ast }$ is also guaranteed positive definite and well-conditioned for any dimensionality. As suggested in [33] and [7], we consider a popular target matrix for nonhomogeneous variances: the “diagonal, unequal variance” matrix, i.e., the diagonal sample covariance matrix $\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$.

We also note that, given the target matrix, the estimation of the shrinkage parameter $\delta$ is also crucial to the final estimation. The available estimation methods for the shrinkage parameter are mainly: (1) the unbiased estimation, and (2) the consistent estimation. The unbiased estimation is replacing unknown terms in the optimal value by their unbiased estimators [33]. Whereas, the consistent estimation is replacing the unknown terms in the optimal shrinkage parameter with $\left(n,p\right)$-consistent estimators [7]. Taken together, we present below the four methods for estimating the covariance matrix and consequently for estimating $\theta$, respectively.

Method 3: Unbiased Shrinkage Estimator with $T=I$ (USIE)

Letting the target matrix be $T=I$, [33] proposed an unbiased estimator for the shrinkage parameter, denoted by ${\stackrel{ˆ}{\delta }}_{1}^{\ast }$. This leads to ${S}^{\ast }={\stackrel{ˆ}{\delta }}_{1}^{\ast }I+\left(1-{\stackrel{ˆ}{\delta }}_{1}^{\ast }\right){S}_{n}$. We then define the third estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(3\right)}=log|{\stackrel{ˆ}{\delta }}_{1}^{\ast }I+\left(1-{\stackrel{ˆ}{\delta }}_{1}^{\ast }\right){S}_{n}|.\end{array}$(3)

Method 4: Consistent Shrinkage Estimator with $T=I$ (CSIE)

Letting the target matrix be $T=I$, [7] proposed a consistent estimator for the shrinkage parameter, denoted by ${\stackrel{ˆ}{\delta }}_{2}^{\ast }$. This leads to ${S}^{\ast }={\stackrel{ˆ}{\delta }}_{2}^{\ast }I+\left(1-{\stackrel{ˆ}{\delta }}_{2}^{\ast }\right){S}_{n}$. We then define the fourth estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(4\right)}=log|{\stackrel{ˆ}{\delta }}_{2}^{\ast }I+\left(1-{\stackrel{ˆ}{\delta }}_{2}^{\ast }\right){S}_{n}|.\end{array}$(4)

Method 5: Unbiased Shrinkage Estimator with $T=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$ (USDE)

Letting $T=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$, [33] also proposed an unbiased estimator for the shrinkage parameter, denoted by ${\stackrel{ˆ}{\delta }}_{3}^{\ast }$. This leads to ${S}^{\ast }={\stackrel{ˆ}{\delta }}_{3}^{\ast }\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)+\left(1-{\stackrel{ˆ}{\delta }}_{3}^{\ast }\right){S}_{n}$. We then define the fifth estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(5\right)}=log|{\stackrel{ˆ}{\delta }}_{3}^{\ast }\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)+\left(1-{\stackrel{ˆ}{\delta }}_{3}^{\ast }\right){S}_{n}|.\end{array}$(5)

Method 6: Consistent Shrinkage Estimator with $T=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$ (CSDE)

Letting $T=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)$, [7] also proposed a consistent estimator for the shrinkage parameter, denoted by ${\stackrel{ˆ}{\delta }}_{4}^{\ast }$. This leads to ${S}^{\ast }={\stackrel{ˆ}{\delta }}_{4}^{\ast }\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)+\left(1-{\stackrel{ˆ}{\delta }}_{2}^{\ast }\right){S}_{n}$. We then define the sixth estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(6\right)}=log|{\stackrel{ˆ}{\delta }}_{4}^{\ast }\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({S}_{n}\right)+\left(1-{\stackrel{ˆ}{\delta }}_{4}^{\ast }\right){S}_{n}|.\end{array}$(6)

## 2.3 Sparse estimation

When $p$ is much larger than $n$, the shrinkage methods in Section 2.2 may not achieve a significant improvement over ${S}_{n}$. In such settings, to have a good estimate of $\mathrm{\Sigma }$, one may have to impose some structural assumptions such as sparsity in the parameters. Recently, [15] reviewed some methods on estimating structured high-dimensional covariance and precision matrix. A typical sparsity is to assume that most of the off-diagonal elements in the covariance matrix are zero. To estimate the covariance matrix under a sparsity condition, various thresholding-based methods have been proposed in the literature that aim to locate some “large” off-diagonal elements. See, for example, [8, 9, 46, 47, 48, 49, 5051], and [52]. Particularly, the adaptive thresholding estimator proposed by [49] achieves the optimal rate of convergence over a large class of sparse covariance matrix under a wide spectral norms. Besides, it can be shown that the adaptive thresholding estimator also attains the optimal convergence rate under Bregman divergence losses over a large parameter class [15, 50]. Therefore, we also consider the sparsity methods as a representative and use them to estimate $\theta$, i.e., the log-determinant of the covariance matrix.

Method 7: Adaptive Thresholding Estimator (ATE)

Bickel and Levina [8] proposed a universal thresholding method where all entries in the sample covariance matrix are thresholded by a common value $\gamma$. They required that the variances ${\sigma }_{j}^{2}$ are uniformly bounded by a constant $K$, and consequently, the variances of the entries of the sample covariance matrix are also uniformly bounded. However, it was shown that a universal thresholding method is suboptimal over a certain class of sparse covariance matrices.

To improve the method above, [49] proposed an adaptive thresholding estimator for the covariance matrix: $\begin{array}{r}{\stackrel{ˆ}{\mathrm{\Sigma }}}^{\ast }=\left({\stackrel{˜}{\sigma }}_{ij}^{\ast }{\right)}_{p×p}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{w}\mathrm{i}\mathrm{t}\mathrm{h}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\stackrel{˜}{\sigma }}_{ij}^{\ast }={s}_{{\gamma }_{ij}}\left({s}_{ij}\right),\end{array}$

where ${\gamma }_{ij}$ is the corresponding threshold of ${\stackrel{˜}{\sigma }}_{ij}^{\ast }$, and ${s}_{{\gamma }_{ij}}\left(\cdot \right)$ is a generalized thresholding operator [47], which is specified as the soft thresholding throughout simulations. With the proper ${\gamma }_{ij}$, the estimator ${\stackrel{ˆ}{\mathrm{\Sigma }}}^{\ast }$ adaptively achieves the optimal rate of convergence over a large class of sparse covariance matrix under the spectral norm. Now by ${\stackrel{ˆ}{\mathrm{\Sigma }}}^{\ast }$, the seventh estimator of $\theta$ is $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(7\right)}=log|{\stackrel{ˆ}{\mathrm{\Sigma }}}^{\ast }|.\end{array}$(7)

We refer to ${\stackrel{ˆ}{\theta }}_{\left(7\right)}$ as the adaptive thresholding estimator (ATE).

## 2.4 Factor model estimation

The sparsity condition on the covariance matrix assumes that most of covariates are uncorrelated to each other. Note that, however, this assumption may not be realistic in practice. Recently, under the assumption of conditional sparsity, [54] introduced a principle orthogonal complement thresholding method using the factor model. In this section, we briefly review their method and then apply it to estimate the log-determinant of the covariance matrix.

Method 8: Principal Orthogonal Complement Thresholding Estimator (POET)

Fan et al. [54] considered the approximate factor model: $\begin{array}{r}{y}_{g}=B{f}_{g}+{u}_{g},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}g=1,\dots ,G,\end{array}$

where ${y}_{g}=\left({y}_{1g},\dots ,{y}_{pg}{\right)}^{T}$ is the observed response, $B=\left({b}_{1},\dots ,{b}_{p}{\right)}^{T}$ is the loading matrix, ${f}_{g}$ is a $Q×1$ vector of common factors, and ${u}_{g}=\left({u}_{1g},\dots ,{u}_{pg}{\right)}^{T}$ is the error vector. In this model, we can only observe ${y}_{g}$. Let $\begin{array}{r}\mathrm{\Sigma }=B\mathrm{c}\mathrm{o}\mathrm{v}\left({f}_{g}\right){B}^{T}+{\mathrm{\Sigma }}_{u},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}g=1,\dots ,G,\end{array}$

where ${\mathrm{\Sigma }}_{u}$ is the covariance matrix of ${u}_{g}$. To estimate $\mathrm{\Sigma }$, [54] applied the spectral decomposition on the sample covariance matrix: $\begin{array}{r}{S}_{n}=\sum _{j=1}^{Q}{\stackrel{ˆ}{\lambda }}_{j}{\stackrel{ˆ}{\xi }}_{j}{\stackrel{ˆ}{\xi }}_{j}^{T}+{\stackrel{ˆ}{R}}_{Q},\end{array}$

where ${\stackrel{ˆ}{\lambda }}_{1}\ge {\stackrel{ˆ}{\lambda }}_{1}\ge \dots \ge {\stackrel{ˆ}{\lambda }}_{p}$ are eigenvalues of ${S}_{n}$, ${\stackrel{ˆ}{\xi }}_{j}$, $j=1,\dots ,p$ are the corresponding eigenvectors, and ${\stackrel{ˆ}{R}}_{Q}=\sum _{j=Q+1}^{p}{\stackrel{ˆ}{\lambda }}_{j}{\stackrel{ˆ}{\xi }}_{j}{\stackrel{ˆ}{\xi }}_{j}^{T}$ is the principal orthogonal complement. For this decomposition, the first $Q$ principal components were kept and the thresholding was applied on ${\stackrel{ˆ}{R}}_{Q}$. Here, the generalized thresholding operator can be used. In addition, [54] also introduced a method to obtain an estimation of $Q$, denoted by $\stackrel{ˆ}{Q}$. Their final estimator of $\mathrm{\Sigma }$ is $\begin{array}{r}{\stackrel{ˆ}{\mathrm{\Sigma }}}_{\stackrel{ˆ}{Q}}=\sum _{j=1}^{\stackrel{ˆ}{Q}}{\stackrel{ˆ}{\lambda }}_{j}{\stackrel{ˆ}{\xi }}_{j}{\stackrel{ˆ}{\xi }}_{j}^{T}+{\stackrel{ˆ}{R}}_{\stackrel{ˆ}{Q}}^{{T}},\end{array}$(8)

where ${\stackrel{ˆ}{R}}_{\stackrel{ˆ}{Q}}^{{T}}$ is the thresholding result of ${\stackrel{ˆ}{R}}_{Q}$. Now by eq. (8), we define the last estimator of $\theta$ as $\begin{array}{r}{\stackrel{ˆ}{\theta }}_{\left(8\right)}=log|{\stackrel{ˆ}{\mathrm{\Sigma }}}_{\stackrel{ˆ}{Q}}|.\end{array}$(9)

We refer to ${\stackrel{ˆ}{\theta }}_{\left(8\right)}$ as the principal orthogonal complement thresholding estimator (POET).

## 3 Simulation studies

In this section, we compare the numerical performance of the aforementioned eight estimators. We consider five different setups. In the first setup, we generate data from the multivariate normal distribution, ${N}_{p}\left(0,\mathrm{\Sigma }\right)$. In the second setup, we generate data from a mixture distribution where the covariance matrix is highly sparse. In the third setup, we simulate data from the log-normal distribution to assess the robustness of the eight methods under heavy-tailed data. In the forth setup, we consider a special case where the covariance matrix is degenerate and the data are generated from a degenerate multivariate normal distribution. And in the final setup, we use a realistic covariance matrix structure that is obtained from a real data. To compare these methods, we compute the mean squared error (MSE) as below: $\mathrm{M}\mathrm{S}\mathrm{E}\left(\theta ,\stackrel{ˆ}{\theta }\right)=\frac{1}{Mp}\sum _{m=1}^{M}\left({\stackrel{ˆ}{\theta }}_{m}-\theta {\right)}^{2},$

where $M$ is the repeated times. Throughout the simulations, we take $M=500$.

Figure 1

Log MSEs for data from normal distribution with $p$=50. The sample size ranges from 5 to 50. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figure 2

Log MSEs for data from normal distribution with $p$=300. The sample size ranges from 10 to 200. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figure 3

Log MSEs for data from normal distribution with $p$=300, and $\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

## 3.1 Normal data

In this setup, we consider a block diagonal structure for the covariance matrix. This structure is widely adopted in the literature, e.g., [55] and [56]. Specifically, we let ${\mathrm{\Sigma }}_{2}={D}^{1/2}R\left(\rho \right){D}^{1/2},$

where $D=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({\sigma }_{1}^{2},\dots ,{\sigma }_{p}^{2}\right)$ with ${\sigma }_{j}^{2}$ being i.i.d. from the distribution ${\chi }_{5}^{2}/5$, and $R$ follows a block diagonal structure: $\begin{array}{r}R\left(\phantom{\rule{thinmathspace}{0ex}}\rho \phantom{\rule{thinmathspace}{0ex}}\right)={\left(\begin{array}{ccccc}{\mathrm{\Sigma }}_{\rho }& 0& \cdots & \cdots & 0\\ 0& {\mathrm{\Sigma }}_{-\rho }& 0& \ddots & ⋮\\ ⋮& 0& {\mathrm{\Sigma }}_{\rho }& 0& ⋮\\ ⋮& \ddots & 0& {\mathrm{\Sigma }}_{-\rho }& \ddots \\ 0& \cdots & \cdots & \ddots & \ddots \end{array}\right)}_{p×p}.\end{array}$

In our simulations, we consider ${\mathrm{\Sigma }}_{\rho }=\left({\sigma }_{ij}\left(\rho \right){\right)}_{q×q}$ with ${\sigma }_{ij}\left(\rho \right)={\rho }^{|i-j|}$ for $1\le i,j\le q$. In addition, we set $\rho =0$, $0.3$, $0.6$ or $0.9$, to represent different levels of dependence, and $\left(p,q\right)=\left(50,5\right)$ or $\left(300,10\right)$, respectively.

Table 1

MSEs of $\stackrel{ˆ}{\theta }$ for data from normal distribution with $\rho =0.3,0.6,0.9$, $n=10,40$ and $p=50,100$, respectively. The number of factors $K$ is either fixed or estimated by the method in [54], denoted by $\stackrel{ˆ}{K}$. All MSEs are rounded to integer numbers. The minimum MSE of each line is highlighted.

Figures 1 and 2 display the log(MSE) of the eight methods for different levels of dependence, dimension and sample size. From these figures, we have the following findings. When the covariates are uncorrelated, IDE gives the best performance under a high dimension (e.g., $p=300$). However, if the dimension is not large (e.g., $p=50$), and the covariates are uncorrelated or weakly correlated, shrinking the covariance matrix toward an identity matrix leads to a better performance under a small sample size. This is because when the sample size is small, the variances of the entries of the sample covariance matrix are large. Hence, CSIE and USIE stabilize both diagonal and off-diagonal entries and, at the same time, an identity target possesses an explicit structure which in turn requires little data to fit. Consequently, the resulting estimators have a good bias–variance tradeoff. In addition, when the correlation and dimension are both large, imposing additional structure assumptions is necessary. Under this situation, ATE and POET turn out to be the best two methods among the eight methods unless the sample size is relatively small. When the sample size is small, the pattern of ATE is very similar to that of DE. When the sample size and dimension are both large, ATE outperforms all other methods except for POET.

Figure 3 displays the performance of the eight methods for different levels of dependence with $p=300$. The pattern is consistent with Figure 2. In particular, as the correlation and sample size are large, the performance of POET is satisfactory. From Figures 1 and 2, however, we note that the log(MSE) of POET tends to be oscillating as the sample size increases. This may due to that POET depends on the estimated number of factors $K$. In [54], the authors used a consistent estimator for $K$ and showed that POET is robust to over-estimated number of factors under the spectral norm. Our simulations in Table 1, however, show that the robustness for estimating the covariance matrix may not hold any more when the purpose is to estimate the determinant. In particular for small sample sizes, either over-estimated or under-estimated $K$ leads to a large bias for the determinant estimator.

Figure 4

Log MSEs for data from mixture normal distribution with $p$=50, and $\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figure 5

Log MSEs for data from mixture normal distribution with $p$=300, and $\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

## 3.2 Mixture normal data

In this setup, we consider a mixture model where the random vectors are generated from $X\sim {\alpha }_{1}{f}_{1}\left(X\right)+{\alpha }_{2}{f}_{2}\left(X\right),$

where ${f}_{1}\left(X\right)$ and ${f}_{2}\left(X\right)$ are the density functions of ${N}_{p}\left({\mu }_{3},{\mathrm{\Sigma }}_{3}\right)$ and ${N}_{p}\left({\mu }_{4},{\mathrm{\Sigma }}_{4}\right)$, respectively. For the covariance matrices, we consider a sparse block diagonal structure as follows: ${\mathrm{\Sigma }}_{3}={D}^{1/2}R\left(\rho \right){D}^{1/2}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{a}\mathrm{n}\mathrm{d}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{\Sigma }}_{4}={D}^{1/2}R\left(-\rho \right){D}^{1/2},$

where $D=\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({\sigma }_{1}^{2},\dots ,{\sigma }_{p}^{2}\right)$ with ${\sigma }_{j}^{2}$ being i.i.d. from the distribution $\left(1/5\right){\chi }_{5}^{2}$, and $R\left(\rho \right)$ being the same as in Setup II. For simplicity, we set ${\alpha }_{1}={\alpha }_{2}=1/2$ and ${\mu }_{1}={\mu }_{2}=0$. Under this setting, the covariance matrix of $X$ is simplified as $\left({\mathrm{\Sigma }}_{3}+{\mathrm{\Sigma }}_{4}\right)/2$, which results in a highly sparse matrix where the odd off-diagonal parts in diagonal blocks are zeros. We set $\left(p,q\right)=\left(50,5\right)$ or $\left(300,10\right)$, and $\rho =0$, $0.3$, $0.6$ or $0.9$.

Figure 6

Log MSEs for data from heavy-tailed distribution with $p$=50, and $\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figure 7

Log MSEs for data from heavy-tailed distribution with $p$=300, and $\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figures 4 and 5 display the log(MSE) of the eight methods under different levels of dependence and sample size. When the sample size is large and the covariates are uncorrelated, IDE gives the best performance. When the sample size is small and the dimension is not large (e.g., $n=5,p=50$), shrinking the covariance matrix toward an identity matrix (e.g., USIE and CSIE) outperforms the other methods except that the correlation is very large (e.g., $\rho =0.9$). However, as the sample size and dimension are both large, the shrinkage methods will become suboptimal. Instead,if the correlation is also large (e.g., $\rho =0.6$), ATE and POET outperform the other methods in most settings. As aforementioned, the performance of POET is not stable and may not be satisfactory when the sample size is not large.

## 3.3 Heavy-tailed data

In this setup, we consider to simulate heavy-tailed data from a log-normal distribution, $lnN\left(\mu ,{\sigma }^{2}\right)$, where the mean and variance are ${e}^{\mu +{\sigma }^{2}/2}$ and $\left({e}^{{\sigma }^{2}}-1\right){e}^{2\mu +{\sigma }^{2}}$, respectively. First of all, we generate $n$ independent random vectors ${Z}_{i}=\left({z}_{i1},\dots ,{z}_{ip}{\right)}^{T}$, where all the components of ${Z}_{i}$ are sampled independently from $lnN\left(0,1\right)$. Let ${X}_{i}={\mathrm{\Sigma }}^{1/2}{Z}_{i}^{\ast }$ with ${Z}_{i}^{\ast }=\left({z}_{i1}-{e}^{1/2},\dots ,{z}_{ip}-{e}^{1/2}{\right)}^{T}/\left\{e\left(e-1\right){\right\}}^{1/2}$, and $\mathrm{\Sigma }$ is a $p×p$ positive definite matrix. Consequently, the mean vector and covariance matrix of ${X}_{i}$ are ${0}_{p×1}$ and ${\mathrm{\Sigma }}_{p×p}$, respectively. For the covariance matrix, we consider the block diagonal structure as described in Section 3.1. We set $\left(p,q\right)=\left(50,5\right)$ or $\left(300,10\right)$, and $\rho =0$, $0.3$, $0.6$ or $0.9$.

Figures 6 and 7 display the log(MSE) of the eight methods under different levels of dependence and sample size. When the dimension and correlation are both small, USIE and CSIE outperform the other methods. The reason is similar as the discussion in Section 3.1, the heavy-tailed data may lead to unstable estimates of the entries of $\mathrm{\Sigma }$, hence shrinking towards a simple identity target, which requires little data to fit, stabilizes the sample covariance matrix. In addition, as shown in Figure 7, when the dimension is large and the correlation is not small, ATE and POET are the only two methods that have a better performance than the other methods except that the sample size is small. Finally, we also note that IDE cannot provide a satisfactory performance even if the covariates are uncorrelated. As demonstrated in Theorem 1, IDE estimator is derived under the normal distribution and may not be robust to heavy-tailed data.

## 3.4 Degenerate normal data

To further investigate the performance of the eight methods, we consider a non-positive definite covariance matrix in which the positive definite assumption of the covariance matrix is violated. Note that this new setting is often overlooked in the literature. To construct a non-positive definite covariance matrix, we define the affine transformation $C$ as $\begin{array}{r}C={\left(\begin{array}{cccccccc}1& 0& \cdots & \cdots & \cdots & \cdots & \cdots & 0\\ 0& 1& 0& \ddots & \ddots & \ddots & \ddots & ⋮\\ ⋮& 0& 1& 0& \ddots & \ddots & \ddots & ⋮\\ ⋮& \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & ⋮\\ ⋮& \ddots & \ddots & \ddots & 0& 1& 0& ⋮\\ ⋮& \ddots & \ddots & \ddots & \ddots & 0& 1& 0\\ 0& \cdots & 0& 1/\sqrt{p-4}& 1/\sqrt{p-4}& \cdots & 1/\sqrt{p-4}& 0\end{array}\right)}_{p×p}.\end{array}$

We then apply the affine transformation to the covariance matrix in Setup II and form $\begin{array}{r}{\mathrm{\Sigma }}_{5}=C{\mathrm{\Sigma }}_{2}{C}^{T}.\end{array}$

Figure 8

Log MSEs for data from degenerate normal distribution with $p$=50. The sample size ranges from 5 to 50. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

It is obvious that $|{\mathrm{\Sigma }}_{5}|=0$ since $|C|=0$. We set $\left(p,q\right)=\left(50,5\right)$, and $\rho =0$, $0.3$, $0.6$ or $0.9$. Note that he log-determinant of ${\mathrm{\Sigma }}_{5}$ is negative infinity. Hence, for this degenerate setting, the MSE is defined on the determinant rather than on the log-determinant. Specifically, it is $\mathrm{M}\mathrm{S}\mathrm{E}\left({e}^{\theta },{e}^{\stackrel{ˆ}{\theta }}\right)=\frac{1}{Mp}\sum _{m=1}^{M}{\left({e}^{{\stackrel{ˆ}{\theta }}_{m}}-{e}^{\theta }\right)}^{2}.$

Figure 8 shows the log(MSE) of all eight methods for different levels of dependence and sample size. We can see that the simulation results are different from those in the previous three setups. POET gives the best performance among the eight methods. In addition, we note that, under the non-positive definite setting, POET performs extremely well when the sample size is very small. For this phenomenon, we explore the possible reasons in the next paragraph.

To estimate $\mathrm{\Sigma }$, [54] applied the spectral decomposition on the sample covariance matrix: ${S}_{n}=\sum _{j=1}^{Q}{\stackrel{ˆ}{\lambda }}_{j}{\stackrel{ˆ}{\xi }}_{j}{\stackrel{ˆ}{\xi }}_{j}^{T}+{\stackrel{ˆ}{R}}_{Q}.$

If the sample size is much smaller than the dimension $p$, most eigenvalues of ${S}_{n}$ are zeros. This leads to ${\stackrel{ˆ}{R}}_{Q}$, the principal orthogonal complement of the largest $Q$ eigenvalues, is nearly a zero matrix. And consequently, the final estimator of POET, ${\stackrel{ˆ}{\mathrm{\Sigma }}}_{\stackrel{ˆ}{Q}}=\sum _{j=1}^{\stackrel{ˆ}{Q}}{\stackrel{ˆ}{\lambda }}_{j}{\stackrel{ˆ}{\xi }}_{j}{\stackrel{ˆ}{\xi }}_{j}^{T}+{\stackrel{ˆ}{R}}_{\stackrel{ˆ}{Q}}^{{T}}$, tends to be highly degenerate for small sample sizes rather than for large sample sizes.

Finally, it is noteworthy that when the correlation is strong, the log(MSE) of POET is also fluctuant as the sample size increases. This again verifies that both the correlation and sample size have a large impact on the performance of POET.

## 3.5 Real data

In this setup, we consider to generate a realistic covariance matrix from the Myeloma data [57], which is a real microarray data set including a total of 54, 675 genes, with 351 samples in the first group and 208 samples in the second group. To generate the covariance matrix, we first select $100$ genes randomly from the first group and then compute the sample covariance matrix using the selected genes, denoted by ${\mathrm{\Sigma }}_{r}$. Next, to evaluate the performance of the estimators under different levels of dependence, we follow [58] and define the true covariance matrix as ${\mathrm{\Sigma }}_{1}=\left(1-\rho \right)\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left({\mathrm{\Sigma }}_{r}\right)+\rho {\mathrm{\Sigma }}_{r},$

where $\rho$ controls the level of dependence. We set $\rho =0$, $1/3$, $2/3$ or $1$. Note that $\rho =0$ corresponds to a diagonal covariance matrix, and $\rho =1$ treats the generated sample covariance matrix as the true covariance matrix.

Figure 9 shows the log(MSE) of the eight methods for different levels of dependence and sample size. The comparison results are summarized as follows. When the sample size and correlation are both small, the methods that shrinking the covariance matrix toward the identity matrix (e.g., USIE and CSIE) lead to a good performance. When the covariates are uncorrelated and the sample size is large, IDE has the best performance. In addition, when the sample size is large and the correlation is moderate (e.g., $n=80$ and $\rho =2/3$), shrinking the sample covariance matrix toward a diagonal target matrix (e.g., USDE and CSDE) has a good performance. When the correlation and sample size are both large, ATE outperforms or is at least comparable to USDE and CSDE. Finally, POET is not stable and very sensitive to both the correlation and the sample size. When the correlation and sample size is not large, POET may fail to provide a satisfactory performance owing to the largely increased bias compared with the other methods.

Figure 9

Log MSEs for real data with $p$=100. The sample size ranges from 10 to 80. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

## 4 Conclusion

In this section, we summarize some useful findings of the comparison results and also provide some practical guidelines for researchers.

1. Diagonal estimation

The diagonal estimator, DE, is the simplest method for estimating the determinant of high-dimensional covariance matrix. It assumes that all covariates are uncorrelated. For independent normal data, IDE is an unbiased estimator of $log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|$ and also provides the best performance, especially when the dimension is large. For such settings, IDE can be recommended for estimating the determinant of high-dimensional covariance matrix. In addition, we note that IDE is not robust and may lead to an unsatisfactory performance when the independent normal assumption is violated.

Table 2

The time consumption of computing $\stackrel{ˆ}{\theta }$ with DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively. In ATE and POET, the turning parameter was selected based on 5-fold cross validation. The data is generated as described in Section 3.1. Timings (seconds) of 10 runs with Intel Core(TM) 3.20GH processor.

2. Shrinkage estimation

For the shrinkage estimation, different choices of the target matrix and shrinkage parameter result in different performance for the determinant estimation. In general, when the dimension is not large (e.g., $p=50$), the shrinkage towards an identity target matrix (e.g., CSIE and USIE) performs well under the small sample size and weak correlation. This pattern is more evident for the heavy-tailed data. With a diagonal target matrix, CSDE, the consistent estimator of [7], has a similar performance with USDE. However, CSDE and USDE are seldom the best methods especially when the sample size is not large.

For the shrinkage estimators, the optimal shrinkage intensity can be specified without any further turning parameters. Consequently, the time consuming procedures such as cross-validation or bootstrap can be avoided. Table 2 shows the computational time of the eight methods. As we can see, the shrinkage methods are much faster than ATE and POET. More importantly, if the sample size is very small as $n=5$, $10$, selecting the turning parameters in ATE and POET by cross-validation may result in a large bias. Under this situation, the shrinkage estimators (e.g., shrinkage towards an explicit target matrix) can be very attractive. Nevertheless, as the sample size increases or the correlation is strong, the performance of the shrinkage methods may not be as competitive as the sparse method and the factor model method.

3. Sparse estimation

ATE presents its robust property in our settings. Specifically, when the sample size is not very small, ATE performs better or comparably to the other seven methods under various data structures and different levels of dependence. In practice, if the sample size is not very small and we have no prior information about the dependence level of the covariates, the sparse estimator can be recommended for estimating the determinant of high-dimensional covariance matrix.

As shown in the simulations, when the sample size is very small, the performance of ATE is not attractive as the shrinkage estimators or even the diagonal estimators. For possible reasons, we note that an adaptive thresholding parameter in ATE is needed in practice. When the sample size is very small, however, their proposed cross-validation method may not provide a reliable estimate for the optimal threshold value.

4. Factor model estimation

The factor model estimation, POET, is very attractive for strongly correlated data sets when the sample size is not small. [54] assumed that the data are weakly correlated after extracting the common factors which can result in high levels of dependence among the covariates. This implies that POET may provide a good performance if the data are strongly correlated. Note also that POET can select $K=0$ automatically if the true covariance matrix is sparse. Then consequently, their method will degenerate to the sparse estimation such as the hard thresholding estimation in [8] or ATE in [49].

POET, however, depends on the number of factors $K$, which is unknown in practice. To investigate the impact of the factors under different sample sizes and different levels of dependence, we simulated the MSE of POET for the log-determinant of the covariance under Setup II. Results from Table 1 show that $K$ has a large impact on the determinant estimation. When the correlation is strong, $\stackrel{ˆ}{K}$, a consistent estimator of $K$, usually leads to a large MSE. [54] demonstrated that POET is robust to over-estimated and sensitive to under-estimated factors. For the finite sample size, they suggested to chose a relatively large $K$ (e.g., not less than 8). However, our simulation studies showed that the robustness for estimating the covariance matrix may not hold any more for estimating the determinant. In particular, for small sample size, both under-estimated and over-estimated factors will give a bad performance of POET. In view of this, we believe that future research is needed for selecting the optimal $K$ when the factor model method is applied to estimate the determinant of the covariance matrix.

To conclude, the sample size, the dependence level and the dimension of the data sets take a great impact on the accuracy of estimation. In practice, we may need to select an appropriate estimation method according to the sample size and the prior information on the correlation structure of the covariates. When such prior information is not available, we recommend to use ATE [49] to estimate the determinant of high-dimensional covariance matrix, which is robust to various correlations and data structures.

## 5 Discussion

In this paper, we have compared a total of eight methods for estimating the log-determinant of the high-dimensional covariance matrix. The performance of the eight methods depends on the sample size, the dependence structure and the dimension of the data. When the sample size is not small, we note that ATE [49] is always able to provide an average or above average performance among the eight methods. Hence, if there is little prior information about the structure of the covariance matrix, we recommend to use ATE to estimate the log-determinant $\theta$, or GV, in practice. In terms of computational time, the shrinkage methods are more convenient than ATE and POET because the latter two methods need to select the penalty parameters via cross-validation.

Note that the log-determinant of a covariance matrix is a scalar, the two-step procedure may not provide the best estimation for $\theta$. One possible future direction is to consider circumventing the full covariance matrix estimation, and estimating the log-determinant directly. Note that $log|\mathrm{\Sigma }|=\mathrm{t}\mathrm{r}\left(log\mathrm{\Sigma }\right)$, which is essentially a summation of the log-eigenvalues of $\mathrm{\Sigma }$. This suggests that the random matrix theory or the spectrum analysis may provide feasible solutions to estimate the log-determinant more accurately. The comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation. Specifically, from a perspective of the loss function, if we define the loss function as $\begin{array}{r}\mathrm{L}\mathrm{o}\mathrm{s}\mathrm{s}\left(\stackrel{ˆ}{\mathrm{\Sigma }},\mathrm{\Sigma }\right)=\left(log|\stackrel{ˆ}{\mathrm{\Sigma }}|-log|\mathrm{\Sigma }|{\right)}^{2}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{o}\mathrm{r}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{L}\mathrm{o}\mathrm{s}\mathrm{s}\left(\stackrel{ˆ}{\mathrm{\Sigma }},\mathrm{\Sigma }\right)=\left(|\stackrel{ˆ}{\mathrm{\Sigma }}|-|\mathrm{\Sigma }|{\right)}^{2},\end{array}$

then the conducted simulations in Section 3 provide essentially a comparison for the eight methods for estimating $\mathrm{\Sigma }$ rather than $\theta$. Of course, we do not intend to claim that the above loss functions should be consistently recommended. In contrast, for evaluating the covariance matrix estimation, other popular methods are also available in the literature. For instance, by letting $L$ as the likelihood function and $\stackrel{ˆ}{L}$ as the corresponding estimator, we may consider any of the distance between the log-likelihood and the estimated log-likelihood as the criterion to evaluate the performance: $D\left(L,\stackrel{ˆ}{L}\right)=\left\{log\left(L\right)-log\left(\stackrel{ˆ}{L}\right){\right\}}^{2}.$

In addition, we can also consider any of the following loss functions:

• $\mathrm{L}\mathrm{o}\mathrm{s}\mathrm{s}\left(\stackrel{ˆ}{\mathrm{\Sigma }},\mathrm{\Sigma }\right)=\parallel \stackrel{ˆ}{\mathrm{\Sigma }}-\mathrm{\Sigma }{\parallel }_{2}=\sqrt{{\lambda }_{max}\left\{\left(\stackrel{ˆ}{\mathrm{\Sigma }}-\mathrm{\Sigma }{\right)}^{T}\left(\stackrel{ˆ}{\mathrm{\Sigma }}-\mathrm{\Sigma }\right)\right\}}$, where ${\lambda }_{max}\left(\cdot \right)$ denotes the maximum eigenvalue [46, 47, 59].

• $\mathrm{L}\mathrm{o}\mathrm{s}\mathrm{s}\left(\stackrel{ˆ}{\mathrm{\Sigma }},\mathrm{\Sigma }\right)=\parallel \stackrel{ˆ}{\mathrm{\Sigma }}-\mathrm{\Sigma }{\parallel }_{F}=\sqrt{\sum _{i,j}\left({\stackrel{ˆ}{\sigma }}_{ij}-{\sigma }_{ij}{\right)}^{2}}$, where $\mathrm{\Sigma }=\left({\sigma }_{ij}{\right)}_{p×p}$ and $\stackrel{ˆ}{\mathrm{\Sigma }}=\left({\stackrel{ˆ}{\sigma }}_{ij}{\right)}_{p×p}$ [49, 54].

• $\mathrm{L}\mathrm{o}\mathrm{s}\mathrm{s}\left(\stackrel{ˆ}{\mathrm{\Sigma }},\mathrm{\Sigma }\right)=\parallel \stackrel{ˆ}{\mathrm{\Sigma }}-\mathrm{\Sigma }{\parallel }_{max}=\underset{i,j}{max}|{\stackrel{ˆ}{\sigma }}_{ij}-{\sigma }_{ij}|$ [54].

Further research is needed to investigate which loss function provides the best criterion for evaluating the estimation methods of the covariance matrix.

Finally, it is noteworthy that there is another category of publications in the literature on calculating the log-determinant of the covariance matrix [53, 60, 61, 62, 63, 6465]. We now point out that they are very different from the proposed study in our paper. Specifically, these papers assume that the covariance matrix $\mathrm{\Sigma }$ is known, yet as the dimension is very large, the canonical methods (e.g., the Choleskey decomposition) for computing $log|\mathrm{\Sigma }|$ require a total of $O\left({p}^{3}\right)$ operations and may not be feasible in practice. The above papers have proposed more efficient algorithms including the random matrix theory and the spectrum analysis for fast computation of $log|\mathrm{\Sigma }|$.

## A proof of Theorem 1

(1) From ${s}_{j}^{2}={\sigma }_{j}^{2}{\chi }_{\nu ,j}^{2}/\nu$, we have $log{s}_{j}^{2}=log{\sigma }_{j}^{2}+log\left({\chi }_{\nu ,j}^{2}/\nu \right)$. Then, $\sum _{j=1}^{p}log{s}_{j}^{2}=\sum _{j=1}^{p}log{\sigma }_{j}^{2}+plog{\chi }_{\nu ,j}^{2}-plog\nu$. Further, $\begin{array}{r}E\left(\sum _{j=1}^{p}log{s}_{j}^{2}\right)=\sum _{j=1}^{p}log{\sigma }_{j}^{2}+p\left\{log2+\psi \left(\nu /2\right)\right\}-plog\nu .\end{array}$

This leads to $\begin{array}{rcl}E\left\{{\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}\right\}& =& E\left(\sum _{j=1}^{p}log{s}_{j}^{2}\right)-p\left\{log2+\psi \left(\nu /2\right)\right\}+plog\nu \\ & =& \sum _{j=1}^{p}log{\sigma }_{j}^{2}=log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|.\end{array}$

Hence, ${\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}$ is an unbiased estimator of $log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|$.

(2) For $E\left(log{\sigma }_{1}^{2}\right)<\mathrm{\infty }$, we have $\begin{array}{r}\frac{1}{p}\sum _{j=1}^{p}log{\sigma }_{j}^{2}\stackrel{a.s.}{⟶}E\left(log{\sigma }_{1}^{2}\right)\text{\hspace{0.17em}}\text{\hspace{0.17em}}as\text{\hspace{0.17em}}p\to \mathrm{\infty }.\end{array}$

Since $E\left(log{s}_{1}^{2}\right)=E\left\{E\left(log{s}_{1}^{2}|{\sigma }_{1}^{2}\right)\right\}=E\left(log{\sigma }_{1}^{2}\right)+log\left(2/\nu \right)+\psi \left(\nu /2\right)$, we have $\begin{array}{r}\frac{1}{p}\sum _{j=1}^{p}log{s}_{j}^{2}-log\left(2/\nu \right)-\psi \left(\nu /2\right)\stackrel{a.s.}{⟶}E\left(log{\sigma }_{1}^{2}\right)\text{\hspace{0.17em}}\text{\hspace{0.17em}}as\text{\hspace{0.17em}}p\to \mathrm{\infty }.\end{array}$

By the above two results, it yields that $\begin{array}{r}\frac{1}{p}\sum _{j=1}^{p}log{s}_{j}^{2}-log\left(2/\nu \right)-\psi \left(\nu /2\right)-\frac{1}{p}\sum _{j=1}^{p}log{\sigma }_{j}^{2}\stackrel{a.s.}{⟶}0\text{\hspace{0.17em}}\text{\hspace{0.17em}}as\text{\hspace{0.17em}}p\to \mathrm{\infty }.\end{array}$

Finally, we have $\begin{array}{rcl}\frac{1}{p}\left\{{\stackrel{ˆ}{\theta }}_{\left(1\right)}+{C}_{0}-log|\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{g}\left(\mathrm{\Sigma }\right)|\right\}& =& \frac{1}{p}\sum _{j=1}^{p}log{s}_{j}^{2}-log\left(2/\nu \right)-\psi \left(\nu /2\right)-\frac{1}{p}\sum _{j=1}^{p}log{\sigma }_{j}^{2}\\ & \stackrel{a.s.}{⟶}& 0\text{\hspace{0.17em}}\text{\hspace{0.17em}}as\text{\hspace{0.17em}}p\to \mathrm{\infty }.\square \end{array}$

## Acknowledgements:

Tiejun Tong’s research was supported by the National Natural Science Foundation of China grant (No. 11671338), and the Hong Kong Baptist University grants FRG2/15-16/019, FRG2/15-16/038 and FRG1/16-17/018. The authors thank the editor, the associate editor and two reviewers for their constructive comments that have led to a substantial improvement of the paper.

## References

• 1.

Kaur S, Archer KJ, Devi MG, Kriplani A, Strauss JF, Singh R. Differential gene expression in granulosa cells from polycystic ovary syndrome patients with and without insulin resistance: identification of susceptibility gene sets through network analysis. J Clin Endocrinol Metab 2012;97:E2016–E2021.

• 2.

Kuster DW, Merkus D, Kremer A, van IJcken WF, de Beer VJ, Verhoeven AJ, et al. Left ventricular remodeling in swine after myocardial infarction: a transcriptional genomics approach. Basic Res Cardiol 2011;106:1269–1281. Google Scholar

• 3.

Mokry M, Hatzis P, Schuijers J, Lansu N, Ruzius FP, Clevers H, et al. Integrated genome-wide analysis of transcription factor occupancy, RNA polymerase II binding and steady-state RNA levels identify differentially regulated functional gene classes. Nucleic Acids Res 2012;40:148–158.

• 4.

Richard AC, Lyons PA, Peters JE, Biasci D, Flint SM, Lee JC, et al. 2014; Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation. BMC Genomics 15:649–659.

• 5.

Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use. RNA 2016;22:839–851.

• 6.

Ledoit O, Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empirical Finance 2003;10:603–621.

• 7.

Fisher TJ, Improved Sun X. Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput Stat Data Anal 2011;55:1909–1918.

• 8.

Bickel PJ, Levina E. Covariance regularization by thresholding. Ann Stat 2008;36:2577–2604.

• 9.

Cai T, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann Stat 2012;40:2014–2042.

• 10.

Rothman AJ. Positive definite estimators of large covariance matrices. Biometrika 2012;99:733–740.

• 11.

Cai T, Ren Z, Zhou H. Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab Theo Relat Fields 2013;156:101–143.

• 12.

Chen X, Xu M, Wu WB. Covariance and precision matrix estimation for high-dimensional time series. Ann Stat 2013;41:2994–3021.

• 13.

Basu S, Michailidis G. Regularized estimation in sparse high-dimensional time series models. Ann Stat 2015;43:1535–1567.

• 14.

Tong T, Wang C, Wang Y. Estimation of variances and covariances for high-dimensional data: a selective review. WIREs Comput Stat 2014;6:255–264.

• 15.

Cai T, Ren Z, Zhou H. Estimating structured high-dimensional covariance and precision matrices: optimal rates and adaptive estimation. Electron J Stat 2016;10:1–59.

• 16.

Fan J, Liao Y, Liu H. An overview of the estimation of large covariance and precision matrices. Econometrics J 2016;19:C1–C32. Google Scholar

• 17.

Wilks SS. Certain generalizations in the analysis of variance. Biometrika 1932;24:471–494.

• 18.

Wilks S. Multidimensional statistical scatter. In: Andreson TW, editor. Collected papers: contributions to mathematical statistics. New York: John Wiley & Sons, 1967:597–614. Google Scholar

• 19.

Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007;94:19–35.

• 20.

Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008;9:432–441.

• 21.

Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J Mach Learn Res 2008;9:485–516. Google Scholar

• 22.

Witten DM, Tibshirani R. Covariance-regularized regression and classification for high dimensional problems. J R Stat Soc Ser B 2009;71:615–636.

• 23.

Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron J Stat 2011;5:935–980.

• 24.

Yin J, Li H. Adjusting for high-dimensional covariates in sparse precision matrix estimation by ℓ1-penalization. J Multivariate Anal 2013;116:365–381.

• 25.

Bishop CM. Pattern recognition and machine learning. New York: Springer, 2006. Google Scholar

• 26.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2002. Google Scholar

• 27.

Rousseeuw PJ. Multivariate estimation with high breakdown point. Math Stat Appl 1985;8:283–297. Google Scholar

• 28.

Rousseeuw PJ, Driessen KV. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999;41:212–223.

• 29.

Ro K, Zou C, Wang Z, Yin G. Outlier detection for high-dimensional data. Biometrika 2015;102:589–599.

• 30.

Boudt K, Rousseeuw P, Vanduffel S, Verdonck T. The minimum regularized covariance determinant estimator, 2017. arXiv preprint arXiv:1701.07086. Google Scholar

• 31.

Anderson TW. An introduction to multivariate statistical analysis. New York: Wiley, 1984. Google Scholar

• 32.

Tsai CA, Chen JJ. Multivariate analysis of variance test for gene set analysis. Bioinformatics 2009;25:897–903.

• 33.

Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol. Biol. 2005;4:32. Google Scholar

• 34.

Ullah I, Jones B. Regularised MANONA for high-dimensional data. Aust N Z J Stat 2015;57:377–389.

• 35.

Chiu TY, Leonard T, Tsui KW. The matrix-logarithmic covariance model. J Am Stat Assoc 1996;91:198–210.

• 36.

Cai T, Liang T, Zhou H. Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions. J Multivariate Anal 2015;137:161–172.

• 37.

Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002;97:77–87.

• 38.

Bickel PJ, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 2004;10:989–1010.

• 39.

Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001;17:509–519.

• 40.

Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003;19:2448–2455.

• 41.

Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005;6:59–75.

• 42.

Tong T, Wang Y. Optimal shrinkage estimation of variances with applications to microarray data analysis. J Am Stat Assoc 2007;102:113–122.

• 43.

Tong T, Jang H, Wang Y. James-Stein type estimators of variances. J Multivariate Anal 2012;107:232–243.

• 44.

Warton DI. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J Am Stat Assoc 2008;103:340–349.

• 45.

Warton DI. Regularized sandwich estimators for analysis of high-dimensional data using generalized estimating equations. Biometrics 2011;67:116–123.

• 46.

Karoui, NE. 2008; Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann Stat 36:2717–2756.

• 47.

Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J Am Stat Assoc 2009;104:177–186.

• 48.

Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann Stat 2009;37:42–54. Google Scholar

• 49.

Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J Am Stat Assoc 2011;106:672–684.

• 50.

Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. Ann Stat 2012;40:2389–2420.

• 51.

Mitra R, Zhang C. Multivariate analysis of nonparametric estimates of large correlation matrices, 2014. arXiv preprint arXiv:1403.6195.

• 52.

Wang T, Berthet Q, Samworth RJ. Statistical and computational trade-offs in estimation of sparse principal components. Ann Stat 2016;44:1896–1930.

• 53.

Barry RP, Pace RK. Monte carlo estimates of the log determinant of large sparse matrices. Linear Algebra Appl 1999;289:41–54.

• 54.

Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements (with discussion). J R. Stat Soc: Ser B 2013;75:603–680.

• 55.

Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics 2007;8:86–100.

• 56.

Pang H, Tong T, Zhao H. Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics 2009;65:1021–1029.

• 57.

Zhan F, Barlogie B, Arzoumanian V, Huang Y, Williams DR, Hollmig K, et al. Gene-expression signature of benign monoclonal gammopathy evident in multiple myeloma is linked to good prognosis. Blood 2007;109:1692–1700.

• 58.

Tong T, Feng Z, Hilton JS, Zhao H. Estimating the proportion of true null hypotheses using the pattern of observed p-values. J Appl Stat 2013;40:1949–1964.

• 59.

Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Ann Stat 2011;39:3320–3356.

• 60.

Boutsidis C, Drineas P, Kambadur P, Kontopoulou E-M, Zouzias A. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications 2017, in press. Google Scholar

• 61.

Fitzsimons J, Cutajar K, Osborne M, Roberts S, Filippone M. Bayesian inference of log determinants, 2017a. arXiv preprint arXiv:1704.01445. Google Scholar

• 62.

Fitzsimons J, Granziol D, Cutajar K, Osborne M, Filippone M, Roberts S. Entropic trace estimates for log determinants, 2017b. arXiv preprint arXiv:1704.07223. Google Scholar

• 63.

Han I, Malioutov D, Shin J. Large-scale log-determinant computation through stochastic Chebyshev expansions. In: Proceedings of the 32nd International Conference on Machine Learning, 2015:908–917. Google Scholar

• 64.

Peng W, Wang H. Large-scale log-determinant computation via weighted l2 polynomial approximation with prior distribution of eigenvalues. In:International conference on high performance computing and applications. Springer, 2015:120–125. Google Scholar

• 65.

Zhang Y, Leithead WE. Approximate implementation of the logarithm of the matrix determinant in Gaussian process regression. J Stat Comput Simul 2007;77:329–348.

Accepted: 2017-08-16

Published Online: 2017-09-21

Citation Information: The International Journal of Biostatistics, Volume 13, Issue 2, 20170013, ISSN (Online) 1557-4679,

Export Citation

© 2017 Walter de Gruyter GmbH, Berlin/Boston.