As previously stated, the objective of this work is to make use of external gene annotations to choose a model for which clusters may be meaningfully interpreted both with respect to their expression profiles and their functional properties. To do so, we propose a novel model selection criterion that highlights the association between the clusters of expression profiles and the functional annotations associated with a subset of genes. Since gene annotations are binary variables (i.e. a gene is either annotated or unannotated), it may seem natural to directly use the SICL defined in Equation (5). However, in contrast to the situation considered by Baudry et al. (2014), gene annotation information is often incomplete. More precisely, for each of the *G* annotation terms, indexed by *g*, the available information **u**^{g} is as follows:

$${u}_{i}^{g}=\{\begin{array}{ll}1\hfill & \text{if\hspace{0.17em}gene\hspace{0.17em}}i\text{\hspace{0.17em}is\hspace{0.17em}known\hspace{0.17em}to\hspace{0.17em}be\hspace{0.17em}implicated\hspace{0.17em}in\hspace{0.17em}function\hspace{0.17em}}g\mathrm{,}\hfill \\ 0\hfill & \text{if\hspace{0.17em}gene\hspace{0.17em}}i\text{\hspace{0.17em}is\hspace{0.17em}not\hspace{0.17em}known\hspace{0.17em}to\hspace{0.17em}be\hspace{0.17em}implicated\hspace{0.17em}in\hspace{0.17em}function\hspace{0.17em}}g\mathrm{.}\hfill \end{array}$$

Note that $${u}_{i}^{g}=0$$ can indicate that information is missing (i.e. gene *i* has not yet been identified for annotation *g*) or that gene *i* is known to be unrelated to annotation *g*. As such, $${u}_{i}^{g}=0$$ does not represent the null level of variable and thus represents an incomplete binary variable. For this reason, the SICL criterion is not an appropriate measure of the link between an external annotation **u**^{g} and a classification **z**, and a specific criterion must be defined to incorporate the gene annotation information into the model selection step. To this end, we propose the integrated completed annotated likelihood (ICAL) criterion as follows.

For each gene annotation **u**^{g}, we first define the random matrix **b**^{g} of latent variables indicating the allocation of the annotations among the *K* clusters:

$${b}_{ik}^{g}=\{\begin{array}{ll}1\hfill & \text{with\hspace{0.17em}probability\hspace{0.17em}}{p}_{k}^{g}\text{\hspace{0.17em}if\hspace{0.17em}}{u}_{i}^{g}=\mathrm{1,}\hfill \\ 0\hfill & \text{if\hspace{0.17em}}{u}_{i}^{g}=0.\hfill \end{array}\text{\hspace{1em}(6)}$$(6)

Each row of the matrix **b**^{g} is a random vector following a multinomial distribution with parameters $${u}_{i}^{g}$$ and $$\mathrm{(}{p}_{1}^{g}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{p}_{K}^{g}\mathrm{)}$$ if $${u}_{i}^{g}>0,$$ and is the null vector **0** if $${u}_{i}^{g}=0.$$

For the sake of simplicity, we first derive ICAL when a single external annotation **b**^{1} is available. ICAL aims to select the clustering model that maximizes the logarithm of the integrated annotated likelihood:

$$f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{)}={\displaystyle {\int}_{{\theta}_{K}}}f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}\pi \mathrm{(}{\theta}_{K}\mathrm{)}d{\theta}_{K}\mathrm{.}\text{\hspace{1em}(7)}$$(7)

As for the definition of the SICL, the variables **y** and **b**^{1} are assumed to be conditionally independent given **z**. Using Bayes formula, we have

$$f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}=f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}f\mathrm{(}{b}^{1}|y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{}\mathrm{)}\mathrm{.}$$

Note that since **y** and **b**^{1} are assumed to be independent given **z**, the conditional distribution of **b**^{1} given **z** does not depend on **y** or the mixture parameters. Thus, as *f*(**b**^{1}|**y**, **z**; *K*, *θ*_{K})=*f*(**b**^{1}|**z**; *K*), it follows that:

$$logf\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{)}=logf\mathrm{(}{b}^{1}|z\mathrm{;}\text{\hspace{0.17em}}K\mathrm{)}+log{\displaystyle {\int}_{{\theta}_{K}}}f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}\pi \mathrm{(}{\theta}_{K}\mathrm{)}d{\theta}_{K}\mathrm{.}\text{\hspace{1em}(8)}$$(8)

The last term in Equation (8) can be approximated with ICL(*K*) from Equation (3), and the first term may be approximated with

$$logf\mathrm{(}{b}^{1}|\widehat{z};\text{\hspace{0.17em}}K\mathrm{)}={\displaystyle \sum _{k=1}^{K}}{n}_{k}^{1}log\frac{{n}_{k}^{1}}{{n}^{1}}\mathrm{,}$$

where $${n}^{1}=\text{card}\mathrm{\{}i\mathrm{:}{u}_{i}^{1}=1\}$$ and $${n}_{k}^{1}=\text{card}\mathrm{\{}i\mathrm{:}{\widehat{z}}_{ik}=1\text{\hspace{0.17em}and\hspace{0.17em}}{u}_{i}^{1}=1\}\mathrm{.}$$ Finally, an asymptotic approximation of the expression in (7) leads to the ICAL criterion:

$$\text{ICAL}\mathrm{(}K\mathrm{)}=\text{ICL}\mathrm{(}K\mathrm{)}+{\displaystyle \sum _{k=1}^{K}}{n}_{k}^{1}log\frac{{n}_{k}^{1}}{{n}^{1}}\mathrm{.}$$

The generalization of this criterion to the case where *G*>1 gene annotations are available is straightforward. The aim is now to maximize the logarithm of the integrated annotated likelihood:

$$logf\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}^{G}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{)}=log{\displaystyle {\int}_{{\theta}_{K}}}f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}^{G}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}\pi \mathrm{(}{\theta}_{K}\mathrm{)}d{\theta}_{K}\mathrm{.}$$

Assuming that **b**^{1}, …, **b**^{G} and **y** are conditionally independent given **z**, we have

$$logf\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}{b}^{1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}^{G}\mathrm{;}\text{\hspace{0.17em}}K\mathrm{)}=logf\mathrm{(}{b}^{1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}^{G}\mathrm{;}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}K\mathrm{)}+log{\displaystyle {\int}_{{\theta}_{K}}}f\mathrm{(}y\mathrm{,}\text{\hspace{0.17em}}z\mathrm{;}\text{\hspace{0.17em}}K\mathrm{,}\text{\hspace{0.17em}}{\theta}_{K}\mathrm{)}\pi \mathrm{(}{\theta}_{K}\mathrm{)}d{\theta}_{K}\mathrm{.}$$

Assuming in addition that **b**^{1}, …, **b**^{G} are independent and that gene annotations are missing at random, we can write

$$f\mathrm{(}{b}^{1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}^{G}\mathrm{;}\text{\hspace{0.17em}}z\mathrm{,}\text{\hspace{0.17em}}K\mathrm{)}={\displaystyle \prod _{g=1}^{G}}f\mathrm{(}{b}^{g}\mathrm{|}z\mathrm{,}\text{\hspace{0.17em}}K\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{1em}(9)}$$(9)

leading to the generalized ICAL criterion:

$$\text{ICAL}\mathrm{(}K\mathrm{)}=\text{ICL}\mathrm{(}K\mathrm{)}+{\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k}^{g}log\frac{{n}_{k}^{g}}{{n}^{g}}\mathrm{.}\text{\hspace{1em}(10)}$$(10)

**Comparing ICAL and SICL** If we ignore the uncertainty associated with $${u}_{i}^{g}=0$$ (i.e. that gene *i* could either be unassociated with function *g* or that this information is missing), the SICL criterion could be considered to choose the model dimension *K*. In this case, using the notation from Section 2 and defining *n*_{k} the size of the cluster *k*, the SICL may be written as follows:

$$\text{SICL}\mathrm{(}K\mathrm{)}=\text{ICL}\mathrm{(}K\mathrm{)}+{\text{pen}}_{\text{SICL}}\mathrm{,}$$

where

$$\begin{array}{c}{\text{pen}}_{\text{SICL}}={\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k1}^{g}log\frac{{n}_{k1}^{g}}{{n}_{k}^{g}}+{\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k0}^{g}log\frac{{n}_{k0}^{g}}{{n}_{k}^{g}}\mathrm{,}\\ ={\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k1}^{g}log{n}_{k1}^{g}+{\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k0}^{g}log{n}_{k0}^{g}-G{\displaystyle \sum _{k=1}^{K}}{n}_{k}log{n}_{k}\mathrm{.}\end{array}$$

On the other hand, using the notation from Section 2 and defining $${n}_{.1}^{g}={\displaystyle {\sum}_{k=1}^{K}}{n}_{k1}^{g},$$ the ICAL may be written as follows:

$$\text{ICAL}\mathrm{(}K\mathrm{)}=\text{ICL}\mathrm{(}K\mathrm{)}+{\text{pen}}_{\text{ICAL}}\mathrm{,}$$

where

$$\begin{array}{ll}{\text{pen}}_{\text{ICAL}}\hfill & ={\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k1}^{g}log\frac{{n}_{k1}^{g}}{{n}_{.1}^{g}}\mathrm{,}\hfill \\ \hfill & ={\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k1}^{g}log{n}_{k1}^{g}-{\displaystyle \sum _{g=1}^{G}}{n}_{.1}^{g}log{n}_{.1}^{g}\mathrm{.}\hfill \end{array}$$

We note that the last term in the equation above is a constant independent of *K*. Finally, we can rewrite ICAL as a function of SICL:

$$\text{ICAL}\mathrm{(}K\mathrm{)}=\text{SICL}\mathrm{(}K\mathrm{)}-{\displaystyle \sum _{g=1}^{G}}{\displaystyle \sum _{k=1}^{K}}{n}_{k0}^{g}log{n}_{k0}^{g}+G{\displaystyle \sum _{k=1}^{K}}{n}_{k}log{n}_{k}+\text{constant}\mathrm{.}\text{\hspace{1em}(11)}$$(11)

From Equation (11), we note that the SICL takes into account both modalities (0 and 1) of the external variables **u**, while the ICAL discards the null modality (the $$-{\displaystyle {\sum}_{g=1}^{G}}{\displaystyle {\sum}_{k=1}^{K}}{n}_{k0}^{g}log{n}_{k0}^{g}$$ term). Moreover, it can be seen that the ICAL penalizes a large number of clusters, while the SICL does not (the $$G{\displaystyle {\sum}_{k=1}^{K}}{n}_{k}log{n}_{k}$$ term). As such, the ICAL tends to select parsimonious models with a relatively small number of clusters, as compared to SICL.

It is also helpful to consider the behavior of the ICAL and SICL criteria in extreme conditions. If the number of clusters *K* equals 1, the ICAL penalty pen_{ICAL} equals zero whereas SICL penalty pen_{SICL} is not null $$\mathrm{(}{\displaystyle {\sum}_{g=1}^{G}}{n}_{1}^{g}log\frac{{n}_{1}^{g}}{n}+{\displaystyle {\sum}_{g=1}^{G}}{n}_{0}^{g}log\frac{{n}_{0}^{g}}{n}\mathrm{)}.$$ In contrast, if the number of clusters *K* is equal to the number of observations, with one gene per cluster, the SICL penalty pen_{SICL} equals zero whereas the ICAL penalty is not null $$\mathrm{(}{\displaystyle {\sum}_{g=1}^{G}}{n}_{1}^{g}log{n}_{1}^{g}\mathrm{)}.$$ In general, ICAL tends to merge clusters to group genes annotated for the same function, reducing the number of optimal clusters *K* with respect to the optimal number of clusters selected by ICL. SICL tends to split clusters in order to obtain clusters made up only of annotated genes, increasing the number of optimal clusters with respect to the optimal number of clusters selected by ICL. In other words, SICL tends to select more complex models than ICL while ICAL tends to favor more parsimonious models than ICL. Note that this behavior of ICAL and SICL is a general trend, not a rule: ICAL does not always merge clusters and SICL does not always split them since clusters for different solutions are not necessarily nested in each other.

Code to implement our method is available in the R package ICAL, which may be found at the following website: https://github.com/Gallopin/ICAL.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.