Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel


Introduction
Bayesian nonparametric methods have become commonplace in the statistics and machine learning literature due to their flexibility and wide applicability. For model-based clustering, Dirichlet process (Ferguson, 1973(Ferguson, , 1974 mixture models have become particularly popular (Antoniak, 1974, Escobar, 1994, Escobar and West, 1995, Lo, 1984, partly because they allow the number of clusters supported by the data to be inferred. By introducing latent selection indicators, these models can be extended to perform variable selection for clustering (Kim et al., 2006), which is particularly relevant in highdimensional settings (Constantinopoulos et al., 2006, Law et al., 2004. There are now several approaches for model-based clustering and variable selection (see Fop and Murphy, 2017, for a recent review), but current Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference in Dirichlet process (DP) mixture models (e.g. Neal, 2004, Neal, 2000) are computationally costly, and often infeasible for large datasets.
Algorithms for fast approximate inference in DP mixture models, such as the use of fast search algorithms (Daume III, 2007), Bayesian hierarchical clustering (Cooke et al., 2011, Darkins et al., 2013, Heller and Ghahramani, 2005, Savage et al., 2009, or the sequential updating and greedy search (SUGS) algorithm (Wang andDunson, 2011, Zhang et al., 2014), make possible the analysis of datasets with large numbers of observations. However, without variable selection such algorithms may be ill-suited to the high-dimensional setting. In the spirit of the original SUGS algorithm, here we pose clustering and variable selection as a Bayesian model selection (BMS) problem. We consider variable selection for clustering in terms of partitioning variables into those which are relevant and those which are irrelevant for defining the clustering structure, and thereby pose the problem as one of using BMS to select both a partition of the variables and a partition of the observations. We moreover consider the benefits of performing Bayesian model averaging (BMA) (Hoeting et al., 1999, Madigan andRaftery, 1994) for summarising the SUGS output. For ease of exposition, we focus on the case of DP Gaussian mixtures, but note that all of our methods extend straightforwardly to other distributions for which conjugate priors may be chosen.
We consider a range of simulation settings and well-studied examples from cancer transcriptomics to show that our methods perform competitively with the current state-of-the-art. Having established the utility of our approach, we consider an application to reverse-phase protein arrays (RPPA) datasets in order to characterise the pan-cancer functional proteome. Such datasets have the potential to provide a deeper understanding of the biomolecular processes at work in cancer cells, and have previously been shown to offer additional insights beyond what may be captured by genomics or transcriptomics datasets (Akbani et al., 2014). Here we consider RPPA data for 5,157 tumour samples obtained from The Cancer Genome Atlas (TCGA).
Section 2 recaps DP mixture models and the SUGS algorithm, then describes our extensions to SUGS including variable selection and BMA. Section 3 evaluates our method on simulated datasets and compares it with other approaches to clustering and variable selection. We then apply our method to a large proteomics dataset, highlighting its applicability. In the final section, we make some concluding remarks and discuss limitations and extensions. Our methods are implemented in an R package: https://github.com/ococrook/sugsvarsel.

Dirichlet process mixtures
We provide a very brief recap of DP mixture models, mainly to introduce notation, and refer to the overview provided in Section 3 of Teh et al. (2006) for further details. Let G ∼ DP(βP 0 ) where β > 0 is the DP concentration parameter, P 0 is the base probability measure, and G is a random probability measure. We consider a Pólya urn scheme in which we have independent and identically distributed (i.i.d.) random variables θ 1 , θ 2 , ... distributed according to G. Computing the sequential conditional distributions of θ i given θ 1 , ..., θ i−1 , upon marginalising out the random G, we obtain (Blackwell and MacQueen, 1973): where δ θ is a probability measure with mass concentrated at θ. It is clear from this equation that for any r = 1, 2, . . . , i − 1, the probability that θ i is equal to θ r is given by i−1 l=1 I(θ l = θ r )/(β + i − 1), where I(X) = 1 if X is true and I(X) = 0 otherwise. Thus θ i has non-zero probability to be equal to one of the previous draws, and it is this clustering property that makes the DP a suitable prior for mixture models.
The DP mixture model is obtained by introducing an additional parametric probability distribution, F. More precisely, let observations x i be modelled according to the following hierarchical model: where F denotes the conditional distribution of the observation x i given θ i . For example, when F is chosen to be a Gaussian distribution we arrive at the DP Gaussian mixture model (also referred to as the infinite Gaussian mixture model; Rasmussen, 2000).
When performing inference for such models, it is common to introduce a set of latent variables (cluster labels) z 1 , ..., z n associated with the observations, such that z i is the cluster label for observation x i . From the above specification of the DP mixture model, it follows that the conditional prior distribution of z i given z −i = (z 1 , ..., z i−1 ) is categorical with: where β > 0 is the DP concentration parameter, n k := i−1 l=1 I(z l = k) is the number of previous observations allocated to cluster k, and K = max{z −i } + 1. Larger values of β encourage observations to be allocated to new clusters, hence β plays a role in controlling the number of clusters.
Inference for DP mixture models can performed using computationally intensive MCMC methods Neal, 2004, Neal, 2000). However, as we discuss, here we are interested in the SUGS algorithm for approximate inference, proposed by Wang and Dunson (2011).

Sequential Updating and Greedy Search (SUGS)
SUGS is a sequential approach for allocating observations to clusters, which (greedily) allocates the i-th observation to a cluster, given the allocations of the previous i −1 observations. Suppose that observations x −i = (x 1 , ..., x i−1 ) have previously been allocated to clusters. As described in Wang and Dunson (2011), the posterior probability of allocating observation i to cluster k according to the DP mixture model formulation above is given by: where π ik is defined as in Equation (3), and is the conditional marginal likelihood associated with x i given allocation to cluster k and the cluster allocations for observations 1, ..., i − 1, with f (x i |θ k ) denoting the likelihood associated with x i as a function of θ k . If k is a cluster to which previous observations have already been allocated, then p(θ k |x −i , z −i ) is the posterior distribution of θ k given the observations previously allocated to cluster k; i.e.
is the prior on the cluster-specific parameters, θ k . For a new cluster, i.e. for k = K, we have p(θ k |x −i , z −i ) = p 0 (θ k ). If p 0 is taken to be conjugate for the likelihood f , then the posterior and conditional marginal likelihood are available analytically. Assuming that the concentration parameter β is given and that conjugate priors are taken, the above suggests a computationally efficient deterministic clustering algorithm (the SUGS algorithm). That is, z 1 is initialised as z 1 = 1, and then subsequent observations are sequentially allocated to clusters by setting z i = arg max k ∈ {1,...,K } P(z i = k |x i , x −i , z −i , β), where we recall that K = max{z −i } + 1 may change after each sequential allocation.

Dealing with unknown β
The DP concentration parameter β directly influences the number of clusters, thus we treat this as a random variable to be inferred, in the same way as in Wang and Dunson (2011). In particular, let β = (β 1 , ...,β L ) be a discrete grid of permissible values for β with a large range, and then define the prior for β to be discrete with the following form: where κ l = p(β =β l ). Further defining φ (i−1) , the β parameter may be marginalised in Equation (4) to obtain: where π ikl := p(z i = k | β =β l , z −i ) is given by Equation (3); φ (0) l = κ l ; and: may be calculated sequentially for i = 1, . . . , n. The SUGS algorithm for allocating observations to clusters when β is unknown is then as presented in Algorithm 1.
Algorithm 1: The SUGS algorithm, when the DP precision parameter β is allowed to be unknown.

Formulation of Bayesian Model Selection problem
A notable limitation of the (deterministic) SUGS algorithm as presented so far is that the clustering structure obtained is dependent upon the initial ordering of the observations. To remove this dependence, Wang and Dunson (2011) consider multiple permutations of this ordering, and pose SUGS as a Bayesian model selection (BMS) problem. More concretely, the algorithm is repeated for many random orderings of the data and a final partition of the observations is then chosen by optimising an appropriate objective function for BMS, such as the marginal likelihood (ML): In practice, Wang and Dunson (2011) advocate optimising the pseudo-marginal likelihood (PML), since they found that the marginal likelihood to often produce many small clusters. The PML is given by: where, defining X = {x 1 , ..., x n } and Z = {z 1 , ..., z n }, we have X n\−i = X\{x i } is the set of all observations except the i th , and similarly z n\−i = Z\{z i }. In addition, Wang and Dunson (2011) remark that that p(x i |X, Z) can be used to approximate p(x i |X n\−i , z n\−i ) to speed up computations and that this approximation is accurate for large sample sizes.

SUGS for variable selection
Irrelevant variables in high-dimensions can present a considerable challenge for clustering models and algorithms, because the number of variables with no clustering structure can overwhelm those where a clustering structure exists (Witten and Tibshirani, 2010). There have been many approaches to modelbased clustering and variable selection (e.g. Maugis et al., 2009, Raftery andDean, 2006), and we direct readers to Fop and Murphy (2017) for a recent review. However, many of these scale poorly with increasing dataset dimension, and/or require the number of clusters to be determined as a separate analysis step. To address these challenges, here we extend the SUGS algorithm to simultaneously perform clustering and variable selection, and refer to the resulting procedure as SUGSVarSel.
Since we are in the high-dimensional setting, we assume for simplicity that variables are independent given the cluster allocations (which, in the Gaussian case, is equivalent to assuming a diagonal structure for the covariance matrix). Let x i,d be the d th element of the i th observation vector, with d = 1, . . . , D, and D the number of variables. Introducing indicator variables γ d , which is 1 if the d th variable is relevant for the clustering structure and 0 if not, we follow a common approach from the literature (Kim et al., 2006, Law et al., 2004, Tadesse et al., 2005 and assume that the cluster conditional likelihood can be factorised as follows: where θ 0 are "global" (i.e. not cluster-specific) parameters. In other words, the variables for which γ d = 1 are modelled by a mixture distribution with cluster-specific parameters θ k,d , while the variables for which γ d = 0 are modelled by a single component with (global, not cluster-specific) parameters θ 0,d . Having introduced the D indicator variables γ d , we now extend the SUGS algorithm in order to estimate them.

The SUGSVarSel algorithm
Given a realisation of the indicator variables, Γ = {γ 1 , . . . , γ D }, we may plug the cluster conditional likelihood given in Equation (11) into Equation (5) and proceed as before in order to identify a clustering, Z.
Conversely, suppose we have a realisation, Z, of the set of component allocation variables, but that the indicator variables Γ are unknown. In this case, the posterior probabilities associated with the variable indicators are given by: where p 0 (γ d = q) indicates the prior probability that γ d = q, and B is a normalising constant that ensures that p(γ d = 0|X, Z) and p(γ d = 1|X, Z) sum to 1. Thus, given a realisation, Z, of the set of component allocation variables, a greedy approach to finding γ d is to set γ d = arg max q ∈ {0,1} P(γ d = q|X, Z).
Given an initial realisation of the indicator variables, Γ = Γ (0) , the above suggests an iterative strategy in which at each iteration we use the SUGS algorithm to find a partition Z (t) given Γ (t−1) , and then greedily update the indicator variables according to Equations (12) and (13) above in order to obtain Γ (t) given Z (t) . This algorithm, which we refer to as SUGSVarSel, is presented in Algorithm 2.

Initialisation strategies for SUGSVarSel
Like the SUGS algorithm, the output of SUGSVarSel depends upon the initial ordering of the observations. It moreover depends upon the initialisation of the variable selection switches, Γ (0) . To address this latter issue, we propose a random sub-sampling initialisation strategy. This is as follows: first randomly select p 1 variables (with 1 < p 1 ≤ D) and apply SUGSVarSel on this new datasetX of size n × p 1 with a small number of random orderings of the observations (we find 10 works in practice). The initial indicator for the variables ofX, which we write asΓ (0) , are set as all-on (γ d = 1 for these p 1 variables).Γ (0) is held the same for each of the random orderings. For each of the random orderings, this approach outputsZ for all observations butΓ for only a subset of size p 1 of the variables. To obtain Γ for all D variables, we use the cluster allocationsZ and the full data X to compute probabilities for the remaining variables using equations 12 and 13. We then greedily assign the indicator variables. A single best model generated by these random orderings is selected using the ML. This procedure returns a Γ 1 ∈ {0, 1} D ; that is, variable selection switches with some variables switched on and other variables switched off. We repeat this process for a total of M random sub-samples of the variables to produce a set of clusterings Z 1 , ..., Z M and a set of variables Γ 1 , .., Γ M . These variable sets are then used as initial inputs Γ (0) = Γ i for i = 1, ..., M for the SUGSVarSel algorithm (which is now run using all variables p = D) with Q new random orderings (again we find 10 is sufficient in practice). This SUGSVarSel with sub-sampling initialisation strategy returns Q models for each random sub-sample of the variables. Thus, we have QM models from which to choose.
For each model obtained in this way, we calculate the marginal likelihood (see Section 2.2.2). We can then perform BMS to obtain a single "best" model, or we can use Bayesian model averaging (BMA; see next section).

Bayesian model averaging
The output of our algorithm is a set of clusterings, associated variables and a marginal likelihood. One can select a single "best" model amongst these possible clustering, however we can also average over these models to capture the model uncertainty. The idea is called Bayesian model averaging (BMA) and we apply the method to clustering and variable selection (Hoeting et al., 1999, Madigan and Raftery, 1994, Russell et al., 2015. For each model we form a co-clustering matrix S. S is defined in the following way: That is the i j th entry of S is 1 if observation x i and x j are in the same cluster and 0 otherwise. We note that the S is invariant to relabelling and the number of clusters. Now, suppose we have M models M 1 , ..., M M , letting X be our observations and θ m be the parameters associated with model M m . The posterior probability for M m is given by where The marginal likelihood (16) is the key quantity for model comparison and can be interpreted as the weight given to each proposed model. Further note the two sources of averaging: the averaging over the parameters in the ML and the averaging over the models in equation (15). We suppose that a priori all models are equally likely, choosing the prior on each model to be p 0 (M m ) = 1/M. One computational challenge that (15) gives us is computing the summation, since it can involve evaluating possibly thousands of models. To overcome this, one can discount models that are poor at describing our observations comparatively to our best model. More precisely, let us form Occam's window (Hoeting et al., 1999): where K is a tuning parameter. Occam's window is the set of all possible models within a reasonable Bayes factor from the best model under consideration. The summation in (15) is then replaced with a summation over the set W.

Averaging the co-clustering matrices
We can form the Bayesian model-averaged co-clustering matrix (BMAC) by taking the set of co-clustering matrices S W and averaging, weighting by their ML: The BMA of the variable set can be found in the same way by averaging over the weighted variable sets for each model: where we denote by F m the variable set associated with model M m .
3 Comparisons with the state-of-the-art We compare SUGSVarSel to a number of alternative algorithms, and demonstrate the performance of our method in two situations. The first is the p > n paradigm, where the number of variables exceeds the number of observations. The second situation considers n > p for n = 1000, while simultaneously considering different proportions of variables being relevant. In both cases, we consider a variety of scenarios, for which different proportions of the variables are relevant.

Alternative methods for clustering and variable selection
We compare our method relative to the current state-of-the-art, including methods that do and do not peform variable selection. These include: mclust, a finite mixture model based clustering method (Fraley and Raftery, 2002, Fraley et al., 2012, Scrucca et al., 2016; clustvarsel, a finite mixture model method with variable selection (Maugis et al., 2009, Raftery and Dean, 2006, Scrucca and Raftery, 2014; the original sequential updating and greedy search algorithm (Wang and Dunson, 2011) as implemented in our sugsvarsel R package; and VarSelLCM, a model-based clustering and variable selection approach using the integrated complete-data likelihood (Marbac and Sedki, 2017).
When running SUGS and SUGSVarSel we use the same prior specification for both methods and 30 random orderings of the data. Throughout this article, we always perform 2 iterations of variable selection in the SUGSVarSel algorithm. To initialise variable selection in SUGSVarSel, we subsample 10% of the variables 20 times to produce an initial variable selection set. For SUGS we choose the partition with maximal PML (as advised in the original SUGS paper by Wang and Dunson 2011), while for SUGSVarSel we select the result with maximal ML. Prior choices for SUGS and SUGSVarSel can be found in the appendix. For mclust and clustvarsel, we find the appropriate number of clusters using a sequential search up to a maximum of 9 possible clusters. We then use then Bayesian Information Criterion (BIC) to select an appropriate model (Schwarz, 1978). For VarSelLCM we run the algorithm up to a maximum of 9 possible clusters and select an appropriate model using the Maximum Integrated Complete-data Likelihood (MICL) Sedki, 2017, 2018). All methods are run in serial for fair comparison.
Results are presented in Tables 1 -4. In all tables, we provide runtimes for each of the methods, indicate the proportion of relevant and irrelevant variables that each method correctly identified (for methods without variable selection this is reported as 1 for relevant and 0 for irrelevant variables), and report the adjusted Rand index (Hubert andArabie, 1985, Rand, 1971) between the clustering produced and the truth. We repeat all methods for 10 different random realisation of the datasets to produce a distribution of scores. We report the median scores, along with the upper and lower quartiles.   It is evident that methods that do not perform variable selection such as mclust and SUGS perform poorly when there are many irrelevant variables. The performance of clustvarsel here seems volatile and performs poorly at correctly selecting relevant features. VarSelLCM and SUGSVarSel are competitive in terms variable selection and clustering. However, VarSelLCM requires an exhaustive search over the number of clusters, which makes this method computationally costly to apply when the number of clusters is not known. SUGSVarSel outperforms all variable selection and clustering methods in terms of speed, while also automatically inferring the number of clusters in the data. We proceed to evaluate the performance of SUGSVarSel on large simulated datasets.

Increasing the number of observations
We simulate the same distribution as before, but instead sample 1000 observations and only 100 variables and the irrelevant variable are simulated from a standard Gaussian distribution. All priors are the same as in the previous analysis and we sub-sample 10% of the variables 10 times to produce an initial variable selection set. We repeat SUGS and SUGSVarSel for 10 random orderings of the data. We compare the scalable methods mclust, SUGS, SUGSVarSel and VarSelLCM, where 25%, 10%, 5% of the variable are relevant. For SUGS we choose the partition with maximal PML, while for SUGSVarSel we select the result with maximal ML. For VarSelLCM we run the algorithm for possible number of clusters 1 through 4 and select an appropriate model using the MICL, as previously. Results are presented in Tables 5-7.   SUGSVarSel and VarSelLCM produce high quality answers in all situations but SUGSVarSel is 2 orders of magnitude faster. However, to alleviate the computational burden we searched up to a maximum of 4 clusters in VarSelLCM, providing it with an easier opportunity to produce high quality clusterings. In applications to real data this would have to be much larger, adding considerably to computational time, whereas the inference of the number of clusters is automatic in SUGSVarSel.

Advantages of Bayesian model averaging
As an example, we simulate a dataset with 30 observations from a mixture of 3 Gaussians, where two of the Gaussians are isotropic and centred (2,2) and (-3,-3) respectively, each with mixing weights 0.4. The third component has mixture weight 0.2 and is centered at (-3,4) but the covariance matrix is 2 on the diagonals and 1 on the off diagonals, violating our independence assumption. We additionally include 2 components of irrelevant variables generated from standard Gaussians. Our prior specifications are set as in the previous section. Simply using the ML to pick a partition results in an ARI of 0.635 between the clustering produced and the truth. However, we can also perform BMA and then summarise our co-clustering. We applied hierarchical clustering with average linkage to compute a clustering, which has previously be applied to posterior similarity matrices (Fritsch and Ickstadt, 2009, Liverani et al., 2015, Medvedovic et al., 2004) (see appendix for complete details). This clustering then produces an ARI of 0.875. The heatmap of the co-clustering matrix is Figure 1, providing a visualisation of the uncertainty in the clustering. 4 Applications to cancer subtyping

Application to Leukaemia Dataset
In this section, we apply SUGSVarSel to real biological datasets. The first is a well-studied genomic clustering problem: the separation of acute myeloid leukaemia (AML) and the B/T-cell subtypes of acute lymphoblastic leukemia (ALL) samples on the basis of microarray transcriptomic data. We use the dataset described by Golub et al. (1999), which comprises 38 samples, 27 of which are ALL (8 T-cell and 19 B-cell related), and 11 of which are AML cases. Initial preprocessing is performed as in Dudoit et al. (2002), which reduces the dimension of the dataset from 6,817 to 3,051 genes. In Dudoit et al. (2002), a further dimension reduction step is performed that makes use of the AML and ALL class labels, so that only those genes that have a high ratio of their between-class to within-class sums of squares are retained. Here we instead wish to adopt a completely unsupervised approach, so that we may use the known ALL-AML class label in order to validate our results. We select the 200 most variable genes and then normalise, so the expression values for each gene are mean-centred at 0 with variance 1. 200 genes were chosen because this led to good predictive performance in previous analysis of these data (Dudoit et al., 2002, Golub et al., 1999. We then apply SUGSVarSel to the resultant dataset. We sub-sample 10% of the variables 20 times to produce an initial variable selection set, and run the algorithm for 100 random orderings. We adopt our default priors and summarise the output using BMA. A final summary clustering is obtained by performing hierarchical clustering with average linkage (Fritsch and Ickstadt, 2009). We use the ARI to compare our results to the truth (of 3 classes) and repeat the process 10 times and report the average results.
Results are illustrated in Figure 2. The final clustering result provides an ARI of 0.831, which is in line with previous analyses preformed on this dataset (Dudoit et al., 2002, Golub et al., 1999. The algorithm selects a total of 92 genes, including TCL1, TCRB, IL8, EPB72, IL7R, TCRG, NFIL6, which are all known to be associated with leukaemia (Chen et al., 2010, Kuett et al., 2015, Natsuka et al., 1992, Pekarsky et al., 2001, Shochat et al., 2011, Van der Velden et al., 2004. A full list of the selected genes (including their descriptions) can be found in the appendix. The advantage of our analysis over other methods is that we did not need to specify the number clusters -the algorithm automatically inferred 3 clusters in the data, which have excellent correspondence to the known classes of AML and ALL, as well as the 2 ALL subgroups. To assess the importance of variable selection, we also apply mclust and the original SUGS algorithm to the data. We run the mclust algorithm performing a systematic search to select the number of clusters, up to a maximum of 9, and select the number of cluster which maximises the BIC. This criterion selects 3 clusters and clustering produced gives an adjusted Rand index of 0.627 -the inclusion of irrelevant variables has led to reduced cluster quality. We run SUGS using our default prior choices and using the PML criterion to select a clustering. The algorithm was run for 100 random ordering and we repeated the process 10 times, reporting an average ARI of 0. The lack of variable selection renders SUGS unable to produce a meaningful clustering. In Figure 3, we visualise the BMA co-clustering matrix for these data when applying the SUGSVarSel algorithm.

Application to TCGA breast cancer dataset
We demonstrate SUGSVarSel on a further genomics dataset. We analyse an expression dataset for breast cancer tumour data from The Cancer Genome Atlas (TCGA) (Network, 2012), which we pre-process in the same way as in Lock and Dunson (2013). The processed expression dataset comprises 348 tumours with 645 genes, of which 14 belong to the PAM50 (Prediction Analysis of Microarray) group of genes (Parker et al., 2009). Analysis was performed in the following way. We first standardise our data so that each column is mean-centred with variance 1. We then subsample 10% of the variables 64 times to produce an initial variable set. We then apply the SUGSVarSel algorithm with default settings. We summarise our output by performing BMA and then hierarchical clustering with average linkage.
SUGSVarSel reveals two clusters in the dataset, the second of which is significantly associated with Basal-like tumours (Fisher test, p < 0.0001). The algorithm selects 245 variables to discriminate between the groups. We perform PCA before and after variable selection to demonstrate that the reduced variable set produces more separable and therefore more interpretable clusters (Figure 4).  Furthermore, the algorithm selected 13 out of a total of 14 of the PAM50 genes, which is significantly better than random (Fisher Test, p < 0.0001).
There is perhaps concern that variable selection could remove relevant genes for clustering, in the situation where we have a highly informative set of variables. We consider the following task to cluster the breast cancer genes using the PAM50 genes from the total unprocessed dataset (that is without the filtering of Lock and Dunson (2013)), of which there are 48. We apply the SUGSVarSel algorithm in identical fashion to before, sub-sampling 10% of the variables 4 time to produce an initial variable set. We obtain 5 clusters which correspond well to the different breast cancer subgroups. Cluster A is associated with Luminal A cancers, cluster B is associated with Luminal cancers, cluster C with basal-like tumours, cluster D contains mostly HER2 type breast cancers (chi-squared p < 0.0001). Thus, hardly surprisingly, the cluster produce on the PAM50 data coincide well with the PAM50 subgroups. Furthermore, 87.5% of the genes were selected which is more than we expect given our prior, telling us this was a highly informative set of genes.
The clusterings shown in Figures 4 and 5 demonstrate that the variables we use for clustering are critically important. The two different pre-filtering choices led to results of varying quality and biological meaning. This is strong evidence in support of model-based variable selection rather than ad-hoc preprocessing.

Pan-cancer proteomic characterisation
In this section we apply our method to The Cancer Proteome Atlas (TCPA) datasets (Akbani et al., 2014, Li et al., 2013, Städler et al., 2017. The dataset contains a large number of tumours and cell line samples with protein expression levels generated using reverse-phase protein arrays (RPPAs). Our method allows us to perform a number of tasks on this data; in particular, for each cancer we can detect possible subgroups and the relevant proteins which discriminate these subgroups. We can also perform a pan-cancer analysis to explore the differences and similarities between cancers. Pan-cancer studies can unravel inter-cancer relationships which are important for developing new clinical targets (Berger et al., 2018, Hoadley et al., 2018, Uhlen et al., 2017, Weinstein et al., 2013. Recent pan-cancer analyses have suggested that cancers should be classified based on their molecular signatures rather than tissue of origin (Berger et al., 2018, Hoadley et al., 2018 and this motivates our analysis.
As is usual with this data there are irrelevant variables so methods that do not perform variable se-lection such as mclust and SUGS are ill-suited. Furthermore, there is little a priori knowledge about the number of clusters and so methods such as VarSelLCM and clustvarsel which require an exhaustive search of the number of clusters are inappropriate. To perform the analysis on all cancer sets would be prohibitively slow for the slowest of analysis methods.
The TCPA datasets contain data on 19 cancer types and the description of these cancers can be found in the appendix. The total dataset consists of over 5000 tumour samples with only a few samples for some cancers and hundreds of samples for others and several hundreds of proteins. The merged PAN-Can 19 level 4 dataset is used in the following analysis, since it is appropriate for multiple disease analysis. More information about the data can be found here http://tcpaportal.org/tcpa/, where the data itself can also be downloaded. In addition, we standardise the expression levels for each protein so that they are zero-centred with unit variance. Table 8 demonstrates the number of cases for each cancer type: We only keep proteins which have been measured on all cancers, which total 217 and so our dataset has a total of 5157 tumour samples with 217 variables. We apply SUGSVarSel to this data by first subsampling 10% of the variables 43 (a fifth of the total number of variables) times. Using the same priors as in previous analysis we analyses this data using the SUGSVarSel algorithm, running the algorithm for 50 random orderings, thus exploring a total of 2150 models. We summarise the BMA clustering using hierarchical clustering with average linkage. The summarised clustering contains 60 clusters, however many of these clusters contain only a few observations. Reassuringly there are 18 clusters with more than 20 observations and we focus on these for our analysis. A table summarising the clusters can be found in the appendix. Figure 6 summarises the relationship between the cancer types and SUGSVarSel clusterings. In addition, in Figure 7 we plot a heatmap of the data with the clustering identifed by SUGSVarSel, using only the proteins selected by the algorithm. It is rare that a cancer associates with a single cluster, however there are evident relationships between cancers and clusters. Cluster A contains predominately womens' cancers (OV, UCEC, BRCA), while cluster B contains a large spread of cancers. Clusters C, E and F contain the cancers of the digestive tract (STAD, COAD and READ). Cluster D contains a subgroups of breast cancers (BRCA), while cluster G contains solely kidney cancer (KIRC). Clusters H and I contain cancers of the brain (LGG, GBM). Cluster J and P contain aero-digestive cancers (HNSC, LUAD LUSC). Thyroid cancer (THCA) is spread across clusters K, L and B, whilst KIRP is predominately found in cluster M. Pancreatic cancer (PAAD) is split across clusters N and B. Cluster O contains the majority of breast cancer patients. Prostate cancer (PRAD) is dominantly found in Q, while R forms a small cluster of stomach cancers. This is in line with other analyses performed on these data (Akbani et al., 2014, Hoadley et al., 2014, Şenbabaoglu et al., 2016. A total of 147 proteins were selected as relevant for clustering. We now consider an illustrative example. Figure 6 indicates that clusters K and L contain only thyroid cancers. It is of biological interest to see what drives the differences between these clusters, as they could define clinically relevant thyroid cancer subtypes. Considering only the 147 selected proteins, in Figure 8 we plot the expression profile for the 20 proteins, with smallest p-value, which are significantly different between clusters K and L (T-test (Welch, 1947), p < 0.00001, using Benjamini-Hochberg correction (Benjamini and Hochberg, 1995 We do not observe an over representation of known thyroid cancers subtypes within each of these clusters (see Table 9).

Conclusion
In this article we presented SUGSVarSel, an extension to the SUGS algorithm of Wang and Dunson (2011) to allow variable selection. We demonstrated that when irrelevant variables are present the quality of the clustering can be degraded and clusters become more challenging to interpret. SUGSVarSel allows the flexibility of a Bayesian nonparametric approach but inference is considerably faster than using MCMC. Indeed, the SUGSVarSel algorithm infers the number of clusters automatically and performs inference for the Dirichlet process hyperparameter. This is in contrast to most clustering with variable selection methods which require a systematic search over the number of clusters.
Whilst our method is approximate it performs competitively with other commonly used approaches. Furthermore, we take advantage of exploring many models by performing Bayesian model averaging, which is important for exploring uncertainty in our clustering. We remark that model uncertainty and the application of BMA is rarely explored in clustering tasks. We have provided an R package to facilitate dissemination of our method utilising C++ to accelerate intensive computations and parallel processing features to make further computational gains Application to two cancer transcriptomic datasets show the clear benefit of simultaneously performing variable selection and clustering. We demonstrate that variable selection improves interpretation of these datasets, providing the genes that drive the clustering structure of the data, as well as identifying those that are irrelevant for clustering. We further applied our method to a pan-cancer proteomic dataset for which none of the current model-based clustering and variable selection methods are suitable. SUGSVarSel is able to provide a characterisation of 5, 157 tumour samples, demonstrating clustering relationships across cancer types based on their molecular signature rather than tissue of origin.
There are a number of ways in which our proposed method could be extended. Firstly, our assumption that variables are conditionally independent given the cluster allocations might be unrealistic for some datasets. In such cases, more elaborate variable selection methods might be desirable, although this is likely to come at increased computational cost. Furthermore, we have assumed conjugacy throughout, so that the marginal likelihood in Equation (5) may be evaluated analytically. As noted in the original SUGS paper of Wang and Dunson (2011), one possible way to extend to non-conjugate cases would be to approximate this marginal likelihood, e.g. using a Laplace approximation.
where the parameter updates are obtained sequentially, through the following equations (dropping the subscript d for clarity) (Murphy, 2007): where in the case i = 0 the parameters are given by their specified prior values, except we set T (0) k = ν 0 S 0 +λ 0 µ 2 0 . The required conditional likelihood is given by a non-central T -distribution. Remembering at the i th iteration we compute with the updated parameters from the previous iteration; that is, at the (i − 1) th iteration, the required distribution is For a N I χ 2 prior the marginal likelihood is given by the following equation, (dropping the subscript d for clarity) ∫ θ k f (X k |θ k )p 0 (θ k )dθ k = 1 π n k /2 Γ(ν k /2) Γ(ν 0 /2) The other required equations have already been given and require simple substitutions.
Prior Settings for the SUGS and SUGSVarSel high-dimensional example Here we state the prior specification for the SUGS and SUGSVarSel algorithms. We let µ 0 be the mean of the observations' data for each variable, λ 0 = 0.01, ν 0 be the number of variables, S 0 = 0.2 for all variables. We letβ = (0.01, 0.1, 1, 5, 10, 15, 30, 50, 100) T and set the prior to be G(1, 1). In addition, for SUGSVarSel we suppose that a priori variables are equally likely to be relevant or irrelevant.
Summarising the Bayesian model averaged co-clustering matricies Fritsch and Ickstadt (2009) propose a method to summarise the posterior similarity matrix of a Bayesian clustering method. We apply their methodology to summarise our Bayesian model averaged co-clustering matrix. They present several method to obtain a clustering by maximising the posterior expected adjusted Rand index. We use the proposed method which obtains clusterings from applying hierarchical clustering with average linkage. An optimal clustering is then obtain by cutting the dendrogram at 0.5 (Fritsch and Ickstadt, 2009).