Statistical Applications in Genetics and Molecular Biology
Editor-in-Chief: Stumpf, Michael P.H.
6 Issues per year
IMPACT FACTOR 2016: 0.646
5-year IMPACT FACTOR: 1.191
CiteScore 2016: 0.94
SCImago Journal Rank (SJR) 2016: 0.625
Source Normalized Impact per Paper (SNIP) 2016: 0.596
Mathematical Citation Quotient (MCQ) 2016: 0.06
Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression
Probabilistic mixture models provide a popular approach to cluster noisy gene expression data for exploring gene function. Since gene expression data obtained from microarray experiments are often associated with significant sources of technical and biological noise, replicated experiments are typically used to deal with data variability, and internal replication (e.g. from multiple probes per gene in an experiment) provides valuable information about technical sources of noise. However, current implementations of mixture models either do not consider the correlation between the replicated measurements for the same experimental condition, or ignore the probe-level measurement error, and thus overlook the rich information about technical noise. Moreover, most current methods use non-robust Gaussian components to describe the data, and these methods are therefore sensitive to non-Gaussian clusters and outliers. In many cases, this will lead to over-estimation of the number of model components as multiple Gaussian components are used to fit a non-Gaussian cluster. We propose a robust Student's t-mixture model, which explicitly handles replicated gene expression data, includes the consideration of probe-level measurement error when available and automatically selects the appropriate number of model components using a minimum message length criterion. We apply the model to gene expression data using probe-level measurements from an Affymetrix probe-level model, multi-mgMOS, which provides uncertainty estimates. The proposed Student's t-mixture model shows robust performance on synthetic data sets with realistic noise characteristics in comparison to a standard Gaussian mixture model and two other previously published methods. We also compare performance with these methods on two yeast time-course data sets and show that the new method obtains more biologically meaningful clusters in terms of enrichment statistics for GO categories and interactions between transcription factors and genes. Automatically selecting the number of components is more computationally efficient than using a model selection approach and allows the methods to be applied to larger data sets.