Supervised classification of combined copy number and gene expression data

Summary In this paper we apply a predictive profiling method to genome copy number aberrations (CNA) in combination with gene expression and clinical data to identify molecular patterns of cancer pathophysiology. Predictive models and optimal feature lists for the platforms are developed by a complete validation SVM-based machine learning system. Ranked list of genome CNA sites (assessed by comparative genomic hybridization arrays – aCGH) and of differentially expressed genes (assessed by microarray profiling with Affy HG-U133A chips) are computed and combined on a breast cancer dataset for the discrimination of Luminal/ ER+ (Lum/ER+) and Basal-like/ER- classes. Different encodings are developed and applied to the CNA data, and predictive variable selection is discussed. We analyze the combination of profiling information between the platforms, also considering the pathophysiological data. A specific subset of patients is identified that has a different response to classification by chromosomal gains and losses and by differentially expressed genes, corroborating the idea that genomic CNA can represent an independent source for tumor classification.


Introduction
Both transcriptome profiling, by gene expression microarray, and genomic copy number aberrations (CNA) detection, by comparative genomic hybridization arrays (aCGH), have been used to produce molecular portraits of breast cancer specimens.These are used to derive signatures of prognostic value for patients by means of unsupervised hierarchical clustering.Microarrayderived data allowed the identification of five breast cancer subtypes (basal-like, luminal A, luminal B, ERBB2, normal breast-like) two of which (basal-like and ERBB2) have been associated with strongly reduced survival [14,15,18,19].aCGH also consistently detected characteristic CNA in breast cancer, which allows the classification of tumor samples on the basis of their pattern of chromosomal gains and losses [10,11,16].Based on unsupervised clustering on aCGH data, the breast tumor genomes fall into one of three categories (called 1q/16q or simple, complex and amplifier) [5].The emergence of features distinguishing breast cancer subgroups on the basis of either their genomic and transcriptomic blueprints raises the possibility combining them to produce compound signatures potentially endowed with extended predictive power.A recent extensive study [4] explored the value of the combination of microarray-derived and aCGH-derived data obtained from the same collection of tumor samples, coming from patients undergoing aggressive adjuvant chemotherapy and endowed with adequate clinical description and follow-up.The conclusion was that the classes obtained by expression profiling have different recurrent CNA, and that the parallel use of these data can improve patient stratification according to the outcome.These results suggest that, for breast cancer, integration of genomic and transcriptomic abnormalities could provide a potential enrichment in predictive power, justifying further attempts to unravel the structure of the relationship between these two levels of observation of cancer genomic instability.
We introduce in this context a set of machine learning methods to investigate common and different structures within a possible integrative space of high-throughput features.In terms of machine learning, our main task is the predictive classification and feature ranking of gene expression and genome copy number with respect to the Luminal/ER+ and Basal/ER-classes.We first develop predictive models and optimal feature lists for the two platforms separately.Different encodings are applied to the CNA data.Then we filter data and variables and consider the combination of profiling information between the platforms, also considering the pathophysiological data.Finally, we analyze in detail a specific subset of patients in which differences are found by BAC aCGH and gene expression.We show that SVM-based classification of microarray-derived and aCGH-derived data on a common subset of 63 samples detects a subset of 8 cases with specific pathophysiological characteristics.
To our knowledge, this is one of the first studies applying supervised classification in a complete validation context to aCGH data and combining them with gene expression data with the same machine learning methodology.The structure of the rest of the paper is as follows: the dataset and preprocessing methods are discussed in Sec. 2, and Sec. 3 details the machine learning framework, whose application is described in Sec. 4. Results are finally discussed in Sec. 5.

Data description
This paper is based on the aCGH and gene expression data first presented in [4].In this section we summarize the main information on the original data and provide details on different preprocessing methods we applied for supervised classification tasks.
The array-CGH data were obtained from the Scanning and OncoBAC methodologies as described in [4].The data consists of 2149 BAC describing 149 samples.This dataset includes missing values (NAs).In this study, we consider the data either as BAC or by different encodings and we impute the missing values through kNN as in [20], we eliminate samples and/or genes with too many NAs, or we obtain imputation directly from another type of preprocessing which encodes BAC into segments.The segmented data are obtained by processing the original aCGH data as in [12,21]: circular binary segmentation is applied to all samples to obtain, for each chromosome of each sample, a piecewise constant function with the DNAcopy R/Bioconductor package.By intersecting all those values (Fig. S1 in Supplementary Material), we build a sparse matrix of BioDCV inputs.Note that NA imputation is automatically performed within the segmentation algorithm.Expression data for 118 samples described by 22215 probes were collected by using Affymetrix HG-U133A arrays and preprocessed as detailed in [4].Tumor subtype classes can be assigned on these data by clustering with the 70-genes molecular signature from [19].The samples were labelled as belonging to one out of the five classes (basal-like, ERBB2, luminal A, luminal B and normal-like) according to proximity to the molecular-signature cluster computed both by the Euclidean distance and by Pearson's correlation (in case of ambiguity we consider the latter) [4].Although an alternative classification in subtypes of biological relevance have been recently proposed [9], in this study we follow the original five class partition.
Clinical data are available and consist of 136 fields for 174 patients.In particular, the ER status and the disease-specific survival information are provided.

Methods
Our approach towards supervised classification relies on a set of algorithms aimed at achieving predictive classification together with stable molecular profiles, avoiding overfitting risks due to selection bias effects [1].This method has been used so far with satisfying results in different functional genomics and proteomics tasks (e.g.profiling microarrays, time series microarray, integration with clinical data, and mass spectrometry data) and also by using grid computing infrastructures [2,3,6,7,8,13].
The method's core is a complete validation procedure [8] realizing repeated training/test splits of the original dataset.A feature ranking algorithm is applied to the training portion, and classification models with increasing number of best ranked features are computed on the test part.
Accuracy performance is assessed by averaging classification errors on all the test splits (ATE, Average Test Error) both globally over all samples and on a samplewise basis (sample-tracking analysis).Given the population of ranked lists produced at each training/test split, a global ranked list is then computed by means of an aggregating algorithm (Borda count method).As a rule of thumb, the more the splits, the smoother the obtained results.Class label randomization is also applied to validate the entire method.The whole procedure is implemented in BioDCV, a Python/C environment freely available at http://biodcv.itc.it.
In this paper, Support Vector Machines (SVM) are used as the classification algorithm.They are well-known both in machine learning as well as in bioinformatics literature for their good performance on high-throughput data, characterized by the huge unbalance p >> n between the number n of samples and the number p of features [17].Regularization parameter and other kernel-specific parameters (such as bandwith for Gaussian kernels) are preliminarly tuned by training error minimization in bootstrap experiments.Finally, each feature in the data matrix is usually standardized across samples to zero mean and one standard deviation to avoid unwanted effects due to strong dishomogeneities in the features' ranges.
Entropy-based Recursive Feature Elimination (E-RFE) is used as the ranking algorithm [7].RFE-like algorithms are wrapping methods that recursively discard a bunch of the features which are less contributing to the classifier until all features have been eliminated.Different choices in the discarding rule lead to variants of the original RFE algorithm, which eliminates just the least important feature at each loop (stepwise backward selection).The E-RFE algorithm has been shown to achieve performances comparable with the original RFE with a consistent improvement in terms of computing time.Nevertheless, performing the complete validation procedure for a large number of splits (e.g.400 as in the experiments discussed in this paper) is still computationally heavy: reasonable computing times can thus be obtained by distributing loads on High Performance facilities such as clusters or grid infrastructures [3].

Results
In this study, we initially performed a 10-NN imputation of missing values on the aCGH data.BAC with more than 10% missing values (≥ 7 cases) across samples were discarded, leaving a total of 1590 BAC features.We also performed a circular binary segmentation on the same 65 samples, obtaining 1674 features.The BioDCV-based predictive profiling was applied to these datasets, following the steps detailed in Methods.The experiment on the gene expression data (BioDCV, linear SVM, E-RFE) reaches near perfect classification with very few genes (ATE < 1% with 3 genes, on average).Results are summarized in Tab. 1 (see also Fig. S2 and  S3): predictive average test errors (ATE) and their standard deviations are listed for increasing feature set sizes, globally and separately for the classes.This result, which extends the original study [4], is however predictable since tumor subtype labeling was set by clustering of the gene expression data and these two tumor subtypes are considered as well separated also in known breast cancer signatures [19].In particular, the most discriminating feature, for all the 400 test sets, is the probe 205225 at (ESR1).
Classification of the aCGH data as BAC (also with BioDCV, linear SVM and E-RFE) gives an ATE < 12% for all models with less than 300 features, reaching 9.8% as the minimum value with 900 features.Two samples (s0004 and s0138) are however consistently misclassified in all tests according to the BioDCV procedure.Note that the subtype class of s0004 is not univocally assigned by the Sorlie signature: the label is basal-like by Pearson correlation and ERRB2 by Euclidean distance.
We thus applied the shaving procedure detailed in [13] and removed the two samples from both the BAC and the expression data.The shaving lowers the no-information rate from the original 23 23+42 = 35.4% to 21 21+42 = 33.3%.The classification exercise was then repeated on both the datasets.For the gene expression data, the shaving procedure did not significantly affect the performances of the experiment (ATE <1% with four features).As above, the best discriminating gene was ESR1, typically the gene more directly associated with the ER status, introducing a first order effect that masks other potential descriptors of the underlying pathophysiology.We thus removed the ESR1 gene from the feature set and repeated the classification.We did not remove other ER-related probes that are poorly discriminating on this dataset.For gene expression, performances were only slightly affected (ATE <2% with 3 features and ATE <1% with 15 features), confirming the effect of other genes in this classification task.Moreover the ordering of features in the ranked list showed only minor rearrangements (discussed below).In summary, classification by gene expression is not modified by the removal of the two samples (problematic for the task on BAC) and the ESR1 probe.
Classification of the aCGH BAC data is instead improved in this setting.As shown in Tab. 2 and displayed in Fig. 2, now ATE is less than 10% for models with at least 15 features, with a minimum ATE 6.9% reached with 50 BAC.The shaving procedure has thus effectively reduced noise from the analysis.Note that performance with Gaussian kernel does not improve with respect to the simpler linear SVM model, also after parameter tuning by bootstrap-based procedures, while the resulting optimal lists are similar.We then analyzed the problem of combining genomic and trascriptomic information as derived from profiling.Alignment of BAC and corresponding HG-U133A probes for the the best features is detailed in Tab. 3 (from probes to BAC) and Tab.4 (from BAC to probes).The correspondances were seeked by using GenomeMap (NCBI: http://www.ncbi.nlm.nih.gov/mapview),NetAffix (https://www.affymetrix.com),UCSC Genome Browser (http://genome.ucsc.edu).Given the first 15 probes (ATE < 0.9% on the shaved dataset without ESR1), 11 BAC were mapped (3 corresponding to one single probe).However, we were able to map only 2 BAC of the 1590 conserved after imputation.In the other direction (Tab.4), when we considered the first 15 ranked BAC, 22 Affy probes were found for 5 different BAC, of which 11 for the HG-U133A platform used in [4].The differential expressions for these 11   in Fig. 3: best separation is shown for probe 201805 at, which is the best ranked for gene expression classifiers.It is worth noting that in both cases the mappings found low ranked features, suggesting independence between the optimal lists produced by the two experiments on the different datasets for the admissible mappings with current information.Note that greater insights may be gained by using higher resolution arrays such as SNP.
As an alternative way of comparing the best ranked lists from the two platforms, we first explored a binning technique discussed in [4], in which every chromosome was subdivided into non-overlapping sections (bins) of length 20Mb.Locations of top-20 probesets and BAC are listed in Tab. 5.Only 5 common bins were detected, supporting a possible independence between the feature sets.None of the top 20 BAC and genes lie on chromosomes 7, 8, 9, 11, 17.As an intermediate feature encoding between BAC and coarse binning, we performed the predictive profiling study on the aCGH data preprocessed by circular binary segmentation (see Methods).Classification with the segmented data gave poor performances: as shown in Tab.S5 and Fig. S6, ATE always remains above 13%, and above 20% with less than 70 segments used as features in the SVM classifiers.In order to enhance possible effects due to CNA highamplification, the classification exercise was repeated with both BAC and segments without standardizing data, after imputation.This variant worsened the performance with BAC (minimum ATE at 7.8% for 90 features), while a negligible higher accuracy (minimum ATE 12.8% at 600 features and 19.4% at 70 features) was obtained for the segmented data.In summary, seg- 13 at 3q24 ZIC1

Table 3: Mapping of the top-15 ranked probesets on the available BAC. Probeset ranking position for gene expression classification is listed for ESR1 probeset included (col. # [ER]) or excluded # [noER]). The BAC clones in parentheses are not present (missing or discarded by imputation).
Gene Expression Gene Expression mentation and non-standardized data gave worse results than standardized BAC on this dataset.Because differences were found by comparing and mapping the profiling results on the two different platforms, we consider the individual ATE error curves computed for each sample (i.e.sample-tracking curves, see Methods and [8]).The sample-tracking curves are displayed in Fig.     those eight samples.By comparing the frequency of gains and losses (as in [4]) of those eight samples vs. the remaining 55 samples, significant differences emerge for many chromosomes (see Fig. 5).Further differences can also be dectected by the box plots in Fig. 6 and Fig. S6.

Conclusions
In this paper we propose a combined approach to supervised classification of breast cancer specimens based on genomic lesion detection from aCGH and transcriptome analysis by microarray profiling.The CNA signature we derived discriminates the Lum/ER+ and Basal/ER-subgroups with a 12.5% predictive error rate (with 10 BAC features); this suggests that aCGH data could be used in classification tasks where transcriptome profiling data are unavailable or cannot be obtained for the high degree of sample degradation.Moreover, the absence of co-occurence between the genes located in the top ranked BAC and the location of the genes detected by the top ranked Affy probesets suggests a segregation of predictive power at the genome and the transcriptome level in these samples, further corroborating the idea that genomic CNA can represent an independent source for tumor classification.The 8 samples misclassified by the supervised aCGH-based analysis are characterized by a common overall worse prognosis and by statistically significant differences in the CNA profile, with substantial changes in the number of gains and losses in the 5, 6, 7, 10, 12 and 14 chromosomes.Therefore, this tumor subset could be endowed with other features emerging only on the basis of the CNA profile, and having no effect on the discriminating ability of the transcriptome profiles.This finding again reinforces the interest in aCGH-derived information for breast tumor supervised classification.

S3
Sample-tracking analysis of the Lum/ER+ vs. Basal/ER-task on the expression data.For each sample, the plot indicates percent error for BioDCV runs (indicated in parenthes) in which the sample is in test, averaged on models of the same feature set size.The horizontal grey line indicates the no-information error rate for baseline comparison.1: b0241 (94)

Figure 2 :
Figure 2: Average Test Error on the copy number data (BAC) for the Lum/ER+ and Basal/ERtask, with 95% student's bootstrap confidence intervals.

Figure 3 :
Figure 3: Expression level of the affy probesets mapped from top-ranked BAC.

Figure 4 :
Figure 4: Sample-tracking error analysis of Lum/ER+ and Basal/ER-task on the copy number data (BAC features).For each sample, the plot indicates percent error for BioDCV runs (indicated in parenthes) in which the sample is in test, averaged on models of the same feature set size.The horizontal grey line indicates the no-information error rate for baseline comparison.

Figure 5 :
Figure 5: Gains (upper red) and losses (lower green) frequencies of the eight misclassified samples are compared to the frequencies for the remaining 55 samples (gray dots).

Figure 6 :
Figure 6: Comparison of gene expression on optimal gene expression signature of the aCGH misclassified samples with the remaining data.All data are perfectly classified by the selected Affy probes.However, probes such as 212692 s at discriminate within the subset better than within the rest of the data.

Table 1 :
Average Test Error (M ) on the gene expression data for the Lum/ER+ vs. Basal/ER-task (global and classwise), with standard deviation (SD).

Table 2 :
Predictive errors (M ) with standard deviation (SD) on the copy number data (BAC) for the Lum/ER+ and Basal/ER-task (separately on the two classes and for all data), after shaving two samples.

Table 4 : Top-15 ranked BAC with the included genes. The Affy probesets in parentheses belongs to the Affymetrix U133B platform, and thus they are not present in the Chin06 dataset [4].
4.As shown in the figure, eight aCGH samples (4 Lum/ER+ and 4 Basal/ER-) are mostly misclassified.As they are altogether perfectly classified by gene expression, it is suggestive to analyze this subset in some detail.Some of their clinical features (SBR grade and survival time) are differently distributed w.r.t. to the rest of samples: SBRgrade=3 for 6/8 cases vs SBR-grade=3 for 22/55; median recurrence 3.3 years vs 6.9 years, suggesting a worse prognosis for

Table 5 : Chromosome portions of the top-20 affy probesets and top-20 BAC.
Comparison of gene expression on optimal gene expression signature of the 8 aCGH misclassified samples with the remaining 55 samples. S6