Integrating data from heterogeneous DNA microarray platforms

Summary DNA microarrays are one of the most used technologies for gene expression measurement. However, there are several distinct microarray platforms, from different manufacturers, each with its own measurement protocol, resulting in data that can hardly be compared or directly integrated. Data integration from multiple sources aims to improve the assertiveness of statistical tests, reducing the data dimensionality problem. The integration of heterogeneous DNA microarray platforms comprehends a set of tasks that range from the re-annotation of the features used on gene expression, to data normalization and batch effect elimination. In this work, a complete methodology for gene expression data integration and application is proposed, which comprehends a transcript-based re-annotation process and several methods for batch effect attenuation. The integrated data will be used to select the best feature set and learning algorithm for a brain tumor classification case study. The integration will consider data from heterogeneous Agilent and Affymetrix platforms, collected from public gene expression databases, such as The Cancer Genome Atlas and Gene Expression Omnibus.


Introduction
DNA microarrays have been widely used in the study of several organisms and phenotypes, with applications ranging from biological sciences to health care and biotechnology.One recent application has been in the study of different types of cancer [1].The correct classification of the pathology of a patient, in particular in cancer, is essential to decide which drugs or therapies may be applied [2, Chapter 2][3] [4].Often, tumor samples with an atypical morphology complicate the analysis.In addition, certain types or subtypes of tumors may have very little differentiation between them [5].
The analysis based on gene expression is extremely important, given the gaps that traditional diagnosis methods still present.These data are obtained by measuring the amounts of mRNA in a sample for the different genes in study.DNA microarrays can monitor simultaneously expression profiles from a large number of genes [6], as each microarray slide can carry a high amount of probes.The major issue for statistical microarray data analysis is dimensionality.In a typical experiment, a table with thousands of genes for a small number of samples is obtained.This leads to situations where it is difficult, or even impossible, to employ classical prediction algorithms leading to poor prediction accuracy in discriminant models [7].
While a great number of studies have been focused on data dimensionality reduction [8,Chapter 13], applying statistical methods like Principal Component Analysis or Linear Discriminant Analysis, other works turned their attention to the increase in the number of available samples [9].One way to increase sample size is the integration of microarray data from different studies over the same phenotypes.When the platforms have the same identifiers, this implies a data adjustment to minimize batch effect, understood as non-biological experimental variation across multiple microarray experiments [10].When the platforms have different feature identifiers (i.e.different probesets), a re-annotation method is needed to make a bridge between them.
Here, it will be presented a re-annotation process based on transcripts, aiming to achieve a universal process that allows the integration of data from distinct microarray platforms.The re-annotated data will then be integrated to form a richer dataset, using different approaches, focusing on reducing batch effect.The resulting integration will be applied on brain tumor grade classification, performing feature selection and classification through filter and wrapper methods.The final result is a methodology and a set of computational tools that allows researchers on this area to integrate several gene expression datasets and apply machine learning algorithms over them, obtaining a subset of the best features-algorithm set for a specific case study.With the re-annotation based on transcripts it is possible to cross statistical significance with biological relevance and find the best subset of biomarkers.

Microarray fundamentals
Gene expression values are obtained by measuring amounts of mRNA in samples.Early methods targeted one gene (or a small number of genes) at a time requiring an a priori hypothesis suggesting which gene could be of interest.This limitation has been surpassed with the rising of DNA microarrays.Microarrays are physical structures, which have coupled thousands of specific DNA probes in a plate with a diameter of less than 250 microns [4].The sequences of these DNA probes, when in contact with a sample, will hybridize with the complementary mRNA segments.

Affymetrix probesets
In single channel microarrays, the measurement corresponds to a certain amount of expression, which is the case for Affymetrix ® platforms.These probesets contain 11 or 16 25-mer probes, each probe measuring a specific zone of the gene.The small size of the sequence means that there is a great chance of cross-hybridization in these probesets.Each probe of Affymetrix is actually a pair of probes.One of these probes is the exact complementary sequence of the target to measure, called Perfect Match (PM), and the other has only a change of one base in the middle of the string, called Mismatch (MM) [11][12].This change in the basis of MM probes is considered sufficient to not hybridize with the target, so it is considered that this probe measures background noise.The specific intensity for a probe is thus given by Eq. 1.

P robe Signal
To calculate the probeset expression, Affymetrix statistical methods (Af f y A lg) are applied over the n probes of the probeset, which aim, among other purposes, to exclude from the calculation the pairs of probes that present levels outside the normal parameters [13].The final value of a probeset measurement is often a log 2 intensity (Eq.2).

Agilent probesets
When a microarray considers the existence of two samples on the same experiment, it is called a two channel microarray.That is the case with many Agilent ® platforms.An Agilent probeset is composed of several identical probes, with a target of 60-mers oligonucleotides.The measurement of the signal by spot (probe) is made using an image processing that measures the average intensity of pixels of this spot [14].The datasets used on this work use a cell line, established through reference samples of various tissues, which serves to ameliorate the errors inherent in the platform, like for example the background error.The cell line is marked with green dye (Cy3) and the target sample is marked with red dye (Cy5).The signal that is returned is a logarithm of the ratio (Eq.3).
A microarray experiment generally results in an extensive list of probesets and their expression intensities (single-channel) or intensity ratios (dual-channel) [15].With the current state of the art, it is not possible to cover all genome with probes.There are several zones of the genes that are not monitored on a microarray experiment.This means that the expression of some exons is not measured and, thus, the "gene expression" concept is actually flawed.The target of a probeset is, in fact, a transcript and not a gene.To meet this proposal, probe designers try to direct targets to exons that belong to a single transcript [16].Another goal when designing microarray probes is specificity.The genome areas that present high levels of repeated patterns are discarded, because these areas tend to repeat all over the genome and this would cause cross-hybridization.

Related work
There have been several studies targeting data integration between microarray platforms.A few are summarized in Table 1.The main distinction is at which level the interface between platforms is made.Since the majority of the datasets available are at the probeset level, the first approach could be to directly map probeset values.However, the data files generally do not have sufficient information to achieve this mapping.The most traditional method is to integrate at the gene level, where it is possible to resort to probeset-gene mapping files available from manufacturers.However, several inconsistencies were pointed out for this approach and recent studies are turning to the transcript level [17].
Culhane et al. [18] made a co-inertia analysis to relate datasets without the need of identifiers annotation, using a multivariate method that identifies co-relationships in multiple datasets.The method was able to visually cluster genes with similar expression patterns between platforms.
The authors claim that it is possible to assist in the selection of the strongest features from each dataset for subsequent analysis.This approach prevents the bottleneck caused by the annotation based methods for integration, that limit the genes on study to those identified in all the datasets involved.
Woo et al. [19] compared the variability in expression between three different types of microarrays (Affymetrix GeneChip, cDNA and oligonucleotide).The gene identifiers and probesetgene mapping were obtained from TIGR and GenBank.The expression value for each gene was obtained by averaging probesets linked to the gene, adding experimental effects like dye and array effects.The platforms were compared by concordance of the F-statistics using a 2x2 cross-classification of significant vs. non-significant genes.Affymetrix microarrays shown greater variation in the magnitude of expression between replicates.However, in terms of statistical significance, a greater concordance between Affymetrix GeneChips and oligonucleotide arrays was shown.The analysis was done at the gene level.
Jarvinen et al. [20] compared data between different expression microarrays, using Affymetrix, Agilent cDNA and a custom-made microarray from a cDNA library.The comparison was made at the gene-level, using the UniGene cluster ID as a common handle among platforms.The expression values of probesets with the same ID, called clones, were averaged.A total of 7936 identifiers for Affymetrix, 7117 for Agilent and 7273 in the cDNA microarray were identified, but only 2340 identifiers were in common among the three platforms.After processing information, the authors obtained 1147 identifiers in common, to which they applied the Pearson and Spearman correlations to compare the gene expression between platforms.The results indicate variability across the three platforms, being greater among commercial arrays.
Ballester et al. [11] established an annotation pipeline at the transcript-level.The probe sequences were collected from the Affymetrix site and were aligned with the genome using the Ensembl exonerate tool.An association probeset-transcript was made if at least 50% of their probes match the transcript, considering 1 bp as maximum mismatch.Probes with match in more than 100 different places on the genome were discarded.Aiming to improve annotation, the 3' untranslated regions (3' UTRs) were considered.
Bellis [21] pointed the problem of annotation dependency on the reliability of files provided by manufacturers with the association of probesets-transcripts-genes.The authors refer that the assumption that an exon read leads to a specific transcript is ambiguous, which implies the need for extra validation.To analyze if two probesets annotated to the same gene are measuring the same entity, they used Pearson correlations, complemented by networks of positive and negative correlations.This structure compares all the probesets and genes and assigns them to different classes.The result was a full textual description of the association of each probeset to genes, exons and transcripts.Integrating multi-platform microarray data raises many difficulties, and therefore it is often avoided due to the lack of good results.Other authors who have taken this challenge focus on specific integration issues, and the methods used widely diverge.Kostadinova [22] examined how the combination of several related microarray datasets affects different areas of preprocessing and analysis of gene expression data, such as missing value imputation, gene clustering and biomarkers detection.Meng et al. [23] used a top-level approach, with multiple co-inertia analysis to relate two datasets, by maximizing the covariance between eigenvectors in paired dataset analysis.With this method, they can integrate and compare multi-omics data, independent of data annotation.The majority of the authors also focus on a single vendor or on datasets that are only one-channel or two-channel based.Papiez et al. [24] combined data sets from two types of microarrays: oligonucleotide and cDNA.They extract a set of genes common for both platforms and remove batch effects obtaining combined p-values from the two experiments.Even being different platforms, both datasets considered only intensity values.
Although being difficult, data integration has been seen as an important achievement, even leading to proprietary solutions, like the ones described by Willis et al. [25].In this work, a combination of Thomson Reuters solutions was used, including the Data Annotation and Processing tool, MetaCore Pathway Analysis and the Biomarkers Module of Thomson Reuters Integrity for a simple workflow to process omics data and identify biomarkers of target engagement through pathway analysis and data integration.

Methodology
This work intends to establish a general methodology/architecture for microarray data integration and application.The process begins with data collection of gene expression datasets from different free available sources.These datasets must be normalized for expression values and phenotypes, to achieve a common structure across them.The re-annotation processes are then applied to bring heterogeneous annotation datasets into a common backbone of features.The datasets could then be integrated, taking into account the batch effect, and applied on a case study where it is possible to select the best features and classification algorithm.The system was named Integrated Gene Expression Information System (IGEIS) (Fig. 1).

Data collection
The first decision to be made is at what level to perform the data integration: at microarray spot level, probeset level, transcript level or at the gene level.An important factor is at which level the data are available.If data are on a higher level, then it is not possible to descend to a lower level.This stage also encompasses scaling of expression values, as some are on a logarithmic scale and others are on an absolute scale.As most differential expression tools use the logarithmic scale, this was chosen.A more intricate problem is the standardization of phenotypic meta-data.Clinical data are often provided in plain text.This leads to a need of manual intervention to achieve homogeneous and unambiguous information that associates gene expression levels to specific phenotype attributes.

Re-annotation
The consistent homogeneous feature annotation is a very important step to compare expression data from different platforms.It is only possible to combine values of expression of a probeset with another one if both have the same target.The manufacturers provide annotation to their probesets, but with the constant evolution of platforms, this is quickly outdated [11].The probeset targets are typically (sets of) exons that comprise transcripts.If two probesets have different transcripts as target from the same gene, it is not surprising that they present different expression values.
The re-annotation starts with probe sequence alignment with the genome.This stage is done by collecting the sequences of each probe and registering all the matches those sequences present along the genome.As only exons are expressed, the matches of interest are the ones that lie on these sections.Given the exon-transcript correspondence, it is then possible to establish a probe-transcript correspondence, having in mind that the same exon can belong to multiple transcripts due to alternative splicing.For the contribution ratio of a probeset to a transcript expression value, the quantity of matches that the probeset makes with the transcript will be used, divided by the total matches on the whole genome (Eq.4).
where: psetR -probeset ratio for this transcript; npm -number of probe matches on the transcript; npn -number of probe matches outside the transcript; i -probeset index; j -transcript index.
For the determination of npn, it is necessary to check if each hit is outside the bounds of the target transcript, because a probe may have a hit in another transcript that has overlapping sections and, therefore, be identified erroneously as an outside match.Where it is not possible to find the probe sequences, or when a probe does not have matches, it will be discarded.A transcript expression value will be the weighted average of corresponding probesets (Eq.5): where: te -transcript expression; psetV -probeset expression value; n -number of probesets that match transcript j.With Eq. 5 it is possible to restructure datasets on a transcript based manner.

Datasets integration
The method used for integration may be generic or adapted for a specific case study.On brain tumor grade determination, the problem to solve is a classification task, with 4 distinct classes (1 to 4), being 1 the less malignant grade and 4 the higher malignant grade, called glioblastoma.
To exemplify, the distribution of the expression values across samples of an Affymetrix dataset is shown in Fig. 2, specifically for the probeset 201292 at (gene TOP2A).This probeset was highlighted from a previous eBayes analysis, presenting a p-value of 8.02E-34 in this analysis, run to determine differential expression significance.In Fig. 2, a positive correlation between expression and tumor grade is clearly evident.
The platforms to be integrated are from Affymetrix and Agilent so, in this stage, batch effect needs to be considered, as well as ratio-intensity differences between them.Since prenormalization already done over datasets is not known, it is not possible to establish a concrete mathematical transformation to achieve this goal.It was then chosen to experiment six different ways to obtain a comparison: • Norm.Where the data is mean subtracted and divided by standard deviation.
• Linear.A set of reference features are selected, that present the lowest dataset and intergrade expression variation in all datasets.When joining two datasets, one is used as base and a linear model is built that represents the linear variation between expression values of references from the base and the references of the new dataset.That model is then used to adjust the entire values of the new dataset.This is applied to linear data, without log2 transformation.
• LinearLog2.The same as the previous, but applied to log2 transformed data.
• Grade.Since the grade is an ordered value, dividing all data by the mean of values of common grade brings the heterogeneous datasets to levels that are comparable.This is applied to linear data, without log2 transformation.
• GradeLog2.The same as the previous but applied to log2 transformed data.

Find candidate biomarkers
Four candidate machine learning algorithms were used, suitable for multi-level classification tasks: Ordinal Logistic Regression (OLR); Support Vector Machines (SVM); k-Nearest Neighbors (k-NN); and Linear Discriminant Analysis (LDA).
The process used to select the best pair of training algorithm -feature set is to firstly choose the possible algorithm candidates and then test each against possible combinations of feature sets.Obviously, this would result on a large processing burden.Then, the first step is to do a filtering stage, to reduce the initial set to a more reasonable subset.Two filtering methods will be used: Pearson's correlation of each feature with the tumor grade and differential expression analysis using empirical Bayes (eBayes), which will give a rank with p-values of the features that best differentiate the four tumor grades.Since some algorithms do not deal well with data collinearity, features that have a high Pearson's correlation among them will be eliminated.
With the reduced feature set (rfSet), a greedy wrapper algorithm is deployed to choose, for each algorithm, the best feature subset.The main steps of this algorithm are are shown in Fig. 3.The first iteration is made with only sets with a single feature, testing each algorithm with each of the filtered features.The features which present the best results will be kept and then used as base for the next set of tests, where an additional feature, from the rest of filtered features, is included to organize sets of 2-features combinations.This method is repeated until reaching the maxSetSize threshold.The test criterion was the lower classification error rate (number of misclassifications over total number of samples) by Leave One Out Cross Validation (LOOCV).
For OLR, the Akaike's Information Criterion (AIC) was calculated, due to lower computation burden.

Implementation and Results
Gene expression datasets and phenotypes where collected from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA), targeting all datasets that present good quality and have samples with explicit brain tumor grade classification.For re-annotation, the probe sequences were obtained from the official sites of the respective vendors and genome annotation was gather from Ensembl.Computational tools were developed with R programming language, using Bioconductor project framework broadly.The database used for datastage and re-annotation processes was implemented over the MySQL Management System.

Data collection
The data was collected from GEO at the probeset level, while in TCGA the data was found in three levels.This restricted the integration to a level not lower than that of probesets.The Affymetrix and Agilent probeset identifiers were collected from the original datasets.Assuming a specific identifier of a manufacturer measures the same sequence, even if the platform is different, a global set of unique probeset identifiers were collected, for each manufacturer.The Agilent platforms were: • Human 1B Microarray: Design ID: 011871; Design Format: 1 X 22K; • Human 1A Microarray (V2): Design ID: 012097; Design Format: 1 X 22K; • Whole Human Genome Microarray: Design ID: 026652; Design Format: 4x44K v2; • Whole Human Genome Oligo Microarray: Design ID: 012391; Design Format:1x44K.
The Affymetrix platforms were: • Human Genome U133 Plus 2.0 Array • Human Genome U133A • Human Genome U133B • Human Genome U95Av2 The Agilent probe sequences were collected with the eArray tool, available in the following URL: https://earray.chem.agilent.com/earray,using the option Microarray → Browse Microarray Designs, selecting "H.sapiens".The Affymetrix sequences were taken from the Web site through the option Products → Microarray Solutions → 3' IVT Expression Analysis → select desired microarray → Technical Documentation → Sequence Files (Tabular).It was checked that all probesets of U133A and U133B are also in U133 Plus 2.0 and that some U95Av2 probesets are also in U133 Plus 2.0.So, the collection was started with the 54613 U133 Plus 2.0 probesets, to which 12286 U95Av2 specific probesets were added, making a total of 66899 unique probeset identifiers.On U133 Plus 2.0 array, 9711 probes were identified with duplicate sequences and in the U95Av2 array 1722 duplicate sequences were found.After joining the two sets, repeated sequences were noted in 22351 probes.

Re-annotation
Using Biostrings, with the human genome version BSgenome.Hsapiens.UCSC.hg19, the match of each probe sequence with the genome was done.Each probe could have zero or more matches along the genome.A maximum of one base pair (bp) mismatch [11] was allowed.
From the 799238 Affymetrix probes less than 5% do not have any matches in the genome.
Most probes present only one match, while a maximum of 366059 matches was registered for a single probe.The largest numbers of matches were recorded in 4 probes of the probeset 1008 f at.There are also cases where the same probe matches in several distinct regions of the same gene.For example, in the probeset 1557055 s at a probe was detected with 10 distinct matches in the gene RP11-206L10.11(ENSG00000228794).The distribution of number of matches per probe is given in Tab. 2.
For some probesets (0,6%), no match has been identified for any probe.This could be explained by the upgrade of the genome annotation.Other possibilities could be pointed, like the fact that approximately 200 Mbp of theTab human genome, mainly from the centromeres and the short arms of the acrocentric chromosomes, are missing from the human reference genome [26].The Genome Reference Consortium (GRC) builds the genome reference leaving these sequences in structures outside chromosomes.A GRC genome assembly, for H. Sapiens species, is composed by [27]: 24 "relatively complete" sequences for chromosomes 1 to 22, X and Y; a complete mitochondrial sequence; several "unlocalized sequences" (their exact location within the chromosome is not known); several "unplaced sequences" (their chromosomal association is not known); and several "alternate loci" (contain alternate representations of specific regions).Considering these 'extra' genome sections, 487 additional probes were identified.The rest of the probes were analyzed with BLAST tools from Ensembl and UCSC, but continued to return no matches.
Annotation through GeneAnnot [28] presented incomplete results (Table 3) and is gene centric, i.e., it is not possible to obtain the exact chromosome location.The information obtained from GeneAnnot can be found through Affymetrix annotation files.The fact that some probes do not match any region of the genome could be due to outdated Affymetrix probe design (Build 133, April 2001 for U133 Plus 2.0 Array) when compared with the genome assemblies that are frequently updated by GRC (February 2009 for GRCh37/hg19).Probes that showed no match were discarded.Among the probes with matches, 95.2% registered at least one match in the genes referenced by Ensembl, getting 5% of the probes without apparent connection to any gene.79.6% of the probes with matches in genes had at least one match in an exon.The remaining cover zones of introns, UTRs, or boundary regions that do not belong in their totality to exons.The annotation process was made by stages, crossing the probe matches obtained from Biostrings with the genome information contained on the Ensembl database.From Ensembl, for each exon, information was collected about the corresponding gene, sequence, location (start and end) on the respective chromosome, and strand.A filter was applied to return only the exons present in the 24 chromosomes from 1 to Y.The information extraction for exons was made using biomaRt R package, because it allows to associate the sequence of each exon to the rest of the information.The transcript annotation data obtained with the biomaRt package include non-normalized fields that would require a considerable processing time for their normalization.Thus, the Ensembl data for transcripts and genes was collected using the BioMart web tool in CSV format from http://www.ensembl.org/biomart/martview,with the options: Database: Ensembl Genes 72 → Dataset: Homo sapiens genes (GRCh37.p11)→ Attributes.It was possible to identify, for each transcript, the gene, the strand, start and end positions on the chromosome.In parallel, the transcript-exons map was built.

Find candidate biomarkers
Pearson's correlation and eBayes filters were applied on these datasets and the probesets were sorted by rank.Over this, a cleaning process was performed to remove collinear probesets, using a 0.9 Pearson's correlation cutoff.Finally, only a set of the 200 best ranked probesets were kept for further tests.The wrapper method was tuned to make combinations until a maximum of 20 features.Also, the quantity of the best combinations preserved between cycle iteration was limited to 20.For k-NN, k was always set to 3. The main results are provided in Tables 4, 5 and 6.The differences between results associated with the filtering method are not significant, so one will be used arbitrarily.It is also possible to see that the accuracy is always better for Agilent datasets, and SVMs have the best performance when considered both platforms.In the matters of probeset, gene or transcript level, the former presented slightly worse results.Between gene and transcript, it is not possible to distinguish a clear difference.Besides the identification of the best classification algorithm, it is important to compare the prediction features obtained for Affymetrix and Agilent.If a transcript/gene is related to the grade phenotype, then it should be relevant for both platforms.To check this premise, eBayes was applied to rank features separately, and then pairwise Kendall's coefficient of concordance (W) was calculated.For genes, W = 0.675 was obtained, and for transcripts W = 0.711.There is a considerable discordance about feature relevance between Affymetrix and Agilent.Deepening the analysis, the features selected as the best predictor with SVM for Affymetrix was used for Agilent tests and vice versa.With Affymetrix predictors, a LOOCV error of 0.10 for genes and 0.09 for transcripts, was obtained when these predictors were applied on Agilent.For Agilent predictors, an error of 0.29 for genes and 0.32 for transcripts was obtained, when applied on Affymetrix data.It is clear that predictors obtained from Agilent data tend to overfit more and loose generalization.
The next tests were made joining all datasets on a single one, using different methods for data integration (Table 7).SVMs are confirmed as the training algorithm that gives the best results and the division of the data by a reference set of values (grade) also improves accuracy.
As the common grade between all datasets is g3, this was the reference grade used.Joining heterogeneous datasets aims to produce a robust training set that allows to classify a new sample with a good rate of assertiveness.With these LOOCV tests it is possible to preliminary confirm that it is possible to do this with datasets from Agilent and Affymetrix.To take the tests a little further, the same joined dataset was used to build a model to predict a new Affymetrix dataset, collected from TCGA, not used before in any stage.This new dataset has 10 g0 and 277 g4 samples so, to apply the reference grade method, all data was divided by grade 0. The results again confirmed the good performance of the method (Table 8).

Figure 2 :
Figure 2: Expression/grade distribution for Affymetrix probeset 201292 at.Grade 0 corresponds to non-tumor samples and grade 1 was not represented on collected datasets

Figure 3 :
Figure 3: Algorithm for best feature set selection

Table 1 :
Related works

Table 2 :
Number of genome matches per probe

Table 6 :
Results for classifiers (columns), filtering methods (rows) at the transcript level.

Table 7 :
LOOCV error rate for Affymetrix and Agilent integrated data Raw Norm Linear LinearLog2 Grade GradeLog2