Co-expressed gene groups analysis (CGGA): An automatic tool for the interpretation of microarray experiments

Summary Microarray technology produces vast amounts of data by measuring simultaneously the expression levels of thousands of genes under hundreds of biological conditions. Nowadays, one of the principal challenges in bioinformatics is the interpretation of this large amount of data using different sources of information. We have developed a novel data analysis method named CGGA (Co-expressed Gene Groups Analysis) that automatically finds groups of genes that are functionally enriched, i.e. have the same functional annotations, and are co-expressed. CGGA automatically integrates the information of microarrays, i.e. gene expression profiles, with the functional annotations of the genes obtained by the genome-wide information sources such as Gene Ontology. By applying CGGA to wellknown microarray experiments, we have identified the principal functionally enriched and co-expressed gene groups, and we have shown that this approach enhances and accelerates the interpretation of DNA microarray experiments.


Introduction
One of the main challenges in microarray data analysis is to highlight the principal functional gene groups using different sources of genomic information.These sources of information, constantly growing by an ever-increasingly volume of genomic data, are: • Taxonomies, thesaurus and ontologies providing the semantic information for the genes, for example: Gene Ontology (GO) 2 , Unified Medical Language System (UMLS), Medical Subject Headings (MESH), Universal Protein Ressource (Uniprot), etc.
• Literature and bibliographic databases (articles, on-line libraries, etc.) covering the results of previous analysis: Pubmed, Medline, etc.
A variety of statistical and data analysis approaches, identifying groups of co-expressed genes based only on the expression profiles, i.e. without taking into account prior knowledge, have been reported: [4], [6], [8], [22].A common characteristic of purely numerical approaches is that they determine gene groups (called clusters) of potential interest; however, they leave to the expert the task of discovering and interpreting biological similarities hidden within these groups.
These methods are useful, because they guide the analysis of the co-expressed gene groups.Nevertheless, their results are often incomplete, because these approaches do not include biological considerations and also, they reject heterogeneous functional groups i.e. that belong to various functional groups [21].Actually, one of the major goals in bioinformatics is the automatic integration of biological knowledge from different sources of information with gene expression data [2].A first assessment of the methods developed to answer this challenge was proposed by Chuaqui [5].
Nowadays, one of the richest sources of biological annotations is contained on structured and controlled vocabulary such as ontologies.These annotations can be functional, relational and syntactic information on genes.We target here the enrichment of two recently developed research orientations, sequential and a priori, that exploit multiple sources of annotations such as Gene Ontology.
The sequential axis methods build co-expressed gene clusters (groups of genes with a similar expression profiles).Then they detect co-annotated gene subsets (sharing the same annotation).Afterwards, the statistical significance of these co-annotated gene subsets is tested.Among the methods in this axis let us quote Onto Express [7], Quality Tool [9], EASE [10], THEA [15] and Graph Modeling [21].
The a priori axis methods first finds functionally enriched groups (FEG), i.e. groups of coannotated genes by function.Then they integrate the information contained in the profiles of expression.Later on, the statistical significance of the FEG is tested by an enriched score [14], a pc-value based on a hypergeometric distribution [3], or a z-score test [11].
Our approach, called CGGA (Co-expressed Gene Groups Analysis), is inspired by the a priori axis: the FEG are initially formed from the Gene Ontology, next a function, which synthesizes the information contained in the expression data, is applied in order to obtain an arranged gene list.In this list, the genes are sorted by decreasing expression variability.The statistical significance of the FEG obtained is then tested using a similar hypothesis proof as presented in Onto Express.Finally, we obtain co-expressed and statistically significant FEG.
The IGA algorithm [3] is a method from the a priori axis that allows to find the FEG of most expressed genes, leaving out all the FEG made up of less expressed genes that have however a similar level of expression and thus can be related later.Our CGGA method is an extension of the IGA algorithm that finds all subsets FEG of significant co-expressed genes with similar level of expression.
This article is organized in the following way: in section 2 we describe the validation data as well as the tools used: databases, ontologies, statistical packages; our algorithm CGGA is described in section 3; the results obtained are presented in section 4 and the last section presents our conclusions.
2 Data and Methods

Dataset and Statistical Pretreatment
In order to evaluate our approach, the CGGA algorithm was applied to the DeRisi dataset which is one of the most studied in this field [6].This dataset measures the variations in gene expression profiles during the cellular process of diauxic shift for the yeast Saccharomyces Cerevisiae.When inoculated into a glucose-rich medium (anaerobic growth), the budding yeast can convert the glucose to ethanol (aerobic respiration), the shift from anaerobic fermentation of glucose to aerobic respiration of ethanol is the so-called diauxic shift.
The technique used is double channel microarray, obtained by two color fluorochromes with distinct emission spectra Cy3 and Cy5.The DeRisi dataset contains the expression levels of 6199 ORF's, opening reading frame, of the yeast (an entirely sequenced organism), for 7 temporal points that correspond to samples harvested at successive two-hour intervals after an initial nine hours of growth.
The dataset was pretreated by taking the log 2 ratios (to consider cellular inductions and repressions in a numerically equal way) and applying the imputation algorithm of k-nearest neighbors [12] in order to treat the missing values (1.9% of the total).

Ontology and Functionally Enriched Groups (FEG)
In order to fully exploit data, knowledge discovery systems rely on a formal representation of information based on a well-defined semantic [19].These formal requirements have led to the utilisation of the well structured ontology Gene Ontology (GO) and the nomenclature database SGD 3 .Sructure of Gene Ontology (GO) and the annotations of Saccharomyces Cerevisiae Genome with GO terms were retrieved from the GO database web site4 on may 2006.Automatic annotations not reviewed by curators (IEA evidence code) were discarded.For each gene product, we have stored all the functional annotations of the gene product and his parents preserving the hierarchical structure of GO.

Gene Ontology (GO)
GO is a controlled vocabulary developed by a consortium of scientists to address the need for consistent descriptions of gene products in different databases.It can be used to annotate a gene or gene product by a GO-term, with regard to its molecular functions (GO:MF), cellular localizations (GO:CL) and biological processes (GO:BP).
GO-terms are organized in structures called directed acyclic graphs (DAGs), which differ from hierarchies in that a child, or more specialized, term can have many parent, or less specialized, terms.Annotators can assign properties of gene products at different levels, depending on how much is known about a gene [1].

Genome Data
In order to be congruent with GO annotations files and among the multiple yeast gene identifiers, we have used the yeast Saccharomyces cerevisiae database.SGD is a scientific database of the molecular biology and genetics of the yeast [24].

Functionally Enriched Groups (FEG)
Queries carried out on the GO database have built the whole set of the FEG: each FEG corresponds to a couple made up of a GO-term and of the list of genes annotated by this one.

Expression Profile Measure of the Genes
In order to incorporate the expression profile of the genes, we have used a measurement of their variability of expression, f-score, which is more robust than other measurements such as anova, fold change or t-student statistics [17].
This measurement enables us to build a list of genes, g-rank, ordered by decreasing expression variability.We have used the SAM program [23] to calculate the f-score associated with each gene.

Co-expressed Gene Groups Analysis (CGGA)
The CGGA is based on the idea that any resembling change (co-expression) of a gene subset belonging to an FEG is physiologically relevant.We say that two genes are co-expressed if they are close in the sense of the metric given by the expression variability (f-score).The CGGA algorithm computes a pc-value for each FEG that estimates its coherence (according to the g-rank) and thus allows to detect the statistically significant groups.

CGGA Algorithm
The CGGA algorithm first builds the g-rank list from the expression levels and the FEG from the GO database.For each FEG of n genes, the algorithm determines the n(n + 1)/2 gene subsets that we want to test for co-expression.For each subset we compute the pc-value corresponding to the test described below in order decide whenever the genes of the subset are co-expressed.
Let H 0 be the hypothesis that x genes from one of these subsets were associated by chance, given their place on the g-rank list.If H 0 is rejected, there are good chances that the genes belonging to the subset are improbably close on the list because they have a very similar expression profile.
To compute the probabilty that H 0 is true for a fixed subset FEG or class, let us ask the question, how likely is to find x members from the class placed this way on the g-rank list?The answer to this question is given by the following hypergeometric distribution:

Input:
List of annotations for each gene G: annotations(G).Ordered list of N genes: g-rank.Output: Results set containing the FEG of co-expressed genes: results(F EG A ).For example, let the F EG A annotated set, thus we have: Then, all the subsets of {g 1 , g 2 , g 3 } are deleted from results(F EG A ). Finally, the total result consists of all the groups of co-expressed and significant genes (stage 19).

Example
An example of the CGGA applied to a group of co-annotated genes is presented in Table 1.The data used in the example is from the experiment carried out by DeRisi (see section 2.1), where the diauxic shift process of the yeast, Saccharomyces Cerevisiae, was analyzed.
The ordered g-rank list was computed using the f-score obtained with the SAM program (see section 2.3).The data of the FEG, annotated "vacuolar protein catabolism", was obtained from the GO database (see section 2.2).This FEG contains 4 genes (n = 4) whose rows in the total g-rank list vary from 6 to 424.
In Table 1 we show the values of the parameters needed to determine the significant gene subsets within the FEG.We have highlighted the subset of genes: {1, 3}, from vacuolar protein catabolism FEG, found significantly co-expressed by CGGA.
List g-rank x Gene ID (SGD) GO Annotation r g(x) R g(x) CGGA tested for H 0 the (4*5)/2=10 FEG subsets computing their pc-value and comparing it to the p-value.For example, the pc-value corresponding to the subset {S000000490, S000001586} of rank 6 and 8 in g-rank is 2.63E −05 (cf.Table 2).This pc-value being lower than p-value, fixed at 6.88E −04 (cf.section 3.1), CGGA rejected H 0 and the group of genes {S000000490, S000001586} is then labelled statistically significant and co-expressed.We see that the subset with genes of rank 6 and 8 is very close and then co-expressed.On the other hand the genes of rank 69 and 424 are rather distant from their closer neighbours, i.e. the groups that contain them are not co-expressed significantly.

Results
In order to evaluate our method, we compared the results obtained by DeRisi [6], IGA [3] and CGGA.The results obtained using CGGA for the over-expressed and under-expressed genes are presented in Table 2 and Table 3 respectively.As expected, almost all groups identified as significantly co-expressed by the DeRisi method have also been identified by the CGGA.The groups identified by CGGA and DeRisi are in bold, the ones identified only by CGGA are in italics, and the only group identified also by IGA is in SMALL CAPS.
In the case of over-expressed genes (Table 2), CGGA found seven of the nine groups obtained manually by DeRisi [6].The two annotated groups "glycogen metabolism" and "glycogen synthase" have not been identified by CGGA because they are expressed only at the initial phase of the process.However CGGA identified eight other statistically significant and coherent groups.Only one of these eight other groups has also been identified by IGA and none of them by DeRisi.For the case of under-expressed genes (Table 3), CGGA has found seven of the eight gene groups selected manually by DeRisi.As for over-expressed genes, the group annotated "ribosome biogenesis" was not identified by CGGA, because it was only expressed during the final phase of the process.CGGA have also identified seven other statistically significant and coherent groups which were not identified on the DeRisi analysis nor by IGA.
The three groups identified by DeRisi that CGGA did not identify, namely the over-expressed groups "glycogen metabolism" and "glycogen synthase", and the under-expressed group "ribosome biogenesis" share two important properties.First, they contain genes belonging to a heterogeneous structure, i.e genes that appertain to several functional groups.Second, these FEG are not expressed throughout the entire process but only during a specific phase.Detect these groups will only be possible by integrating information on the metabolic pathways ontologies such as: KEGG, EMP, CFG, etc.

Conclusion
The CGGA algorithm presented in this article makes it possible to automatically identify groups of significantly co-expressed and functionally enriched genes without any prior knowledge of the expected outcome.CGGA can be used as a fast and efficient tool for exploiting every source of biological annotation and different measure of gene variability.
In contrast to sequential approaches such as [7], [9], [10], [15] and [21], CGGA analyze all the possible subsets of each FEG and does not depend on the availability of fixed lists of expressed genes.Thus, it can be used to increase the sensitivity of gene detection, especially when dealing with very noisy datasets.CGGA can even produce statistically significant results without any  experimental replication.It does not need that all genes in a significant and co-expressed group change, so it is therefore robust against imperfect class assignments, which can be derived from public sources (wrong annotations in ontologies) or automated processes (naming errors, spelling mistakes, etc.).
The automated functional annotation provided by our algorithm reduces the complexity of microarray analysis results and enables the integration of different sources of genomic information such as ontologies.
CGGA can be used as a tool for platform-independent validation of a microarray experiment and its comparison with the huge number of existing experimental databases and the documentation databases.Experimental results show the interest of our approach and make it possible to identify relevant information on the analyzed biological processes.In order to identify heterogeneous groups of genes expressed only in certain phases of the process, we plan to integrate the information concerning the metabolic pathways ontologies for future work.

Table 1 :
CGGA Analysis for the FEG of genes annotated "vacuolar protein catabolism"