Gene Expression Studies to Identify Significant Genes in AR, MTOR, MAPK Pathways and their Overlapping Regulatory Role in Prostate Cancer

Abstract Gene expression studies revealed a large degree of variability in gene expression patterns particularly in tissues even in genetically identical individuals. It helps to reveal the components majorly fluctuating during the disease condition. With the advent of gene expression studies many microarray studies have been conducted in prostate cancer, but the results have varied across different studies. To better understand the genetic and biological regulatory mechanisms of prostate cancer, we conducted a meta-analysis of three major pathways i.e. androgen receptor (AR), mechanistic target of rapamycin (mTOR) and Mitogen-Activated Protein Kinase (MAPK) on prostate cancer. Meta-analysis has been performed for the gene expression data for the human species that are exposed to prostate cancer. Twelve datasets comprising AR, mTOR, and MAPK pathways were taken for analysis, out of which thirteen potential biomarkers were identified through meta-analysis. These findings were compiled based upon the quantitative data analysis by using different tools. Also, various interconnections were found amongst the pathways in study. Our study suggests that the microarray analysis of the gene expression data and their pathway level connections allows detection of the potential predictors that can prove to be putative therapeutic targets with biological and functional significance in progression of prostate cancer.


The Prostate Cancer
Prostate Cancer (PC) is the second most widespread disease among men in the world and most often identified cancer in developing countries [1]. An exocrine gland, prostate discharges its secretions outside the male reproductive system. The liquidity of semen in prostate gland is maintained by the epithelial cells which produce a protein named PSA (prostate-specific antigen), which when high i.e. above 4 ng/ml is an indication of prostate cancer [2]. In most cases, the cancer starts in the gland cells that make the prostate fluid; such types of cancer are called adenocarcinoma. In fact, nearly 60% of all prostate cancers are diagnosed in men over the age of 65. As indicated by the rates of diagnosis, age is the biggest but not the only risk factor for prostate cancer. The major risk factors include age, family history, genetic factors, race, lifestyle, and dietary habits. The slow advancement of PC is witnessed through the changes in shape and size of the gland cells named as prostatic intraepithelial neoplasia (PIN). The possible complications include metastasis, incontinence and impact of changes of certain metabolic factors. Hormone therapy, that maintains low testosterone level in the body is the procedure to slow down the growth of already, spread PC, but it is often less effective. When the cancer grows with this therapy it is termed as castrate-resistant prostate cancer (CRPC).

Key Pathways in PC
The high variability in the disease progression and the paucity of the biomarker research provides incomplete understanding for the treatment of this disease. The challenge is to target prostate cancer biomarkers to ad- dress the unmet clinical needs in prostate cancer management [3]. There is urgent need to translate biologically relevant information and generate concordant results from different methods and procedures. For the identification of biomarkers, many studies have been performed, but to achieve clinical utility, these biomarkers should have a prognostic significance [4], [5]. Experiments using microarray technology have identified the expression of the genes associated with the different stages of PC in semen, tissues and glands [6]. Prostate cancer cells, like normal prostate cells, necessitate androgens to grow and survive. To develop successful future therapies, it is necessary to understand the critical events and complexities of AR signaling in the progression to CRPC. Significant evidence supports the fact that the occurrence and development of CRPC is casually related to continued transactivation of AR [7]. Many attempts have been made to search for gland-specific molecules those might help as potent biomarkers or as therapeutic agents due to the limitations in the standard treatment procedures of PC. Although, AR is an essential player that controls different elements in all phases of prostate carcinogenesis, but many other signaling pathways along with their interactions with AR signaling, are also critically implicated especially in advanced stages of prostate cancer. We have studied the expression profiles of the PC patients regarding their role in AR signaling pathways.
The already recounted increase in AR levels in 65% of CRPC cases was not alone overly significant; but, when combined with stabilization through c-Jun and phosphorylation from the MAPK cascade, the small increase becomes highly significant with increased activity [8]. This explains the correlation that exists between MAPK levels in the nucleus and levels of AR found in tumours.

Related Works
Studies probing MAPK activity in PC material suggested MAPK activity relates to development of a progressively complex and hormone independent PC. Preclinical studies have suggested a direct connection between the AKT which is a target for many anti-cancer agents [9] and AR signaling cascade, showing a dynamic interplay between these pathways during the development of ADT resistance and the development of androgen insensitivity. AKT signaling pathway has revealed a very important node that directs ADT resistance and stimulates tumour growth by adjusting the castrate levels of testosterone [10]. In addition, this pathway is altered at the genomic as well as the transcriptional level in almost all advanced PCs. Recently, a mathematical model has been developed to examine the steady-state and dynamic characteristics of the major feedback loops that synchronize the cross-talk between insulin-AKT and MAPK/ERK signaling pathways [11].
Co-targeting strategies strive to improve cancer outcomes by combining therapies that inhibit driver genes in AR, MAPK, and mTOR pathways and are of great interest, since they are among the most frequently altered [10]. In our work we studied the interactions between three pathways AR, MAPK and mTOR ( Figure 1) to conduct a cross-pathway examination for meta-analysis of PC related microarray data. Figure 1 depicts the cross regulatory effect of three pathways that shows cell proliferation (in terms of cancer progression) either in ERK or non-ERK dependent manner. These pathways represent close association thus showing interdependent pathway regulation as PI3K, AKT has regulatory effect on mTOR and likewise AKT, AR, and PI3K shows cross regulating effect. We approached the problem of computational data analysis using tools to integrate genespecific expression changes in PC related samples and obtained several genes which can be targeted for detailed analysis and experimental procedures as therapeutic agents.

Data Collection
The gene expression data have been retrieved from the National Centre for Biotechnology Information (NCBI); Gene Expression Omnibus (GEO) [12] and ArrayExpress [13]. Summarized methodology for the analysis is shown in Figure 2. The information regarding the data sets that we incorporated in our microarray data analysis for all the three pathways is shown in Supplementary Table 1. We collected the gene expression data using the GEO series ID's obtained from PubMed using the keyword search "Prostate Cancer" AND "AR Pathway" AND "Homo Sapiens" which resulted in 27 entries. The number of series datasets selected for AR pathway studies in prostate cancer is five, respectively (Supplementary Table 1).

Data Preprocessing
The processed gene expression data corresponding to these ID's was downloaded from ArrayExpress to identify the differentially expressed genes within DNA repair pathways.

Comparative Analysis
Additionally, the same data was downloaded from GEO database in CEL format for further analysis. The Ar-rayExpress data was processed using WebMeV (Multiple Experiment Viewer) which is a free and open-source cloud service platform that supports analysis, visualization, and stratification of large genomic data, particularly for RNASeq and microarray data [14]. GeneSD (standard deviation) was done in the range 0-0.998 showing different colours in the heat map for top 20 probes/genes. The genes are classified based on the standard deviation of their expression values for all samples.
Similarly, GeneMAD (median of the absolute deviation) was performed; it is better at removing random clusters of multiple outliers of the expression values in the range 0-0.994, depicted by different colours in the heatmap. Principal component analysis (PCA) which is used for clustering large number of genes in complex biological networks was carried out on the same data in which each dot represents a PC sample plotted against its expression levels for the probes/genes. Different clustering method like k-means clustering was implicated using Euclidean distance measure methods. GEO2R was used for the expression analysis to compare two or more groups of samples, to distinguish genes that are expressed differentially throughout experimental states. Based on literature information, the samples are grouped into test and control for diseased and non-diseased samples, respectively. Both GEO2R and ArrayExpress uses same method i.e. Benjamini-Hochberg method. The method is selected by default because it is the most commonly used adjustment for microarray data and provides a good balance between discovery of statistically significant genes and limitation of false positives.
The BH threshold is defined for pre-specified 0 < α < 1 as: The GEO2R results provide information regarding probe ID, its t-value which is a measure of the difference in means taking into consideration of data spread and sample sizes, and smaller p-value than a predefined significance level (0.01), defines the gene as differentially expressed. A p-value (or probability value) is very helpful in determining the confidence in the result decision that is taken under the null hypothesis. If the p-value is small; it indicates that it is very unlikely that the data was generated according to the null hypothesis. The p-value can be calculated either by the t-test or normal distribution parametrically or by a permutation method. Since the gene expression data are generally not normally distributed, the permutation test method is recommended to compute the p-values. The p-value of a permutation test is the cumulative sum of the probability of the observed outcome and the probability of all more extreme outcomes. Where the test statistic is denoted by S i and the observed score for gene i by * , the p-value corresponding to the observed score * is.
For each gene, a test statistic and its corresponding p-value are calculated to determine the extent of differential expression. The B value i.e. the log-odd function for each gene which is differentially expressed between the two groups. After, then the fold change on a logarithmic scale (logFC) value is shown along with the gene symbol and its title. The F-statistic calculates overall test of significance for that gene by combining the t-statistics for all pairwise comparisons. Generally F-statistic is given by comparison of the two variances, s1 and s2, and dividing them. It tests the equality of the class means for a fixed gene as:

The Significant Genes
From the results top 20 and bottom 20 genes i.e. the over-expressed and under-expressed genes, respectively were identified by sorting the p-value column in descending order and discarding the value except <0.01 and <0.05 (as defined earlier as the significance level of 99% and 95%, respectively). Comparative analysis was performed for the pre-processing of data followed by statistical analysis and mapping of over expressed and under expressed genes for the involved pathways using WB-DEGS (Within and Between Group Comparisons for Differentially Expressed Gene Selection), a R based platform which performs pre-processing, visualization, and genes selection with an accuracy to minimize the false positive rates using some classical methods of gene selection [15].

Gene Mapping and Interactions
The corresponding gene mapping were performed using these probe ID's, first in NetAffx™ [16] which enumerates the probe sequences and the consensus sequence interrupted by the probe Ids also DAVID (based on another agglomeration approach, heuristic fuzzy multiple-linkage partitioning) [17] and GeneAnnot [18] (which links between Affymetrix arrays and the rich human gene annotations) were utilized for the purpose. Then, input was provided in STRING database for each pathway and the network was constructed [19]. The same analysis was obtained from GeneMANIA [20] which helps to generate genetic interactions and predict the function of the genes.

Gene Annotations
Finally the gene annotations are performed using Panther tool that follows statistical enrichment using Mann-Whitney test to determine whether any ontology class or pathway has numeric values that are nonrandomly distributed with respect to the entire list of values and is given as: It checks that the probability of the functional category distribution was drawn randomly from the reference distribution.

The Datasets
Analysis was performed for all series mentioned in Supplementary Table 1, as discussed above similar type of analysis has been performed for all the data; however here we are representing the detailed study for series GSE2443. The GEO2R analysis for GSE2443 gave a great number of genes but after applying the filters such as p-value cut-off we were left with only a few significant entities. The overlapping genes found may have some significance towards PC after analysing them through three different methods (i.e. GeneMAD, GeneSD, GEO2R). Through the analysis only two genes were identified to be the most common entities in all methods i.e. CEP57 (HGNC: 30794) and PDLIM5 (HGNC: 17468) (Figure 3).

Identification of Set of Common Genes
The genes which were found common between the microarray data analysis using GEO2R, WB-DEGS and TM4 (WebMeV) were tabulated and grouped differently. Then, input was provided in STRING database for each pathway and the network was constructed [19]. The interaction value between the two genes was considered greater than or equal to 0.9 which is considered significant and others are discarded. Then, using conditional formatting in excel, the genes were categorized into AR, MAPK and mTOR. The mapping of probeIDs to their respective genes was done using the supplementary file, containing the unabridged list of genes. The SAM analysis results at the delta value 1.5 gave 15 probes out of which 14 were mapped to the genes in NetAffx™ ( Figure 6). The Simple Statistical Test (paired t-test) gave 11 probes out of which 10 were successfully mapped to their respective genes (Figure 7). There was no significance in within group comparisons. The twilight statistical analysis at FC 1.5 gave 29 probes out of which 28 probes were mapped to genes ( Figure 8). There was no significance in within group comparisons. The linear model analysis gave a significant number of genes only in comparison between groups ( Figure  9) but, the expressed genes showed the p-values greater than 0.01, which were not significant and thus were discarded.  Figures 1 and 2). Twenty seven were found in GSE21887 ( Supplementary Figures 3 and 4). Again, in GSE33316, 13 genes were found based on the results (Supplementary Figures 5-8). In GSE67537 there were no common genes at any level. So, our analysis for Androgen receptor pathway (AR) ended up with 50 genes in total and only one gene in common across all the five series datasets. Now, for mitogen activated protein kinases pathway (MAPK), all four series datasets have given some common significant genes. In GSE20906 we found 11 genomic entities based on following results (Supplementary Figures 9-11). In GSE23038 we found 53 genes based on the following results (Supplementary Figures  12 and 13). In GSE29438 we found 4 significant genes based on the following results ( Supplementary Figures 14  and 15). In GSE39735 156 significant genes were found based on the following results (Supplementary Figures  16-20).
So, our analysis for mitogen activated protein kinases pathway (MAPK) ended up with 224 genes in total with 9 genes in common across all the four series datasets. Now, for mechanistic target of rapamycin (m-TOR), all three series datasets have given a few more common significant genes. In GSE26332, 52 genes were found based on the following results ( Supplementary Figures 21-23). In GSE49232 7 genes were found based on the following results ( Supplementary Figures 24-26). In GSE32875 there were no genes in common between all the methods. So, our analysis for mechanistic target of rapamycin (m-TOR) ended up with 59 genes in total with 3 genes in common across all the three series datasets. Now, the genes obtained through all these three analysis were identified and were processed for functional annotation so that we can obtain few relevant genes. A total of 349 genes were identified in all pathways with 229 genes in MAPK, 63 genes in mTOR and 51 genes in AR pathway. After then the genes were categorized into the three pathways to visualize the interactions between them (Supplementary Table 2). The input of STRING was provided as these genes and the networks created for each pathway and multiple-pathway (in the pair of two). This determined the interactions between the genes belonging to same or different pathway.

The Gene Interactions and Annotations
From the network results obtained from STRING database, only, significant interactions having combined score greater than or equal to 0.9 was taken into consideration. The intra-pathway interactions include: AR-AR, mTOR-mTOR, MAPK-MAPK (Supplementary Table 2) and inter-pathway interactions include: AR-mTOR, mTOR-MAPK, AR-MAPK (Supplementary Table 3). The same analysis was obtained from GeneMANIA [20] which helps to generate genetic interactions and predict the function of the genes. Same type of input was provided with genes belonging to different pathways and the results were generated in the same format but with the weight value between the two genes. The weight represented the strength of interactions between them and this was processed by discarding the weights greater than 0.1 which is a significant interaction. The common genes inter-connections obtained from STRING and GeneMANIA are the putative genes which are suspected to play a major role in mediating the progression of the prostate cancer between the AR, mTOR and MAPK pathways (Table 1 and Table 2).  RPS20  RPS6  SEC11A  SPCS1  VAMP8  STX2  RPS8  RPS6  BNIP3  HIF1A  IL6  IL6R  RPL29 RPL23 SMN1 SMN2 Table 2: Common inter-pathway connections between STRING and GeneMANIA for all three pathways.

AR-mTOR mTOR-MAPK
The final genes that were found interconnected in three pathways are: EDN1(mTOR), EIF5(MAPK), RPL23(AR), RPL29(AR), RPS6 (AR), RPS8(AR), RPS20(AR), PRKACB(mTOR), SERBP1(mTOR), ND-UFA12(mTOR), CDK1(mTOR), EIF5B(MAPK), and UCHL5(MAPK). Thus, the analysis finally leads to the 13 common genes. These genes are then further analysed based on their functional annotation to retrieve their importance in the regulation of prostate cancer in the AR, mTOR and MAPK pathway. Also, hubs were identified in the networks which are the genes which are connected to multiple genes; thereby influence the functions of each other (Table 3 and Table 4).