Four long noncoding RNAs act as biomarkers in lung adenocarcinoma

Abstract Introduction Lung adenocarcinoma (LUAD) is currently one of the most common malignant tumors worldwide. However, there is a lack of long noncoding RNA (lncRNA)-based effective markers for predicting the prognosis of LUAD patients. We identified four lncRNAs that can effectively predict the prognosis of LUAD patients. Methods We used data gene expression profile for 446 patients from The Cancer Genome Atlas database. The patients were randomly divided into a training set and a test set. Significant lncRNAs were identified by univariate regression. Then, multivariate regression was used to identify lncRNAs significantly associated with the survival rate. We constructed four-lncRNA risk formulas for LUAD patients and divided patients into high-risk and low-risk groups. Identified lncRNAs subsequently verified in the test set, and the clinical independence of the lncRNA model was evaluated by stratified analysis. Then mutated genes were identified in the high-risk and low-risk groups. Enrichment analysis was used to determine the relationships between lncRNAs and co-expressed genes. Finally, the accuracy of the model was verified using external database. Results A four-lncRNA signature (AC018629.1, AC122134.1, AC119424.1, and AL138789.1) has been verified in the training and test sets to be significantly associated with the overall survival of LUAD patients. Conclusions The present study demonstrated that identified four-lncRNA signature can be used as an independent prognostic biomarker for the prediction of survival of LUAD patients.


Introduction
Lung cancer is one of the most common cancers worldwide [1]. Adenocarcinoma accounts for 40% of all lung cancers [2]. Worsening of environmental pollution is associated with an increase in the number of patients with the disease. Many new methods are currently used to diagnose and treat this disease [3]. Epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase are two key oncogenes that trigger lung adenocarcinoma (LUAD) [4][5][6]. Despite great achievements, overall survival (OS) time for LUAD is less than 5 years because of aggressive properties of the tumor and a lack of effective early diagnostic and prognostic biomarkers [7]. Currently, there are only a few prognostic markers for LUAD, and thus identification of valuable prognostic indicators for clinical treatment is very important.
Long noncoding RNAs (lncRNAs) are RNAs that are longer than 200 bp [8]. lncRNAs have been recently shown to regulate large-scale gene expression programs; lncRNAs influence gene expression by interactions across multiple chromosomes and at various locations on a given chromosome. The mechanisms by which some lncRNAs target a distal binding site are not completely clear, and hence, spatial organization analysis of the genome may help to determine the function of lncRNAs. Recent studies suggest that lncRNAs may have extensive effects on the regulation of genes related to tumor formation and metastasis than previously thought [9][10][11]. Moreover, many lncRNAs can be used as predictors of prognosis and survival for tumor patients, providing a reliable basis for cancer diagnosis and treatment. LOXL1-AS1 can be used as a biomarker to assess the prognosis of glioma [12]. AK001796 is an independent predictor of poor prognosis in esophageal squamous cell carcinoma [13]. Prognostic lncRNA signatures have been detected in many cancer types [14][15][16][17]; however, the studies of lncRNAs with the prognosis of LUAD patients are in the early stages and this subject requires long-term efforts.
High-throughput sequencing data and basic clinical information used in the present study were provided by The Cancer Genome Atlas (TCGA). We used high-throughput data and clinical information of 466 LUAD patients in the TCGA database to identify potential lncRNA markers that can effectively predict the survival of LUAD patients. Eventually, we identified four lncRNAs, which were shown to be independent of clinical factors according to stratified analysis, and these four RNAs have not been reported previously. The findings of the present study demonstrated the important role of these four lncRNAs in predicting the survival of LUAD patients.

Basic information of the patients
The lncRNA sequencing profile and LUAD clinical information were obtained from the TCGA data set. This study included 466 LUAD patients, excluding the data that lacked complete survival information. Associated clinical information included OS, age, sex, AJCC tumor stage, etc. The patients were randomly divided into two groups, including 233 patients as a training set and 233 patients as a test set. Detailed information on the sets used in the present study is provided in Table 1.

lncRNA sequencing profile source
LUAD RNA sequencing data were downloaded from the TCGA database (https://tcga-data.nci.nih.gov/tcga/). The data were compared with the human genome, and the reads per thousand bases per million exon models (RPKM) determine the expression levels of lncRNAs and mRNAs. We selected lncRNAs in the TCGA database according to the following three criteria: (1) the transcript was not found in any protein coding regions; (2) the transcribed sequence was found in the GENCODE project [18]; and (3) it was expressed in at least more than half of the LUAD patients. The transcripts with an average RPKM > 0.1 were considered a part of the lncRNA profile of 466 LUAD samples. Finally, 3,881 lncRNAs were identified in the data set. Then, the "edgeR" package for R [19] was used for differential expression analysis based on adjusted P < 0.001 and |log 2 (F c ) | ≥ 2 threshold to determine differentially expressed RNAs.
Externally validated sequencing data, clinical information, and platform annotation information were obtained from the GSE11969 data set of the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih. gov/geo). A total of 148 patients were included in this part of the study.

Establishment of the risk formula
Univariate Cox regression was used to calculate the associations between the expression level of each lncRNA and the survival of patients in the training set. lncRNAs with P values less than 0.001 were considered to be significant. Then, random forest and multivariate Cox regression models were used to analyze the lncRNAs that had the best independence from the prognosis of the patients. The regression coefficient obtained using multivariate Cox regression model was multiplied by the linear combination of the expression levels of corresponding lncRNAs define the prognostic risk formula. Risk formula = ΣN (i = 1)(Expi × Coei), where N represents the total number of prognostic lncRNAs, Expi the expression of a certain lncRNA and Coei the regression coefficient obtained from the multivariate Cox regression analysis for a certain lncRNA numbered as i.

Statistical analysis
Kaplan-Meier survival analysis was used to determine the differences in the OS and progression-free survival (PFS) between the high-risk and low-risk groups. Multivariate Cox regression and stratified analysis were used to determine whether the prognosis prediction based on identified lncRNA is independent of clinical variables. Receiver operating characteristic (ROC) curves were used to compare the sensitivity and specificity of survival prediction based on the risk score. The R package "maftools" was used to analyze mutated genes in LUAD. All analyses were performed using R software (version 3.6.3).

Differentially expressed lncRNAs
A total of 381 differentially expressed lncRNAs were identified according to the criteria of |log 2 (F c ) | ≥ 2 and P adj < 0.001. These lncRNAs included 291 upregulated and 90 downregulated lncRNAs ( Figure 1). Subsequent prognostic analysis used lncRNAs with expression levels significantly different between the LUAD and control groups.

Identification of four lncRNAs that predicted prognosis in the training set
A total of 466 LUAD patients were randomly divided into a training set (n = 223) and a test set (n = 223). Univariate Cox regression analysis based on lncRNAs in the training set was used to identify a total of 43 genes (P < 0.05) for subsequent analysis. Multivariate Cox regression analysis based on age, sex, and AJCC stage identified four lncRNAs as independent prognostic biomarkers for LUAD patients. These four lncRNAs included AC018629.1, AC122134.1, AC119424.1, and AL138789.1. Three of these lncRNAs (AC018629.1, AC122134.1, and AL138789.1) were positively associated with survival, indicating that higher expression levels of these lncRNA are associated with shorter survival of the patients. The negative coefficient estimated for the remaining lncRNA (AC119424.1) suggested that high expression level was associated with longer survival. Detailed information about these four lncRNAs is presented in Table 2.

Evaluation of the prognostic model in the training set
The expression levels of these four lncRNAs and their prognostic values obtained by multivariate Cox regression analysis were used to design a risk score formula to predict the survival of patients with LUAD. The risk score formula was as follows: risk score = (0.22 × the expression level of AC018629.1) + (0.11 × the expression level of AC122134.1) + (−0.22 × the expression level of AC119424.1) + (0.14 × the expression level of AL138789.1). The risk score was calculated for each patient in the training set based on the lncRNA signature, then the median risk score was used as a threshold to divide LUAD patients into high-risk (n = 117) and low-risk groups (n = 116). Additionally, Kaplan-Meier analysis was used to assess the effects of these prognostic indicators on the survival of patients in the training group. The Kaplan-Meier analysis showed that patients in the high-risk group had worse outcome compared with that in the low-risk group (P = 6.91 × 10 −5 ; Figure 2a). The 3-year OS rate after diagnosis in the low-risk group was 87%, and the 5-year OS rate after diagnosis in the low-risk group was 77%; the 3-year and 5-year survival rates in the high-risk group were only 71 and 67%, respectively. Time-dependent ROC curves were used to evaluate the prognostic significance of four lncRNAs. The area under the curve (AUC) of the four-lncRNA signatures was 0.766, indicating good prognostic value for survival prediction of the patients (Figure 2b). Univariate and multivariable Cox regression analyses showed that risk score based on four lncRNAs was significantly associated with patient survival ( Table 3).
The basic information about the risk score, survival rate, and expression profiles of four lncRNAs in the samples of the training data set are shown in Figure 2c. The distributions of the risk score, survival rates, and expression profiles of four lncRNAs in the samples of the training set were ranked according to the risk scores. Patients with a high risk score had shorter survival compared with that of patients with a low risk score. In patients with high risk scores, the expression of three lncRNAs (AC018629.1, AC122134.1, and AL138789.1) was significantly upregulated, and the expression of one lncRNA (AC119424.1) was downregulated.

Prognostic value of four-lncRNA signatures and analysis of PFS
The test set and the entire set were used to validate the prognostic model based on four lncRNAs. The risk score for each patient in the test set was calculated according to risk score formula. LUAD patients in the test set were divided into high-risk (n = 111) and low-risk groups (n = 122) using the threshold similar to that for the division of the training set. Kaplan-Meier analysis was performed in the test set. The results indicated that the survival time of patients in the high-risk group was significantly shorter than that in the low-risk group (P = 0.0436). The AUC obtained by ROC curve analysis in the test set was 0.778 (Figure 3a). We analyzed the entire data set in a similar manner. A high risk score based on the four-lncRNA signature was associated with poor OS of LUAD patients (P = 2.65 × 10 −5 ). The AUC obtained by timedependent ROC curve analysis was 0.769 (Figure 3b). Then we analyzed the PFS rate in the training set, test set, and the entire set ( Figure 4a). The data indicated that the PFS of patients in the high-risk group was significantly lower than that in the low-risk group, and the areas under the ROC curves ( Figure 4b) were 0.664, 0.652, and 0.657 in the training set, test set, and the entire set, respectively. The results showed that the predictive value of the signature based on four lncRNAs had great potential for predicting prognosis.

Independence of the four-lncRNA-based biomarkers and clinical factors
The predictive ability of the four-lncRNA signature in combination with clinical factors was used to determine whether the signature is independent from other factors associated with survival. These factors included lncRNA risk score, sex, age, and AJCC tumor stages ( Table 3). The data showed that the four-lncRNA risk score remained closely associated with the survival rate. Then, entire data set was assessed by stratified analysis. The patients were divided based on the threshold age of 60 years into the younger (n = 147) and older groups (n = 319) according to a threshold age of 60. LUAD patients in each group were divided according to the risk score into the high-risk and low-risk subgroups. The results suggested that the four-lncRNA signature can predict the prognosis of patients regardless of age (Figure 5a-c). We also analyzed the AJCC staging of the tumors. Stages I and II were classified as early stage (n = 364), and stages III and IV were classified as late stage (n = 102). The results suggest that the four-lncRNA risk score can adequately predict the survival of patients with LUAD (Figure 5d-f).

Analyses of somatic mutation and function prediction of four prognostic lncRNAs
Tumor-specific mutations are one of the important factors affecting the prognosis of the patients, and thus we analyzed the gene mutations in the high-risk group and low-risk groups by using a waterfall diagram ( Figure 6). TP63 and TTN were the genes with a high mutation frequency. EGFR accounted for 12 and 14% of mutations in the high-risk and low-risk groups, respectively, and ALK accounted for 5 and 4% of the mutations. The functions of four lncRNAs were predicted using coexpression network to investigate the mechanism of action of these lncRNAs in LUAD in detail. Spearman coefficients were calculated based on the expression of lncRNAs and protein-coding genes, and genes with a correlation coefficients >0.4 were selected. A total of 232 genes were used for pathway enrichment analysis. The web-based tool Metascape (http://metascape.org/) was used to analyze the biological function of these coexpressed genes. Most of these genes were enriched in 20 pathways (Figure 7), including interleukin-7-mediated signaling, nucleosome positioning, and regulation of glycogen biosynthetic process. These results suggested that four prognostic lncRNAs may be involved in glycogen metabolism and tumor immunity.

Validation of external databases
The data of the GEO database were used to validate the four-lncRNA signature. Patients with LUAD in the GSE11969 data set were analyzed. Patients in the highrisk group had lower survival rate than those in the lowrisk group, and the AUC was 0.705 ( Figure 8). Therefore, four-lncRNA signature had predictive significance for the prognosis of patients with LUAD.

Discussion
LUAD has become one of the common killers of cancer patients, and its incidence has been recently increasing [20]. The main causes of lung cancer are smoking, air pollution, and gene variation, which have important effects on the development and clinical outcomes of LUAD. Effective therapies have been currently developed to target EGFR, KRAS, and MET; however, the prognosis of patients with advanced lung cancer remains unsatisfactory [21]. The lack of effective and reliable prognostic biomarkers or models is the major problem in enhancing the clinical prognosis of LUAD patients. Recent studies demonstrated that lncRNAs play an important role in the development of tumors. Relevant studies have detected abnormal lncRNA expression in many cancer types, which is closely related to cancer progression [22][23][24]. Therefore, specific lncRNAs can be used as independent diagnostic and prognostic markers in cancer. lncRNAs were rarely reported to predict the prognosis of patients with LUAD, and hence, the present study aimed to provide a more effective prognostic marker for LUAD. In this study, we analyzed the expression of lncRNAs in LUAD patients and identified four genes as prognostic markers. Subsequent analysis demonstrated that the four-lncRNA signature was an independent predictor of prognosis in patients. The performance of four-lncRNA signatures was evaluated by ROC analysis of the training, test, and entire sets, indicating a good prognostic value of the signature. Additionally, stratified analysis showed that the 4-lncRNA signature was more accurate than other factors in predicting the prognosis of LUAD patients. These four lncRNAs have not been reported previously. Therefore, the enrichment analysis of these lncRNAs was used to  understand the details of the regulatory mechanism of lncRNAs.
A large number of references demonstrated that microRNA can be used as diagnostic and prognostic markers [25,26]. For example, abnormally high expression of miR-21-5p is helpful for the diagnosis of nonsmall cell lung cancer [27]. Technological development resulted in the low costs of microchips and sequencing, and thus, aberrant lncRNA expression in tumors has received increasing attention and has been confirmed to influence the occurrence, development, and metastasis of the tumors. Moreover, lncRNAs have been reported to be more specific biomarkers than microRNAs in some cancer types [21,28,29]. Thus, lncRNAs have shown great potential as new molecular diagnostic biomarkers. For example, upregulation of LINC00963 can inhibit ubiquitination of PGK1 to activate the AKT/mTOR pathway, thus promoting the metastasis of lung cancer [30], and DLX6-AS1 promotes tumor proliferation by regulating the JAK/STAT pathway [31]. The present study demonstrated that multiple lncRNAs predicted the prognosis of LUAD. Kaplan-Meier analysis showed that the risk score model has good prognostic ability. Analysis of the test set and entire set demonstrated the reliability of these biomarkers.
Analysis of the gene mutations in patients with LUAD identified common mutated genes and the genes with the highest mutation frequency. EGFR and ALK are not the most mutated; however, patients with mutations in the EGFR and ALK genes have poor prognosis [32]. These mutations play a very crucial role in tumor surgery, chemotherapy, and precision treatments [33][34][35]. The functions of four lncRNAs identified in the present study are not completely clear; hence, we used enrichment analysis to search for possible biological functions of these lncRNAs. Most of the enriched functions were concentrated in the processes related to nucleosomes, regulation of glycogen synthesis, and interleukin-7 (IL-7)-mediated signaling.        Farman et al. demonstrated that the nucleosome position around the transcriptional start site of a tumor suppressor gene in breast cancer is different from that in normal individuals [36]. High levels of glycogen have been reported in most types of cancer cells. The genes involved in glycogen synthesis enable cancer cells to use glucose for anaerobic glycolysis to provide glycogen required during brief energy deficits [37][38][39][40]. IL-7 regulates tumor proliferation, apoptosis, and tumor lymphangiogenesis [41][42][43][44][45]. Therefore, we hypothesized that these four lncRNAs may play an important role in tumor immune regulation and glycogen metabolism in the tumor microenvironment of LUAD.
The present study has several limitations. First, the lncRNA sequencing data were from the TCGA database, and these samples were mainly from African Americans and Caucasians. Therefore, additional studies are needed to determine whether the lncRNA signature is effective in Asian populations. Second, we analyzed and verified four lncRNAs based only on prognostic ability, and hence, the diagnostic and therapeutic capabilities of the signature require more additional investigations. Third, lncRNAs identified in the present study have not been reported in the literature, and further experiments are needed to analyze their functions and mechanisms.
On the whole, based on a large volume of expression profile data, a four-lncRNA signature was shown to have a good prognostic value in LUAD patients. Our results indicate that these four lncRNAs may play an important role in tumor immune regulation and in tumor microenvironment and constitute a reliable biomarker for the prediction of prognosis in LUAD patients.
Funding information: This research was funded by the National Natural Science Foundation (81671905).

Conflict of interest:
The authors declare there are no competing interests.
Data availability statement: The data sets generated and analyzed in the present study are available in the TCGA data set LUAD project (https://tcga-data.nci.nih.gov/tcga/) and as a GEO data set.