RNA sequencing is a powerful tool to identify distinct types of previously un-annotated transcripts, including coding and non-coding RNAs, and new splice isoforms of known genes, in addition to measurement of gene and/or transcript abundances precisely , , . De novo transcriptome reconstruction is now becoming routine and substantially improving the transcriptome annotation of animal and plant species , . Moreover, the identification of such novel genetic sequences helps us to better understand complex gene regulatory networks that underlie biological processes, such as development and disease , , . Therefore, discovery of novel transcripts has recently begun to receive growing attention by the scientific community, and several computational methodologies have been developed so far to annotate types of transcripts , , , , , , .
MicroRNA (miRNA) as a class of non-coding regulatory RNAs is involved in the regulation of gene expression in eukaryotes . It is known that the expression of ∼60 % of the all coding transcripts in mammals are under the control of miRNA-based regulation . To date a total of 28,265 miRNAs have been deposited in a public database called miRBase (http://www.mirbase.org) . Post-transcriptional suppression activity of miRNAs can also be controlled via endogenous target mimicry (eTM) , . ETM molecules, also called miRNA sponges or competing endogenous RNAs (ceRNAs), a type of long non-coding RNA (lncRNA), specifically block the targeting activity of miRNAs to decoy the miRNA-based target transcript suppression . In target mimicry, eTM transcript binds to target miRNA with perfect base pairing from the 2nd to 8th nucleotides, and with a three-nucleotide bulge from nucleotides 9–11 at the 5′ end of target miRNA sequence. Thus, the eTM-mRNA pairing inhibits miRNA activity and prevents the binding of the miRNA to its target mRNAs, which results in increased levels of target mRNA expression.
A number of plant eTM sequences have been computationally predicted so far , , , , , , and some of them have been functionally characterized , , , . In a recent report by , several eTM transcripts were computationally identified in Glycine max (soybean) through RNA sequencing, and it was shown that nine miRNAs involved in lipid metabolism were potentially regulated by previously un-annotated eTMs. Besides, the expression level of miRX27 involved in nicotine biosynthesis was sponged by up-regulation of an eTM (nta)-eTMX27 in Nicotiana tabacum (tobacco) . Recently, PeTMbase  and PceRBase  online tools have been established to facilitate the study of putative eTM-miRNA interactions in plants. PeTMbase permits searching and browsing of computationally identified ∼2700 eTM transcripts in 11 plant species. The database annotates eTM transcripts through previously described binding rules by , and it allows retrieving eTM sequences and their genomic features by given target miRNA and species name. PceRBase is a comprehensive catalogue of target mimicry sequences that comprises more than 160,000 target-mimic pairs (as of March 2017) determined through Tapir tool  from 26 plant species. One of the unique features of PceRBase is to provide its users with detailed information regarding putative eTM-miRNA pairs, including their expression levels and associated Gene Ontology (GO) terms. MiRSponge  is another source of target mimicry molecules experimentally validated in diverse organisms. However, it has a very limited number of eTM data for only Arabidopsis thaliana.
Given the regulatory roles of eTM sequences in a wide range of biological processes in plants, an increasing need for discovery and annotation of previously un-annotated eTMs via computational techniques arose. However, bioinformatics strategies that take advantage of the full potential of public transcriptome data sets for discovering eTM transcripts are still lacking. Herein, I aimed to demonstrate a computational analysis pipeline for the prediction of putative eTM sequences from published public plant transcriptome libraries. Using two transcriptome libraries as an example case, I found previously un-annotated Prunus persica (peach) eTMs and proposed an eTM-miRNA-mRNA interaction network module for peach fruit organ development. The study predicted 44 novel eTM sequences from the libraries and a previously published set of lncRNAs, and I further found eTM related pathways by incorporating miRNA target information and annotation from previous studies. I also provided a step-by-step guideline and sample codes for predicting previously un-annotated eTM sequences from the raw transcriptome libraries of plant species.
2 Materials and Workflow
The eTM discovery pipeline was run on Linux distribution, CentOS (RedHat) 6.x. (https://www.centos.org/), with 16 × 2.6 GHz (Intel E5-2650v2) CPUs and 64 GB memory.
2.2 Data Analysis
In each data analysis step, I first provide necessary command to properly run individual tools, then specify my sample code(s) starting with $ symbol, which denotes a Linux shell prompt. Once the required tools are installed, users can simply write and execute the sample commands in Linux terminal screen to reproduce the same outputs obtained in the proposed analysis pipeline (Figure 1). All transcriptome analysis tools used in the pipeline are publicly available and they can be freely downloaded from their web sites. Users can also modify the arguments represented with < … > symbols to customize the analysis steps.
i) In the first step, I initially need Sequence Read Archive (SRA) Tool Kit (v2.6.3) to download peach transcriptome libraries from SRA database (http://www.ncbi.nlm.nih.gov/sra) , which is a public repository of unprocessed next generation sequencing data sets. SRA Tool Kit can be downloaded from https://github.com/ncbi/sra-tools and its user manual can be viewed at http://www.ncbi.nlm.nih.gov/books/NBK158900. Herein, I utilize two transcriptome libraries, as an example case, previously generated by , (SRA accession IDs: SRR1300787 and SRR1300788) to identify eTM sequences, and raw sequencing files of the libraries in FASTQ format can be obtained from SRA using the following commands:
SRA Tool Kit command:
fastq-dump --gzip --skip-technical --readids --dumpbase --clip <SRA accession ID>
$ fastq-dump --gzip --skip-technical --readids --dumpbase --clip SRR1300787
$ fastq-dump --gzip --skip-technical --readids --dumpbase --clip SRR1300788
ii) Here, I align short reads in the FASTQ files to peach reference genome with splice-aware mapper, STAR (v2.5.1b) . The source code and user manual of STAR aligner are freely available at https://github.com/alexdobin/STAR. I need to obtain peach DNA sequence as a zipped (.gz) multi-line fasta format (.fa) from the Ensembl ftp site. To perform download task, I first utilize the wget command line utility of Linux environment. Then, I make use of gunzip command to uncompress the downloaded file.
$ wget ftp://ftp.ensemblgenomes.org/pub/release-31/plants/fasta/prunus_persica/dna/Prunus_persica.Prupe1_0.31.dna.genome.fa.gz
$ gunzip Prunus_persica.Prupe1_0.31.dna.genome.fa.gz
iii) Build a STAR index (.index) that will be utilized during short read alignment to the reference genome. Herein, I use the multi-line DNA fasta file (downloaded in step ii) as an input for the STAR genomeGenerate command to generate STAR indexes.
STAR genomeGenerate command:
STAR --runMode genomeGenerate --genomeDir <output index folder name> --genomeFastaFiles <Assembly name .fa>
$ STAR --runMode genomeGenerate --genomeDir./ppersica.star.index/ --genomeFastaFiles Prunus_persica.Prupe1_0.31.dna.genome.fa
iv) Download and extract the gene transfer file (.GTF), including exon structures of known transcripts annotated in the Ensembl plant database using wget and gunzip commands, respectively (as in step ii). GTF file can be downloaded using the following command:
$ wget ftp://ftp.ensemblgenomes.org/pub/release-31/plants/gtf/prunus_persica/Prunus_persica.Prupe1_0.31.gtf.gz
$ gunzip Prunus_persica.Prupe1_0.31.gtf.gz
v) Align short reads to the reference genome using the command below. It will generate a Binary Sequence Alignment/Map output file (.BAM), which contains read alignment data. To run STAR aligner, you will utilize the FASTQ files (downloaded in step i), STAR index (generated in step iii), and GTF file (downloaded in step v).
STAR align command:
STAR --genomeDir <STAR index folder name> --sjdbGTFfile <Assembly name .gtf> --readFilesIn <Sample name_1.fastq> <Sample name_2.fastq> --outFileNamePrefix <output file prefix name>
$ STAR --genomeDir ./ppersica.star.index/ --genomeLoad NoSharedMemory -- --sjdbGTFfile Prunus_persica.Prupe1_0.31.gtf --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode Basic --readFilesIn SRR1300787_1.fastq.gz SRR1300787_2.fastq.gz --outFileNamePrefix SRR1300787. --outFilterIntronMotifs RemoveNoncanonical
$ STAR --genomeDir ./ppersica.star.index/ --genomeLoad NoSharedMemory -- --sjdbGTFfile Prunus_persica.Prupe1_0.31.gtf --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode Basic --readFilesIn SRR1300788_1.fastq.gz SRR1300788_2.fastq.gz --outFileNamePrefix SRR1300788. --outFilterIntronMotifs RemoveNoncanonical
vi) The previous step generates two bam files with Aligned.sortedByCoord.out.bam extension in the STAR output folders. I utilize these files to perform genome-guided assembly with the popular transcriptome analysis suite, Cufflinks (v2.2.1) , which is available at http://cole-trapnell-lab.github.io/cufflinks/. Cufflinks will output transcript information in a file called “transcripts.gtf” for each bam file.
cufflinks -g <Assembly name .gtf> <accepted_hits.bam>
$ cufflinks -g Prunus_persica.Prupe1_0.31.gtf --library-type fr-firststrand
$ cufflinks -g Prunus_persica.Prupe1_0.31.gtf --library-type fr-firststrand
vii) Create a text file named “gtfs.path.txt”, which includes the path and filename of each “transcripts.gtf” (generated in step vi), and then merge each “transcripts.gtf” files into a single one using cuffmerge utility of Cufflinks suite with the following command:
cuffmerge <Gtf file list.txt >
$ cuffmerge ppersica.gtfs.path.txt
viii) Query the transcripts being found in the previous step against the Ensembl plant database (http://plants.ensembl.org) with the Cuffcompare tool, which is an analysis module of Cufflinks tool. The “merged.gtf” file generated with Cuffmerge (in step vii) contains exon structures of all possible transcripts, and Cuffcompare will help us to classify those transcripts as novel or known. For additional information about transcript classification, I recommend reading the Cuffcompare user manual, accessible at http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/. In my example, I am seeking to identify previously un-annotated intergenic transcripts; therefore I am only interested in the transcripts with class_code “u” within Cuffcompare output, “cuffcmp.combined.gtf”.
cuffcompare -r <Assembly name .gtf> <merged.gtf>
$ cuffcompare -r Prunus_persica.Prupe1_0.31.gtf merged.gtf
ix) Retrieve intergenic previously un-annotated transcript sequences with the gffread utility of Cufflinks. Parse “cuffcmp.combined.gtf”, generated in the previous step, and extract the exon structure information of the transcripts with annotated “u” class code with the grep command line in the Linux terminal window. This command helps to process the “cuffcmp.combined.gtf” file line by line, and extracts the lines that match a specified pattern. After extracting the exon structure information of previously un-annotated transcripts from “cuffcmp.combined.gtf”, I will create a separate .gtf file named “novel.transcripts.gtf” that will be utilized for retrieving transcript sequences.
$ grep ‘class_code “u”;’ cuffcmp.combined.gtf > novel.transcripts.gtf
gffread -w <gffread output name .fa> <Assembly name .fa> <novel.transcripts.gtf>
$ gffread -w novel.transcripts.fa -g Prunus_persica.Prupe1_0.31.dna.genome.fa novel.transcripts.gtf
x) Detect potential coding regions within the transcript sequences being extracted in the previous step. In order to complete this step, I will utilize a computational tool called TransDecoder (v2.0.1), which can be downloaded from https://transdecoder.github.io/. One of the features of TransDecoder is to allow users to define the minimum length of the open reading frame (ORF), which will be searched within each transcript. In this analysis pipeline, however, I will search only the ORFs at least 50 amino acids length.
TransDecoder.LongOrfs –m <the length of amino acids searched within transcripts> -t <fasta file name including transcript sequences>
$ TransDecoder -m 50 –t novel.transcripts.fa
xi) Download and extract mature miRNA sequences from miRBase (release 21) using the following commands (as in steps ii and iv):
$ wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz
$ gunzip mature.fa.gz
xii) Execute my custom eTM identification script written in R (v3.1.0) statistical computing environment . The script has been developed based on the binding rules previously defined by . It classifies a given sequence as eTM if: (i) the 2nd to 8th positions at the 5′ end of a miRNA sequence perfectly match to the given transcript sequence, (ii) there are three nucleotides bulges between the 9th to 11th positions at the 5′ end of the miRNA sequence, and (iii) maximum 3 nucleotide mismatch (excluding bulge region) exist miRNA and its target. To successfully run the script, I need “Biostrings” (https://bioconductor.org/packages/release/bioc/html/Biostrings.html) and “data.table” (https://cran.r-project.org/web/packages/data.table/index.html) R packages. Once the installation of the R packages is completed, download my custom script from http://tools.ibg.deu.edu.tr/ppersica/eTM.search.script.zip, and put the R script in same folder with “mature.fa” (downloaded in step xi), “novel.transcripts.fa” (generated in step ix) and “longest_orfs.pep” (generated in step x) files that have been generated in the previous steps then execute the following command:
$ Rscript eTM.search.R
xiii) The previous steps yield a .csv file including miRNA-eTM interaction pairs and the run time of the command above depends on the number of miRNA and transcript sequences in “mature.fa” and “novel.transcripts.fa” files.
3 Results and Discussion
One of the major advantages of RNA sequencing technique over microarray technology is to enable to discover novel transcript sequences previously remained un-annotated in public genomic databases. Utilizing diverse open-source bioinformatics tools and my custom script developed in R language, I found 11 putative eTM sequences from two paired-end peach fruit transcriptome libraries. In addition to these public data sets, I also collected all available peach lncRNA sequences from GReeNC database , and searched for putative eTMs with the custom R script. Consequently, I identified additional 33 putative eTMs from GreeNC data set. In total, the study predicted 44 novel eTM sequences from the libraries and a previously published set of peach fruit lncRNAs. All putative eTM sequences and their predicted miRNA targets are available at http://tools.ibg.deu.edu.tr/ppersica/eTM.sequences.and.features.xlsx.
Computational strategies to identify previously un-annotated eTMs have a great potential to improve our understanding of the miRNA-mediated regulatory programs in plants. Based on the results of my computational analysis, I observed that one of the most widely expressed miRNA families among diverse plant species, miR156 , might be potentially sponged by seven putative eTMs (Figure 2). Additionally, I superimposed the miR156 targets recently introduced by  with the predicted eTMs, and built an eTM-miRNA-mRNA regulatory network module (Figure 2) to demonstrate how predicted eTMs can be linked to GO terms. Per the proposed network module, two potential targets of miR156, ppa021582m and ppa007056m, were found to be involved in biological processes associated with organ development (GO:0048513) and system development (GO:0048731). Another target of miR156, ppa011968m, is predicted to be involved in multicellular organismal development (GO:0007275) . This regulatory network module suggests that eTMs potentially play roles in the regulation of development processes in peach fruit via targeting specific miRNAs. Additionally, the investigation of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways  of miRNA-targeted genes  revealed that the predicted eTMs might have regulatory roles in distinct metabolic processes such as amino acid and carbohydrate metabolism, and zeatin biosynthesis via targeting miRNAs involved in those pathways (Table 1).
Since from the first discovery of plant eTM, Induced by Phosphate Starvation 1 (IPS1) transcript binding to phosphate starvation-induced miRNA, ath-miR399 in Arabidopsis, as an endogenous lncRNA , common interest on eTM-based miRNA regulation in plants. Subsequently, several eTM sequences of some conserved miRNAs were computationally identified in rice, soybean and other plant species , , . Some of the plant eTMs were also utilized for functional characterization of miRNAs , . Overexpression of miR-156 eTM (MIM156) and miR-319 eTM (MIM319) resulted altered phenotypes in A. thaliana . Therefore, identification of eTM sequences as first step of functional characterization of eTM-miRNA-target transcript modules will greatly help to researchers in the field. By utilization of presented workflow, it will be beneficial to discover previously un-annotated plant eTMs. On the other hand, as one of the important conserved miRNAs, miR156, with its regulatory roles on SQUAMOSA-PROMOTER BINDING PROTEIN-LIKE (SPL) family of transcription factors  for a number of vital functions such as vegetative and reproductive phase changes , leaf morphology and development  , panicle architecture , stress responses , . Multiple eTM sequences for miRNA156 in peach found in this study suggest that miRNA with a number of SPL family transcript targets might be under control of several specific miR156 eTMs. The identified miR156 eTMs will then be utilized for functional characterization and enhancement of agronomical traits.
In conclusion, with the help of the presented pipeline in this manuscript, I believe that previously un-annotated plant eTMs can be predicted effectively from next generation sequencing transcriptome datasets, and these eTM sequences can be utilized for the establishment of the eTM-miRNA-mRNA regulatory networks. However, researchers should note that one of the purposes of this manuscript is to provide a general overview regarding novel eTM sequence discovery, and I strongly recommend researchers to examine the user manual of each tool being used to select appropriate parameters in line with their research goals.
I would like to thank ÜnverLab for its support in implementing the analysis pipeline.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5. PubMedCrossrefWeb of ScienceGoogle Scholar
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010;28:503–10. Web of SciencePubMedCrossrefGoogle Scholar
Chaitankar V, Karakulah G, Ratnapriya R, Giuste FO, Brooks MJ, Swaroop A. Next generation sequencing technology and genomewide data analysis: perspectives for retinal research. Prog Retin Eye Res. 2016;55:1–31. CrossrefPubMedWeb of ScienceGoogle Scholar
Liu SJ, Nowakowski TJ, Pollen AA, Lui JH, Horlbeck MA, Attenello FJ, et al. Single-cell analysis of long non-coding RNAs in the developing human neocortex. Genome Biol. 2016;17:67. CrossrefWeb of SciencePubMedGoogle Scholar
Alvarez-Dominguez JR, Bai Z, Xu D, Yuan B, Lo KA, Yoon MJ, et al. De Novo reconstruction of adipose tissue transcriptomes reveals long non-coding RNA regulators of brown adipocyte development. Cell Metab. 2015;21:764–76. PubMedCrossrefWeb of ScienceGoogle Scholar
Parikshak NN, Swarup V, Belgard TG, Irimia M, Ramaswami G, Gandal MJ, et al. Genome-wide changes in lncRNA, splicing, and regional gene expression patterns in autism. Nature. 2016;540:423–7. PubMedWeb of ScienceCrossrefGoogle Scholar
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–5. Web of ScienceCrossrefPubMedGoogle Scholar
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–78. CrossrefPubMedWeb of ScienceGoogle Scholar
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28:1086–92. PubMedCrossrefWeb of ScienceGoogle Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52. CrossrefWeb of SciencePubMedGoogle Scholar
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–512. CrossrefPubMedWeb of ScienceGoogle Scholar
Gupta PK. MicroRNAs and target mimics for crop improvement. Curr Sci India. 2015;108:1624–33. Google Scholar
Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza I, et al. Target mimicry provides a new mechanism for regulation of microRNA activity. Nat Genet. 2007;39:1033–7. Web of ScienceCrossrefPubMedGoogle Scholar
Zhang H, Chen X, Wang C, Xu Z, Wang Y, Liu X, et al. Long non-coding genes implicated in response to stripe rust pathogen stress in wheat (Triticum aestivum L.). Mol Biol Rep. 2013;40:6245–53. PubMedCrossrefWeb of ScienceGoogle Scholar
Gupta PK. MicroRNAs and target mimics for crop. Curr Sci. 2015;108:1624–33. Google Scholar
Todesco M, Rubio-Somoza I, Paz-Ares J, Weigel D. A collection of target mimics for comprehensive analysis of microRNA function in Arabidopsis thaliana. PLoS Genet. 2010;6:e1001031. CrossrefPubMedWeb of ScienceGoogle Scholar
Li F, Wang W, Zhao N, Xiao B, Cao P, Wu X, et al. Regulation of nicotine biosynthesis by an endogenous target mimicry of microRNA in tobacco. Plant Physiol. 2015;169:1062–71. CrossrefPubMedWeb of ScienceGoogle Scholar
Bonnet E, He Y, Billiau K, Van de Peer Y. TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. Bioinformatics. 2010;26:1566–8. PubMedCrossrefWeb of ScienceGoogle Scholar
Wang P, Zhi H, Zhang Y, Liu Y, Zhang J, Gao Y, et al. miRSponge: a manually curated database for experimentally supported miRNA sponges and ceRNAs. Database (Oxford). 2015;2015:bav098. PubMedWeb of ScienceGoogle Scholar
Bakir Y, Eldem V, Zararsiz G, Unver T. Global transcriptome analysis reveals differences in gene expression patterns between nonhyperhydric and hyperhydric peach leaves. Plant Genome. 2016;9. PubMedWeb of ScienceGoogle Scholar
Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. ISBN 3-900051-07-0 2014. Google Scholar
Paytuvi Gallart A, Hermoso Pulido A, Anzar Martinez de Lagran I, Sanseverino W, Aiese Cigliano R. GREENC: a Wiki-based database of plant lncRNAs. Nucleic Acids Res. 2016;44:D1161–6. CrossrefPubMedWeb of ScienceGoogle Scholar
Zhang C, Zhang B, Ma R, Yu M, Guo S, Guo L, et al. Identification of known and aovel microRNAs and their targets in peach (Prunus persica) fruit by high-throughput sequencing. PLoS One. 2016;11:e0159253. CrossrefPubMedGoogle Scholar
Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007;35:W182–5. PubMedCrossrefWeb of ScienceGoogle Scholar
Eldem V, Akçay UÇ, Ozhuner E, Bakır Y, Uranbey S, Unver T. Genome-wide identification of miRNAs responsive to drought in peach (Prunus persica) by high-throughput deep sequencing. PLoS One. 2012;7:e50298. Web of ScienceCrossrefPubMedGoogle Scholar
About the article
Published Online: 2017-06-28
Conflict of interest statement: The author states no conflict of interest. The author has read the journal’s publication ethics and publication malpractice statement available at the journal’s website and hereby confirms that he complies with all its parts applicable to the present scientific work.
Citation Information: Journal of Integrative Bioinformatics, Volume 14, Issue 4, 20170009, ISSN (Online) 1613-4516, DOI: https://doi.org/10.1515/jib-2017-0009.
©2017, Gökhan Karakülah, published by De Gruyter. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0