RNAcode_Web – Convenient identification of evolutionary conserved protein coding regions

Abstract The differentiation of regions with coding potential from non-coding regions remains a key task in computational biology. Methods such as RNAcode that exploit patterns of sequence conservation for this task have a substantial advantage in classification accuracy in particular for short coding sequences, compared to methods that rely on a single input sequence. However, they require sequence alignments as input. Frequently, suitable multiple sequence alignments are not readily available and are tedious, and sometimes difficult to construct. We therefore introduce here a new web service that provides access to the well-known coding sequence detector RNAcode with minimal user overhead. It requires as input only a single target nucleotide sequence. The service automates the collection, selection, and preparation of homologous sequences from the NCBI database, as well as the construction of the multiple sequence alignment that are needed as input for RNAcode. The service automatizes the entire pre- and postprocessing and thus makes the investigation of specific genomic regions for previously unannotated coding regions, such as small peptides or additional introns, a simple task that is easily accessible to non-expert users. RNAcode_Web is accessible online at rnacode.bioinf.uni-leipzig.de.


Introduction
The discrimination of coding and non-coding regions has remained an important task in genome annotation [1,2].Three fundamentally different classes of methods have been proposed for this task: comparison with protein databases [3], alignment-free feature-based methods [4,5] and methods that evaluate multiple sequence alignments (MSAs) [6,7].Multiple sequence alignments have also been used to boost the accuracy of gene predictors such as AUGUSTUS [8].MSA-based methods leverage the characteristic patterns of nucleotide substitutions of evolutionarily conserved coding sequences and thereby gain a sometimes very large advantage in accuracy over pattern-based methods in particular on short coding regions.
The detection of short conserved coding sequences is of practical importance in two quite different application scenarios.Small, independently encoded peptides have received increasing interest following the discovery that some of them carry important physiological functions in both eukaryotes and prokaryotes [9][10][11][12][13].Despite targeted studies, novel small peptides are still difficult to detect with proteogenomic methods [14].Many eukaryotic genes, furthermore, exhibit a diverse set of splice variants that are far from completely catalogued.Additional, potentially coding exons thus may have been overlooked if they are not sufficiently long to be detected by gene prediction tools or included in sufficiently abundant RNAs to be discovered in transcriptomics data.In both use cases, alignment-based computational methods are a useful complement to direct experimental data.
The MSAs required as input for comparative methods, however, are time-consuming and often difficult to construct despite recent, major progress [15].Genome-wide MSAs available through the ENSEMBL and UCSC genome browsers, on the other hand, are restricted to limited sets of species.While the effort of preparing genome-wide MSA containing the species of interest cannot really be spared in genome-scale surveys, the required computational and personnel resources is still prohibitive in case only moderate-size regions, e.g., particular introns or intergenic regions are of interests.Although standard sequence alignment tools can be utilized in such cases, retrieving and aligning the sequences is still a tedious task.The manual work-flow typically comprises the selection of genomes at appropriate phylogenetic distance, a series of blast searches and inspection of their results, retrieval and post processing of the sequences, construction of the MSA, and some quality control on the resulting alignment before the data are suitable as input for a comparative coding sequence detector.
The web service described here is designed to automatize the entire work-flow.Since the advantage of comparative methods is most pronounced for small peptides, special care was taken to ensure that the homology search also works reliably for short query sequences.To this end, an iterative scheme was adopted to identity homologous sequences in a range of sequence similarity at which RNAcode [6] (available from https:// viennarna.github.io/RNAcode/)operates with high accuracy.In applications with large queries, such as the scanning eukaryotic introns, the computational costs of the coding sequence detector can become infeasible since the optimization steps in RNAcode's core algorithm scale quadratically with input size [6].Large inputs are therefore subdivided into overlapping windows for which MSAs are constructed and scored independently before the results are combined.

Work-flow
The RNAcode_Web service implements three distinct processing steps: (1) retrieval for suitable sequences, (2) construction of MSAs, and (3) the scoring of MSAs with RNAcode and the collation of the results.

Sequence retrieval and selection
Depending on the size of user's query sequence, RNAcode_web either runs blastn searches of the query against the NCBI data base, or retrieves plausible candidates from NCBI and generates a temporary local database and runs blastn searches against this local data base.If the NCBI nt database is used, an entrez query is used to identify and discard non-genomic sequences.For long query sequences, a local database is created for efficiency.The query is subdivided into smaller, overlapping chunks (length = 1000 bp) that are searched independently against the local database.This step is parallelized on the server.All blastn hits are then extended to the full length of the query using estimated coordinates.The corresponding sequences are retrieved using blastdbcmd.
The accuracy and sensitivity of RNAcode depends crucially on the pairwise distances of the selected sequences.Therefore, a subset comprising a minimum of 2 and a maximum of 16 extended blastn hits is selected for further processing.Similarity s between sequences is computed from a simple pairwise alignment as the number of matches divided by the length of the shorter sequence.The corresponding distance (expressed in percent) is then given as 100(1 − s)%.The user may specify the minimal and maximal distance for pairwise alignments; default values are 10 %-60 %.The blastn search is iterated five times with different word lengths (see Table 2 in the Methods section for details) to ensure that distance homologs are included if necessary.The candidates are then subjected to a density-based clustering.Algorithm 1 operates similar to DBSCAN [16] and ensures that representative sequences have a minimum pairwise distance of at least the user-defined threshold (10 % by default) from each other.It determines the centroid of the sequences that best represents the largest number of sequences until the required minimum pairwise distance is reached.The corresponding function reduce_cand_min_dist, which is contained in the module SeqSelection of the Python implementation (see "Software versions and parameters" below), may also be of use in unrelated applications.This step is essential because the data base often contains large numbers of very similar sequences, e.g. from closely related strains or different isolates from the same species.Even large sets of nearly identical sequences, however, convey very little evolutionary information and this cannot be utilized by RNAcode.The final set of sequence is then selected in the order of increasing blastn E-value, i.e., decreasing similarity from the query.The final result is a set of sequences that is as similar as possible to the query given the constraint imposed on their minimal pairwise similarity.

Construction of a MSA and scoring
The selected homologs for each query (or sub-query in the case of long input sequences) are the aligned with Clustal Omega [17] and provided to RNAcode for scoring.RNAcode identifies maximal regions with conserved coding potential, which are referred to as "high scoring segments" (HSS) in the following.RNAcode also reports the reading frame (relative to the start position of the input sequence), a p-value for the conserved protein-coding segment, and a graphical representation of corresponding evidence, see Figure 1 for an example.The level of sequence conservation is indicated as a contour at the bottom.The example shows a novel ORF "nov_131" identified from mass spectrometry data in Blautia producta [18].The alignment contains a diverse set of Bacteriodes species.The image is cropped.The original image with a list of all species can be found in Supplementary Material at http://www.bioinf.uni-leipzig.de/publications/supplements/22-004.

Output
The RNAcode_Web service returns a web page with "Coding Regions" and "Alignment Info".It produces a graphical overview indicating the HSS a p-value <0.05 that summarizes position, reading and relative reading, Figure 2. the split multiple sub-queries, boundaries at the bottom of plot.
The output page is interactive and the to apply filters, e.g. a p-value threshold or removing all the best HSS if overlapping HSS in different readings frames are HSS with high scores tend to show "shadow" the opposite strand with a much lower, but still significant score.This is a quence of the structure of the genetic the original table of provides links to annotated alignments, allowing detailed inspection of the substitutions patters, see Figure 1.An option to all predicted HSS as protein sequences in fasta is also available.In case the query was split, overlapping HSS the reading frame can be merged into a single The output page also provides access to all alignments that were passed to RNAcode for scoring to facilitate the inspection of negative results.The "Alignment Info" pages also show phylogenetic trees constructed by ClustalOmega to identify possible outlying or misaligned sequences.

Example applications
To demonstrate the capability of RNAcode_Web to investigate putative small proteins in prokaryotes, we investigated six candidate peptides (nov_34, nov_102, nov_131, nov_174, nov_180, nov_215) Blautia producta, which were found in a metaproteomics survey of an artificial gut community [18].Figure 1 shows the RNAcode output for nov_131.For all five candidates, RNAcode_Web predicted a conserved coding region with p < 0.001 based automatically generated alignments.The corresponding output is available in the Electronic Supplementary Material at http://www.bioinf.uni-leipzig.de/publications/supplements/22-004.Small open reading frames can also be detected efficiently in large, eukaryotic genomes.The human APELA (Apelin Receptor Early Endogenous Ligand), which comprising aminoacids an intron approximately in middle of the ORF [19], serves as an example.Here both short coding regions are reliably identified.The full results are available in the Electronic Supplementary Material as well, including a figure of the predicted exons as seen in the UCSC genome browser.
RNAcode was originally designed to operate on whole genome alignments (WGA).We compared the results of RNAcode_Web with the results of RNAcode scoring on merged alignment blocks based on the 100 vertebrate WGA provided by the UCSC genome browser.The merging of consecutive alignment blocks is necessary because many blocks in 100 vertebrate WGA are too short for RNAcode to have sufficient statistical power.In order to illustrate the use RNAcode_Web with long inputs we used the DNA sequences of eukaryotic genes as input.The example of the human phosphodiesterase PLCH2 shows that most coding introns are correctly recovered.To compare the results visually, we used blat to map the coding sequences predicted by RNAcode_Web to the human genome.We displayed these data in the UCSC genome browser together with the annotation files generated by RNAcode for the genome wide alignments, see Figure 3. Overall, the RNAcode_Web predictions are more complete, although they have a tendency of extended predicted coding regions a few amino acids beyond the splice Figure 3: RNAcode_Web predictions for the genomic region harboring the human phosphodiesterase PLCH2.The image was generated by the UCSC browser.For better readability, the region is shown in five slightly overlapping segments.The black "RNAcode WGA" shows the predicted regions by RNAcode based on the whole genome alignment based on 100 vertebrate genomes.The blue track "RNAcode_Web" shows the predicted regions by RNAcode using the alignment constructed by the web service.The histogram track at the bottom shows the conservation level for each base pair across all species in the genome alignment.
sites and sometimes collate consequence exons if they are in frame, i.e., separated by short exons with a length divisible by 3.

Discussion
Comparative methods that evaluate substitution patterns by construction require MSAs as input and depend crucially on their quality.This limits their use in practice to application scenarios where suitable alignments are already available, usually in the form of genome-wide alignments, or where can be constructed quickly.The latter is mostly the case if only short, uninterrupted sequences with fairly homogeneous levels of sequences conservation are of interest; for these online blast services are usually sufficient to obtain the required sequences manually.Even in such simple cases, blastn hits often do not extend over the full length of the query and require manual intervention.In general, however, the investigation of a single query that potentially contains multiple coding blocks requires tedious manual data retrieval and multiple steps to prepare the data for scoring, as well as some post processing of the RNAcode results.
In order to alleviate these efforts, we developed a service, RNAcode_Web, that takes a single query sequence as input, which may be short and contain a single coding block or long, encompassing an entire genomic region in a eukaryotic genome.The pipeline then automatically retrieves suitable homologs with pairwise sequencesimilarities that both allow reliable alignments and provide enough variation for RNAcode to be a reliable classifier for conserved coding sequence.The pipeline collates the results from RNAcode into a convenient webbased report and provides the pertinent data for download to facilitate further processing.RNAcode_Web is intended to provide its users with fast overview of the conserved coding potential of a DNA sequence of interest, without the need to set up software and download data for comparative analysis locally.Several application examples (accessible in the Electronic Supplementary Material http://www.bioinf.uni-leipzig.de/publications/supplements/22-004) showcase the usefulness of the service.The RNAcode_Web service is designed to be completely independent of existing sequence annotation and makes only marginal use of external data.It only uses the NCBI taxonomy to restrict the scope of sequence searches in the iterative retrieval step.We argue that the use of taxonomy data is unproblematic since taxonomy information is by definition available for every sequence in the NCBI sequence databases.The taxonomic restriction speeds up the DB search but has no appreciable effect on the final results.The independence from other data sources makes RNAcode_Web applicable without restrictions to sequence data from any organism.
We have developed a rather general approach to automatize the retrieval of sequences with a prescribed minimal similarity to ensure high quality alignments and a prescribed maximal pairwise similarity to ensure that the alignment contains enough sequence variation to be informative.With present implementations of sequence similarity search tools such as blast, it is possible to enforce an upper bound on the similarity only by first retrieving all matches up to a specified minimal similarity.The first step incurs a substantial, but manageable, computational cost whenever the target database contains many highly similar sequences.The MSA is then constructed in subsequent step from the initial set of hits.We believe that this general approach can be transferred also to other applications in comparative genomics.However, the many details, including the bounds on the required sequence similarity, the construction of overlapping or non-overlapping sequence intervals and thus alignment blocks and the size of the aligned blocks will most likely need to be adjusted for specific applications.We have therefore focused here on a specific service rather than a generalist toolkit for sequence retrieval.

Figure 1 :
Figure 1: Example of the graphical output produced by RNAcode.The yellow bar on top indicates the predicted coding region, green codons indicate synonymous or similar amino acids, red codons indicate non-conservative substitutions.The level of sequence conservation is indicated as a contour at the bottom.The example shows a novel ORF "nov_131" identified from mass spectrometry data in Blautia producta[18].The alignment contains a diverse set of Bacteriodes species.The image is cropped.The original image with a list of all species can be found in Supplementary Material at http://www.bioinf.uni-leipzig.de/publications/supplements/22-004.

Figure 2 :
Figure 2: An example for a HSS-plot the genomic regions harboring the C. gene Y54E2A, a predicted 3 ′ −5 ′ helicase activity.Each arrow represents one HSS.Colors indicate relative reading frames.The query was subdivided into four sub-queries that were processed separately as indicated at the bottom the