iMLP, a predictor for internal matrix targeting-like sequences in mitochondrial proteins

: Matrix targeting sequences (MTSs) direct proteins from the cytosol into mitochondria. Efficient targeting often relies on internal matrix targeting-like sequences (iMTS-Ls) which share structural features with MTSs. Predicting iMTS-Ls was tedious and required multiple tools and webservices. We present iMLP, a deep learning approach for the prediction of iMTS-Ls in protein sequences. A recurrent neural network has been trained to predict iMTS-L propensity profiles for protein sequences of interest. The iMLP predictor considerably exceeds the speed of existing approaches. Expanding on our previous work on iMTS-L prediction, we now serve an intuitive iMLP webservice available at http://iMLP.bio.uni-kl.de and a stand-alone command line tool for power user in addition.


Introduction
Mitochondrial genomes code only for a handful of proteins.Most of the roughly 800 (yeast) to 1500 (human) mitochondrial proteins are encoded in the nucleus.These proteins are synthesized in the cytosol as precursor proteins that have to be imported into the organelle (Avendaño-Monsalve et al. 2020;Chacinska et al. 2008), which is realized by targeting signals of characteristic features of their primary and secondary structures (Grevel et al. 2019;Heijne 1986;Vögtle et al. 2009) Most proteins of the mitochondrial matrix and the inner membrane contain matrix targeting sequences (MTSs).These amino-terminal presequences are necessary and sufficient for the import of the respective protein into mitochondria.They are recognized by translocases in the outer mitochondrial membrane (the TOM complex) and the inner mitochondrial membrane (the TIM23 complex) which facilitate their translocation into mitochondria (Heijne 1986).MTSs are often removed from these precursors during the translocation process by the mitochondrial processing peptidase (MPP) (Calvo et al. 2017;Friedl et al. 2020;Vögtle et al. 2009).As recently discovered, many matrix proteins contain additional internal sequences which support protein translocation into mitochondria.These regions were called MTS-like sequences (iMTS-Ls) owing to their MTS-like structural properties; however, in contrast to the amino-terminal MTSs they are located within the mature region of the protein (Backes et al. 2018).While neither necessary nor sufficient on their own for translocation, iMTS-Ls can considerably increase import efficiency of the precursors.This stimulating effect is mediated by the mitochondrial surface receptor Tom70 in conjunction with the cytosolic chaperone machinery (Backes et al. 2020;Kreimendahl and Rassow 2020;Young et al. 2003).The interaction with Tom70 is particularly relevant for aggregation-prone or hydrophobic proteins and cooperative binding of several Tom70 receptors to one precursor might support the unfolding and import of precursor proteins (Backes et al. 2020;Backes and Herrmann 2017;Brix et al. 1999Brix et al. , 2000;;Yamamoto et al. 2009).The iMTS-Ls-mediated protein targeting seems not to be restricted to MTS-containing proteins.For instance, an iMTS-L was found to target the ubiquitin ligase Ubx2 onto the mitochondrial surface (Mårtensson et al. 2019) or the cytosolic protein Orf9b of the SARS-CoV2 virus (Gordon et al. 2020;Samavarchi-Tehrani et al. 2020).The latter is particularly interesting as the COVID-19 causing coronavirus expresses an iMTS-L containing peptide in high amounts in order to inhibit Tom70 function in infected cells (Jiang et al. 2020;Miserey-Lenkei et al. 2021;Shi et al. 2014).The physiological relevance of iMTS-L sequences still has to be explored more generally but these examples already show that these Tom70-interaction regions are not necessarily restricted to mitochondrial proteins and can be highly relevant in the context of diseases.
MTSs and iMTS-L are helical regions of varying lengths, mostly between 10 and 70 residues.One face of these amphipathic helices is positively charged, one is hydrophobic; negative charges are absent and serine and threonine residues are frequent.MTSs can be detected by algorithms which either scan the sequences for these characteristic pre-defined features or which were trained on protein sequences of known localization.The TargetP algorithm is one of the machine-learning programs which is widely used in the community (Emanuelsson et al. 2000(Emanuelsson et al. , 2007)).It is well suited for the detection of MTSs and detects amino-terminal sequences with high accuracy.
Here we describe a user-friendly webservice implementation, coined iMLP, for iMTS-L prediction.This tool leverages the capabilities of sequence-based deep learning to offer a fast and reliable prediction of iMTS-L propensity profiles for any given input sequence through a convenient interface.

The iMLP prediction workflow
The iMLP application presented here provides a direct and fast approach to predict iMTS-L propensity profiles from protein sequences of interest.This is achieved by an algorithm based on an artificial recurrent neural network architecture named long short-term memory (LSTM).These architectures are specially designed for feature detection in sequences and therefore well suited for the recognition of iMTS-Ls.In this setup, we trained our model in a supervised fashion, on a set of training sequences derived from the well-established TargetP algorithm.We exploited the fact shown previously, that iMTS-L stretches, supporting Tom70 binding, share the typical biochemical properties of mitochondrial targeting signals and can therefore be predicted classically using TargetP.However, to identify iMTS-L stretches, additional steps are necessary that are comprised of (i) preprocessing, (ii) profile generation, and (iii) a normalization step.During preprocessing, the restriction of analyzing only the N-terminus of a given protein is relieved.Due to the nature of the TargetP MTS prediction, a single TargetP score reflects the probability of the presence or absence of a transit peptide.This might lead to sudden changes in the positional score along a sequence that are unreasonable when expecting contiguous motives in a sequence.Therefore, the vector of raw TargetP scores is smoothed using a Savitzky-Golay filter with the window size of the expected value of the distribution of known MTS lengths (Vögtle et al. 2009).To identify iMTS-Ls in the resulting iMTS profiles, the signal of the respective target sequence is normalized to background signal using artificial signals generated by analyzing 5000 random sequences.In our framework, we applied this procedure to the combined reference proteomes of yeast, mice, and human creating the dataset to train and evaluate our model.It can be directly consumed by our webservice (http://imlp.bio.uni-kl.de/)or using a stand-alone version of the prediction tool (https://github.com/CSBiology/iMLP)and allows direct and fast identification of iMTS-Ls (Figure 1).Additional to the source code of the webservice (https:// github.com/CSBiology/iMLP_Web),we provide the model in Open Neural Network Exchange (ONNX) standardized export format for reproducibility and validation independent of the machine learning framework.Exploiting a multi organism proteome as a learning data set, we could perform a train-test split in a 1:9 ratio, which enabled us to test our model using 2.500,000 iMTS-L predictions not encountered during the training process.Comparing model based iMTS-L propensity scores with those assigned to the test data set, we reached an accuracy of 99.10% determined by Pearson correlation.This seems sufficient considering a certain indistinctness of the indirect identification using TargetP followed by the processing procedure, indicating a successful direct mapping between protein sequence and iMTS-L propensity score (Figure 2A).This global quality assessment can be complemented by the in-depth analysis of well-studied candidates, such as the yeast ATP synthase alpha subunit Atp1.The comparative visualization of the iMTS-L propensity scores based on the TargetP based calculation and the LSTM based model shows that both prediction approaches agree in predicting the respective experimentally verified iMTS-L positions (Figure 2B).The comparison shows a different behavior of the TargetP2.0algorithm, which is known to have a higher accuracy in predicting classical N-terminal MTS (Almagro Armenteros et al. 2019).However, this prediction does not match the iMTS-L propensity score expected according to the experimental evidence (Backes et al. 2018).
Analogously, we provide a plant model that was learned and evaluated on a combined reference proteomes of Arabidopsis thaliana, Zea mays and Oryza sativa with an accuracy of 99% determined by Pearson correlation (Supplementary Figure 1A).While the TargetP2.0-basedprediction differs, iMLP is able to detect iMTS-Ls in  SDHB-RPS14 from rice flanking the cleavage site (G247) as expected according to Friedl et al. 2020 based on homology (Supplementary Figure 1B).Therefore, we offer the plant model as an experimental feature of our application that will need additional experimental validation.
Most recently, the interaction between Tom70 and Sars-Cov2 protein Orf9b was shown to be of high importance (Jiang et al. 2020).Using iMLP we could analyze the sequence of Orf9b and identify the interaction region, which is clearly separated from the rest of the sequence.The direct generation of the iMTS-L profile allows a more direct interpretation of the prediction result compared to the TargetP raw scores (Figure 3).In general, iMTS-L propensity scores above zero indicate iMTS-L stretches.It becomes apparent that the normalization versus random sequences increases the signal-to-noise ratio of the score derived by our model when compared to the raw TargetP output.The same holds true for the inspection of the wellstudied Atp1 yeast protein (Figure 2B).

Web interface for the iMLP service
In order to use iMLP to predict iMTS-Ls for their own research, users do not have to undergo installation procedures or provide computational resources but can retrieve the prediction via an easy-to-use web interface (Figure 4).The iMLP webservice is available at http://iMLP.bio.uni-kl.de.Here, users can submit either individual protein sequences or files with multiple entries.The output is a detailed prediction report including: (i) a score heatmap of the input sequence, (ii) predicted iMTS-L propensity profile (P ′iMTS−L a ).The graphics (ii) are generated with plotly.jsand are therefore well suited for exploratory analysis due to their interactivity.These plots can be easily downloaded using the functionality above the respective plots.If multiple sequences are provided in a file, each sequence report is displayed in a separate tab on the page.Additionally, iMLP results can also be exported as tab separated files.

Discussion
Predicting iMTS-Ls is of great interest to identify protein sequences that determine the intracellular distribution of proteins.Such iMTS-L sequences are frequently found in mitochondrial proteins but also present in many cytosolic proteins, in some cases obviously to facilitate protein association with the mitochondrial surface (Backes et al. 2018).The relevance of iMTS-Ls was only experimentally analyzed for a small set of model proteins and a general, proteome-wide analysis is lacking.This necessitates our machine-trains-machine learning approach to derive a predictive model for fast iMTS-L sequence detection.
In many cases, potential iMTS-Ls might be hidden in the three-dimensional structure of proteins and therefore not accessible once proteins are folded.However, these sequences might be relevant during the folding reaction of proteins and provide binding sites for chaperones, co-chaperones as well as proteins with tetratricopeptide repeat (TPR) domains, such as Tom70.Alternatively, in organisms lacking Tom70 other mitochondrial receptors with TPR domains could recognizing iMTS-L sequences.Om64 in plants (Nickel et al. 2019) or ATOM69 in the mitochondrial outer membrane of Kinetoplastids (Mani et al. 2015) have been proposed to perform a Tom70 like function.
In this description, we present a tool for a simplified iMTS-L prediction.Protocols for iMTS-L prediction had been published before (Backes et al. 2018;Boos et al. 2018), and the last edition of the TargetP application (TargetP 2.0) can also be enabled to calculate propensity scores technically.It has been shown, that TargetP2.0shows an improved specificity to detect classical N-terminal MTS (Almagro Armenteros et al. 2019).However, due the divergent results predicting the iMTS-L profile of the wellstudied Atp1, we believe that version 2.0 is not ideally suited to detect iMTS-Ls.This might be explained by slight differences in the physio-chemical properties between iMTS-Ls and classical MTS and the superior specificity of TargetP2.0for the later.
In contrast, the iMLP predictor is easier to use, and makes tedious manual pre-and postprocessing obsolete.Moreover, the prediction speed is highly improved due to the thin network architecture, which makes the iMLP algorithm ideal for a web service-based prediction.
In vivo, the intracellular distribution of proteins is achieved by a number of biological reactions, including the binding of targeting factors, the insertion of proteins into translocation pores, the specific interactions with chaperones that drive protein translocation and the specific degradation of missorted proteins by proteases.Thus, complex patterns presumably contribute to the intracellular distribution of proteins and the iMLP predictor will provide an exciting tool to study the ability of iMTS-L sequences to modulate intracellular protein targeting and to elucidate their prevalence in the proteomes of different organisms.

Dataset and data processing
In order to prepare the test and training datasets, the first step is the generation of n suffixes s i of the protein sequence, where n is the length of the original sequence, and The figure shows a screen shot of the iMLP window.The FASTA sequence of the Orf9b protein of SARS-CoV2 was used for iMTS-L prediction.The output window indicates the iMTS-L probability scores as a heatmapcolored sequence as well as a profile.The Tom70 binding region of the Orf9b sequence (Jiang et al. 2020) is predicted with high probability score by iMLP.
is the amount of amino acids truncated from the N-terminus of the original sequence to generate the suffix s i .Subsequently, each suffix s i is subject of TargetP MTS prediction (TargetP V1.1).The resulting prediction scores for each s i are then concatenated to a positional score sequence vector S. Vector S is then smoothed using a Savitzky-Golay filter (Savitzky and Golay 1964) with the window size of the expected value of the distribution of known MTS lengths.This results in the iMTS-L probability profile P iMTS−L a , where a denotes the amino acid index in the sequence.For the detection of the actual iMTS-L region and to separate them from noise, a normalization step is necessary, which is realized by extracting iMTS-L profiles over (n = 5000) random sequences.The expected value μ iMTS−L and the SD σ iMTS−L of this distribution is then used to normalize the iMTS-L probability profile and estimate the iMTS-L propensity score P ′iMTS−L a : With this normalization, the P ′iMTS−L a score is 0 if the iMTS-L propensity at position a along the sequence is equal to that of a random sequence and e.g. 1 if it is one standard deviation more MTS-like.The described processing procedure could then be used to construct two learning data sets named in conjunction with the used targetP model "non plant" and "plant".The "non plant" data set was based on the Uniprot reference proteomes of Saccharomyces cerevisiae (UP000002311), Homo sapiens (UP000005640) and Mus musculus (UP000000589) yielding a set of 48654 protein sequences with about 25 million iMTS-L propensity scores.The "plant" data set was constructed using the reference proteomes of Arabidopsis thaliana (UP000006548), Zea mays (UP000007305) and Oryza sativa subsp.japonica (UP000059680) with a total size of 110539 protein sequences yielding ∼40 million iMTS-L propensity scores.

iMLP deep learning architecture
To setup a direct prediction of the iMTS-L propensity scores solely based on the protein sequence we employ state-of-the-art deep learning techniques using the Microsoft Cognitive Toolkit (CNTK, available at https://github.com/Microsoft/CNTK).In order to recover sequential dependencies between amino acids of the protein sequence, we chose a recurrent neural network architecture based on Long shortterm memory (LSTM) cells.The model employs no constraints on the lengths of the protein sequences used as an input.Following an embedding layer (dimension = 24), transforming each amino acid symbol to a vector representation, we used two bidirectional recurrent layers.Each bidirectional layer consists of two spliced LSTM layers (dimension = 100) to implement both a negative and positive lookahead.The output is realized as a single neuron dense layer.

Model training and evaluation
The model was optimized in a supervised fashion using the Adam optimizer (Kingma and Ba 2014) to minimize the squared error between the calculated P ′iMTS−L a and the output of the final model layer as a loss function.Learning was carried out for 9000 epochs with a batch size of 2000, the learning rate was adjusted based on the performance on the last epoch.The optimization was carried out on a randomly sampled subset consisting of 90% of the constructed learning data set, leaving the remainder of the learning data set to validate the model on previously unseen data (Figure 2A).

iMTS-L protein propensity estimation
To facilitate P iMTS−L a the comparison across sequences, a iMTS-L protein propensity is calculated, that summarizes the iMTS-L profile information in a single score.This overall iMTS-L protein propensity score for a given sequence is calculated by averaging over all P ′iMTS−L i that are higher than propensities of random sequences (meaning summing up non-negative normalized propensity scores of all amino acids of the sequence):

Figure 1 :
Figure 1: Schematic overview illustrating model building and consumption.(A) Based on the protein sequences of the yeast, mice and human proteomes, we used the previously published workflow to calculate LSTM scores utilizing the independent calculation of TargetP scores followed by downstream normalization.The resulting data set consisting of 25,000,000 iMTS-L propensity scores was subsequently split in a 9:1 ratio into a test and training set used for training and evaluation, respectively (B) The multilayer LSTM based RNN was then trained to directly predict the iMTS-L propensity scores by minimizing the squared error between the model prediction and the scores assigned to the training set.The generalization of the model was verified by the correlation with the respective scores of the test set.(C) The resulting iMLP model can now be consumed using the iMLP webservice, directly mapping a user defined amino acid sequence(s) to the corresponding iMTS-L propensity score profile(s).

Figure 2 :
Figure 2: Validation of the deep learning based iMTS-L propensity score prediction.(A) The genome wide model performance was validated comparing model-based iMTS-L propensity score predictions with previously unseen validation data consisting of 10% of the total data set with ∼25 million entities.Considering the imperfect nature of the initial TargetP based iMTS-L propensity score calculation, the computed Pearson correlation of 0.991 indicates that the model is capable to create a direct mapping between protein sequence and iMTS-L propensity score (B) The exemplary comparison of scores analyzing the well-studied ATP1 protein sequence shows that the prediction approaches of iMLP and TargetP1 are in agreement predicting the respective iMTS-L positions (orange and blue), while TargetP2 shows a very different profile (yellow).

Figure 3 :
Figure 3: Normalization of the targetP score leads to an improved signal-to-noise ratio.Analyzing the SARS-CoV2 protein Orf9b, a recently verified Tom70 interaction partner (Jiang et al. 2020), the iMTS-L propensity score clearly separates the predicted target sequence stretch (score > 0) from the background noise (score < 0).

Figure 4 :
Figure 4: The iMLP window allows simple prediction of iMTS-L propensity scores.The figure shows a screen shot of the iMLP window.The FASTA sequence of the Orf9b protein of SARS-CoV2 was used for iMTS-L prediction.The output window indicates the iMTS-L probability scores as a heatmapcolored sequence as well as a profile.The Tom70 binding region of the Orf9b sequence (Jiang et al. 2020) is predicted with high probability score by iMLP.