Amino acid repeats are found to play important roles in both structures and functions of the proteins. These are commonly found in all kingdoms of life, especially in eukaryotes and a larger fraction of human proteins composed of repeats. Further, the abnormal expansions of shorter repeats cause various diseases to humans. Therefore, the analysis of repeats of the entire human proteome along with functional, mutational and disease information would help to better understand their roles in proteins. To fulfill this need, we developed a web database HPREP (http://bioinfo.bdu.ac.in/hprep) for human proteome repeats using Perl and HTML programming. We identified different categories of well-characterized repeats and domain repeats that are present in the human proteome of UniProtKB/Swiss-Prot by using in-house Perl programming and novel repeats by using the repeat detection T-REKS tool as well as XSTREAM web server. Further, these proteins are annotated with functional, mutational and disease information and grouped according to specific repeat types. The developed database enables the users to search by specific repeat type in order to understand their involvement in proteins. Thus, the HPREP database is expected to be a useful resource to gain better insight regarding the different repeats in human proteome and their biological roles.
Amino acid repeats that are commonly found in all kingdoms of life have played essential roles on both structures and functions of the proteins . The different well-characterized repeats such as Leucine rich repeats (LRR), Ankyrin (ANK) and Armadillo etc., with regard to their structures and functions of the proteins have been extensively analysed , , , . The roles played by the domain repeats of immunoglobulin, human matrix metalloproteinase and zinc finger type proteins in protein–protein interaction as well as binding to DNA or RNA have been observed , , . Several web servers and tools such as RADAR, TRUST, XSTREAM, T-REKS, and PTRStalker etc., that use different methods of sub-optimal alignment, short seed expansion, K-means clustering and normalized BLOSUM-weighted edit distance , , , ,  for repeats detection have been developed. Further, the repeats in the sequences of Protein Data Bank (PDB)  and UniProtKB/Swiss-Prot  identified by using RADAR have been analyzed at the structural and functional level. Several databases for amino acid repeats from different set of protein sequences were constructed for large scale analysis. RepSeq  is a database for repeats of lower eukaryotic pathogens obtained by searching identical shorter amino acids, PTRStalkerDB for repeats of SwissProt identified by using PTRStalker algorithm  and ProRepeat database  for repeats in UniProt as well as in some eukaryotic proteomes of RefSeq collection obtained based on suffix tree algorithm. The PRDB includes repeats found in the sequences of (i) NR (non-redundant) data bank of NCBI (ii) PDB and (iii) Swiss-Prot obtained by using T-REKS program . The IR-PDB database for different repeats patterns of tandem repeats, non-tandem repeats, shorter repeats and long repeats identified in the sequences of PDB by using RADAR has been developed . The various repeat detection algorithms, web servers, tools, databases and their online availability have been listed out . It has been observed that protein repeats are not only carrying out important biological functions but also related to several diseases. The high incidence of tandem repeats in the sequences of virulence factors of pathogenic agents, toxins, and allergens and in other disease-related sequences have been found out , , .
Earlier, the Human Genome Project (HGP) has derived a draft map of complete human proteome with approximately 20,000 protein-coding genes. These proteins are expertly curated and annotated with amino acid sequence, protein name, protein family, repeats, domain, function, mutation and disease in UniProtKB/Swiss-Prot . The normal functions of single amino acid repeats  and their abnormal expansion for several human diseases , , ,  have been studied. Further, the PolyQ 2.0 database for polyglutamine (polyQ) repeats of human proteins with functional, domain and single point mutation information has been developed . It was also observed that 15–20% of the human proteins have contained repeats of size longer than 5 , [, 32]. However, there is no exclusive repository for human proteins containing different categories of repeats of longer in lengths. Towards this goal, a web database HPREP that consists of different categories of well-characterized, domain and novel repeats of human proteins provided with functional, mutational and diseases information was developed.
The work flow of generation and accessing of the HPREP database is shown as a flowchart (Figure 1). First we detect the well-characterized repeats and domain repeats that are present in the human protein sequences of UniProtKB/Swiss-Prot. Then, the proteins with no repeats assignment are analysed for novel repeats by using repeat detection T-REKS tool as well as XSTREAM web server. Further, the functions, mutations and diseases of the repeat proteins are assigned and developed as a HPREP database. The database can be accessed either by giving UniProtID/Protein name or by selecting specific well-characterised repeat name, domain name and novel repeat containing protein. The database was implemented by running a set of Perl programs for creation and usage in a semi-automatic manner. The update of the database can be done by using a specific module being launched manually that is able to import newly added sequences since the last update from UniProtKB/Swiss-Prot and then to extend the database with identified well-characterized, domain and novel tandem repeats of the proteins.
The complete set of 20,385 of human proteins from UniProtKB/Swiss-Prot as on 30/11/2019 with Uniprot ID, amino acid sequence, sequence length, protein name, protein family and gene as well as annotated with function, family & domain, and pathology & biotech was obtained  and stored in a file. The functional annotation contains the general function, active site, binding site, motif, calcium binding and nucleotide binding of the protein. The family & domain contains the details of repeat, motif, domain and zinc finger type of the protein. The repeat section includes repeated regions with specific name such as LRR, Ankyrin, Kelch and HEAT, etc., if exists otherwise with no name. The motif section contains the motif regions and their functions. The domain includes the functional domains as well as their regions. Specific annotation is available for zinc finger type protein which contains zinc finger domains and their regions. The pathology & biotech provides the mutational residues and diseases of the protein.
The 1955 Proteins that are annotated with repeats were identified by using in-house developed Perl program. Then, the protein name, protein family, sequence, length, function, mutation and disease of the repeat proteins were retrieved from the stored information. Then, these proteins are grouped according to specific repeat names for further analysis.
The 8446 human proteins that have functional domain assignments were analyzed for the presence of domains repeated with same family by using in-house developed Perl program. Further, the 1786 zinc-finger type proteins were also analyzed for repeats in zinc finger domains. Then, the protein name, family, sequence, length, function, mutation and disease of 3018 domains repeats identified proteins were retrieved from the stored information. Further, these proteins are grouped according to domain names for further analysis.
The UniProtKB/Swiss-Prot proteins are annotated with repeats if they have a repeated sequence motifs or repeated domains of a specific protein or protein family. However, the repeat patterns of the proteins that are not studied are uncharacterised repeats which can be identified without prior knowledge by using novel repeat detection algorithms. In this study, we used the T-REKS tool that uses K-means clustering algorithm as well as XSTREAM web server that uses the short seed extension method since both the methods are observed more efficient in producing true repeats from large sequence databases . The novel tandem repeats of the proteins are identified by running T-REKS (http://bioinfo.montp.cnrs.fr/?r=t-reks) with the parameter settings of 20% length variability, similarity threshold of 0.7 and disallow overlaps. The XSTREAM web server (https://amnewmanlab.stanford.edu/xstream) is executed with the parameters of moderate degeneracy, high significance, a minimum word match similarity of 0.7 and redundancy removal. We set the similarity threshold to 0.7 and a minimal total length of tandem repeat of 14 residues to these two programs since the repeats based on this approximation is considered as true repeats with potential biological meaning .
The obtained repeats are categorized into (i) well-characterized repeats of 84 different repeat types and 320 other repeats with no specific name, (ii) domain repeats of 193 functional domains and 37 zinc finger domains and (iii) novel tandem repeats and available as a web database HPREP developed by using Perl and HTML. The developed database displays the different categories of human repeats along with functional, mutational and disease information of the proteins for further analysis. Figure 1 shows the generation and accession of the database in the form of a flow chart.
The HPREP database can be accessed via the web link http://bioinfo.bdu.ac.in/hprep. Figure 2 shows the home page that includes the details about the database with search option for repeats either by using UniPort ID or by using protein name. Figure 2A shows the LRR repeats (100–124/125–150/152–174/176–199/200–223/227–248/250–271) and their sequence region, function, motif, mutation and disease of the Thyrotropin receptor protein obtained by giving UniProt ID (P16473) of the protein. The search of the database by using the protein name (Coronin) (Figure 2B) displays the list of seven Coronin names containing proteins with WD repeats in which further details can be obtained by clicking corresponding UniProt ID link. The database also contains the different categories of repeats of (i) well-characterized repeats (ii) domain repeats and (iii) novel tandem repeats of human proteins to search by their specific repeat name. Further, the analysis of numbers of human proteins with repeats shows an approximately of 25% of them contained repeats.
The observed 84 different repeat types and their number of occurrences has suggested the abundant occurrences of LRR repeats (309 proteins), WD repeats (272 proteins), ANK repeats (250 proteins) and TPR (155 proteins) repeats in human proteins compared to other types (Figure 3). The details of specific repeat containing protein’s functional and diseases can be obtained by selecting the desired one (For e.g. LRR repeats). For example, the 309 LRR repeats proteins’ UniProt ID, protein name, sequence length, LRR repeats region as well as whether these proteins with general functions, active sites, binding sites, motifs, calcium binding, nucleotide binding, mutations and diseases are shown in Figure 3. From Figure 3, we observed that Nucleotide-Binding Domain, Leucine-Rich Repeat Proteins (NLP), Preferentially Expressed Antigen in Melanoma (PRAME), and Toll-like receptor family proteins containing these repeats as single pair to multiple copies of 30 with 20–30 residues in length. The number of LRR proteins with functions (202 proteins), diseases (53 proteins), active sites (9 proteins), binding sites (8 proteins), motifs (17 proteins), calcium bindings (1 protein), nucleotide bindings (25 proteins) and mutations (62 proteins) are displayed with links on the right side of the page to view their details. The 212 functional details show the diverse functions of LRR proteins of signal transduction, cell adhesion and DNA damage repair. While analysis the functional regions of active site, binding site, motif, calcium binding and nucleotide binding in LRR, we observed binding site in the LRR region such as the Toll-like receptor nine protein (Q9NR96) performs cytidine-phosphateguanosine (CpG)-DNA binding through 132 and 208 residues which are in the LRR region of (122–147 and 198–221). However, most of them are not in the LRR region.
Further, the associated diseases of LRR proteins such as myopia, epilepsy, rheumatoid arthritis and sclerosis are found out. We observed 17 mutations out of 62 mutations in the repeat regions such as the mutations (358 C->A: 358 C->S) for loss of binding with CRY1 (Cryptochrome Circadian Regulator 1) in LRR repeats (119–146/181–207/208–233/234–259/316–341/343–368/369–394) of F-box/LRR-repeat protein 3 (Q9UKT7). Further, the involvement of repeats in diseases of proteins is also observed such as Toll-like receptor 3 (O15455) protein with mutations in the regions (95 C->A: 122 C->A: 196 N->G: 247 N->R) of LRR repeats cause reduced response to double-stranded RNA that lead to acute infection-induced (herpes-specific) encephalopathy-2 (IIAE2) (Figure 4). This suggests that LRR repeats are for general protein function and sometimes get mutated for causing diseases. Likewise, the roles of well-characterized repeats can be observed by using the developed database.
The obtained functional domain repeats and their number of occurrences shows the abundant occurrence of Ig-like C2-type followed by EF-hand, Fibronectin type-III and Cadherin in human proteins. Further, the details of specific domain proteins can be obtained by selecting the desired one. The details of 205 Ig-like C2-type proteins have shown that Carcinoembryonic antigen-related cell adhesion molecule, Leukocyte immunoglobulin-like receptor subfamily A member 3 and Vascular cell adhesion proteins are generally containing these repeats as single pair to multiple copies of 42 with 72–100 residues in length. Further, the number of Ig-like C2-type proteins containing functions and diseases can be obtained. The 172 functional details show the cell–cell adhesion and immune response modulation functions of the proteins. From the analysis of functional regions in these repeats, we observed motifs in Ig-like C2-type region such as EWI motif (250–252) of Immunoglobulin superfamily member 3 (O75054) in Ig-like C2-type region of 143–262. Further, the associated diseases of cardiomyopathy, hearing loss and cancer in the bladder and prostatic of the repeat proteins are observed. The details of 52 mutations show 15 mutations in the repeats such as mutations (63 E->R: 66 E->R: 84 T->R) for binding growth factor GAS6 in the repeat region (27–128/139–222) of tyrosine-protein kinase receptor (P30530) protein. Likewise, the functions and diseases of specific domain proteins as well as their involvement in functions and mutations are found out using the developed database.
The 132 tandem repeats of the proteins apart from well annotated repeats of UniProtKB using T-REKS and 202 proteins with repeats using XSTREAM can be obtained. While analysis of overlapping or completely different set, the 92 protein repeats that cover nearly the same regions in both programs are observed because of the closest definition of similarity threshold and minimal total length are given to these programs. The proteins with novel repeats along with other information of whether they have general function, active site, binding site, motif, calcium binding, nucleotide binding, mutation and diseases can be obtained by clicking the corresponding links.
We compared the performance of our database with PRDB database (http://bioinfo.montp.cnrs.fr/?r=repeatDB) which contains tandem repeats of UniProtKB/Swiss-Prot detected by using T-REKS program. We observed that PRDB contains 848 human tandem repeats of length >14 with entries for each repeat of protein, while in our database includes 672 proteins with repeats covering the same regions of UniProtKB annotated repeats as well as novel tandem repeats. For example, PRDB shows the repeats (8–273) with general function, Gene ontology, subcellular localization, pfam domain of the Nuclear receptor subfamily 0 group B member 1 (P51843) protein whereas, our database shows well-characterised repeats (1–253), gene, domain, general function, disease, LXXLL motifs (13–17/80–84/146–150) for transcription factor binding and mutations (16–17 (ML->AA); 83–84 (ML->AA); 149–150 (LL->AA)) which inhibit the transcriptional activity of the protein (Figure 5). Further, the PRDB shows the consensus patterns and structure forming potential of the repeats as well as the similar search repeats in PRDB which are useful for functional analysis. Such details from PRDB could be used to enrich the HPREP database in the future for better analysis of human repeats.
The human proteome comprises repeats in nearly one third of the proteins and significant portions of repeats proteins carrying fundamental functional roles have been observed. Furthermore, the high incidence of repeats in virulence factors, amyloidogenic, prion and other disease-related sequences of the proteins has suggested that repeats are not only performing biological functions but also related to number of human diseases. The database HPREP for human proteome repeats has been developed with the aim of understanding the different categories of repeats in human proteins and their involvement in function, mutation and disease of the proteins. This knowledge could be used for better understanding the underlying roles of repeats in human proteins for drug development and identify promising biomolecules for diagnostic and prognostic purposes.
2. Matsushima, N, Enkhbayar, P, Kamiya, M, Osaki, M, Kretsinger, RH. Leucine–Rich Repeats (LRRs): structure, function, evolution and interaction with ligands. Drug Des Rev 2005;2:305–22. https://doi.org/10.2174/1567269054087613. Search in Google Scholar
3. Batrukova, MA, Betin, VL, Rubtsov, AM, Lopina, OD. Ankyrin: structure, properties, and functions. Biochemistry 2000;65:395–408. Search in Google Scholar
5. Vetting, MW, Hegde, SS, Fajardo, JE, Fiser, A, Roderick, SL, Takiff, HE, et al.. Pentapeptide repeat proteins. Biochemistry 2006;45:1–10. https://doi.org/10.1021/bi052130w. Search in Google Scholar
6. Sawaya, MR, Wojtowicz, WM, Andre, I, Qian, B, Wu, W, Baker, D, et al.. A double shape provides the structural basis for the extraordinary binding specificity of Dscam isoforms. Cell 2008;134:1007–18. https://doi.org/10.1016/j.cell.2008.07.042. Search in Google Scholar
7. Elkins, PA, Ho, YS, Smith, WW, Janson, CA, D’Alessio, KJ, McQueney, MS, et al.. Structure of the C-terminally truncated human ProMMP9, a gelatin-binding matrix metalloproteinase. Acta Crystallogr D Biol Crystallogr 2002;58:1182–92. https://doi.org/10.1107/s0907444902007849. Search in Google Scholar
8. Lee, MS, Gippert, GP, Soman, KV, Case, DA, Wright, PE. Three-dimensional solution structure of a single zinc finger DNA-binding domain. Science 1989;245:635–7. https://doi.org/10.1126/science.2503871. Search in Google Scholar
9. Heger, A, Holm, L. Rapid automatic detection and alignment of repeats in protein sequences. Proteins 2000;41:224–37. https://doi.org/10.1002/1097-0134(20001101)41.2<224::aid-prot70>3.0.co;2-z. Search in Google Scholar
11. Jorda, J, Kajava, AV. T-REKS: identification of tandem repeats in sequences with a K-means based algorithm. Bioinformatics 2009;25:2632–8. https://doi.org/10.1093/bioinformatics/btp482. Search in Google Scholar
12. Newman, A, Cooper, J. Xstream: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinf 2007;8:382. https://doi.org/10.1186/1471-2105-8-382. Search in Google Scholar
13. Pellegrini, M, Renda, ME, Vecchio, A. Ab initio detection of fuzzy amino acid tandem repeats in protein sequences. BMC Bioinf 2012;21:13. https://doi.org/10.1186/1471-2105-13-S3-S8. Search in Google Scholar
14. Mary Rajathei, D, Selvaraj, S. Analysis of sequence repeats of proteins in the PDB. Comput Biol Chem 2013;47:156–66. https://doi.org/10.1016/j.compbiolchem.2013.09.001. Search in Google Scholar
15. Rajathei, DM, Parthasarathy, S, Selvaraj, S. Identification and analysis of long repeats of proteins at the domain level. Front Bioeng Biotechnol 2019;7:250. https://doi.org/10.3389/fbioe.2019.00250. Search in Google Scholar
16. Depledge, DP, Lower, RP, Smith, DF. Repseq – a database of amino acid repeats present in lower eukaryotic pathogens. BMC Bioinf 2007;8:122. https://doi.org/10.1186/1471-2105-8-122. Search in Google Scholar
17. Luo, H, Lin, K, David, A, Nijveen, H, Leunissen, JAM. Prorepeat: an integrated repository for studying amino acid tandem repeats in proteins. Nucleic Acids Res 2012;40:D394–9. https://doi.org/10.1093/nar/gkr1019. Search in Google Scholar
19. Selvaraj, S, Rajathei, M. A web database IR-PDB for sequence repeats of proteins in the Protein Data Bank. Int J Knowl Discov Bioinf 2017;7:1–10. https://doi.org/10.4018/IJKDB.2017070101. Search in Google Scholar
22. Baxa, U, Cassese, T, Kajava, AV, Steven, AC. Structure, function, and amyloidogenesis of fungal prions: filament polymorphism and prion variants. Adv Protein Chem 2006;73:125–80. https://doi.org/10.1016/S0065-3233(06)73005-4. Search in Google Scholar
27. Orr, HT, Zoghbi, HY. Trinucleotide repeat disorders. Annu Rev Neurosci 2007;30:575–621. https://doi.org/10.1146/annurev.neuro.29.051605.113042. Search in Google Scholar
29. Lieberman, AP, Shakkottai, VG, Albin, RL. Polyglutamine repeats in neurodegenerative diseases, annual review of pathology. Mechanisms of Disease 2019;14:1–27. https://doi.org/10.1146/annurev-pathmechdis-012418-012857. Search in Google Scholar
30. Li, C, Nagel, J, Androulakis, S, Lupton, CJ, Song, J, Buckle, AM. PolyQ 2.0: an improved version of PolyQ, a database of human polyglutamine proteins. Oxford: Database; 2016;2016:1–8. https://doi.org/10.1093/database/baw050. Search in Google Scholar
31. Karlin, S, Brocchieri, L, Bergman, A, Mrazek, J, Gentles, AJ. Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA 2002;99:333–8. https://doi.org/10.1073/pnas.012608599. Search in Google Scholar