Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Bulletin of the Polish Academy of Sciences Technical Sciences

The Journal of Polish Academy of Sciences

6 Issues per year

IMPACT FACTOR 2016: 1.156
5-year IMPACT FACTOR: 1.238

CiteScore 2016: 1.50

SCImago Journal Rank (SJR) 2016: 0.457
Source Normalized Impact per Paper (SNIP) 2016: 1.239

Open Access
See all formats and pricing
More options …
Volume 60, Issue 3


Efficient alternatives to PSI-BLAST

M. Startek / S. Lasota / M. Sykulski / A. Bułak / L. Noé / G. Kucherov / A. Gambin
  • Corresponding author
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland / Mossakowski Medical Research Centre Polish Academy of Sciences, 5 Pawińskiego St., 02-106 Warszawa, Poland
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2012-12-22 | DOI: https://doi.org/10.2478/v10175-012-0063-0


In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses the advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we fill this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.

Keywords : PSI BLAST tool; sequence alignment; seeding technique.

  • [1] T. Smith and M. Waterman, “The identification of common molecular subsequences”, J. Molecular Biology 147, 195-197 (1981).Google Scholar

  • [2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool”, J. Molecular Biology 215, 403-410 (1990).Google Scholar

  • [3] S. Altschul, T. Madden, A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NucleicAcids Research 25, 3389-3402 (1997).Google Scholar

  • [4] G. Kucherov, L. No´e, and M. Roytberg, “A unifying framework for seed sensitivity and its application to subset seeds”, J. Bioinformatics and Computational Biology 4 (2), 553-570 (2006).Google Scholar

  • [5] A. Gambin, S. Lasota, R. Szklarczyk, J. Tiuryn, and J. Tyszkiewicz, “Contextual alignment of biological sequences”, Proc. ECCB’02, Bioinformatics 18, 116-127 (2002).CrossrefGoogle Scholar

  • [6] B. Brejova, D.G. Brown, and T. Vinar, “Optimal spaced seeds for homologous coding regions”, J. Bioinformatics and Computational Biology 1 (4), 595-610 (2004).PubMedGoogle Scholar

  • [7] A.S. Shiryev, J.S. Papadopoulos, A.A. S chaffer, and R. Agarwala, “Improved BLAST searches using longer words for protein seeding”, Bioinformatics 23, 2949-2951 (2007).Web of ScienceCrossrefGoogle Scholar

  • [8] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search”, Bioinformatics (Oxford, England) 18, 440-445 (2002).CrossrefGoogle Scholar

  • [9] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: highly sensitive and fast homology search”, J. Bioinformaticsand Computational Biology 2 (3), 417-439 (2004).Google Scholar

  • [10] D. Kisman, M. Li, B. Ma, and L. Wang, “tPatternHunter: gapped, fast and sensitive translated homology search”, Bioinformatics(Oxford, England) 21, 542-544 (2005).CrossrefGoogle Scholar

  • [11] L. Noe and G. Kucherov, “YASS: enhancing the sensitivity of DNA similarity search”, Nucl. Acids Res. 33, W540-543 (2005).Google Scholar

  • [12] J. Buhler, U. Keich, and Y. Sun, “Designing seeds for similarity search in genomic DNA”, J. Comput. Syst. Sci. 70 (3), 342-363 (2005).CrossrefGoogle Scholar

  • [13] B. Brejov´a, D.G. Brown, and T. Vinar, “Vector seeds: an extension to spaced seeds”, J. Comput. Syst. Sci. 70 (3), 364-380 (2005).Google Scholar

  • [14] Y. Sun and J. Buhler, “Designing multiple simultaneous seeds for DNA similarity search”, RECOMB 1, 76-84 (2004).Google Scholar

  • [15] G. Kucherov, L. Noe, and M. Roytberg, “Multiseed lossless filtration”, IEEE/ACM Trans. Comput. Biol. Bioinformatics 2 (1), 51-61 (2005).CrossrefGoogle Scholar

  • [16] M. Roytberg, A. Gambin, L. No´e, S. Lasota, E. Furletova, E. Szczurek, and G. Kucherov, “On subset seeds for protein alignment”, IEEE/ACM Trans. on Computational Biology andBioinformatics 6 (3), 483-494 (2009).Web of ScienceGoogle Scholar

  • [17] W. Li, B. Ma, and K. Zhang, “Amino acid classification and hash seeds for homology search”, BICoB 1, 44-51 (2009).Google Scholar

  • [18] S.M. Kiebasa, R. Wan, K. Sato, P. Horton, and M.C. Frith, “Adaptive seeds tame genomic sequence comparison”, GenomeResearch 21 (3), 487-493 (2011).Google Scholar

  • [19] C.D. Livingstone and G.J. Barton, “Protein sequence alignments: a strategy for the hierarchical an alysis of residue conservation”, Computer Applications in the Biosciences: CABIOS 9, 745-756 (1993).Google Scholar

  • [20] T. Li, K. Fan, W. Wang, and J. Wang, “Reduction of protein sequence complexity by residue grouping”, Protein Engineering 16 (5), 323-330 (2003).PubMedGoogle Scholar

  • [21] L. Murphy, A. Wallqvist, and R. Levy, “Simplified amino acid alphabets for protein fold recognition and implications for folding”, Protein Engineering 13, 149-152 (2000).Google Scholar

  • [22] B. Rost, “Twilight zone of protein sequence alignments”, ProteinEngineering Design and Selection 12 (2), 85-94 (1999).Google Scholar

  • [23] A. Gambin and J. Tyszkiewicz, “Substitution matrices for contextual alignment”, Journees Ouvertes Biologie InformatiqueMathematique 1, 227-238 (2002).Google Scholar

  • [24] S. Henikoff and J. Henikoff, “Amino acid substitution matrices from protein blocks”, Proc. Natl. Acad. Sci. USA 89, 10915- 10919 (1992).CrossrefGoogle Scholar

  • [25] A. Gambin and P. Wojtalewicz, “CTX-BLAST: context sensitive version of protein blast”, Bioinformatics 23 (13), 1686- 1688 (2007).CrossrefWeb of SciencePubMedGoogle Scholar

  • [26] I. Friedberg, T. Kaplan, and H. Margalit, “Evaluation of PSIBLAST alignment accuracy in comparison to structural alignments”, Protein Science 9, 2278-2284 (2000).Google Scholar

  • [27] A. Gambin, S. Lasota, M. Startek, M. Sykulski, L. Noé, and G. Kucherov, “Subset seed extension to protein blast”, Bioinformatics 1, 149-158 (2011).Google Scholar

  • [28] B. Korte and D. Hausmann, “An analysis of the greedy heuristic for independence systems”, Ann. Discrete Math. 2, 65-74 (1978).Google Scholar

  • [29] S. Cheng and Y.-F. Xu, “Constrained independence system and triangulations of planar point sets”, Computing and Combinatorics 1, 41-50 (1995).Google Scholar

  • [30] B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O’Donovan, and I. Phan, “The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003”, Nucl. Acids Res. 31 (1), 365-370 (2003).Google Scholar

  • [31] Y. Ponty, M. Termier, and A. Denise, “GenRGenS: software for generating random genomic sequences and structures”, Bioinformatics 22, 1534-1535 (2006).PubMedCrossrefGoogle Scholar

  • [32] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient methods for generating optimal single and multiple spaced seeds”, BIBE ’04: Proc.4th IEEE Symp. on Bioinformatics and Bioengineering 1, 411 (2004).Google Scholar

  • [33] B. Ma and H. Yao, “Seed optimization is no easier than optimal golomb ruler design”, APBC 1, 133-144 (2008).Google Scholar

  • [34] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, London, 1996.Google Scholar

  • [35] F. M. Liang, “Word hy-phen-a-tion by com-put-er”, Tech. Rep., Stanford University, Stanford, 1983.Google Scholar

  • [36] A. Gambin, J. Tiuryn, and J. Tyszkiewicz, “Alignment with context dependent scoring function”, J. Computational Biology 13 (1), 81-101 (2006).PubMedGoogle Scholar

  • [37] S. Altschul and W. Gish, “Local alignment statistics”, MethodsEnzymol. 266, 460-480 (1996).Google Scholar

  • [38] S. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions”, Nuclear Acids Res. 29 (2), 351-361 (2001).Google Scholar

  • [39] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. Eddy, S. Griffiths-Jones, K. Howe, M. Marshall, and E. Sonnhammer, “The pfam protein families database”, Nucl.Acids Res. 30 (1), 276-280 (2002).Google Scholar

  • [40] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, “The pfam protein families database”, Nucl. Acids Res. 36 (1), D281-288 (2008).Google Scholar

  • [41] L. Oliveira, A.C.M. Paiva, and G. Vriend, “A common motif in g-protein-coupled seven transmembrane helix r eceptors”, J.Computer-Aided Molecular Design 7, 649-658 (1993).Google Scholar

  • [42] P. Peterlongo, L. No, D. Lavenier, G. illes Georges, J. Jacques, G. Kucherov, and M. Giraud, “Protein similarity search with subset seeds on a dedicated reco nfigurable hardware”, ParallelProcessing and Applied Mathematics 1, 1240-1248 (2008).Google Scholar

  • [43] V.H. Nguyen and D. Lavenier, “Speeding up subset seed algorithm for intensive protein sequence comparison”, RIVF 1, 57-63 (2008).Google Scholar

  • [44] T. Kahveci and A. Singh, “An efficient index structure for string databases”, Proc. 27th VLDB 1, 352-360 (2001).Google Scholar

  • [45] M. Cameron, H. Williams, and A. Cannane, “A deterministic finite automaton for faster protein hit detection in BLAST”, J.Comput. Biol. 13 (40), 965-78 (2006). Bull.CrossrefGoogle Scholar

About the article

Published Online: 2012-12-22

Published in Print: 2012-12-01

Citation Information: Bulletin of the Polish Academy of Sciences: Technical Sciences, Volume 60, Issue 3, Pages 495–505, ISSN (Print) 0239-7528, DOI: https://doi.org/10.2478/v10175-012-0063-0.

Export Citation

This content is open access.

Comments (0)

Please log in or register to comment.
Log in