Jump to ContentJump to Main Navigation
Show Summary Details
In This Section

Bulletin of the Polish Academy of Sciences Technical Sciences

The Journal of Polish Academy of Sciences

6 Issues per year

IMPACT FACTOR 2016: 1.156
5-year IMPACT FACTOR: 1.238

CiteScore 2016: 1.50

SCImago Journal Rank (SJR) 2015: 0.526
Source Normalized Impact per Paper (SNIP) 2015: 1.208

Open Access
See all formats and pricing
In This Section
Volume 60, Issue 3 (Dec 2012)


Efficient alternatives to PSI-BLAST

M. Startek
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland
/ S. Lasota
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland
/ M. Sykulski
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland
/ A. Bułak
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland
/ L. Noé
  • LIFL/CNRS/INRIA, Bˆat. M3, Campus Scientifique, Villeneuve d’Ascq, France
/ G. Kucherov
  • Laboratoire d’Informatique Gaspard-Monge, Marne-la-Valle, France
/ A. Gambin
  • Corresponding author
  • Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland / Mossakowski Medical Research Centre Polish Academy of Sciences, 5 Pawińskiego St., 02-106 Warszawa, Poland
  • Email:
Published Online: 2012-12-22 | DOI: https://doi.org/10.2478/v10175-012-0063-0


In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses the advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we fill this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.

Keywords : PSI BLAST tool; sequence alignment; seeding technique.

  • [1] T. Smith and M. Waterman, “The identification of common molecular subsequences”, J. Molecular Biology 147, 195-197 (1981).

  • [2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool”, J. Molecular Biology 215, 403-410 (1990).

  • [3] S. Altschul, T. Madden, A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NucleicAcids Research 25, 3389-3402 (1997).

  • [4] G. Kucherov, L. No´e, and M. Roytberg, “A unifying framework for seed sensitivity and its application to subset seeds”, J. Bioinformatics and Computational Biology 4 (2), 553-570 (2006).

  • [5] A. Gambin, S. Lasota, R. Szklarczyk, J. Tiuryn, and J. Tyszkiewicz, “Contextual alignment of biological sequences”, Proc. ECCB’02, Bioinformatics 18, 116-127 (2002). [Crossref]

  • [6] B. Brejova, D.G. Brown, and T. Vinar, “Optimal spaced seeds for homologous coding regions”, J. Bioinformatics and Computational Biology 1 (4), 595-610 (2004). [PubMed]

  • [7] A.S. Shiryev, J.S. Papadopoulos, A.A. S chaffer, and R. Agarwala, “Improved BLAST searches using longer words for protein seeding”, Bioinformatics 23, 2949-2951 (2007). [Web of Science] [Crossref]

  • [8] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search”, Bioinformatics (Oxford, England) 18, 440-445 (2002). [Crossref]

  • [9] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: highly sensitive and fast homology search”, J. Bioinformaticsand Computational Biology 2 (3), 417-439 (2004).

  • [10] D. Kisman, M. Li, B. Ma, and L. Wang, “tPatternHunter: gapped, fast and sensitive translated homology search”, Bioinformatics(Oxford, England) 21, 542-544 (2005). [Crossref]

  • [11] L. Noe and G. Kucherov, “YASS: enhancing the sensitivity of DNA similarity search”, Nucl. Acids Res. 33, W540-543 (2005).

  • [12] J. Buhler, U. Keich, and Y. Sun, “Designing seeds for similarity search in genomic DNA”, J. Comput. Syst. Sci. 70 (3), 342-363 (2005). [Crossref]

  • [13] B. Brejov´a, D.G. Brown, and T. Vinar, “Vector seeds: an extension to spaced seeds”, J. Comput. Syst. Sci. 70 (3), 364-380 (2005).

  • [14] Y. Sun and J. Buhler, “Designing multiple simultaneous seeds for DNA similarity search”, RECOMB 1, 76-84 (2004).

  • [15] G. Kucherov, L. Noe, and M. Roytberg, “Multiseed lossless filtration”, IEEE/ACM Trans. Comput. Biol. Bioinformatics 2 (1), 51-61 (2005). [Crossref]

  • [16] M. Roytberg, A. Gambin, L. No´e, S. Lasota, E. Furletova, E. Szczurek, and G. Kucherov, “On subset seeds for protein alignment”, IEEE/ACM Trans. on Computational Biology andBioinformatics 6 (3), 483-494 (2009). [Web of Science]

  • [17] W. Li, B. Ma, and K. Zhang, “Amino acid classification and hash seeds for homology search”, BICoB 1, 44-51 (2009).

  • [18] S.M. Kiebasa, R. Wan, K. Sato, P. Horton, and M.C. Frith, “Adaptive seeds tame genomic sequence comparison”, GenomeResearch 21 (3), 487-493 (2011).

  • [19] C.D. Livingstone and G.J. Barton, “Protein sequence alignments: a strategy for the hierarchical an alysis of residue conservation”, Computer Applications in the Biosciences: CABIOS 9, 745-756 (1993).

  • [20] T. Li, K. Fan, W. Wang, and J. Wang, “Reduction of protein sequence complexity by residue grouping”, Protein Engineering 16 (5), 323-330 (2003). [PubMed]

  • [21] L. Murphy, A. Wallqvist, and R. Levy, “Simplified amino acid alphabets for protein fold recognition and implications for folding”, Protein Engineering 13, 149-152 (2000).

  • [22] B. Rost, “Twilight zone of protein sequence alignments”, ProteinEngineering Design and Selection 12 (2), 85-94 (1999).

  • [23] A. Gambin and J. Tyszkiewicz, “Substitution matrices for contextual alignment”, Journees Ouvertes Biologie InformatiqueMathematique 1, 227-238 (2002).

  • [24] S. Henikoff and J. Henikoff, “Amino acid substitution matrices from protein blocks”, Proc. Natl. Acad. Sci. USA 89, 10915- 10919 (1992). [Crossref]

  • [25] A. Gambin and P. Wojtalewicz, “CTX-BLAST: context sensitive version of protein blast”, Bioinformatics 23 (13), 1686- 1688 (2007). [Crossref] [Web of Science] [PubMed]

  • [26] I. Friedberg, T. Kaplan, and H. Margalit, “Evaluation of PSIBLAST alignment accuracy in comparison to structural alignments”, Protein Science 9, 2278-2284 (2000).

  • [27] A. Gambin, S. Lasota, M. Startek, M. Sykulski, L. Noé, and G. Kucherov, “Subset seed extension to protein blast”, Bioinformatics 1, 149-158 (2011).

  • [28] B. Korte and D. Hausmann, “An analysis of the greedy heuristic for independence systems”, Ann. Discrete Math. 2, 65-74 (1978).

  • [29] S. Cheng and Y.-F. Xu, “Constrained independence system and triangulations of planar point sets”, Computing and Combinatorics 1, 41-50 (1995).

  • [30] B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O’Donovan, and I. Phan, “The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003”, Nucl. Acids Res. 31 (1), 365-370 (2003).

  • [31] Y. Ponty, M. Termier, and A. Denise, “GenRGenS: software for generating random genomic sequences and structures”, Bioinformatics 22, 1534-1535 (2006). [PubMed] [Crossref]

  • [32] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient methods for generating optimal single and multiple spaced seeds”, BIBE ’04: Proc.4th IEEE Symp. on Bioinformatics and Bioengineering 1, 411 (2004).

  • [33] B. Ma and H. Yao, “Seed optimization is no easier than optimal golomb ruler design”, APBC 1, 133-144 (2008).

  • [34] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, London, 1996.

  • [35] F. M. Liang, “Word hy-phen-a-tion by com-put-er”, Tech. Rep., Stanford University, Stanford, 1983.

  • [36] A. Gambin, J. Tiuryn, and J. Tyszkiewicz, “Alignment with context dependent scoring function”, J. Computational Biology 13 (1), 81-101 (2006). [PubMed]

  • [37] S. Altschul and W. Gish, “Local alignment statistics”, MethodsEnzymol. 266, 460-480 (1996).

  • [38] S. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions”, Nuclear Acids Res. 29 (2), 351-361 (2001).

  • [39] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. Eddy, S. Griffiths-Jones, K. Howe, M. Marshall, and E. Sonnhammer, “The pfam protein families database”, Nucl.Acids Res. 30 (1), 276-280 (2002).

  • [40] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, “The pfam protein families database”, Nucl. Acids Res. 36 (1), D281-288 (2008).

  • [41] L. Oliveira, A.C.M. Paiva, and G. Vriend, “A common motif in g-protein-coupled seven transmembrane helix r eceptors”, J.Computer-Aided Molecular Design 7, 649-658 (1993).

  • [42] P. Peterlongo, L. No, D. Lavenier, G. illes Georges, J. Jacques, G. Kucherov, and M. Giraud, “Protein similarity search with subset seeds on a dedicated reco nfigurable hardware”, ParallelProcessing and Applied Mathematics 1, 1240-1248 (2008).

  • [43] V.H. Nguyen and D. Lavenier, “Speeding up subset seed algorithm for intensive protein sequence comparison”, RIVF 1, 57-63 (2008).

  • [44] T. Kahveci and A. Singh, “An efficient index structure for string databases”, Proc. 27th VLDB 1, 352-360 (2001).

  • [45] M. Cameron, H. Williams, and A. Cannane, “A deterministic finite automaton for faster protein hit detection in BLAST”, J.Comput. Biol. 13 (40), 965-78 (2006). Bull. [Crossref]

About the article

Published Online: 2012-12-22

Published in Print: 2012-12-01

Citation Information: Bulletin of the Polish Academy of Sciences: Technical Sciences, ISSN (Print) 0239-7528, DOI: https://doi.org/10.2478/v10175-012-0063-0. Export Citation

This content is open access.

Comments (0)

Please log in or register to comment.
Log in