Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 30, 2015

Node sampling for protein complex estimation in bait-prey graphs

  • Denise M. Scholtens EMAIL logo and Bruce D. Spencer


In cellular biology, node-and-edge graph or “network” data collection often uses bait-prey technologies such as co-immunoprecipitation (CoIP). Bait-prey technologies assay relationships or “interactions” between protein pairs, with CoIP specifically measuring protein complex co-membership. Analyses of CoIP data frequently focus on estimating protein complex membership. Due to budgetary and other constraints, exhaustive assay of the entire network using CoIP is not always possible. We describe a stratified sampling scheme to select baits for CoIP experiments when protein complex estimation is the main goal. Expanding upon the classic framework in which nodes represent proteins and edges represent pairwise interactions, we define generalized nodes as sets of adjacent nodes with identical adjacency outside the set and use these as strata from which to select the next set of baits. Strata are redefined at each round of sampling to incorporate accumulating data. This scheme maintains user-specified quality thresholds for protein complex estimates and, relative to simple random sampling, leads to a marked increase in the number of correctly estimated complexes at each round of sampling. The R package seqSample contains all source code and is available at

Corresponding author: Denise M. Scholtens, PhD, Northwestern University Feinberg School of Medicine, Division of Biostatistics, Department of Preventive Medicine, 680 N. Lake Shore Drive Suite 1400, Chicago, IL 60611, USA, e-mail:

Appendix A

According to equation (3), if we choose a number of baits bh,SEQGNm+1 to be sampled from generalized node h in round m+1 such that


where bhm and phm are the number of baits and prey in h, respectively, then we know the following for any set ℐ of generalized nodes: phmbh,SEQGNmN(1f)×(bhm+phm)h(phmbh,SEQGNmN)(1f)h(bhm+phm)pmb,SEQGNmNbm+pm(1f)(pmb,SEQGNmN)2(bm+pm)2(1f)

where pm=hphm,bm=hbhm and b,SEQGNm+1=hbh,SEQGNm+1. And since


the desired bound in equation (4) from the manuscript is established.

Appendix B

Let [(Y+1)Cm] be a binary vector with entries of 1 indicating use as bait or adjacency to at least one bait, i.e. detection as prey, at some point during sampling up to and including sampling round m. Consider two nodes i and j that are members of the same generalized node h such that Xih=Xjh=1. By the definition of generalized node, Y·i+I·i=Y·j+I·j. Suppose Cim=1, but Cjm=0. Node j would be eligible for selection as bait in the next round of sampling; let the vector S*j indicate its potential selection with jth value equal to 1 and all other values equal to 0. Since


it follows that selection of a bait from a generalized node for which another member has already been used as bait will not change the set of detected prey. New prey will only be incorporated into the assayed graph when baits are selected from previously unsampled generalized nodes.


Altaf-Ul-Amin, M., Y. Shinbo, K. Mihara, K. Kurokawa and S. Kanaya (2006): “Development and implementation of an algorithm for detection of protein complexes in large interaction networks,” BMC Bioinformatics, 7, 207.10.1186/1471-2105-7-207Search in Google Scholar PubMed PubMed Central

Aryee, M. J. A. and J. Quackenbush (2008): “An optimized predictive strategy for interactome mapping,” Nat. Biotechnol., 20, 991–997.Search in Google Scholar

Bader, G. D. and C. W. Hogue (2002): “Analyzing yeast protein-protein interaction data obtained from different sources,” Nat. Biotechnol., 20, 991–997.Search in Google Scholar

Bader, G. D. and C. W. Hogue (2003): “An automated method for finding molecular complexes in large protein interaction networks,” BMC Bioinformatics, 4, 2.10.1186/1471-2105-4-2Search in Google Scholar PubMed PubMed Central

Casey, F. P., G. Cagney, N. J. Krogan and D. C. Shields (2008): “Optimal stepwise experimental design for pairwise functional interaction studies,” Bioinformatics, 24, 2733–2739.10.1093/bioinformatics/btn472Search in Google Scholar PubMed PubMed Central

Chiang, T. C. and D. Scholtens (2009): “A general pipline for quality and statistical assessment of protein interaction data using R and Bioconductor,” Nat. Protoc., 4, 535–546.Search in Google Scholar

Chiang, T. C., D. Scholtens, D. Sarkar, R. Gentleman and W. Huber (2007): “Coverage and error models of protein-protein interaction data by directed graph analysis,” Genome Biol., 8, R186.Search in Google Scholar

Damaschke, P. (2011): “Finding hidden hubs and dominating sets in sparse graphs by randomized neighborhood queries,” Networks, 57, 344–350.10.1002/net.20404Search in Google Scholar

Enright, A. J., S. Van Dongen and C. A. Ouzounis (2002): “An efficient algorithm for large-scale detection of protein families,” Nuc. Acids Res., 30, 1575–1584.Search in Google Scholar

Ewing, R. M., P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-Cerajewski, M. D. Robinson, L. O’Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore, S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M. Abu-Farha, J. P. Lambert, H. S. Duewel, I. I. Stewart, B. Kuehl, K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S. L. Adams, M. F. Moran, G. B. Morin, T. Topaloglou and D. Figeys. (2007): “Large-scale mapping of human protein-protein interactions by mass spectrometry,” Mol. Syst. Biol., 3, 89.Search in Google Scholar

Freidel, C. C., J. Krumsiek and R. Zimmer (2009): “Bootstrapping the interactome: unsupervised identification of protein complexes in yeast,” J. Comp. Biol., 16, 971–987.Search in Google Scholar

Gavin, A. C., M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. M. Michon, M. Cruciat, C M amd Remor, C. Höfert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A. Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer and G. Superti-Furga. (2002): “Functional organization of the yeast proteome by systematic analysis of protein complexes,” Nature, 415, 141–147.10.1038/415141aSearch in Google Scholar PubMed

Gavin, A. C., P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M. A. Heurtier, V. Hoff-man, C. Hoefert, K. Klein, M. Hudak, A. M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster, P. Bork, R. B. Russell and G. Superti-Furga. (2006): “Proteome survey reveals modularity of the yeast cell machinery,” Nature, 440, 631–636.10.1038/nature04532Search in Google Scholar PubMed

Geva, G. and R. Sharan (2011): “Identification of protein complexes from co-immunoprecipitation data,” Bioinformatics, 27, 111–117.10.1093/bioinformatics/btq652Search in Google Scholar PubMed PubMed Central

Goodman, L. A. (1961): “Snowball sampling,” Ann. Math. Stat., 32, 148–170.Search in Google Scholar

Güldener, U., M. Münsterkötter, G. Kastenmüller, N. Strack, J. van Helden, C. Lemer, J. Richelles, S. J. Wodak, J. García-Martínez, J. E. Pérez-Ortin, H. Michael, A. Kaps, E. Talla, B. André, J. L. Souciet, J. De Montigny, E. Bon, C. Gaillardin and H. W. Mewes (2005): “CYGD: the comprehensive yeast genome database,” Nuc. Acids Res., 33, D362–C368.Search in Google Scholar

Han, J. D., D. Dupuy, N. Bertin, M. E. Cusick and M. Vidal (2005): “Effect of sampling on topology predictions of protein-protein interaction networks,” Nat. Biotechnol., 23, 839–844.Search in Google Scholar

Handcock, M. S. and K. J. Gile (2010): “Modeling social networks from sampled data,” Ann. Appl. Stat., 4, 5–25.Search in Google Scholar

Kavvadias, D. J. and E. C. Stavropoulos (2005): “An efficient algorithm for the transversal hypergraph generation,” J. Graph Alg. Appl., 9, 239–264.Search in Google Scholar

Kikugawa, S., K. Nishikata, K. Murakami, Y. Sato, M. Suzuki, M. Altaf-Ul-Amin, S. Kanaya and T. Imanishi (2012): “PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset,” BMC Syst. Biol., 6 Suppl 2, S7.10.1186/1752-0509-6-S2-S7Search in Google Scholar PubMed PubMed Central

Krogan, N. J., M. H. Lam, J. Fillingham, M. C. Keogh, M. Gebbia, J. Li, N. Datta, G. Cagney, S. Buratowski, A. Emili and J. F. Greenblatt (2004): “Proteasome involvement in the repair of DNA double-strand breaks,” Mol. Cell, 16, 1027–1034.Search in Google Scholar

Krogan, N. J., G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikusis, T. Punna, J. M. Peregrin-Alvarez, M. Shales, X. Zhang, M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P. Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete, J. Vlasbom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone, K. Gandi, N. J. Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Lam, G. Butland, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles, T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili and J. F. Greenblatt. (2006): “Global landscape of protein complexes in the yeast Saccharomyces cerevisiae,” Nature, 440, 637–643.10.1038/nature04670Search in Google Scholar PubMed

Lappe, M. and L. Holm (2004): “Unraveling protein interaction networks with near-optimal efficiency,” Nat. Biotechnol., 22, 98–103.Search in Google Scholar

Macropol, K., T. Can and A. K. Singh (2009): “Rrw: repeated random walks on genome-scale protein networks for local cluster discovery,” BMC Bioinformatics, 10, 283.10.1186/1471-2105-10-283Search in Google Scholar PubMed PubMed Central

Pu, S., J. Wong, B. Turner, E. Cho and S. J. Wodak (2009): “Up-to-date catalogues of yeast protein complexes,” Nuc. Acids Res., 37, 825–831.Search in Google Scholar

Royer, L., M. Reimann, A. F. Stewart and M. Schroeder (2012): “Network compression as a quality measure for protein interaction networks,” PLOS One, 7, e35729.10.1371/journal.pone.0035729Search in Google Scholar PubMed PubMed Central

Ruepp, A., B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, B. Waegele, T. Schmidt, O. N. Doudieu, V. Stümpflen and H. W. Mewes (2008): “CORUM: the Comprehensive Resource of Mammalian Protein Complexes,” Nuc. Acids Res., 36, D646–D650.Search in Google Scholar

Saha, S., P. Kaur and R. M. Ewing (2010): “The bait compatibility index: computational bait selection for interaction proteomics experiments,” J. Proteome Res., 9, 4972–4981.Search in Google Scholar

Scholtens, D., M. Vidal and R. Gentleman (2005): “Local modeling of global interactome networks,” Bioinformatics, 21, 3548–3557.10.1093/bioinformatics/bti567Search in Google Scholar PubMed

Schwartz, A. S., J. Yu, K. R. Gardenour, R. Finley Jr and T. Ideker (2009): “Cost-effective strategies for completing the interactome,” Nat. Methods, 6, 55–61.Search in Google Scholar

The Gene Ontology Consortium (2000): “Gene Ontology: a tool for the unification of biology,” Nat. Genet., 25, 25–29.Search in Google Scholar

Wasserman, S. and K. Faust (1997): Social network analysis, New York: Cam-bridge University Press.Search in Google Scholar

Xie, Z., C. K. Kwoh, X.-L. Li and M. Wu (2011): “Construction of co-complex score matrix for protein complex prediction from ap-ms data,” Bioinformatics, 27, i159–i166.10.1093/bioinformatics/btr212Search in Google Scholar PubMed PubMed Central

Zhang, B., B.-H. Park, T. Karpinets and N. F. Samatova (2008): “From pull-down data to protein interaction networks and complexes with biological relevance,” Bioinformatics, 24, 979–986.10.1093/bioinformatics/btn036Search in Google Scholar PubMed

Published Online: 2015-7-30
Published in Print: 2015-8-1

©2015 by De Gruyter

Downloaded on 25.2.2024 from
Scroll to top button