Abstract
In cellular biology, node-and-edge graph or “network” data collection often uses bait-prey technologies such as co-immunoprecipitation (CoIP). Bait-prey technologies assay relationships or “interactions” between protein pairs, with CoIP specifically measuring protein complex co-membership. Analyses of CoIP data frequently focus on estimating protein complex membership. Due to budgetary and other constraints, exhaustive assay of the entire network using CoIP is not always possible. We describe a stratified sampling scheme to select baits for CoIP experiments when protein complex estimation is the main goal. Expanding upon the classic framework in which nodes represent proteins and edges represent pairwise interactions, we define generalized nodes as sets of adjacent nodes with identical adjacency outside the set and use these as strata from which to select the next set of baits. Strata are redefined at each round of sampling to incorporate accumulating data. This scheme maintains user-specified quality thresholds for protein complex estimates and, relative to simple random sampling, leads to a marked increase in the number of correctly estimated complexes at each round of sampling. The R package seqSample contains all source code and is available at http://vault.northwestern.edu/~dms877/Rpacks/.
Appendix A
According to equation (3), if we choose a number of baits
where
where
the desired bound in equation (4) from the manuscript is established.
Appendix B
Let [(Y+1)Cm] be a binary vector with entries of 1 indicating use as bait or adjacency to at least one bait, i.e. detection as prey, at some point during sampling up to and including sampling round m. Consider two nodes i and j that are members of the same generalized node h such that Xih=Xjh=1. By the definition of generalized node, Y·i+I·i=Y·j+I·j. Suppose
it follows that selection of a bait from a generalized node for which another member has already been used as bait will not change the set of detected prey. New prey will only be incorporated into the assayed graph when baits are selected from previously unsampled generalized nodes.
References
Altaf-Ul-Amin, M., Y. Shinbo, K. Mihara, K. Kurokawa and S. Kanaya (2006): “Development and implementation of an algorithm for detection of protein complexes in large interaction networks,” BMC Bioinformatics, 7, 207.10.1186/1471-2105-7-207Search in Google Scholar PubMed PubMed Central
Aryee, M. J. A. and J. Quackenbush (2008): “An optimized predictive strategy for interactome mapping,” Nat. Biotechnol., 20, 991–997.Search in Google Scholar
Bader, G. D. and C. W. Hogue (2002): “Analyzing yeast protein-protein interaction data obtained from different sources,” Nat. Biotechnol., 20, 991–997.Search in Google Scholar
Bader, G. D. and C. W. Hogue (2003): “An automated method for finding molecular complexes in large protein interaction networks,” BMC Bioinformatics, 4, 2.10.1186/1471-2105-4-2Search in Google Scholar PubMed PubMed Central
Casey, F. P., G. Cagney, N. J. Krogan and D. C. Shields (2008): “Optimal stepwise experimental design for pairwise functional interaction studies,” Bioinformatics, 24, 2733–2739.10.1093/bioinformatics/btn472Search in Google Scholar PubMed PubMed Central
Chiang, T. C. and D. Scholtens (2009): “A general pipline for quality and statistical assessment of protein interaction data using R and Bioconductor,” Nat. Protoc., 4, 535–546.Search in Google Scholar
Chiang, T. C., D. Scholtens, D. Sarkar, R. Gentleman and W. Huber (2007): “Coverage and error models of protein-protein interaction data by directed graph analysis,” Genome Biol., 8, R186.Search in Google Scholar
Damaschke, P. (2011): “Finding hidden hubs and dominating sets in sparse graphs by randomized neighborhood queries,” Networks, 57, 344–350.10.1002/net.20404Search in Google Scholar
Enright, A. J., S. Van Dongen and C. A. Ouzounis (2002): “An efficient algorithm for large-scale detection of protein families,” Nuc. Acids Res., 30, 1575–1584.Search in Google Scholar
Ewing, R. M., P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-Cerajewski, M. D. Robinson, L. O’Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore, S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M. Abu-Farha, J. P. Lambert, H. S. Duewel, I. I. Stewart, B. Kuehl, K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S. L. Adams, M. F. Moran, G. B. Morin, T. Topaloglou and D. Figeys. (2007): “Large-scale mapping of human protein-protein interactions by mass spectrometry,” Mol. Syst. Biol., 3, 89.Search in Google Scholar
Freidel, C. C., J. Krumsiek and R. Zimmer (2009): “Bootstrapping the interactome: unsupervised identification of protein complexes in yeast,” J. Comp. Biol., 16, 971–987.Search in Google Scholar
Gavin, A. C., M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. M. Michon, M. Cruciat, C M amd Remor, C. Höfert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A. Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer and G. Superti-Furga. (2002): “Functional organization of the yeast proteome by systematic analysis of protein complexes,” Nature, 415, 141–147.10.1038/415141aSearch in Google Scholar PubMed
Gavin, A. C., P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M. A. Heurtier, V. Hoff-man, C. Hoefert, K. Klein, M. Hudak, A. M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster, P. Bork, R. B. Russell and G. Superti-Furga. (2006): “Proteome survey reveals modularity of the yeast cell machinery,” Nature, 440, 631–636.10.1038/nature04532Search in Google Scholar PubMed
Geva, G. and R. Sharan (2011): “Identification of protein complexes from co-immunoprecipitation data,” Bioinformatics, 27, 111–117.10.1093/bioinformatics/btq652Search in Google Scholar PubMed PubMed Central
Goodman, L. A. (1961): “Snowball sampling,” Ann. Math. Stat., 32, 148–170.Search in Google Scholar
Güldener, U., M. Münsterkötter, G. Kastenmüller, N. Strack, J. van Helden, C. Lemer, J. Richelles, S. J. Wodak, J. García-Martínez, J. E. Pérez-Ortin, H. Michael, A. Kaps, E. Talla, B. André, J. L. Souciet, J. De Montigny, E. Bon, C. Gaillardin and H. W. Mewes (2005): “CYGD: the comprehensive yeast genome database,” Nuc. Acids Res., 33, D362–C368.Search in Google Scholar
Han, J. D., D. Dupuy, N. Bertin, M. E. Cusick and M. Vidal (2005): “Effect of sampling on topology predictions of protein-protein interaction networks,” Nat. Biotechnol., 23, 839–844.Search in Google Scholar
Handcock, M. S. and K. J. Gile (2010): “Modeling social networks from sampled data,” Ann. Appl. Stat., 4, 5–25.Search in Google Scholar
Kavvadias, D. J. and E. C. Stavropoulos (2005): “An efficient algorithm for the transversal hypergraph generation,” J. Graph Alg. Appl., 9, 239–264.Search in Google Scholar
Kikugawa, S., K. Nishikata, K. Murakami, Y. Sato, M. Suzuki, M. Altaf-Ul-Amin, S. Kanaya and T. Imanishi (2012): “PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset,” BMC Syst. Biol., 6 Suppl 2, S7.10.1186/1752-0509-6-S2-S7Search in Google Scholar PubMed PubMed Central
Krogan, N. J., M. H. Lam, J. Fillingham, M. C. Keogh, M. Gebbia, J. Li, N. Datta, G. Cagney, S. Buratowski, A. Emili and J. F. Greenblatt (2004): “Proteasome involvement in the repair of DNA double-strand breaks,” Mol. Cell, 16, 1027–1034.Search in Google Scholar
Krogan, N. J., G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikusis, T. Punna, J. M. Peregrin-Alvarez, M. Shales, X. Zhang, M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P. Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete, J. Vlasbom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone, K. Gandi, N. J. Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Lam, G. Butland, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles, T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili and J. F. Greenblatt. (2006): “Global landscape of protein complexes in the yeast Saccharomyces cerevisiae,” Nature, 440, 637–643.10.1038/nature04670Search in Google Scholar PubMed
Lappe, M. and L. Holm (2004): “Unraveling protein interaction networks with near-optimal efficiency,” Nat. Biotechnol., 22, 98–103.Search in Google Scholar
Macropol, K., T. Can and A. K. Singh (2009): “Rrw: repeated random walks on genome-scale protein networks for local cluster discovery,” BMC Bioinformatics, 10, 283.10.1186/1471-2105-10-283Search in Google Scholar PubMed PubMed Central
Pu, S., J. Wong, B. Turner, E. Cho and S. J. Wodak (2009): “Up-to-date catalogues of yeast protein complexes,” Nuc. Acids Res., 37, 825–831.Search in Google Scholar
Royer, L., M. Reimann, A. F. Stewart and M. Schroeder (2012): “Network compression as a quality measure for protein interaction networks,” PLOS One, 7, e35729.10.1371/journal.pone.0035729Search in Google Scholar PubMed PubMed Central
Ruepp, A., B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, B. Waegele, T. Schmidt, O. N. Doudieu, V. Stümpflen and H. W. Mewes (2008): “CORUM: the Comprehensive Resource of Mammalian Protein Complexes,” Nuc. Acids Res., 36, D646–D650.Search in Google Scholar
Saha, S., P. Kaur and R. M. Ewing (2010): “The bait compatibility index: computational bait selection for interaction proteomics experiments,” J. Proteome Res., 9, 4972–4981.Search in Google Scholar
Scholtens, D., M. Vidal and R. Gentleman (2005): “Local modeling of global interactome networks,” Bioinformatics, 21, 3548–3557.10.1093/bioinformatics/bti567Search in Google Scholar PubMed
Schwartz, A. S., J. Yu, K. R. Gardenour, R. Finley Jr and T. Ideker (2009): “Cost-effective strategies for completing the interactome,” Nat. Methods, 6, 55–61.Search in Google Scholar
The Gene Ontology Consortium (2000): “Gene Ontology: a tool for the unification of biology,” Nat. Genet., 25, 25–29.Search in Google Scholar
Wasserman, S. and K. Faust (1997): Social network analysis, New York: Cam-bridge University Press.Search in Google Scholar
Xie, Z., C. K. Kwoh, X.-L. Li and M. Wu (2011): “Construction of co-complex score matrix for protein complex prediction from ap-ms data,” Bioinformatics, 27, i159–i166.10.1093/bioinformatics/btr212Search in Google Scholar PubMed PubMed Central
Zhang, B., B.-H. Park, T. Karpinets and N. F. Samatova (2008): “From pull-down data to protein interaction networks and complexes with biological relevance,” Bioinformatics, 24, 979–986.10.1093/bioinformatics/btn036Search in Google Scholar PubMed
©2015 by De Gruyter