Mingming Liu, Vanessa King, Wei Keat Lim
October 18, 2016
An increasing amount of evidence suggests that canonical pathways and standard molecular signature databases are incomplete and inadequate to model the complex behavior of cell physiology and pathology. Yet, many Gene Set Analysis (GSA) studies still rely on these databases to identify disease biomarkers and molecular mechanisms within a specific cell context. While tremendous effort has been invested in developing GSA tools, there is limited number of studies focusing on de novo assembly of context-specific gene sets as opposed to simply applying GSA using the standard gene set database. In this paper, we propose a pipeline to derive the entire collection of Cell context-Specific Gene Sets (CSGS) from a molecular interaction network, based on the hypothesis that molecular events linked to a specific phenotypic response should cluster within a subnet of interacting genes. Gene sets are assigned using both physical properties of the network and functional annotations of the neighboring nodes. The identified gene sets could provide a precise starting point such that the downstream GSA will cover all functional pathways in this particular cell context and, at the same time, avoid the noise and excessive multiple-hypothesis testing due to inclusion of irrelevant gene sets from the standard database. We applied the pipeline in the context of cardiomyopathy and demonstrated its superiority over MSigDB gene set collection in terms of: (i) reproducibility and robustness in GSA, (ii) effectiveness in uncovering molecular mechanisms associated with cardiomyopathy, and (iii) the performance in distinguishing diseased vs. normal states.