GenePEN identifies compact network activity alterations in functional omics datasets (here: using microarray gene expression data as an example) by casting it as a convex optimization problem. We assume a set of supervised (cases vs. controls) gene expression samples $${\mathrm{\{}{x}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i}\mathrm{\}}}_{i=1}^{n}$$ with features **x**_{i}∈ℝ^{p} (with *p*>>*n*) and class labels *y*_{i}∈{–1, 1}, and an undirected graph encoding pairwise functional associations between genes (e.g., a protein-protein interaction network). We are interested in finding a *sparse* set of discriminative genes that form a large *connected* subgraph of the input graph. This is a problem of learning under *structured sparsity* (Rapaport et al., 2007; Li and Li, 2008; Bach et al., 2012; Yang et al., 2012). We adopt a penalized logistic regression approach. This involves finding weights **w**∈ℝ^{p} and *ν*∈ℝ that solve the program

$$\underset{w\mathrm{,}\nu}{min}f\mathrm{(}w\mathrm{,}\text{\hspace{0.17em}}\nu \mathrm{)}+\lambda \text{\hspace{0.05em}}\Omega \mathrm{(}w\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{1em}(1)}$$(1)

where *f*(**w**, *ν*) is the (smooth and convex) expected logistic loss

$$f\mathrm{(}w\mathrm{,}\text{\hspace{0.17em}}\nu \mathrm{)}=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}log\mathrm{(}1+exp\mathrm{(}-{y}_{i}\mathrm{(}{w}^{\top}{x}_{i}+\nu \mathrm{)}\mathrm{)}\mathrm{)}}\mathrm{,}\text{\hspace{1em}(2)}$$(2)

and Ω(**w**) is a penalty function that regularizes **w**, where *λ*∈ℝ_{+} controls the tradeoff. GenePEN implements a novel penalty function Ω(**w**) that penalizes the differences between the *absolute* values of weights of neighboring features in the graph:

$$\Omega \mathrm{(}w\mathrm{)}={\displaystyle \sum _{i=1}^{p}{\left[{\displaystyle \sum _{j=1}^{p}{A}_{ij}\mathrm{|}{w}_{i}\mathrm{|}-}{\displaystyle \sum _{j=1}^{p}{A}_{ij}\mathrm{|}{w}_{j}\mathrm{|}}\right]}^{2}+2\Delta \left|\right|w|{|}_{1}^{2}}\mathrm{,}\text{\hspace{1em}(3)}$$(3)

where *A* is the (symmetric) adjacency matrix of the input graph and Δ its maximum degree, respectively, and ||**w**||_{1} is the *L*_{1} norm of the weight vector.

The use of *absolute* weights in the penalty function is a key difference of our method to previous related methods for biological network alteration analysis that also use a quadratic penalty function over the model weights but not their absolute values (Rapaport et al., 2007; Li and Li, 2008). The motivation for using absolute weights in the penalty function is that in linear logistic regression (as in any other linear model) the magnitude of a weight reflects the *relevance* of the corresponding feature in the solution, and the precise sign of the feature weight is irrelevant. Hence, by penalizing the absolute value of a weight, we “push” all irrelevant weights to zero, thereby keeping only the relevant attributes in the final model. This idea is implicit in the Lasso (Tibshirani, 1996), the Elastic Net (Hastie et al., 2009), and the Pairwise Elastic Net (Lorbert et al., 2010), a generalization of the Elastic Net, in which the parameter determining the trade-off between L1- and L2-regularization can be replaced to adjust the trade-off using other information (e.g., from a feature similarity matrix). The importance of using absolute weights in penalty functions for classification has been demonstrated also in other recent work (Yang et al., 2012), but in this case the proposed penalty functions were nonconvex, aggravating the efficient discovery of global optimal solutions. Our main theoretical result, which is key in achieving computational efficiency, is that the penalty function Ω(**w**) in (3) is convex in **w** (see proof and implementation details in the Supplementary Material). GenePEN solves the convex program in the TFOCS optimization framework (Becker et al., 2011), resulting in highly efficient optimization. The Matlab software implementation of GenePEN provides an easy-to-use function to train a model, taking the features *x*_{i}, the class labels *y*_{i}, the symmetric adjacency matrix *A* for the association graph and the regularization constant *λ* as input. As output, the learned weights *w*_{i} are provided, and the final feature selection amounts to choosing those *w*_{i} that are sufficiently far from zero. As a further feature of the software, example code and microarray and molecular interaction data is provided to perform a cross-validation for GenePEN with different performance statistics.

## Comments (0)