In this section, the proposed method is evaluated using simulated data sets for modeling gene-covariate interactions. We use two scenarios to generate the genotype SNPs datasets. In the first scenario, the SNPs are simulated through independent binomial random generators. Due to complicated dependence structure of real SNPs, we also randomly subsample the real Framingham Heart Study dataset to generate the other SNPs datasets in the second scenario. In detail, for the first scenario, we simulate 1000 subjects with 1000 SNPs (SNP1 to SNP1000) in the model, i.e., *n*=*p*=1000. Within 1000 SNPs, there are 20 true SNPs with nonzero regression coefficients. The allele frequencies for the 20 true predictors are 0.3 and 0.5 alternatively, i.e, SNP1 is generated from *Binomial*(2, 0.3), SNP2 is generated from *Binomial*(2, 0.5), etc. In addition to 1000 SNPs, we generate 2 covariates *E*_{1} and *E*_{2} into the models, where *E*_{1} is one binary random variable from *Bernoulli*(0.5) and *E*_{2} is one normal distributed random variable for *N*(0, 0.5^{2}). In our setting, we treat each SNP as a random variable with a binomial distribution, not a category variable with 3 levels. In some genetic studies, SNPs are considered as 3-level category variables and 2 dummy variables are used to represent each SNP. But in that setting, we will consider more restriction, e.g., the estimation coefficients for the 2 dummy variables should be either both not equal to 0 or both equal to 0. Also, the interactions term has similar restriction. The algorithm would be more complicated and the scale of predictors including both main and interactions are doubled. The computing load would be heavier. After considering SNPs as random binomial variables, similar to Choi et al. (2010), we standardize the main and interaction terms before applying the coordinate descent algorithm.

About the interactions, we specify two gene-covariate interaction settings as follows:

Case I: We add the interactions (SNP1×*E*_{1}, SNP3×*E*_{1}, …, SNP9×*E*_{1}) and (SNP1×*E*_{2}, SNP3×*E*_{2}, …, SNP9×*E*_{2}) in the model.

Case II: We add the interactions (SNP1×*E*_{1}, SNP3×*E*_{1}, …, SNP9×*E*_{1}) and (SNP11×*E*_{2}, SNP13×*E*_{2}, …, SNP19×*E*_{2}) in the model.

In Case I, both covariates *E*_{1} and *E*_{2} interact with the same set of true active SNPs, but in Case II, *E*_{1} and *E*_{2} interact with different sets of true active SNPs. The coefficients for both main and interaction effects are set for 80% power with 5% significance level under standard single-SNP GWAS models with additive-trait structure. In detail, the true coefficients for SNPs with 0.3 and 0.5 allele frequencies are set at 0.15 and 0.13 respectively. The true coefficient for *E*_{1} and *E*_{2} are equal to 0.21 and 0.15, respectively. The interaction coefficients are set at 0.24 between SNP and *E*_{1}, and 0.20 between SNP and *E*_{2}.

For both cases, we simulate a high level of normal observation noise with SNR equal to 0.1 to mimic similar real weak genetic signals. Since the simulated SNPs have independent correlation structure, in order to show the efficiency of our proposed method on real genetic data, we randomly take 100 subsamples of 1000 SNPs in our real data example in the second scenario of SNP dataset generation. We also assign the true SNPs with the same coefficients, and use the same interaction settings (Case I and Case II) as in the first scenario. The normal observation noises are still applied to guarantee SNR equal to 0.1 for the second scenario.

The group structures are simulated from the most popular and interesting real KEGG biological pathway. Since about one-third of genes are found in the KEGG pathways in our real biological dataset, we first randomly sample 300 genes from KEGG pathways, and 700 genes are not from KEGG. We can consider these other 700 genes as 700 groups of size 1. First the 300 genes in the pathways are randomly selected. We then use our first 300 SNPs to represent the 300 selected genes, one SNP per gene. In our simulation, 159 pathways are formed to group 300 genes with the KEGG pathway information. The total number of pathways in KEGG is 186. Therefore around 85% KEGG groups are represented in our simulation. Among the selected 159 pathways, only 16 of them do not overlap with others.

We design two strategies to assign the true SNPs into the formed groups. In the first strategy (Group I), the true SNPs are assigned to guarantee that the true active SNPs percentages are lower than 10% in their groups. This is one more realistic scenario comparing with real genetic data. We put six true SNPs (SNP1 to SNP6) in the pathway group with the largest size. The 12 SNPs (SNP7 to SNP18) are randomly put into 6 groups which include at least 20 variables. SNP19 and SNP20 are randomly distributed into 2 additional groups which include at least 10 variables. In this setting, the situation that one group may contain only one true active SNP is also simulated. In the second strategy (Group II), all of the true SNPs are put in the largest group, and don’t overlap with other groups. This is one extreme case. We use this special group assignment to find how the performance of our method is affected by the groups containing true SNPs.

The two covariates are not penalized and are forced in our model. We run the simulation 100 times for each simulation setting, and record the selection frequencies for main effects in Step 1, and the selection frequencies for both main effects and interaction terms in Step 2. For the competing SHIM model, we simply run Lasso without considering the group structure in Step 1. We also rank the selection frequency of main and interaction effect terms, and use the top 20 main SNPs to calculate the false discovery rate (FDR) for main effects FDR_{M} and the top 10 interaction terms to calculate the corresponding FDR_{I} for interaction terms.

The simulation results of our proposed method are shown and compared with the results of the SHIM model in and . From the results of various simulation outcome, due to the additional group Lasso penalty, our proposed method tends to have much higher selection frequencies for true active main effects and lower selection frequencies for non-active main effects comparing with the SHIM method. This means that the power performance of our method is much better for main effects. The performance of selection frequencies is also very consistent when the ratio *c*_{2} between *λ*_{2} and *λ*_{1} ranges from 0.1 to 0.9. The selection frequencies at *c*_{2}=0.9 are slightly higher than the results at *c*_{2}=0.1, especially, for simulation using random subsamples of real SNPs. Within the same interaction case, the performance of our method for the second true SNP assignment Group II is slightly better comparing with Group I under the same set of SNP variables. This is due to the special true SNP group structure. Since all true SNPs are in the same group, the group Lasso penalty is more efficient to knock out all nuisance groups. Because SHIM method does not consider the group structure, the performance of SHIM for Group I and Group II is similar. Also within the same interaction case, due to the complicated correlated structure of real SNPs, we can find that the results of real SNPs are always worse than the ones of independent simulated SNPs. The selection frequencies for interaction terms are comparable between our proposed method and SHIM method. In most situations, our method selects the true interaction terms with slightly higher frequencies, and selects the non-true interaction terms with slightly lower frequencies. Moreover, because the interaction effects of Case II are much stronger than the ones of Case I, both the true main variables involved in interaction and the true interactions have higher selection frequencies in Case II. In terms of FDR, our method tends to have better performance for main effects. For interaction effects, most simulation results indicate that our method performs better than SHIM. This means that our methods generally tend to have smaller Type I error comparing with SHIM.

Table 1 Simulation results for Case I with two different true SNP assignment strategies (Group I and Group II) and two different ways to generate SNP datasets (simulated SNPs and real SNPs).

Table 2 Simulation results for Case II with two different true SNP assignment strategies (Group I and Group II) and two different ways to generate SNP datasets (simulated SNPs and real SNPs). Notations have the same meanings as in .

In Figure 1, we plot the histograms of main effect selection frequencies for the two different interaction cases (Case I and Case II) of SHIM method and our GISP method at *c*_{2}=0.5. Since the performance patterns of our proposed method and SHIM for different true SNP assignment strategies and different datasets using in the simulation are similar, we only display the simulation results of Group I with random subsamples of real SNPs. We can find the frequencies at low main effect selection frequencies in our method are larger than the ones in SHIM method. Also, the frequency bars from the true active main effect are further apart from the histogram peak from non-active main effects. The dotted line in Figure 1 represents the minimum value of true SNP selection frequencies (*f*_{d}), while the solid one represents the 20-th value of the ordered selection frequencies for all SNPs (*f*_{s}). The smaller the relative distance, which is defined as (*f*_{s}–*f*_{d})/*f*_{d}, the better performance of the estimation method. Our method has smaller relative distances in all simulation situations comparing to SHIM. Moreover, if there are fewer SNPs between the solid line and dotted line (*δ*_{N}), one can improve FDR result by just lowering the cutoff number of chosen SNPs. Comparing with SHIM, our method always has a smaller *δ*_{N} for both interaction cases.

Figure 1: Histograms for SNP selection frequencies of two different interaction settings (Case I and Case II) with the same true SNP assignment strategy (Group I) using random subsamples of real SNPs dataset. Dotted lines: the minimum value of true SNP selection frequencies. Solid lines: the 20th value of the ordered selection frequencies for all SNPs.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.