A weighted empirical Bayes risk prediction model using multiple traits

Abstract With rapid advances in high-throughput sequencing technology, millions of single-nucleotide variants (SNVs) can be simultaneously genotyped in a sequencing study. These SNVs residing in functional genomic regions such as exons may play a crucial role in biological process of the body. In particular, non-synonymous SNVs are closely related to the protein sequence and its function, which are important in understanding the biological mechanism of sequence evolution. Although statistically challenging, models incorporating such SNV annotation information can improve the estimation of genetic effects, and multiple responses may further strengthen the signals of these variants on the assessment of disease risk. In this work, we develop a new weighted empirical Bayes method to integrate SNV annotation information in a multi-trait design. The performance of this proposed model is evaluated in simulation as well as a real sequencing data; thus, the proposed method shows improved prediction accuracy compared to other approaches.


Introduction
The advances in the next-generation sequencing (NGS) technology are underway to directly sequence and identify genetic variants that associate with a variety of diseases (Bamshad et al. 2011;Renkema et al. 2014). A few applications of NGS techniques have been well-established for studies including but not limited to Mendelian disorders (Need et al. 2012;Yang et al. 2013), aging-realated diseases (Boyd-Kirkup et al. 2013), height (Marouli et al. 2017;Yang et al. 2010), and type 2 diabetes (Fuchsberger et al. 2016). As this new technique (NGS) emerges, it greatly accelerates the discovery of novel variants or new pharmacogenetic markers for precision medicine (Gonzalez-Garay 2014). Previous sequencing studies have revealed that hundreds of novel variants in 82 pharmacogenes (Bush et al. 2016) and tens of millions of novel variants in UK10K sequencing study (The UK10K Consortium 2015) have been successfully identified. Lacking of functional annotations of these novel variants, efficient statistical and computational methods are needed to predict the function of these variants (De Baets et al. 2012;Yourshaw et al. 2015).
To date, tens of thousands of whole-exome sequencing (WES) data and whole-genome sequencing (WGS) data are available at a low cost and low error rate. Scientists can easily access these data for examining the impact of genetic variants on a disease trait, especially for rare variants. In fact, each individual carries over three million genotyped variants (Ball et al. 2012) with respect to the reference genome, where the majority are rare variants with low minor allele frequencies (MAFs). Such a high density of rare variants launches a weak signal, and it is difficult to separate their poor signal from noises in a real sequencing data analysis. Recent studies suggest that the effect of rare variants can be assessed by adopting different prior probabilities (Cirulli and Goldstein 2010). A number of statistical methods have been developed based upon this by collapsing variants in each gene (Li and Leal 2008;Luo et al. 2011;Madsen and Browning 2009;Price et al. 2010;Wu et al. 2011). This strategy can be improved by clarifying the molecular function of genetic variants in biological process Wu et al. 2011). Because of the advancements in methodology, bioinformatic annotation is going to play a crucial role in a real sequencing analysis.
It increasingly recognizes that a few rare variants are likely responsible for the risk assessment of diseases (Golstein et al. 2013). There is evidence that roughly 85% of mutations occur in the protein-coding regions which account for 2% of the human genome, and this small region is very important in understanding the impact of SNV annotation on a disease (Majewski et al. 2011;Ng et al. 2010). When a mutation occurs in a protein-coding region, this change can be classified into a synonymous change (no amino acid change) or a non-synonymous change (an amino acid change). These two changes have quite different impact on a protein sequence, thus annotating the variants may better reveal their biological functions at the sequence level. A few strategies have been proposed to annotate each variant using PolyPhen2, SIFT, ANNOVAR or CADD (Adzhubei et al. 2013;Alexander et al. 2016;Mathur et al. 2018;Kircher et al. 2014;Krupp et al. 2017;Kumar et al. 2009). However, how to efficiently incorporate this information in a risk prediction model is still a challenging question.
Another frequently encountered problem in a sequencing data analysis is "Selection Bias" which may result in the overinflated estimate of the true genetic parameter (Efron 2009). It is caused by the imbalance between a small number of subjects (thousands of individuals) and an extremely large amount of features (millions of genetic variants or tens of thousands of genes). Under this imbalance, traditional least square estimates are unstable, and over-inflated estimates of genetic effects have been observed in the microarray data analysis (Tibshirani et al. 2002;Zhang, 2008). Compared to the microarray data, sequencing data has even larger number of features, and it makes the imbalance worse and results in more serious bias when estimating the genetic effects. To address this limitation, several approaches (e.g., Shrink Centroid (Tibshirani et al. 2002), Ridge Regression (Hoerl and Kennard 1970) and Bias-reduced Estimator (Zhang, 2008)) have been proposed to shrink the estimates to more realistic ones through correcting the bias. These commonly applied shrinkage methods are based on a frequentist perspective. A Bayesian approach may be a more natural choice over the frequentist method since it is immune to the bias through naturally producing smoother estimates. The development of empirical Bayes (EB) (Efron 2009) is motivated by a need to greatly simplify the calculation of traditional Bayesian estimates. Here, EB estimates may efficiently capture the true effects of a large amount of variants in a sequencing data analysis.
Recent studies have revealed that empirical Bayes prediction models have brought impressive achievements in the microarray data analysis (Efron 2009;Singh et al. 2002). Shortly afterward, we extended the empirical Bayes model to a mini exome sequencing data for predicting the risk of cardiovascular disease ). These earlier works were developed for a binary trait (disease status), and not applicable to other types of traits. It is known that a simple binary trait may not be adequate to explain the pathogenic mechanism of disease, especially some quantitative traits are more likely to be responsible for the assessment of complex diseases. Later, we developed a joint empirical Bayes model that successfully combines a binary trait with a quantitative trait to more precisely identify the disease-causing mutations in a whole-genome sequencing data (Li et al. 2015). However, this model did not concern the impact of SNV annotation and treated the synonymous and non-synonymous genetic effects equally. A few studies have revealed that an amino acid change may deeply affect the stability of the protein. The model should assign distinct prior probabilities to the synonymous change and non-synonymous change separately. Hence, a new weighted empirical Bayes (WEB m ) method is designed to incorporate the annotation information of variants in a multiple trait risk prediction model for maximizing the functional genetic effects on multi traits. In some aspects, this proposed method combines the annotation information with multiple traits design in one model, and we expect better prediction results. may, on average, strengthen the signals of genetic variants, especially for rare variants. Here, each gene consists of many variants, and these variants are classified into two categories, i.e., synonymous SNVs and non-synonymous SNVs, in terms of their functions. Because non-synonymous SNVs may change the protein sequence and are highly associated with the human diseases, we consider to analyze synonymous SNVs and non-synonymous SNVs separately. Thus, we calculate two sub gene scores for each gene. The sub gene made up of synonymous SNVs is called the synonymous gene, and the non-synonymous gene is made up of non-synonymous variants. Here, a weighted method (Madsen and Browning 2009) is applied to these synonymous and non-synonymous SNVs respectively to calculate two sub gene scores, where X s ij and X o ij (i = 1, … , N; j = 1, … , n) denote the ith synonymous and non-synonymous gene scores for the jth subject, respectively, and the subscripts s and o stand for the synonymous and non-synonymous, respectively. L s ijl denotes the number of rare alleles of the jth subject at the lth variant (l = 1, … , N s i ) within the ith synonymous gene, similarly, L o ijg is the number of rare allele of the jth subject at the gth variant (g = 1, … , N o i ) within the ith non-synonymous gene; Given the ith gene, there are N s i synonymous SNVs and N o i non-synonymous SNVs; p il and q ig denote the minor allele frequencies of the lth and gth variants in the ith synonymous gene and ith non-synonymous gene respectively. For this scoring scheme, the importance of synonymous and non-synonymous variants is counted by the inverse of its minor allele frequency.
We expect the significance of two sub genes to be different. Here, a logistic regression model is applied to the synonymous and non-synonymous gene-level data separately. The difference between two sub genes is measured by a weight a i through the following equation, where τ s i and τ o i denote the p values of the ith synonymous and non-synonymous genes when the logistic regression model is fit to the disease status. The above equation is a function of these two p values. Since the numerator is the minus logarithm of p value for the non-synonymous gene, this weight a i measures the proportion of the gene effect coming from the ith non-synonymous gene. In brief, a larger weight a i (>0.5) implies that the impact of ith non-synonymous gene on the disease trait is stronger than that of the ith synonymous gene. Hence, the weighted score of the ith gene and the jth subject is calculated by, where X ij is a weighted average score between the synonymous gene X s ij and the non-synonymous gene X o ij , and X i is a vector of the weighted average scores for all individuals, namely the ith weighted gene.
2.2 Estimates of each weighted gene impact on multiple traits 2.2.1 A binary trait: Suppose there are N weighted genes (X 1 , … , X N ) and n subjects consisting of n 1 cases and n 0 controls. The best prediction rule 2.4 (Efron 2009) will diagnose individuals to be sick if where G i Xi−μ i 1n σi is a standardized weighted gene vector where μ i and σ i are the true mean and the standard deviation of the ith weighted gene, and 1 n includes n ones. The parameter γ i, d n1n0 n1+n0 (μ i, 1 − μ i, 0 )/σ i measures the mean difference between the cases and controls for the ith weighted gene where the subscript d indicates a binary trait, and μ i, k (k = 1 or 0) denotes the mean score of the ith weighted gene in the case or control group. The difference between two mean scores reflects the genetic contribution of the ith weighted gene to a binary disease trait.
If σ i is unknown, the parameter γ i, d is estimated as follow: Note that σ i is the pooled standard deviation of the ith weighted gene, where s i,1 and s i,0 are the sample standard deviations of the cases and controls respectively. X i, k ∑ n j 1 XijI(Yj k) nk (k = 1 or 0) is the sample mean of the ith weighted gene in the case or control group where Y j (=1 or 0) is the jth individual's disease status. T i,d can be converted to a normal variable , where Φ −1 is an inverse standard normal and P(T ≤ T i,d ) is a cumulative distribution function of the above t statistic with n−2 degrees of freedom. Then Z i,d is an estimate of γ i, d for a binary trait, and it follows a normal distribution with mean γ i, d and the standard deviation close to 1 (Efron 2009, Remark F), In addition, Equation (5) illustrates that the expected distance between the case and control classes is estimated by Then the rule (2.1) (Efron 2009) is applied to the weighted genes (X 1 , … , 2w0 , 1 where + and − indicate the case (disease) and control (healthy) classes, respectively. Threshold b is the decision boundary I to distinguish healthy and disease groups. When equal numbers of cases and controls are collected, it is a hyperplane which is orthogonal to the line between , and the value of b is equal to 0. If there are unequal numbers of cases and controls, b is proportional to the logarithm ratio of sample sizes between cases and controls (8.2 Efron 2009).

A quantitative trait:
In addition to the disease status (binary trait), a quantitative trait that is related to the disease is often collected in most genetic studies. More importantly, the quantitative trait may offer valuable information for the disease diagnosis. Let q denote a quantitative trait. For convenience, we standardize q, A linear regression model is applied to see the impact of q on each standardized weighted gene G i , Suppose ϵ i has mean zero and identity covariance matrix, and is not correlated with q. The coefficient β i, v reflects the effect of the quantitative trait (q) on the ith weighted gene (G i ) where the subscript v indicates the quantitative trait. The joint distribution of G and q is used to derive the best predictor of q (Efron 2009, 7.3).
The Bayesian estimate is calculated from the posterior distribution of coefficients β i, v . According to the definition of G i (X i − μ i 1 n )/σ i where μ i and σ i are the mean and the standard deviation of the ith weighted gene, the best linear prediction model can be expressed as, where the least squares method is used to estimate parameters μ i , σ i and β i, v , and a vector 1 n includes n ones. When ϵ i is normally distributed (see Appendix 1.1), a t-test of hypothesis β i, v 0 is inferred from Equation (6), and it follows a noncentral t distribution, The noncentral t distribution has n−2 degrees of freedom with noncentral parameter γ i, v 2w 0 β i, v , which reflects the impact of the ith weighted gene on the quantitative trait. Then we can obtain a normal scale statistic Z i,v via the transformation equation (Z i, v Φ −1 (P(T ≤ T i, v ))), and this Z i,v is an estimator of the parameter γ i, v . It is normally distributed when sample size is large (Efron 2009),

Weighted EB prediction rule in a multiple trait design
Many medical studies increasingly recognize the importance of some quantitative traits (e.g. blood pressure, weight or blood testing) in diagnosis of complex diseases, especially given the growing understanding that some causal genes may have more immediate effects on a quantitative trait than those on a binary trait. Here, we define a new estimator Z i,c to combine two statistics Z i,d and Z i,v introduced above that correspond to a binary trait and a quantitative trait respectively, where sgn is a sign function which gives +1 or −1 for each non-zero real number (e.g., sgn(5) = 1 or sgn(−5) = −1). We know that both statistics Z i,d and Z i,v are normally distributed, and the new statistics Z i,c can capture the largest additive genetic effect on both traits if sgn(Z i,d ) > 0. In contrast, it can help us aggrandize the protective effects on two traits when sgn(Z i,d ) takes a negative value. In case the conflicting genetic effects on multiple traits are detected, the quantitative trait information will be useless at this moment. Directly taking the statistic of the binary trait Z i,d will guarantee the efficiency of the model. When we investigate the inference of this new statistic, the distribution of Z i,c is similar as that of max In practice, two normal random variables Z i,d and Z i,v might not be independent, then the multivariate normal distribution of a random vector Z i, d , Z i, v T is used to derive the exact distribution of this combined statistic Z i,c .
The exact distribution of Z i,c is not normally distributed (Ker 2001;Nadarajah and Kotz 2008). However, our previous work (Li et al. 2015) has illustrated that Z i,c approximately follows a skewed normal distribution, and the variance of statistics Z i,d and Z i,v close to 1 greatly reduces the extent of the skewness of the normal distribution (Ker 2001;Nadarajah and Kotz 2008). Then empirical Bayes method can shrink this combined statistic (Z i,c ) to a smoother estimate ( γ i, c ). In brief, this new statistic maximizes the effects of weighted genes on multiple traits, including genes that are directly acting on a quantitative response instead of the disease trait. While calculating empirical Bayes estimator, selection bias is inevitable when the number of features (weighted genes) is large. A Bayesian approach is known to offer smoother estimates, and empirical Bayes method (Efron 2009) can greatly simplify the calculation of Bayesian estimates while helping to ensure that the estimates reflect most causal weighted genes' effects. For any pair (Z, γ), the Bayesian estimate γ is obtained by the first derivative of the cumulant generating function of Z (Efron 2009, Remark E and F), Note that f (⋅) estimates the marginal density of Z; K(⋅) denotes the log of the marginal density function of Z; σ 2 is the variance estimate of the combined gene score; K ′ (⋅) is the numerical derivative of K(⋅). γ is a numerical approximation of the empirical Bayes estimates.
After ranking the estimates of all weighted genes, we select a subset of genes with large effects in the prediction classification, where genes with the largest weighted EB estimates ( γ i, c ) on multiple traits are collected in a set I g .
The covariate variables, such as Age and Smoking status, may play a significant role in predicting disease risk. Obviously, age is a continuous variable, then we can directly adopt z or t test to measure its effect on multiple phenotype traits. In particular, we calculate the smoking proportion among n subjects (Simulation data: p ≈ 0.25, np > 10). This property guarantees that z test is also appropriate to detect the smoking effect on multiple traits. Hence, Sections 2.2.1 and 2.2.2 can be successfully applied to these covariates. According to Equation (7), the combined statistic Z m,c is calculated to maximize the effect of the mth covariate (m = 1, … , M) on multiple traits. The relative weighted EB estimator ( γ m, c ) is achieved through shrinking the combined statistic to the smoother estimate. We define a matrix ( W 1 , …, W T ) including all standardized weighted genes ( G 1 , …, G N ) and the standardized covariates The final weighted empirical Bayes prediction model is defined by, where I is a set of most important genes and covariates selected from two traits. W t is the tth (t = 1, … ,T) standardized feature.

Simulation study
GAW 17 sequencing data (Almasy et al. 2011) is simulated from a complex model, where various deleterious genetic variants from some particular biological pathways (e.g., VEGF) are selected to capture the risk of cardiovascular disease. This data set is made by combining real and simulated data. 1000 Genomes Project (http://www.1000genomes.org) provides a real exome sequencing data (genotype data), and this real genotype data is used to simulate the disease trait (D) and the related quantitative traits (e.g., Q 1 ). In fact, the trait D is simulated by a liability threshold model (its top 30% is declared illness) which is controlled by the latent liability and multiple quantitative traits (Almasy et al. 2011, Equation 1), and nine genes are selected to determine the variation of Q 1 (Almasy et al. 2011, Table 1). Since the simulation model just includes the additive genetic effects, the mean level of a quantitative trait is positively associated with the variants. Moreover, GAW 17 data includes 697 unrelated individuals and 24,487 SNVs where a large number of rare SNVs are widely distributed across the genome. It also generates 200 replicates by changing individuals' smoke status in the simulation model (Almasy et al. 2011), and each replicate includes 697 individuals. In particular, around 87% of SNVs are rare SNVs with minor allele frequencies less than 0.05, and 74% of SNVs are extremely rare with minor allele frequencies less than 0.01. More importantly, the annotation information of SNVs is provided, so the Madsen and Browning method (Madsen and Browning 2009) is applied to the synonymous SNVs and nonsynonymous SNVs, separately. It results in 3205 synonymous genes and 3205 non-synonymous genes, and the final weighted gene score is calculated from Equation (3). In this study, our new weighted empirical Bayes (WEB m ) prediction model is developed to maximize the effects of weighted genes on a disease trait (D) and a quantitative trait (Q 1 ). The performance of this method is compared with those of the weighted EB (WEB) method based on a single binary trait, empirical Bayes (EB) method, Logistic Regression (LR) and Random Forests (RF: Breiman, 2001).
One key step in developing a risk prediction model is the feature selection, where genes are ranked by their importance to the disease. When we sort genes by their impact, the performances of various methods will be compared by their ranking lists of genes. We expect the genes selected by the simulation model to possess larger effects, and occupy top positions in the gene list. These selected genes are called "true genes" in this study. Table 1 summarizes top 10 features of five classifiers, including WEB m , WEB, EB, Logistic regression, and Random Forests. It has been demonstrated that one true gene (FLT1) is commonly inferred from four methods (WEB m , WEB, EB and LR). In addition, the proposed new method (WEB m ) could further identify two more true genes, KDR and PTK2B. As we discussed above, this new weighted empirical Bayes method treats synonymous genes and non-synonymous genes unequally, then more true genes can be detected to better predict the disease trait. Specifically, the effects of non-synonymous FLT1 and KDR are more important than those of synonymous genes. For example, the weight of non-synonymous FLT1 (a i = 66%) is around two times as large as that of synonymous FLT1 (1−a i = 34%). The weighted KDR gene also upweights the effect of nonsynonymous change (56%). According to the simulation model, KDR is strongly associated with the quantitative trait Q 1 . Since the new method provides a way to capture causal genes related to Q 1 in addition to the disease status, it is certainly seen that more true genes may be discovered. Besides, Table 1 summarizes the number of SNVs falling into different allele frequency categories, i.e. those with MAF ≥0.05, with MAF between 0.01 and 0.05, and those with MAF <0.01. It is clearly seen that SNVs within the validated genes have a larger proportion of rare variants (FLT1: 91%; KDR: 94%; PTK2B: 61%). Moreover, all methods demonstrate that two covariates (age and smoking status) play more important roles than genes.
When the top feature list is expanded to rank top 50 genes, the comparison of all methods is summarized in Table 2. WEB m and WEB are capable of detecting one more true gene individually (WEB m : PIK3C3; WEB: PIK3C2B). Both genes upweight the non-synonymous change (PIK3C3: a i = 64%; PIK3C2B: a i = 72%) to calculate the weighted gene score. Specifically, WEB m and WEB are able to detect the functional genetic impacts on multiple traits and a single trait respectively, then it may be the reason why these two methods can successfully identify more true genes. In contrast, the logistic regression only identifies one true gene, FLT1, among top 50 genes. It is clearly seen that WEB m combines the annotation information with the multiple trait design to discover the largest number of true genes.
Since GAW 17 data generates 200 replicates through changing individuals' smoke status (Almasy et al. 2011), we can run Cross Validation procedure to calculate the prediction accuracy. These 200 replicates are split into the training and test data, where 100 replicates are randomly selected as the training data to develop the prediction rule, with the remaining 100 replicates as the test data to estimate the prediction accuracy. We repeat this procedure five times to estimate the mean and standard deviation of the prediction accuracy for each method. Table 3 compares areas of under ROC curve (AUC) for different methods, and summarizes their prediction performances. The calculation of AUC is based on two models, where model I considers both genetic and non-genetic features to predict the disease risk, and model II only considers genes. It is clearly seen that WEB m significantly increases the prediction accuracy in two model settings (0.91(SD:0.020); 0.68(SD:0.016)), over other models (EB: 0.76(0.0102) and 0.6(0.0141); WEB: 0.80(0.015) and 0.64(0.0183)). The improvements can be more readily demonstrated through drawing the ROC curve for each method (Figures 1 and 2). Furthermore, the new model is compared to another classifier (Random Forests). Specifically, WEB m still performs better than random forest in terms of prediction accuracy (Table 4).

Real data analysis
The genetic analysis workshop (GAW) 19 (Blangero et al. 2016;Engelman et al. 2016) offers a whole-exome sequence data for the unrelated Mexican American samples. This study focuses on the identification of novel  genetic effect on a complex disease, hypertension disease. The data includes multiple phenotype traits, the hypertension diagnosis (HTN: yes = 1/no = 0) denoted as a binary trait and the systolic blood pressure (SBP) and diastolic blood pressure (DBP) thought as the quantitative traits. In addition, an important covariate, age, is provided for more accurately predicting the risk of disease. With the aid of the UCSC genome annotation database (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/), we classify around 412,781 SNVs into synonymous SNVs and non-synonymous SNVs. The approach (Madsen and Browning 2009) is applied to two category variants separately, then 6541 synonymous genes and 5727 non-synonymous genes are collected. In this real sequence data, 1851 unrelated individuals are involved in the analysis after removing the missing observations of multiple phenotypes and one covariate. Two prediction approaches, WEB m and WEB are applied to this gene-level sequence data to evaluate their performances. 10-fold cross-validation procedure is applied to achieve the prediction accuracy. In brief, the proposed model is to fit the training set (1666 subjects), then the remaining 185 subjects is the test set to assess the prediction accuracy. The same procedure repeats 10 times, and the final accuracy is calculated as the average of these 10 prediction accuracies. Table 5 summarizes top 10 important features selected by three methods (our proposed methods: WEB m , Random Forest and Logistic Regression). Generally, the covariate, age, is commonly selected by three methods, and its ranking position is number one. It indicates that age is strongly associated with the hypertension disease. More importantly, the gene PBX3 is successfully detected by the new method WEB m and logistic regression, and this gene may be important to the diagnosis of hypertension disease. When we extend the feature search to top 100 important factors (Table 6), another two genes, ERBB2IP and IGSF9B, are detected by three methods. If we compare the common genes between two methods, more common genes are detected. For example, Supplementary Table 1 Table 1). Table 7 and Figure 3 summarize the prediction accuracy results. Table 7 compares the prediction accuracy (AUC) between WEB m and WEB. It is clearly seen that WEB m can provide a larger prediction accuracy (AUC: 0.78 with SD: 0.029), followed by WEB (AUC: 0.75 with SD: 0.034). The improvement of WEB m in AUC is most likely due to the fact that WEB m can maximize the functional effects of genes from multiple traits (HTN and SBP). In the real experiment, WEB m works well when two category variants have different impacts on the same disease trait or when a quantitative trait is highly related to the binary disease trait. Figure 3 further illustrates the comparison of the ROC curves between WEB m and WEB under 10-fold cross validation procedure. This figure shows that WEB m outperforms WEB in terms of prediction accuracy. In summary, we believe our proposed method WEB m is an efficient prediction tool for the large sequencing data.

Discussion
With the development of the sequencing technology, scientists can directly identify the disease-causal variants; however, the majority of variants are rare variants with low frequency, which is difficult to be detected in a sequencing data analysis. A few approaches (Li and Leal 2008;Luo et al. 2011;Madsen and Browning 2009;Price et al. 2010;Wu et al. 2011) are applied to strengthen the signals through collapsing all variants in each gene, especially for rare variants. It is increasingly recognized that the bioinformatic annotation of variants plays a crucial role in interpreting the functional impact of a variant. Jointly analyzing multiple phenotype traits may better explore the complicated functions of genetic variants. Hence, incorporating the annotation information in a multi trait model may improve the statistical power to detect the causal genes. Furthermore, it is commonly observed that the estimates of genes are inflated in a sequencing analysis due to the imbalance between a small number of subjects and a large amount of genes. The empirical Bayes (Efron 2009) method provides a natural choice to correct this bias. In this study, we develop a new method to incorporate the SNV annotation in a multiple trait design. Our proposed prediction rule is capable of exploring the functional genetic effects on multiple traits. While annotating the genetic variants, the variants are classified into two classes, synonymous SNV and non-synonymous SNV. It is well-known that non-synonymous variants are more likely related to the pathogenesis of complex diseases (Sun and Yu 2019). Here, the distinctions of two classes are managed by a weighted gene score (Section 2.1). In reality, two types of variants have strong correlation, but this weighted gene method is based on the condition that two classes are independent. Therefore, developing a method to concern this correlation may be a new goal. Besides, SNV annotation consists of more than two classes. How to incorporate all annotation classes in a risk prediction model is still challenging.
The proposed method can be extended to more complex sequencing data with multiple quantitative traits and one binary trait. Suppose there are V number of quantitative traits (Q 1 , Q 2 , … , Q V ) and one binary trait (D) in a sequencing data. When the analysis focuses on the detection of the ith weighted gene effect on multiple traits, three steps are involved here. First, Section 2.2.1 is applied to a disease trait (D), and the estimator (Z i,d ) is  obtained to measure the ith weighted gene effect on a binary trait. Next, we utilize Section 2.2.2 method to multiple quantitative traits separately, and the corresponding gene effects on various quantitative traits are estimated by Z i, v1 , Z i, v2 , … , Z i, vV respectively. Finally, Equation (7) will be changed to measure the largest genetic effect as follows, It is clearly seen that the final combined statistic Z i,c captures the maximum genetic effect on multiple quantitative traits and one binary trait. However, not all quantitative traits can enrich the gene signal in this study. In fact, only the quantitative traits which are positively correlated with the binary trait can greatly strengthen this genetic signal. In our future study, we plan to develop a better way to incorporate various quantitative traits in a binary trait based risk prediction model for further improving the prediction accuracy.
In this approach, the Z statistic is derived from the transformed T statistic (Equation (5)). But this statistic is inferred from a simple probability model, where we assume a gene vector G = (G 1 , G 2 , … , G N ), with each G i approximated by a normal distribution. However, this strategy is based on a balance design where the number of subjects is the same between cases and controls. In real practice, many sequencing data are not balanced, which may largely affect the prediction performance. In the future study, some methods should be proposed to adapt this issue.
It is growingly recognized that the disease pathogenesis is a very complicated process, especially for the interaction between genetic factors and environmental factors. If both genetic and non-genetic predictors are available, a risk prediction may be improved by considering this interaction. Here, two sequencing data are The ranking positions of three common genes PBX, ERBBIP and IGSFB detected from three classifiers; WEB m : the Weighted EB method for two traits; RF: Random Forests; LR: Logistic Regression; "-": the position of one gene is outside of top  important features. analyzed, both of them have demographic and environmental covariates, which may provide more information about subjects, in addition to genetic information. Although our proposed approach incorporates both types of information, we have not considered potential interactions between genetic and non-genetic features, which may improve the statistical power to detect causal genetic variants. Further work is needed to include gene-environment and gene-gene interactions in the risk prediction model.