Show Summary Details
More options …

# Open Physics

### formerly Central European Journal of Physics

Editor-in-Chief: Seidel, Sally

Managing Editor: Lesna-Szreter, Paulina

IMPACT FACTOR 2018: 1.005

CiteScore 2018: 1.01

SCImago Journal Rank (SJR) 2018: 0.237
Source Normalized Impact per Paper (SNIP) 2018: 0.541

ICV 2017: 162.45

Open Access
Online
ISSN
2391-5471
See all formats and pricing
More options …
Volume 15, Issue 1

# Multi-task feature learning by using trace norm regularization

Zhang Jiangmei
• Corresponding author
• Department of Automation, University of Science and Technology of China, Hefei 230027, China
• School of Information Engineering, Southwest University of Science and Technology, Mianyang 621010, China
• Email
• Other articles by this author:
/ Yu Binfeng
/ Ji Haibo
/ Kunpeng Wang
• School of Information Engineering, Southwest University of Science and Technology, Mianyang 621010, China
• Other articles by this author:
Published Online: 2017-11-10 | DOI: https://doi.org/10.1515/phys-2017-0079

## Abstract

Multi-task learning can extract the correlation of multiple related machine learning problems to improve performance. This paper considers applying the multi-task learning method to learn a single task. We propose a new learning approach, which employs the mixture of expert model to divide a learning task into several related sub-tasks, and then uses the trace norm regularization to extract common feature representation of these sub-tasks. A nonlinear extension of this approach by using kernel is also provided. Experiments conducted on both simulated and real data sets demonstrate the advantage of the proposed approach.

PACS: 02.50.Sk; 02.60.Pn; 02.70.Hm

## 1 Introduction

There are many methods to define the correlation among multiple tasks [7, 8, 9, 10, 11, 12]. One important way is to assume that each task shares common features. The same subsets of features are chosen to represent the correlation of input and output in each task. L2,1 norm regularized minimization, a kind of group Lasso problem, is commonly used to find the shared features among different learning tasks [13, 14]. Another way to describe the correlation among multiple tasks is to assume that the linear predictors of different tasks are in a low rank subspace. Argyriou et al. proposed a convex multitask feature learning formulation to learn a common sparse representation from multitasks [15]. Their multitask learning formulation is essentially equivalent to the approach employing the trace norm as a regularization which is introduced to replace non convex rank function. The trace norm (known as nuclear norm) of a matrix is the sum of its singular values, so the trace norm regularization is the absolute shrinkage on the singular values of the coefficients matrix and enforces the singular values of matrix to be zero [16, 17]. The trace norm regularization is a promising heuristic approach to find the low rank structure of the coefficients in different tasks [18].

Many machine learning methods use the divide and conquer strategy to deal with complex classification or regression problems. With the divide and conquer strategy, a complex problem is divided into multiple simpler subproblems. Motivated by the success of the multi-task learning techniques in learning multiple tasks, in this paper we attempt to use the multi-task learning method to improve the performance of the divide and conquer strategy. The divide and conquer strategy often divides the input space into many local regions, and the training data in each region may be insufficient. All subproblems are caused by studying the same target and thus there exists some intrinsic relatedness among the subproblems. Therefore, we can utilize the multi-task learning approach to improve the generalization performance of the divide and conquer strategy.

In order to fulfill a single task through the multi-task learning method, we use the mixture of experts (MOE) [19] method to divide a complex machine learning problem into subproblems. MOE is a probabilistic tree-based model, using a mixture of conditional density models to approximate the conditional distribution of output. The mixing coefficients of MOE depend on the inputs and decided by the gating functions, and the local conditional density models are called experts. A comprehensive survey of the mixture of experts can be found in Ref. [20]. Recently, many novel MOE methods are proposed to handle high dimensional data [21, 22, 23]. In this paper, a new trace norm regularized MOE model is proposed. We choose the trace norm regularization to extract the connection among expert models and gating functions. Trace norm regularization is a feature learning technique. The trace norm regularized MOE model can uncover the shared underlying characteristics of training input. Different from the previous study on the MOE model which often aims to select a set of sparse features [22], the trace norm regularized MOE model gains a small set of underlying characteristics which can be represented by the linear combination of original features. Moreover, the trace norm regularization allows us to work in the kernel space and handle the high dimensional (or infinite) features. At this point, the trace norm regularized MOE model is more flexible than the MOE with sparse feature selection [22].

This paper is organized as follows. Section 2 reviews the standard framework of the MOE model and presents the proposed trace norm regularized MOE model. The optimization and extension with kernel is also provided in section 2. In section 3, we demonstrate the performance of the proposed method by the experiments conducted on both synthetic and real data sets. We discuss the feature research in section 4.

## 2 Mixture of experts model with trace norm regularization

In this section, we first briefly present the basic elements of MOE model. Then we present the new MOE model with the trace norm regularization and give its extension with kernel.

## 2.1 Mixture of experts model

Let (xi, yi) i = 1,2,…N denote the N pairs of input/output data (x, y), where yi is a response variable and xiRp is a p-dimensional vector. The MOE model aims to estimate the conditional probability distribution: $p(y|x)$(1)

MOE model approximate the probability distribution through a mixture of multiple local distributions, called experts. The MOE model with K experts can be expressed as: $p(y|x)=∑k=1kπk(x)pk(y|x)$(2)

In the MOE model, the posteriori conditional distribution is decomposed as a weighted combination of K expert models. The weight function πk(x), defined as gating function, satisfies $\begin{array}{}\pi {}_{k}\left(x\right)>0,\sum _{k}{\pi }_{k}\left(x\right)=1.\end{array}$ The gating functions divide the input data into multiple regions. pk(y|x) is the expert model, which models the data divided by the gating functions. Different from the mixture of Gaussian or mixture of linear regression model, MOE assumes that the weight coefficients of each sample are different. The mixture proportions are allowed to depend on input x.

MOE model often uses the Softmax function as the gating function: $πk(x)=exprkTx∑j=1KexprjTx$(3)

where riRp, i ∈ {1, 2, …, k}.

The expert models also often use linear models such as the generalized linear models. The density function pk(y|x) can be expressed as: $pk(y|x)=h(y,wkTx)$(4)

wherewiRp, i ∈ {1, 2, …, k} are the parameters of the expert models. For example, we can use the logistic function for two-class classification: $pk(y|x)=h(y,wkTx)=11+exp−y(wkTx)$(5)

where y ∈ {1, −1}.

The expectation-maximization (EM) algorithm can be used to train a MOE model. Introduce a K-dimensional binary random variable z=(z1, z2, …, zk) ∈ {0, 1}k, the above MOE model (2) can be expressed as: $p(y|x)=∑zp(z|x)p(y|z,x)$(6)

z has a 1-of-K representation in which a particular element zi is equal to 1 and all other elements are equal to 0. In (6), the distribution over z is specified in terms of the weight coefficient, such that $p(zi=1|x)=πk(x)$(7)

and $p(y|z,x)=∏k=1kh(y,wkTx)zk$(8)

Every latent variable zi corresponds to a training pair (xi, yi).We summarize the training input in a matrix XRp × N, whose columns are given by xi, the training output in a vector YRN, and the latent variables in a matrix z, whose columns are given by zi.

Let θ denote {wk, rk}k=1… k. The EM algorithm optimizes the following objective function alternately to obtain the maximum likelihood estimate of θ: $F(θ,q)=∑zq(Z)lnP(Y,Z|X,θ)q(Z)$(9)

In the E-step, θ is fixed and the posterior distribution of the latent variable zi is estimated by: $q(Zi)=arg⁡maxq(z)F(θ,q)=P(Zi|xi,yi,θ)$(10)

Specifically, we have $q(zni=1)=αni=p(yn|xn,wi,zni)p(zni|xn,ri)∑j=1kp(yn|xn,wj,znj)p(znj|xn,rj)$(11)

In the M-step, we optimize θ to maximize the expected log likelihood of complete data over the posteriori distribution of latent variables estimated from E-step: $θ=arg⁡maxθF(θ,q)=arg⁡maxθL(θ)=arg⁡maxθ∑n=1N∑i=1Kαnilog⁡P(yn|xn,zni=1,θ)+log⁡P(zni=1|xn,θ)$(12)

The EM algorithm repeats the E-step and M-step until either the parameters θ or the log likelihood converges.

## 2.2 Mixture of experts model with the trace norm regularization

Feature extraction is a commonly used approach in machine learning to improve the accuracy of model when the training samples are insufficient. Some feature extraction methods gained the underlying characteristics by estimating a matrix to project the original input feature into a low dimensional subspace. Formally, the extracted input feature can be expressed as FTx, where FRP × S, SP. Subsequently, a regression or classification model is developed with the extracted feature FTx. Combining the feature extraction and the MOE model, and using the projection matrix FG, FE for the gating functions and the expert models respectively, we can obtain $πK(x)=P(zk=1|x,rk)=p(zk=1|FGTx,gk)$(13) $h(y,wKTx)=P(y|x,wk)=P(y|FETx,hk)$(14)

where gkRs, k=1,…,K, hkRs, k=1,…,K. The model coefficients wk, rk, are substituted by: wk = FEhk and rk= FGgk. Let W=(wi,…,wk) ∈ Rp × K and R=(r1, …, rk) ∈ Rp × K. When the projection matrix FG and FE are known, we can use the new inputs $\begin{array}{}{F}_{G}^{T}x\text{\hspace{0.17em}and\hspace{0.17em}}{F}_{E}^{T}x\end{array}$ to train a new MOE model. In this paper, we attempt to simultaneously learn the projection matrix and the MOE model built on the extracted features. Following the work [15, 24], we use the trace norm regularization for simultaneous feature extraction and model learning. As the prediction are dependent on FEH and FGG, we can add the Frobenius norm regularizer in the above EM algorithm to control the magnitude of FE, Fg, H and G. Adding the regularization terms will not change the posterior distribution of the latent variable z. Therefore, the E-step has not been changed. In the M-step, the optimization problem is then reformulated as: $maxFE,FG,H,GL(FEH,FGG,X,Y)−CG12FGF2+12GF2−CE12FEF2+12HF2$(15)

where CG > 0, CE > 0 are the regularization parameters. The optimization problem (15) is nonconvex. However, following Ref. [24], the non-convex optimization problem can be converted into a trace norm regularization problem.

Trace norm ∥⋅∥tr of a matrix is defined as the sum of the singular value of the matrix: $Wtr=∑iγi$(16)

where γi is the i-th singular values of W.

According to [25], the trace norm has the following property: $Wtr=minFG=W12FF2+GF2$(17)

The problem (15) can be rewritten as: $maxW,R,C,D∑n=1N∑i=1Kαnilog⁡P(yn|xn,zni=1,wk)+log⁡P(zni=1|xn,rk)−CGWtr−CERtr$(18)

Trace norm is the convex envelop of matrix rank [26]. Therefore, the trace norm regularization is often used in the multi-task learning and matrix completion to obtain low rank solutions. The idea of using the trace norm to extract the features of the MOE model comes from multi-task learning. In multi-task learning, the trace norm regularizer is often used to gain a few features common across the tasks. The MOE model divides the data into multiple regions, and the data in each region may be insufficient to train the local expert model. Since the multi-task learning can improve the generalization performance when only limited training data for each task are available, we can use the multi-task learning to improve the performance of the expert model and the gating functions in MOE model. Consequently, with the aid of MOE model, we can apply the multi-task learning technique in a single task learning problem.

The optimization problem (18) can be divided into two independent trace norm regularization problems: $maxW,C∑n=1N∑i=1K−αnilogh(yn,wiTxn)+CHWtr$(19)

and $maxR,D∑n=1N∑i=1K−αnilogexp⁡(riTxn+di)∑j=1kexp⁡(rjTxn+dj)+CGRtr$(20)

When $\begin{array}{}h\left({y}_{n},{w}_{i}^{T}{x}_{n}\right)\end{array}$ is log-concave, the optimization with trace norm regularization is a convex, but non-smooth, optimization problem. It can be formulated as a semi-definite program (SDP) and solved by some existing SDP solver such as SDPT3 [26]. Recently, many efficient algorithms such as block coordinate descent [15], accelerated proximal gradient method (APG) [16] and ADMM [27] have been developed to solve the trace norm minimization problem. In this paper, we use the APG algorithm to solve the trace norm regularization due to its fast convergence rate.

In summary, the MOE model with trace norm regularization can be trained with iterating the following two steps:

E-step: evaluate the output of the gating functions and the expert models by using the current parameters, and then evaluate the posterior probability of latent variables using eq. (11).

M-step: update the values of the parameters wi and ri by solving the two trace norm minimization problems (19) and (20).

## 2.3 Nonlinear extension with kernel

The MOE model with linear experts can directly handle the nonlinear data. However, when the feature space is very high dimensional (or infinite), working in the original feature space is computationally expensive. The above method can be extended to work on kernel matrix. Following the representer theorem of trace norm regularization [15], we can obtain the following theorem:

#### Theorem 2.1

If W and R is the optimal solutions of the MOE model, then wk, rk, k ∈ {1, 2, …, K} can be expressed as: $wk=∑i=1Nukixi,rk=∑i=1Nvkixi$(21)

where uki and vki are linear combination coefficients for the k-th experts and gating functions. The proof of this theorem is similar with the proof in Ref. [15]. According the Theorem 1, the optimal W and R can be represented as: $W=XU0,R=XV0$(22)

where U0(i, j) = uji and V0(i, j) = vij. Let ζ = span{xi, i=1, 2,…, N}. We consider a matrix P whose columns form an orthogonal basis of ζ. According to (22), there are matrices U1 and V1 such that $W=PU1,R=PV1$(23)

As PTP=I, we have $Wtr=tracePU1U1TPT12=traceU1U1T12=U1trRtr=V1tr$(24)

Substituting (24) in the objective of (19) and (20) yields the following objective functions: $minU1,C∑n=1N∑i=1Kαnilogh(yn,u1iTPTxn)+CHU1tr$(25) $minV1,D∑n=1N∑i=1K−αnilogexp⁡(v1iTPTxn+di)∑j=1kexp⁡(v1jTPTxn+dj)+CGV1tr$(26)

where u1i and v1i are the i-th columns of U1 and V1. The problems (25) and (26) can be regarded as having modified input B=(β1,β2,… βN) = PTX. As P consists of the basis of ζ, P can be expressed as: $P=XR$(27)

Thus, B=PTX=RTXTX=RTK, K is the Gram matrix: $K=〈x1,x1〉K〈x1,xN〉MOM〈xN,x1〉K〈xN,xN〉$(28)

If the matrix R is known, the above trace norm regularized MOE model will only depend on the inner product of two samples. When the input feature x is mapped onto a kernel feature space by a nonlinear map φ(x), we can use the kernel function to evaluate the inner product in the kernel space. The matrix R is also estimated from the Gram matrix K. Compute the eigen decomposition of the n × n Gram matrix K=UDUT, where D is the diagonal matrix containing the eigenvalues of K, and U is a matrix whose columns are the corresponding eigenvectors. R is equal to UD−1/2. R can also be computed by the Gram-Schmidt orthogonalization [15].

To make a prediction for a new sample x, the expert models only need to evaluate $\begin{array}{}{w}_{i}^{T}x=\left({XRu}_{1i}{\right)}^{T}x=\left({Ru}_{li}{\right)}^{T}\left({X}^{T}x\right)=\left({Ru}_{1i}{\right)}^{T}{K}^{\text{%}}\left(x\right),\end{array}$ and the gating functions only need to evaluate $\begin{array}{}{r}_{i}^{T}x=\left({R{v}_{1i}{\right)}^{T}K}^{\text{%}}\left(x\right),\end{array}$ where K%(x)=(k(x1, x), k(x2, x),…,k(xN, x))T. In summary, using the above procedures, we can build and make predictions with a trace regularized MOE model without the original features.

## 3 Experiment

In this section, we present some numerical experiments on both synthetic and several real data sets to demonstrate the performance of the proposed method. We studied the two-class classification problem in these experiments and used the logistic regression models as the local expert models.

The proposed trace norm regularized MOE model is compared with the L1 norm regularized MOE model [22], support vector machines (SVM), linear logistic regression with L1 norm regularization, SVM ensemble with bagging, and AdaBoost using decision trees as weak classifiers. The parameters of these methods are selected with 3-fold cross-validation using the grid search.

## 3.1 Synthetic data

To generate the synthetic data set, we first construct 2-dimensional positive and negative samples as shown in Figure 1, where the positive samples are randomly selected from a 2-dimenisional Gaussian distribution with zero means and covariance cov=diag(4,4). The i-th negative sample is generated as (6 cos(2πui) + vi, 6 sin(2πui)+wi, where ui, vi, and wi is randomly selected from the standardized normal distribution. Then, we generate a 50 × 2 random orthogonal projection matrix to project the 2-dimensional samples to 50-dimensional linear space. Finally, we further add 50-dimenisional noise features to the projected samples. The noises are zero-mean Gaussian noises with standard deviation equal to 1. Consequently, we obtain 100-dimenisional input data. The labels of these high dimensional data are decided by the labels in original 2-dimensional feature space. We generate a total of 200 samples including 100 positive samples and 100 negative samples.

Figure 1

2-dimensional positive and negative samples

Since the data are generated from the 2-dimenisional data, we can plot the separating hyperplane to illustrate the classified accuracy of different methods. We use 75% samples as the training set to build the classifiers. Figure 2 shows the separating hyperplane obtained by L1 norm regularized logistic model, SVM with Gaussian kernel, MOE with trace norm regularization, and the classic MOE model without regularization. In MOE model, the number of experts is set to 10. In Figure 2, we use the following procedure to obtain the 2-dimenisional separating hyperplane. First, we use the project matrix which is used to generate the 50-dimemisional input features to project the points in the 2-dimemisional plane onto 50 dimensional feature space. Then, we add 50 all zero features for these projected points. Finally, we use the classifers built by different methods to decide the label of the points in the 2-dimenisional plane, and draw the separating hyperplane.

Figure 2

Separating hyperplane obtained by a) L1 norm regularized logistic regression, b) SVM, c) trace norm regularized MOE, d) classic MOE

Although we specify the number of experts to 10, the trace norm regularized MOE model uses about 4 segment lines to separate the samples successfully. As the synthetic data set is linear inseparable, the logistic regression model cannot classify these data correctly. Meanwhile, SVM predict many negative samples as the positive samples due to the disturbance of noise features. Figure 2 shows that the performance of the classic MOE on the high dimensional synthetic data is very poor. Therefore, we do not evaluate the performance of classic MOE model in the following experiments.

Next, we adopt the 10 hold-out partitions to compute the average classification accuracy of these methods. On each partition, we first randomly select 50% samples as the training samples and then use a 3-fold cross validation procedure on training set to obtain the suitable parameters for each method. We evaluate the predictive accuracy of the remaining 50% samples. Table 1 shows the average classification error on the synthetic data set. TMOE in Table 1 stands for the proposed trace norm regularized MOE model, and RMOE in Table 1 stands for the L1 norm regularized MOE model.

Table 1

Average (std. deviation) classification error on synthetic dataset

The results in Table 1 show that AdaBoost obtains the best classification accuracy on the synthetic data set. The results obtained by TMOE and RMOE are comparable with AdaBoost. AdaBoost can not extract the linear combination of features, but MOE model can converge multiple linear models to describe the complex nonlinear relationship between input and output variables.

## 3.2 Real data sets

We test the performance of the proposed method using 4 real datasets including Ionosphere, Musk-1, LSVT, and Sonar data sets. These datasets are taken from UCI Machine Learning Repository for 2-class classification. The main characteristics of each dataset are described in Table 2.

Table 2

Detail of real datasets used for experiments

We also use the 10 hold out partitions to evaluate the average classification accuracy. From each partition, we select 50% samples as training samples and use the remaining samples as the test samples. Table 3 shows the average classification errors on these real datasets.

Table 3

Average (std. deviation) classification error on real datasets

In the experiments, the SVM method can obtain better results by using the Gaussian kernel on Ionosphere and Sonar data sets and using the polynomial kernel on Musk-1 and LSVT data sets. Therefore, SVM use the Gaussian kernel and the polynomial kernel respectively on the four datasets. The kernel function used in TMOE is the same as the kernel function used in SVM. The comparison results in Table 3 show that the trace norm regularized MOE model generally perform better than the L1 norm regularized MOE model. Since the linear experts can preserve more common features among tasks, TMOE using the linear experts obtains the best result in higher-dimensional datasets, such as Musk-1(166-dimensional) and LSVT(309-dimensional); On the contrary, TMOE with kernel obtains the best result in Ionosphere(34-dimensinal, 351 samples) dataset in the case of lower dimensional and large sample size. Regularized MOE models often perform better than linear models because they use the combination of multiple linear models to describe nonlinear relationships. The experiments on the real datasets demonstrate the good performance achieved by the proposed method.

## 4 Conclusions

In the paper, the trace norm regularization is introduced into mixture of expert model to extract the common feature representation of expert models and gating functions. The combination of MOE and trace norm regularization can improve the generalization performance of MOE model. Moreover, trace norm regularization allows us to handle the high dimensional data more flexibly by working on the kernel space. The experiments on synthetic dataset and four real data sets demonstrate the superiority of the proposed method to the classic L1 norm regularized MOE model. However, experimental results also show that the performance of the proposed method is not always as well as classical sophisticated algorithms(such as Bagging-SVM and AdaBoost) in small samples or lower dimensional cases. In the future, the kernel function selected approach will be optimized to improve the classification performance, and the combination of Bayesian multi-task learning technique and MOE model will be considered to avoid the computation of cross-validation in the proposed method.

## Acknowledgement

This work is supported in part by National Natural Science Foundation of China (Grant No. 61501385), Science and Technology Planning Project of Sichuan Province, China (Grant Nos. 2016JY0242, 2016GZ0210), and Foundation of Southwest University of Science and Technology (Grant Nos. 15kftk02, 15kffk01).

## References

• [1]

Pan S.J., Yang Q., A Survey on Transfer Learning, IEEE Knowl Data. EN., 2010, 22, 1345–1359.

• [2]

Kshirsagar M., Carbonell J., Klein-Seetharaman J., Multitask learning for host–pathogen protein interactions. Bioinformatics., 2013, 29, 217–226.

• [3]

Bickel S., Bogojeska J., Multi-task learning for HIV therapy screening, Proceedings of the 25th International Conference on Machine Learning, Helsinki Finland, 2008, 56-63. Google Scholar

• [4]

Yuan X.T., Yan S., Visual classification with multi-task joint sparse representation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, 2010, 3493–3500 Google Scholar

• [5]

Chapelle O., Shivaswamy P., Multi-task learning for boosting with application to web search ranking, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington USA, 2010. 1189-1198. Google Scholar

• [6]

He J.R., Zhu Y.D., Hierarchical Multi-task Learning with Application to Wafer Quality Prediction, Proceedings of the 12th IEEE International Conference on Data Mining, Brussels Belgium, 2012, 290–298. Google Scholar

• [7]

Caruana R., Multitask Learning. MACH Learnm., 1997, 28, 41–7510.

• [8]

Baxter J., A Model of Inductive Bias Learning, J. Artif Intell RES., 2011, 12, 149-198. Google Scholar

• [9]

Schwaighofer A., Tresp V., Yu K., Learning Gaussian process kernels via hierarchical Bayes, Neural Inf. Process. Syst., 2004, 1209–1216. Google Scholar

• [10]

Yu K., Tresp V., Schwaighofer A., Learning Gaussian Processes from Multiple Tasks, Proceedings of the 22Nd International Conference on Machine Learning, New York, USA, 2005, 1012–1019. Google Scholar

• [11]

Zhang J., Ghahramani Z., Yang Y., Learning multiple related tasks using latent independent component analysis, Neural Inf. Process. Syst., 2005, 1585–1592. Google Scholar

• [12]

Evgeniou T., Pontil M., Regularized Multi–task Learning, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2004, 109–117. Google Scholar

• [13]

Liu J., Ji S., Ye J., Multi-task feature learning via efficient l 2, 1-norm minimization, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Montreal Canada,2009, 339–348 Google Scholar

• [14]

Nie F., Huang H., Cai X., Ding C.H., Efficient and robust feature selection via joint 2,1-norms minimization, Neural Inf. Process. Syst., 2010, 1813–1821. Google Scholar

• [15]

Argyriou A., Evgeniou T., Pontil M., Convex multi-task feature learning, MACN. Learn., 2008, Google Scholar

• [16]

Toh K.C., Yun S., An Accelerated Proximal Gradient Algorithm for Nuclear Norm Regularized Linear Least Squares Problems. PAC J. Optim., 2010, 6(3), 615–640. Google Scholar

• [17]

Pong T.K., Tseng P., Ji S.W., Ye J.P., Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning. SIAM J. Optimiz., 2010, 20, 3465–3489.

• [18]

Recht B., Fazel M., Parrilo P.A., Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization, SIAM REV., 2010, 52, 471–501.

• [19]

Jacobs R.A., Jordan M.I., Nowlan S.J., Hinton G.E., Adaptive Mixtures of Local Experts, Neural Comput., 1991, 3, 79–87.

• [20]

Yuksel S.E., Wilson J.N., Gader P.D., Twenty Years of Mixture of Experts, IEEE T NET L EAR., 2012, 23, 1177–1193. Google Scholar

• [21]

Bo L., Sminchisescu C., Kanaujia A., Metaxas D., Fast algorithms for large scale conditional 3D prediction, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage USA, 2008, 1–8. Google Scholar

• [22]

Peralta B., Soto A., Embedded local feature selection within mixture of experts, INFORM Sciences., 2014, 269, 176–187.

• [23]

Khalili A., New estimation and feature selection methods in mixture-of-experts models. CAN J. STAT., 2010, 38, 519–539.

• [24]

Amit Y., Fink M., Srebro N., Ullman S., Uncovering Shared Structures in Multiclass Classification, Proceedings of the 24th International Conference on Machine Learning, New York, USA, 2007, 17–24. Google Scholar

• [25]

Srebro N., Rennie J.D.M., Jaakola T.S., Maximum-Margin Matrix Factorization, Neural Inf. Process. Syst., 2005, 1329–1336. Google Scholar

• [26]

Fazel M., Hindi H., Boyd S.P., A rank minimization heuristic with application to minimum order system approximation, Proceedings of the 2001 American Control Conference, Arlington, USA, 2001, 4734–4739. Google Scholar

• [27]

Yang J., Yuan X., Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization, MATH Comput., 2013, 82, 301–329.10. Google Scholar

Accepted: 2017-09-17

Published Online: 2017-11-10

Citation Information: Open Physics, Volume 15, Issue 1, Pages 674–681, ISSN (Online) 2391-5471,

Export Citation