Feature extraction is a commonly used approach in machine learning to improve the accuracy of model when the training samples are insufficient. Some feature extraction methods gained the underlying characteristics by estimating a matrix to project the original input feature into a low dimensional subspace. Formally, the extracted input feature can be expressed as *F*^{T}x, where *F* ∈ *R*^{P × S}, *S* ≤ *P*. Subsequently, a regression or classification model is developed with the extracted feature *F*^{T}x. Combining the feature extraction and the MOE model, and using the projection matrix *F*_{G}, *F*_{E} for the gating functions and the expert models respectively, we can obtain
$$\begin{array}{}{\pi}_{K}(x)=P({\mathrm{z}}_{k}=1|x,{r}_{k})=p({z}_{k}=1|{F}_{G}^{T}x,{g}_{k})\end{array}$$(13)
$$\begin{array}{}h(y,{w}_{K}^{T}x)=P(y|x,{w}_{k})=P(y|{F}_{E}^{T}x,{h}_{k})\end{array}$$(14)

where *g*_{k} ∈ *R*^{s}, *k*=1,…,*K*, *h*_{k} ∈ *R*^{s}, *k*=1,…,*K*. The model coefficients *w*_{k}, *r*_{k}, are substituted by: *w*_{k} = *F*_{E}h_{k} and *r*_{k}= *F*_{G}g_{k}. Let *W*=(*w*_{i},…,*w*_{k}) ∈ *R*^{p × K} and *R*=(*r*_{1}, …, *r*_{k}) ∈ *R*^{p × K}. When the projection matrix *F*_{G} and *F*_{E} are known, we can use the new inputs
$\begin{array}{}{F}_{G}^{T}x\text{\hspace{0.17em}and\hspace{0.17em}}{F}_{E}^{T}x\end{array}$ to train a new MOE model. In this paper, we attempt to simultaneously learn the projection matrix and the MOE model built on the extracted features. Following the work [15, 24], we use the trace norm regularization for simultaneous feature extraction and model learning. As the prediction are dependent on *F*_{E}H and *F*_{G}G, we can add the Frobenius norm regularizer in the above EM algorithm to control the magnitude of *F*_{E}, *F*_{g}, *H* and *G*. Adding the regularization terms will not change the posterior distribution of the latent variable *z*. Therefore, the E-step has not been changed. In the M-step, the optimization problem is then reformulated as:
$$\begin{array}{}\underset{{F}_{E},{F}_{G},H,G}{max}L({F}_{E}H,{F}_{G}G,X,Y)\\ -{\displaystyle {C}_{G}\left(\frac{1}{2}{{\u2225{F}_{G}\u2225}_{F}}^{2}+\frac{1}{2}{{\u2225G\u2225}_{F}}^{2}\right)}\\ {\displaystyle -{C}_{E}\left(\frac{1}{2}{{\u2225{F}_{E}\u2225}_{F}}^{2}+\frac{1}{2}{{\u2225H\u2225}_{F}}^{2}\right)}\end{array}$$(15)

where *C*_{G} > 0, *C*_{E} > 0 are the regularization parameters. The optimization problem (15) is nonconvex. However, following Ref. [24], the non-convex optimization problem can be converted into a trace norm regularization problem.

Trace norm ∥⋅∥_{tr} of a matrix is defined as the sum of the singular value of the matrix:
$$\begin{array}{}{\displaystyle {\u2225W\u2225}_{\mathrm{t}\mathrm{r}}=\sum _{i}{\gamma}_{i}}\end{array}$$(16)

where *γ*_{i} is the *i*-th singular values of *W*.

According to [25], the trace norm has the following property:
$$\begin{array}{}{\displaystyle {\u2225W\u2225}_{\mathrm{t}\mathrm{r}}=\underset{FG=W}{min}\frac{1}{2}\left({{\u2225F\u2225}_{F}}^{2}+{{\u2225G\u2225}_{F}}^{2}\right)}\end{array}$$(17)

The problem (15) can be rewritten as:
$$\begin{array}{}{\displaystyle \underset{W,R,C,D}{max}\sum _{\mathrm{n}=1}^{N}\sum _{i=1}^{K}{\alpha}_{ni}\left[\mathrm{log}P({y}_{n}|{x}_{n},{z}_{ni}=1,{w}_{k})\right.}\\ +\mathrm{log}P({z}_{ni}=1|{x}_{n},\left.{r}_{k})\right]-{C}_{G}{\u2225W\u2225}_{tr}-{C}_{E}{\u2225R\u2225}_{tr}\end{array}$$(18)

Trace norm is the convex envelop of matrix rank [26]. Therefore, the trace norm regularization is often used in the multi-task learning and matrix completion to obtain low rank solutions. The idea of using the trace norm to extract the features of the MOE model comes from multi-task learning. In multi-task learning, the trace norm regularizer is often used to gain a few features common across the tasks. The MOE model divides the data into multiple regions, and the data in each region may be insufficient to train the local expert model. Since the multi-task learning can improve the generalization performance when only limited training data for each task are available, we can use the multi-task learning to improve the performance of the expert model and the gating functions in MOE model. Consequently, with the aid of MOE model, we can apply the multi-task learning technique in a single task learning problem.

The optimization problem (18) can be divided into two independent trace norm regularization problems:
$$\begin{array}{}{\displaystyle \underset{W,C}{max}\sum _{\mathrm{n}=1}^{N}\sum _{i=1}^{K}-{\alpha}_{ni}\mathrm{log}\left[h\right.({y}_{n},{w}_{i}^{T}{x}_{n}\left.)\right]+{C}_{H}{\u2225W\u2225}_{tr}}\end{array}$$(19)

and
$$\begin{array}{}{\displaystyle \underset{R,D}{max}\sum _{\mathrm{n}=1}^{N}\sum _{i=1}^{K}-{\alpha}_{ni}\left[\mathrm{log}\left(\frac{\mathrm{exp}({r}_{i}^{T}{x}_{n}+{d}_{i})}{\sum _{j=1}^{k}\mathrm{exp}({{r}_{j}}^{T}{x}_{n}+{d}_{j})}\right)\right]+{C}_{G}{\u2225R\u2225}_{tr}}\end{array}$$(20)

When
$\begin{array}{}h({y}_{n},{w}_{i}^{T}{x}_{n})\end{array}$ is log-concave, the optimization with trace norm regularization is a convex, but non-smooth, optimization problem. It can be formulated as a semi-definite program (SDP) and solved by some existing SDP solver such as SDPT3 [26]. Recently, many efficient algorithms such as block coordinate descent [15], accelerated proximal gradient method (APG) [16] and ADMM [27] have been developed to solve the trace norm minimization problem. In this paper, we use the APG algorithm to solve the trace norm regularization due to its fast convergence rate.

In summary, the MOE model with trace norm regularization can be trained with iterating the following two steps:

E-step: evaluate the output of the gating functions and the expert models by using the current parameters, and then evaluate the posterior probability of latent variables using eq. (11).

M-step: update the values of the parameters *w*_{i} and *r*_{i} by solving the two trace norm minimization problems (19) and (20).

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.