One method is presented to modify and extend kernel-based ELM to tackle MIL problems. Throughout this part, the binary classification setting is considered first, where *t*_{i}∈{+1, −1}; then, multiclass classification setting is briefly discussed.

In this section, we will discuss the so-called MI-ELM in detail. In the ELM learning algorithm of RBF kernels, the centers and the impact widths of RBF kernels are randomly generated. In the same way as ELM, we randomly choose the clusters and impact widths. Different from usual RBF function or kernel-based ELM, the training sample of MIL style is a bag that contains multiple instances. Therefore, the input of the MI-ELM neural network corresponds to a bag composed of some instances (vectors) other than a single vector of the traditional neural network. Another main difference between the MI-ELM and the standard typical neural networks is that, for the MI-ELM, each node in the hidden layer corresponds to a bag, while that of the standard RBF ELM network is a vector. Standard Euclidean distance could not be used here. Instead, Hausdorff matrix function is applied to measure the distance between bags and clusters.

Assume that the training set contains *M* bags, and the *i*-th bag is composed of *N*_{i} instances, all instances belong to the *p*-dimensional space, for example, the *j*-th instance in the *i*-th bag is [*B*_{ij1}, *B*_{ij2}, …, *B*_{ijp}]. Each bag is attached to a label *Y*_{i}. If the bag is positive, then *Y*_{i}=1; otherwise, *Y*_{i}=−1.

Based on this, the RBF kernel can be modified. Set the Gaussian function as

$$G\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{,}\text{\hspace{0.17em}}\sigma \mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}exp\mathrm{(}-\frac{{H}^{2}\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{)}}{2\sigma}\mathrm{)},$$(11)

where *B* is a input bag (modeled as a matrix) and *C* is the hidden layer center (modeled as a matrix). The *σ* is the standard deviation of the basis function *G* controlling the smoothness property. *H*(*B*, *C*) is the Hausdorff distance between bag *B* and center *C*

$$maxH\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}max\mathrm{\{}h\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{0.17em}}h\mathrm{(}C\mathrm{,}\text{\hspace{0.17em}}B\mathrm{)}\},$$(12)

where

$$h\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{)}=\underset{b\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}B}{\mathrm{max}}\underset{c\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}C}{\mathrm{min}}\parallel b\text{\hspace{0.17em}}-\text{\hspace{0.17em}}c\parallel ,$$(13)

or

$$minH\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{)}=\text{max}\mathrm{\{}h\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{0.17em}}h\mathrm{(}C\mathrm{,}\text{\hspace{0.17em}}A\mathrm{)}\},$$(14)

where

$$h\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}C\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\underset{b\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}B}{\mathrm{min}}\underset{c\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}C}{\mathrm{min}}\parallel b\text{\hspace{0.17em}}-\text{\hspace{0.17em}}c\parallel .$$(15)

The actual output of an RBF network with *K* kernels for an input bag (vectors) is given by

$${y}_{k}\mathrm{(}{B}_{n}\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{\displaystyle \sum _{n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{K}}{\beta}_{i}{\varphi}_{i}\mathrm{(}{B}_{n}\mathrm{)},$$(16)

where **β**_{i}=[*β*_{i1}, *β*_{i2}, …, *β*_{im}]^{T} is the weight vector connecting the *i*-th kernel and the output neurons and *ϕ*_{i}(*x*) is the output of the *i*-th kernel

$${\varphi}_{i}\mathrm{(}{B}_{n}\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}exp\mathrm{(}-\frac{{H}^{2}\mathrm{(}{B}_{n}\mathrm{,}\text{\hspace{0.17em}}{C}_{i}\mathrm{)}}{2{\sigma}_{i}}\mathrm{)}\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}1\text{\hspace{0.17em}}\le \text{\hspace{0.17em}}i\text{\hspace{0.17em}}\le \text{\hspace{0.17em}}K,$$(17)

where *K* is the number of neurons in the hidden layer.

One important problem is how to search the hidden layer center *C*_{i} and impact widths *σ*_{j}. Instead of taking a long time to search proper centers and impact widths, we can just simply randomly choose the values of these parameters like RBF-based ELM does. Considering that our training sample consists of bags, the hidden layer node center must be a bag, while the traditional RBF kernel’s center is a single value or vector. According to the ELM theory, the hidden layer center *C*_{i} and impact width *σ*_{j} can be just simply initialized randomly. After doing that, the next procedure is very similar to the one adopted to train the typical ELM. The hidden layer activation can be obtained, and output weights are obtained by analytical calculation. Concretely, through minimizing the sum of squares error and maximizing the marginal distance, the output weight of an MI-ELM neural network is optimized:

$$\begin{array}{cc}\text{Minimize}\mathrm{:}& {L}_{{P}_{MIELM}}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\frac{1}{2}\parallel \beta {\parallel}^{2}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}\lambda \frac{1}{2}{\displaystyle \sum _{n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}}\parallel {\xi}_{n}{\parallel}^{2}\\ \text{Subject\hspace{0.17em}to:}& h\mathrm{(}{B}_{n}\mathrm{)}\beta \text{\hspace{0.17em}}=\text{\hspace{0.17em}}{y}_{n}^{T}\text{\hspace{0.17em}}-\text{\hspace{0.17em}}{\xi}_{n}^{T}\text{\hspace{0.17em}}n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}N,\end{array}$$(18)

where 1/*λ* is a predefined parameter in order to obtain better generalization performance. Recall that ${\xi}_{n}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{y}_{n}^{T}\text{\hspace{0.17em}}-\text{\hspace{0.17em}}h\mathrm{(}{B}_{n}\mathrm{)}\beta $ is the training error of sample (bag) *B*_{n}, caused by the dissimilarity between the desired output *t*_{n} and the actual output *h*(*B*_{n})**β**. *h*(*B*_{n}) is the RBF mapping vector with respect to *B*_{n}, and *β* stands for the output weight vector connecting the hidden layer and output neurons.

Based on the Karush-Kuhn-Tucker (KKT) theorem, the dual optimization problem with respect to Eq. (18) is

$${L}_{{D}_{MIELM}}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\frac{1}{2}\left|\right|\beta {\left|\right|}^{2}\text{\hspace{0.17em}}-\text{\hspace{0.17em}}{\displaystyle \sum _{n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}}{\lambda}_{n}\mathrm{(}h\mathrm{(}{B}_{n}\mathrm{)}\beta \text{\hspace{0.17em}}-\text{\hspace{0.17em}}{t}_{n}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{\xi}_{n}\mathrm{)}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}C\frac{1}{2}{\displaystyle \sum _{i\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}}{\xi}_{i}^{2},$$(19)

where *λ*_{n} is the Lagrange multiplier of sample *B*_{n}. The KKT conditions that the minimizer of Eq. (19) has to satisfy are

$$\begin{array}{l}\frac{\partial {L}_{{D}_{MIELM}}}{\partial \beta}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}0\\ \frac{\partial {L}_{{D}_{MIELM}}}{\partial {\xi}_{n}}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}0\\ \frac{\partial {L}_{{D}_{MIELM}}}{\partial {\lambda}_{i}}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}0\end{array}$$(20)

Combing Eqs. (19) and (20) results in

$$\begin{array}{l}\beta \text{\hspace{0.17em}}=\text{\hspace{0.17em}}{\displaystyle \sum _{n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}}{\lambda}_{n}h{\mathrm{(}{B}_{n}\mathrm{)}}^{T}\hfill \\ {\lambda}_{i}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}C{\xi}_{n}\hfill \\ h\mathrm{(}{B}_{n}\mathrm{)}\beta \text{\hspace{0.17em}}-\text{\hspace{0.17em}}{y}_{n}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{\xi}_{n}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}0\text{\hspace{0.17em}}n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}N.\hfill \end{array}$$(21)

The solutions of **β** can be derived from Eq. (21) using generalized Moore–Penrose inverse. Note that there are two forms of pseudo-inverse available to users: the left pseudo-inverse and right pseudo-inverse:

$$\begin{array}{l}\text{When\hspace{0.17em}}N\text{\hspace{0.17em}}<\text{\hspace{0.17em}}L\mathrm{:}\text{\hspace{0.17em}}\beta \text{\hspace{0.17em}}=\text{\hspace{0.17em}}{H}^{\u2020}Y\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{H}^{T}{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}H{H}^{T}\mathrm{)}}^{-1}Y\\ \text{When\hspace{0.17em}}N\text{\hspace{0.17em}}>\text{\hspace{0.17em}}L\mathrm{:}\text{\hspace{0.17em}}\beta \text{\hspace{0.17em}}=\text{\hspace{0.17em}}{H}^{\u2020}Y\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{H}^{T}H\mathrm{)}}^{-1}{H}^{T}Y.\end{array}$$(22)

Notice that right pseudo-inverse is recommended if the training data set is small; otherwise, left pseudo-inverse is more efficient, because they have to calculate the inverse of an *N*×*N* matrix and the inverse of an *L*×*L* matrix, respectively.

Given a new sample *B*, the output function of the MI-ELM is acquired from *f*(*x*)=sign *h*(*x*)**β**:

$$\begin{array}{l}f\mathrm{(}x{\mathrm{)}}_{N\text{\hspace{0.17em}}\times \text{\hspace{0.17em}}N}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{sign}\text{\hspace{0.17em}}h\mathrm{(}x\mathrm{)}{H}^{T}{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}H{H}^{T}\mathrm{)}}^{-1}Y\\ f\mathrm{(}x{\mathrm{)}}_{L\text{\hspace{0.17em}}\times \text{\hspace{0.17em}}L}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{sign}\text{\hspace{0.17em}}h\mathrm{(}x\mathrm{)}{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}H{H}^{T}\mathrm{)}}^{-1}{H}^{T}Y.\end{array}$$(23)

If a feature mapping *h*(**x**) is unknown to users, one can apply Mercer’s conditions on ELM. We can define an RBF kernel [Eq. (11)] matrix for MI-ELM as follows:

$${\Omega}_{\text{ELM}}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}H{H}^{T}\mathrm{:}{\Omega}_{\text{ELM}i\mathrm{,}j}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}h\mathrm{(}{B}_{i}\mathrm{)}h\mathrm{(}{B}_{j}\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}K\mathrm{(}{B}_{i}\mathrm{,}\text{\hspace{0.17em}}{B}_{j}\mathrm{}\mathrm{)}\mathrm{.}$$(24)

Inspired from the work of Ref. [4] and the definition of a kernel, the output function in terms of the kernel is naturally derived from the *N*×*N* version. Then, the output function of the MI-ELM classifier can be written compactly as

$$\begin{array}{c}f\mathrm{(}x\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}h\mathrm{(}x\mathrm{)}{H}^{T}{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}H{H}^{T}\mathrm{)}}^{-1}Y\\ =\text{\hspace{0.17em}}\left[\begin{array}{c}K\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}{B}_{1}\mathrm{)}\\ \vdots \\ K\mathrm{(}B\mathrm{,}\text{\hspace{0.17em}}{B}_{N}\mathrm{)}\end{array}\right]{\mathrm{(}\frac{I}{\lambda}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{\Omega}_{\text{ELM}}\mathrm{)}}^{-1}Y.\end{array}$$(25)

Assuming that training data set is {*B*_{i}, *t*_{i}|*i*=1, …, *M*}, bag *B*_{i} contains *N*_{i} instances $\mathrm{\{}{B}_{i1}\mathrm{,}\text{\hspace{0.17em}}{B}_{i2}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{B}_{i{N}_{i}}\mathrm{\}},$ and the number of hidden node output function *G*(*a*, *b*, *B*_{i}) is *K*, we can now summarize the MI-ELM algorithm.

A big advantage over other methods (e.g. MI-SVM) is that the MI-ELM can be easily extended to deal with multiclass classification and task. We can naturally make use of an SLFN with multiple output nodes. A one-against-all method is adopted to transform multiclassification applications into multiple binary classifiers and turn the discrete classification application to a continuous output function regression problem. Taking a testing sample into account, the class label is the output node with the largest output value. Suppose a set of multiclass training data ${\mathrm{\{}{B}_{i}\mathrm{,}\text{\hspace{0.17em}}{t}_{i}\mathrm{\}}}_{i\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}$ with *M*-labels, where *t*_{i}∈{1, …, *M*}. For example, taking *B*_{i} from class one, the corresponding label vector is encoded to an *M*-dimensional vector *t*_{i}=[1, −1, −1, …, −1]. Additionally, the actual output node function for a sample *B*_{i} is *f*(*B*_{i})=[*f*_{1}(*B*_{i}), …, *f*_{M}(*B*_{i})]. Therefore, the predicted label of *B*_{i} can be inferred from a simple equation as follows:

$label\mathrm{(}{B}_{i}\mathrm{)}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{arg\hspace{0.17em}max\hspace{0.17em}}{f}_{j}\mathrm{(}{B}_{i}\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}j\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}\mathrm{}\mathrm{[}\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}M\mathrm{]}$

Algorithm 1: MI-ELM algorithm.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.