Show Summary Details
More options …

# Journal of Intelligent Systems

Editor-in-Chief: Fleyeh, Hasan

CiteScore 2018: 1.03

SCImago Journal Rank (SJR) 2018: 0.188
Source Normalized Impact per Paper (SNIP) 2018: 0.533

Open Access
Online
ISSN
2191-026X
See all formats and pricing
More options …
Volume 26, Issue 1

# Multiple-Instance Learning via an RBF Kernel-Based Extreme Learning Machine

Jie Wang
/ Liangjian Cai
/ Xin Zhao
Published Online: 2016-03-17 | DOI: https://doi.org/10.1515/jisys-2015-0011

## Abstract

As we are usually confronted with a large instance space for real-word data sets, it is significant to develop a useful and efficient multiple-instance learning (MIL) algorithm. MIL, where training data are prepared in the form of labeled bags rather than labeled instances, is a variant of supervised learning. This paper presents a novel MIL algorithm for an extreme learning machine called MI-ELM. A radial basis kernel extreme learning machine is adapted to approach the MIL problem using Hausdorff distance to measure the distance between the bags. The clusters in the hidden layer are composed of bags that are randomly generated. Because we do not need to tune the parameters for the hidden layer, MI-ELM can learn very fast. The experimental results on classifications and multiple-instance regression data sets demonstrate that the MI-ELM is useful and efficient as compared to the state-of-the-art algorithms.

## 1 Introduction

Multiple-instance learning (MIL) was developed by Dietterich et al. [13] to solve the problem of musk drug prediction. A number of problems are regarded as multiple-instance ones, such as drug activity prediction, spam filtering [21], human action recognition [1], computer-aided diagnosis [14], visual tracking [26], [27], [29], image categorization [10], [11], and text categorization [3], [22], [32]. In MIL, the example is called a bag consisting of many instances (feature vectors). Some of the instances could be responsible for the observed classification of the example, and the label is only attached to the bag instead of its instances. The key assumption in MIL is that a bag is classified as positive if at least one of its instances is positive; otherwise, the bag is labeled to be a negative one. Although conventional supervised classification algorithms can be applied to a multiple-instance problem, their performance is often not good enough.

A vast amount of MIL learning methods have emerged in recent years, many of which are modifications of algorithms for supervised learning. Dietterich et al. created variant algorithms whose concepts are expressed by axis-parallel rectangles (APRs). Maron and Lozano-Perez proposed diverse density [24], which measures the intersection of the positive bags minus the union of the negative. Andrews et al. [3] presented two methods to extend support vector machines (SVMs) to tackle MIL problems. Wang and Zucker [25] trained two variants of a nearest-neighbor algorithm using Hausdorff distance: Bayesian-kNN and Citation-kNN. Zhou and Zhang [31] modified and extended the neural network for MIL through employing a specific error function. Leistner et al. [23] derived MIForests for MIL from the random forest, where the hidden class labels inside target bags are defined as random variables. However, it is common to see that it takes several minutes, several hours, or even days to train most of the MIL algorithms.

Extreme learning machine (ELM) is a new kind of single hidden layer feed-forward neural network (SLFN). When compared with the conventional neural network learning algorithms, it partially overcomes the slow training speed and overfitting drawbacks [8], [18], [20]. We believe that by using ELM, MIL problems can be addressed effectively. As ELM was first developed for SLFNs by Huang et al. [18], it has been widely used in many fields [6], [7], [8], [9]. Using the SLFN architecture, the input weights are chosen randomly and the output weights can be determined by a linear solution. Huang and Siew [17] proved that ELM can be extended with radial basis function (RBF) kernels in which the centers and impact width are generated randomly.

This paper focuses on adapting ELM to multiple instances by using an RBF kernel. Because a bag consists of multiple instances and the bag cannot be denoted by a single point in the feature space, the distance between bags cannot be measured by standard Euclidean distance. Fortunately, the Hausdorff distance endows us the ability to measure such distance. By using Hausdorff distance, a modified kernel-based ELM for multiple-instance problems is proposed. Briefly speaking, instead of adjusting the centers and the impact widths of RBF kernels, the centers and impact widths are allowed to be randomly generated, and the output weights are calculated analytically. The experimental results on benchmark data sets, text categorization, image categorization, and multiple-instance regression task have demonstrated that the MI-ELM is effective and reliable.

The rest of this paper is organized as follows. In Section 2, we briefly discuss ELM and Hausdorff distance. In Section 3, we present the new ELM model for MIL. In Section 4, we present the experimental results on different MIL problems. In Section 5, we conclude the main idea of the method and discuss possible future work.

## 2.1 Extreme Learning Machine

ELM is a learning theory for generalized SLFN, where the hidden layer does not need to be tuned. ELM has much higher generalization performance with much faster learning speed than those conventional methods.

The main difference between ELM and other neural network learning methods is that the input weights in ELM are randomly generated and the output weights are obtained by analytical calculation. Concretely speaking, the hidden layer output (with L nodes) is denoted as a row vector h(x)=[h1(x), …, hL(x)], where x is the input sample. Given N training samples ${\left\{{x}_{i},\text{\hspace{0.17em}}{y}_{i}\right\}}_{i\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N},$ the model of the SLFN can be written as

$fL(x) = ∑j = 1L βjG(aj, bj, x),$(1)

where βj is the weight connecting the j-th hidden node to the output node; aj is the weight vector connecting the j-th hidden neuron to the input neurons; and bj is the bias of the j-th hidden neuron. G(aj, bj, x) is the output of the j-th hidden node. The ELM theory [16], [19] claims that the learning parameters aj and bj can be randomly generated. Two popular hidden layer functions are listed below:

1. Sigmoid function

$G(a, b, x) = 11 + exp(−ax + b),$(2)

where a is the weight parameter connecting the input layer to the hidden node and b is the bias of the hidden node.

2. Gaussian function

$G(a, b, x) = exp(−b||x − a||2),$(3)

where a and b are the center and impact width of the RBF node, respectively. For simplicity, an equivalent compact form of Eq. (1) can be written as

$Hβ = O,$(4)

where H is the output matrix of N×L hidden layer nodes and Hij=G(aj, bj, xi) denote the entry in the i-th row and j-th column of H. Besides, β=[β1, …, βL]T and O=[o1, …, oN]T.

Recall that ELM theories claim that the input learning parameters αj and bj can be randomly generated a priori without considering the training data. With the goal of minimizing the network cost function ‖OY‖ in mind, where Y=[y1, …, yN]T is the target output matrix, Eq. (1) becomes a linear model. The output weights can be analytically determined by finding a least-square solution of this linear system as below:

$β = H†Y,$(5)

where H is the Moore–Penrose generalized inverse of the hidden layer output matrix H.

## 2.2 Hausdorff Distance

Hausdorff distance has been proved to be very useful in MIL. The classical Citation-kNN and Bayesian-kNN [25] were the first to use Hausdorff distance to measure the distance between bags. Usually, every sample is denoted as a feature vector (point), and for different points a and b, their distance can be calculated as

$Dist(a, b) = ||a − b||.$(6)

Equation (6) is often called the standard Euclidean distance, which is capable to define the distance between instances; however, if the sample (bag) contains many instances, Eq. (6) should be extended to measure the distance between bags. The Hausdorff distance endows us the ability to measure such distance.

Concretely, given two sets of objects A={a1, a2, …, am} and B={b1, b2,,…, bn}, the Hausdorff distance is defined as

$H(A, B) = max{h(A, B), h(B, A)},$(7)

where

$h(A, B) = maxa ∈ Aminb ∈ B∥a − b∥,$(8)

and ‖ab‖ is computed as |ab|. For example, let A={5, 8, 10} and B={1, 4, 6}, where A and B contain several one-dimensional points. For the moment, h(A, B)=max{|5−4|, |8−6|, |10−6|}=4 and h(B, A)=max{|1−5|, |4−5|, |6−5|}=4. Thus, H(A, B)=max{h(A, B), h(B, A)}=4.

It can be found from Eq. (8) that even a single outlying point will cause Hausdorff distance oscillation. For instance, let A={1, 2, 3} and B={4, 5, 6}; then, H(A, B)=max{h(A, B), h(B, A)}=3. When B={18, 5, 6}, then H(A, B)=|18−3|=15, where 18 is a single outlying point of A.

To enforce the robustness of this distance with noise, the minimal Hausdorff distance is also used in this paper, which is another variation of Hausdorff distance. The minimal Hausdorff distance is presented as

$minH(A, H) = max{h(A, B), h(B, A)},$(9)

where

$h(A, B) = mina ∈ Aminb ∈ B∥a − b∥.$(10)

Figure 1 illustrates the two Hausdorff distance. Note that both the maximal and the minimal Hausdorff distances are tested in our experiments.

Figure 1:

Minimal and Maximal Hausdorff Distance.

## 3 MI-ELM

One method is presented to modify and extend kernel-based ELM to tackle MIL problems. Throughout this part, the binary classification setting is considered first, where ti∈{+1, −1}; then, multiclass classification setting is briefly discussed.

In this section, we will discuss the so-called MI-ELM in detail. In the ELM learning algorithm of RBF kernels, the centers and the impact widths of RBF kernels are randomly generated. In the same way as ELM, we randomly choose the clusters and impact widths. Different from usual RBF function or kernel-based ELM, the training sample of MIL style is a bag that contains multiple instances. Therefore, the input of the MI-ELM neural network corresponds to a bag composed of some instances (vectors) other than a single vector of the traditional neural network. Another main difference between the MI-ELM and the standard typical neural networks is that, for the MI-ELM, each node in the hidden layer corresponds to a bag, while that of the standard RBF ELM network is a vector. Standard Euclidean distance could not be used here. Instead, Hausdorff matrix function is applied to measure the distance between bags and clusters.

Assume that the training set contains M bags, and the i-th bag is composed of Ni instances, all instances belong to the p-dimensional space, for example, the j-th instance in the i-th bag is [Bij1, Bij2, …, Bijp]. Each bag is attached to a label Yi. If the bag is positive, then Yi=1; otherwise, Yi=−1.

Based on this, the RBF kernel can be modified. Set the Gaussian function as

$G(B, C, σ) = exp(−H2(B, C)2σ),$(11)

where B is a input bag (modeled as a matrix) and C is the hidden layer center (modeled as a matrix). The σ is the standard deviation of the basis function G controlling the smoothness property. H(B, C) is the Hausdorff distance between bag B and center C

$maxH(B, C) = max{h(B, C), h(C, B)},$(12)

where

$h(B, C)=maxb ∈ Bminc ∈ C∥b − c∥,$(13)

or

$minH(B, C)=max{h(B, C), h(C, A)},$(14)

where

$h(B, C) = minb ∈ Bminc ∈ C∥b − c∥.$(15)

The actual output of an RBF network with K kernels for an input bag (vectors) is given by

$yk(Bn) = ∑n = 1Kβiϕi(Bn),$(16)

where βi=[βi1, βi2, …, βim]T is the weight vector connecting the i-th kernel and the output neurons and ϕi(x) is the output of the i-th kernel

$ϕi(Bn) = exp(−H2(Bn, Ci)2σi) 1 ≤ i ≤ K,$(17)

where K is the number of neurons in the hidden layer.

One important problem is how to search the hidden layer center Ci and impact widths σj. Instead of taking a long time to search proper centers and impact widths, we can just simply randomly choose the values of these parameters like RBF-based ELM does. Considering that our training sample consists of bags, the hidden layer node center must be a bag, while the traditional RBF kernel’s center is a single value or vector. According to the ELM theory, the hidden layer center Ci and impact width σj can be just simply initialized randomly. After doing that, the next procedure is very similar to the one adopted to train the typical ELM. The hidden layer activation can be obtained, and output weights are obtained by analytical calculation. Concretely, through minimizing the sum of squares error and maximizing the marginal distance, the output weight of an MI-ELM neural network is optimized:

$Minimize:LPMIELM = 12∥β∥2 + λ12∑n = 1N∥ξn∥2Subject to:h(Bn)β = ynT − ξnT n = 1, …, N,$(18)

where 1/λ is a predefined parameter in order to obtain better generalization performance. Recall that ${\xi }_{n}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{y}_{n}^{T}\text{\hspace{0.17em}}-\text{\hspace{0.17em}}h\left({B}_{n}\right)\beta$ is the training error of sample (bag) Bn, caused by the dissimilarity between the desired output tn and the actual output h(Bn)β. h(Bn) is the RBF mapping vector with respect to Bn, and β stands for the output weight vector connecting the hidden layer and output neurons.

Based on the Karush-Kuhn-Tucker (KKT) theorem, the dual optimization problem with respect to Eq. (18) is

$LDMIELM = 12||β||2 − ∑n = 1Nλn(h(Bn)β − tn + ξn) + C12∑i = 1Nξi2,$(19)

where λn is the Lagrange multiplier of sample Bn. The KKT conditions that the minimizer of Eq. (19) has to satisfy are

$∂LDMIELM∂β = 0∂LDMIELM∂ξn = 0∂LDMIELM∂λi = 0$(20)

Combing Eqs. (19) and (20) results in

$β = ∑n = 1Nλnh(Bn)Tλi = Cξnh(Bn)β − yn + ξn = 0 n = 1, …, N.$(21)

The solutions of β can be derived from Eq. (21) using generalized Moore–Penrose inverse. Note that there are two forms of pseudo-inverse available to users: the left pseudo-inverse and right pseudo-inverse:

$When N < L: β = H†Y = HT(Iλ + HHT)−1YWhen N > L: β = H†Y = (Iλ + HTH)−1HTY.$(22)

Notice that right pseudo-inverse is recommended if the training data set is small; otherwise, left pseudo-inverse is more efficient, because they have to calculate the inverse of an N×N matrix and the inverse of an L×L matrix, respectively.

Given a new sample B, the output function of the MI-ELM is acquired from f(x)=sign h(x)β:

$f(x)N × N = sign h(x)HT(Iλ + HHT)−1Yf(x)L × L = sign h(x)(Iλ + HHT)−1HTY.$(23)

If a feature mapping h(x) is unknown to users, one can apply Mercer’s conditions on ELM. We can define an RBF kernel [Eq. (11)] matrix for MI-ELM as follows:

$ΩELM = HHT:ΩELMi,j = h(Bi)h(Bj) = K(Bi, Bj).$(24)

Inspired from the work of Ref. [4] and the definition of a kernel, the output function in terms of the kernel is naturally derived from the N×N version. Then, the output function of the MI-ELM classifier can be written compactly as

$f(x) = h(x)HT(Iλ + HHT)−1Y= [K(B, B1)⋮K(B, BN)](Iλ + ΩELM)−1Y.$(25)

Assuming that training data set is {Bi, ti|i=1, …, M}, bag Bi contains Ni instances $\left\{{B}_{i1},\text{\hspace{0.17em}}{B}_{i2},\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}{B}_{i{N}_{i}}\right\},$ and the number of hidden node output function G(a, b, Bi) is K, we can now summarize the MI-ELM algorithm.

A big advantage over other methods (e.g. MI-SVM) is that the MI-ELM can be easily extended to deal with multiclass classification and task. We can naturally make use of an SLFN with multiple output nodes. A one-against-all method is adopted to transform multiclassification applications into multiple binary classifiers and turn the discrete classification application to a continuous output function regression problem. Taking a testing sample into account, the class label is the output node with the largest output value. Suppose a set of multiclass training data ${\left\{{B}_{i},\text{\hspace{0.17em}}{t}_{i}\right\}}_{i\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}$ with M-labels, where ti∈{1, …, M}. For example, taking Bi from class one, the corresponding label vector is encoded to an M-dimensional vector ti=[1, −1, −1, …, −1]. Additionally, the actual output node function for a sample Bi is f(Bi)=[f1(Bi), …, fM(Bi)]. Therefore, the predicted label of Bi can be inferred from a simple equation as follows:

$label\left({B}_{i}\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{arg\hspace{0.17em}max\hspace{0.17em}}{f}_{j}\left({B}_{i}\right),\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}j\text{\hspace{0.17em}}\in \text{\hspace{0.17em}}\left[1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}M\right]$

Algorithm 1:

MI-ELM algorithm.

## 4.1 Benchmark Data Sets

We evaluated our proposed methods on five most popular benchmark MIL data sets, namely MUSK1, MUSK2, Fox, Tiger, and Elephant [5]. In MUSK data sets, each molecule is defined by a set of description of molecules, which is denoted by 166-dimensional feature vectors. MUSK1 consists of 47 positive bags and 45 negative bags, and 476 instances in all. MUSK2 consists of 39 positive and 63 negative bags, and 6598 instances in all. For the Elephant, Fox, and Tiger data sets, each of them consists of 100 positive and 100 negative bags. Each image is described with a 230-dimensional vector. To classify images with Tiger, Elephant, and Fox, whether an image contains such an animal, is the main goal here. More details of the data sets are available in Ref. [3].

We have applied the MI-ELM algorithm to an MIL classification task. It involves two parameters: the trade-off factor λ and Gauss kernel impact width σ, which are selected from {2−4, 2−3, 2−2, 2−1, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 210}. MI-ELM has been performed on the MUSK data set for the two Hausdorff distance methods: minimal Hausdorff distance (minHD) and maximal Hausdorff distance (maxHD). It should be noted that because the hidden layer learning parameters are randomly assigned, the repeatability of MI-ELM is not so good. Our reported result is the average prediction accuracy of repeating 10-fold cross-validation for 10 times (random different partitions). The experimental results are shown in Table 1, where the values in brackets represent the standard deviations. For both the MUSK1 and the MUSK2 data set, the best performance (90.0% and 86.1%) was acquired by using minimal Hausdorff distance. It can also be seen from Table 1 that the standard deviation for MUSK1 and MUSK2 by using minHD is lower than that of maxHD. Empirical results indicated that MI-ELM with minHD performs better and with more stability than that of maxHD. It is probably because, under the MIL framework, minHD reflects such characteristics that a bag is classified as positive if at least one of its instances is a positive instance. It can be referred from theory that our method can run faster based on minHD than that based on maxHD. Thus, the rest of the experimental results presented are based on the minimal Hausdorff distance.

Table 1

The Performance of MI-ELM on MUSK Data.

The MI-ELM network that contains 166 input nodes corresponding to the dimension of the feature vectors, and one output unit that outputs [0, 1] for positive while [−1, 0] for negative, are trained for ranging hidden nodes. We compared our method with several typical methods via 10 times 10-fold validation (10CV), except Citation-kNN, using leave-one-out cross-validation. The results are listed in Table 2.

Table 2

MI-ELM Performance on Benchmark Data Sets.

As MIL algorithms are time consuming, we experimented with several typical algorithms and their training time are recorded. All the methods were performed on a personal computer (Intel Core i5-3230 M with 2.6 GHz CPU with 4-GB memory), matlab2013b. The results obtained were based on the total training time of 10CV. The performance results are presented in Tables 3 and 4.

Table 3

Accuracy and Training Time on MUSK1.

Table 4

Accuracy and Training Time on MUSK2.

Table 2 indicates that MI-ELM is competitive with the state-of-the-art algorithms (e.g. Citation-kNN and DD). Particularly, from Tables 24, we can see that our proposed MI-ELM method performs much better than the BP-MIP, which is also a neural-network-based MIL approach. The iterated-discrim APR was especially designed for the MUSK data set, while MI-ELM is a single MIL algorithm. For applicability, it is obvious that MI-ELM is significantly superior to the iterated-discrim APR. Compared with Citation-kNN, MI-ELM is slightly worse from the aspect of prediction accuracy. However, from the aspect of learning time (Tables 3 and 4), MI-ELM runs several times faster than Citation-kNN. Moreover, MI-ELM has some advantages over other MIL algorithms like DD, MI-kernel, EM-DD, MI-SVM, and C4.5. For instance, MI-ELM runs extremely fast, and its performance makes it among the best MIL approaches.

Compared to MI-kernel, which also use Gaussian RBF kernel, MI-ELM performs better than MI-kernel on MUSK1 and Tiger data sets; however, on other three benchmark data sets, MI-kernel has the higher accuracy. Because they both use Gaussian RBF kernel, the performance may be roughly the same. However, from Tables 3 and 4, training MI-ELM takes much less time than does MI-kernel. Therefore, this demonstrates that MI-ELM is superior to MI-kernel.

## 4.2 Image Categorization

Image categorization is probably the most typical application of MIL. COREL is a natural scene images database that has been popularly used in MIL. In experiments, we use the data set 1000-Image, which contains 10 categories of COREL images, and each category has 100 images. Each image is treated as a bag, and the region of interests in the image are treated as instances described by nine features. More information of these data sets can be seen in Ref. [10]. The experimental routine used is the same as that described in Ref. [32]. We compare the related algorithms by their five times two-fold cross-validation results, including MIGraph, MI-kernel, MI-SVM, and MI-ELM. In the last two methods, a one-against-all strategy is employed to deal with this multiple-class task. The overall accuracy and 95% confidence interval is shown in Table 5. It suggests that, on average, MI-ELM is competitive with MI-kernel, and outperforms MI-SVM and MissSVM by a small percent. The results demonstrate that MI-ELM can also tackle with multiclass tasks and works well with image categorization.

Table 5

Test Accuracy on COREL.

## 4.3 Multiple-Instance Regression

The multiple-instance regression data sets we used are named as LJ-r.f.s [2], where r is the number of relevant features, f is the number of features, and s is the number of scale factors suggesting the importance of the features. The real-valued data were generated to mimic the MUSK data, so the data sets use label that are not near 1/2, which is indicated by the suffix S. We performed 10CV tests, and the square loss and the training time are listed in Table 6. MI-ELM was compared with BP-MIP, DD, and MI-kernel on four real-valued data sets, i.e. LJ-160.166.1, LJ160.166.1-S, LJ-80.166.1, and LJ-80.166.1-S. For reference, the performances of the other methods shown in the table were reported in Refs. [15], [24], [31]. It can be found that MI-ELM is much more efficient in terms of computational complexity. From the aspect of square loss, the performance of MI-ELM is also the best among three other methods except LJ-80.166.1-S. Overall, it indicates that MI-ELM is a useful and efficient method for multiple-instance regression tasks.

Table 6

Square Loss and Training Time on Regression Data Sets.

## 5 Conclusions

Learning time is an important factor when designing any computational algorithm for classifications and regressions. In this paper, the MI-ELM algorithm is presented, which is an efficient and effective ELM-based MIL method for classification and regression. We tested MI-ELM over the benchmark data sets that were taken from applications of drug activity prediction, artificial data sets, and image categorization. Two Hausdorff distances were tested on MUSK data sets in this paper. In comparison with other methods, the training MI-ELM model can accomplish in a short time, and its classification accuracy is competitive with state-of-the-art multiple-instance algorithms. MI-ELM also performs well on multiple-instance regression tasks and runs fast.

Notice that ELM can work on very large data sets. Studying the performance of MI-ELM on large-scale data sets will be interesting. Furthermore, we will consider more effective ways to adapt the ELM algorithm for multiple-instance problems.

## Acknowledgments

This work was supported by the Specialized Research Fund for the Doctoral Program of Higher Education of China (no. 20124101120001), Key Project for Science and Technology of the Education Department of Henan Province (no. 14A413009), and China Postdoctoral Science Foundation (nos. 2014T70685 and 2013M541992).

## Bibliography

• [1]

S. Ali and M. Shah, Human action recognition in videos using kinematic features and multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010), 288–303. Google Scholar

• [2]

R. A. Amar, D. R. Dooly, S. A. Goldman and Q. Zhang, Multiple-instance learning of real-valued data, in: Proceedings of the 18th International Conference on Machine Learning, pp. 3–10, Williamstown, MA, 2001. Google Scholar

• [3]

S. Andrews, I. Tsochantaridis and T. Hofmann, Support vector machines for multiple instance learning, in: Advances in Neural Information Processing Systems 15, edited by S. Becker, S. Thrun and K. Obermayer, pp. 561–568, MIT Press, Cambridge, MA, 2003. Google Scholar

• [4]

B. Babenko, M. H. Yang and S. Belongie, Visual tracking with online multiple instance learning[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, 983–990. Google Scholar

• [5]

C. Blake, E. Keogh and C. Merz, UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998, http://www.ics.uci.edu/~mlearn/MLRepository.html, accessed February 2015.

• [6]

J. Cao and Z. Lin, Extreme learning machines on high dimensional and large data applications: a survey, Math. Prob. Eng. 501 (2015), 103796. Google Scholar

• [7]

J. Cao and L. Xiong, Protein sequence classification with improved extreme learning machine algorithms, BioMed Res. Int. 2014 (2014). Google Scholar

• [8]

J. Cao, T. Chen and J. Fan, Landmark recognition with compact BoW histogram and ensemble ELM, Multimed. Tools Appl. (2015), 1–19. Google Scholar

• [9]

J. Cao, Y. Zhao, X. Lai, M. E. H. Ong, C. Yin, Z. X. Koh and N. Liu, Landmark recognition with sparse representation classification and extreme learning machine, J. Franklin Inst. 352 (2015), 4528–4545. Google Scholar

• [10]

Y. Chen and J. Z. Wang, Image categorization by learning and reasoning with regions, J. Mach. Learn. Res. 5 (2004), 913–939. Google Scholar

• [11]

Y. Chen, J. Bi and J. Z. Wang, MILES: multiple-instance learning via embedded instance selection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006), 1931–1947. Google Scholar

• [12]

Y. Chevaleyre and J.-D. Zucker, Solving multiple-instance and multiple-part learning problems with decision trees and decision rules: application to the mutagenesis problem, in: Lecture Notes in Artificial Intelligence 2056, edited by E. Stroulia and S. Matwin, pp. 204–214, Springer, Berlin, 2001.Google Scholar

• [13]

T. G. Dietterich, R. H. Lathrop and T. Lozano-Pérez, Solving the multiple-instance problem with axis-parallel rectangles, Artif. Intell. 89 (1997), 31–71. Google Scholar

• [14]

G. Fung, M. Dundar, B. Krishnapuram and R. B. Rao, Multiple instance learning for computer aided diagnosis, in: NIPS2007, p. 425, MIT Press, Cambridge, MA, 2007. Google Scholar

• [15]

T. Gärtner, P. A. Flach, A. Kowalczyk, et al. Multi-Instance Kernels[C]//ICML. 2 (2002), 179–186.

• [16]

G.-B. Huang and L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (2007), 3056–3062. Google Scholar

• [17]

G.-B. Huang and C.-K. Siew, Extreme learning machine with randomly assigned RBF kernels, Int. J. Inf. Technol. 11 (2005), 16–24. Google Scholar

• [18]

G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of International Joint Conference on Neural Networks (IJCNN2004), Budapest, Hungary, 25–29 July 2004. Google Scholar

• [19]

G.-B. Huang, L. Chen and C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (2006), 879–892. Google Scholar

• [20]

G. B. Huang, D. Wang and Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2011), 107–122. Google Scholar

• [21]

Z. Jorgensen, Y. Zhou and M. Inge, A multiple instance learning strategy for combating good word attacks on spam filters, J. Mach. Learn. Res. 9 (2008), 1115–1146. Google Scholar

• [22]

M. Kim and F. De la Torre, Multiple instance learning via Gaussian processes, Data Mining Knowl. Disc. (2014), 1078–1106. Google Scholar

• [23]

C. Leistner, A. Saffari and H. Bischof, MIForests: Multiple-instance learning with randomized trees[M]//Computer Vision–ECCV 2010. pp. 29–42, Springer, Berlin Heidelberg, 2010. Google Scholar

• [24]

O. Maron and T. Lozano-Perez, A framework for multiple-instance learning, in: Advances in Neural Information Processing Systems 10, edited by M. I. Jordan, M. J. Kearns and S. A. Solla, pp. 570–576, MIT Press, Cambridge, MA, 1998. Google Scholar

• [25]

J. Wang and J.-D. Zucker, Solving the multiple-instance problem: a lazy learning approach, in: Proceedings of the 17th International Conference on Machine Learning, pp. 1119–1125, San Francisco, CA, 2000. Google Scholar

• [26]

Y. Xie, Y. Qu, C. Li and W. Zhang, On line multiple instance gradient feature selection for robust visual tracking, Pattern Recognit. Lett. 33 (2012), 1075–1082. Google Scholar

• [27]

B. Zeisl, C. Leistner, A. Saffari and H. Bischof, On-line semi-supervised multiple-instance boosting, in: IEEE Conference on Computer Vision and Pattern Recognition, p. 1879, IEEE, Piscataway, NJ, 2010. Google Scholar

• [28]

Q. Zhang and S. A. Goldman, EM-DD: an improved multiple-instance learning technique, In: Advances in Neural Information Processing Systems 14, edited by T. G. Dietterich, S. Becker and Z. Ghahramani, pp. 1073–1080, MIT Press, Cambridge, MA, 2002.Google Scholar

• [29]

K. Zhang and H. Song, Real-time visual tracking via online weighted multiple instance learning, Pattern Recognit. (2013), 397–411. Google Scholar

• [30]

Z. H. Zhou and J. M. Xu, On the relation between multi-instance learning and semi-supervised learning[C]//Proceedings of the 24th international conference on Machine learning. ACM, (2007), 1167–1174.Google Scholar

• [31]

Z.-H. Zhou and M.-L. Zhang, Neural networks for multi-instance learning, Technical Report, AI Lab, Computer Science & Technology Department, Nanjing University, China, August 2002.

• [32]

Z. H. Zhou, Y. Y. Sun and Y. F. Li, Multi-instance learning by treating instances as non-I.I.D. samples, in: Proceedings of the 26th International Conference on Machine Learning, edited by L. Bottou and M. Littman, pp. 1249–1256, Omnipress, Montreal, June 2009. Google Scholar

Corresponding author: Liangjian Cai, School of Electrical Engineering, Zhengzhou University, Zhengzhou 450001, China, e-mail: .

Published Online: 2016-03-17

Published in Print: 2017-01-01

Citation Information: Journal of Intelligent Systems, Volume 26, Issue 1, Pages 185–195, ISSN (Online) 2191-026X, ISSN (Print) 0334-1860,

Export Citation

©2017 Walter de Gruyter GmbH, Berlin/Boston.