In this paper, simulations of composite data are used to evaluate methods. 2 dimensional and 3 dimensional data can visually reflect the characteristics of the data. In the following discussion, we apply six data generating processes, which are shown in Figure 2. These data generating processes are: simple small samples, mild hybrid and unbalanced simple triangle samples, multi-cluster samples, the Taiji diagram, superimposed curve samples and 3D spiral samples. During the simulation, to avoid the particularity of a set of data, we repeated the simulation 100 times for each data generation process to obtain 100 datasets. For each dataset, we performed 10 times random sampling, which extracts 1/3 of the observations as test sample and the rest as train sample. In this way, we have 1000 pairs of test samples and train samples for one data generating process.

Figure. 2 Six data generating processes for composite data

This section computes the accuracy of six classification models: naive Bayesian classifier (NBC), C5.0 decision tree classifier (C5.0), k-nearest neighbor (KNN), linear discriminant analysis (LDA), support vector machine (SVM) and local kernel nonparametric discriminant analysis (LKNDA). To emphasize the comparative method credibility, we use the same 1000 pairs of testing data and training data for each classification model. We summarize the mean and standard deviation of prediction accuracy and list them in .

Table 2 Prediction performance of six kinds of composite data

Figure 2(a) shows simple small samples with linear characteristics and presents a simple classification problem. LDA, SVM and LKNDA achieve nearly 100% accuracy of classification. KNN and C5.0 are inefficient for small sample classification and not ideal for this case. NBC strictly depends on the independence assumption and thus reaches a poor result.

Figure 2(b) shows mild hybrid and unbalanced simple triangle samples. It has a single structure, obviously linear boundaries and adequate samples; thus, all six types of methods can predict with approximately 92% accuracy. By inference, for simple classification problems and adequate samples, the effect of different methods has no significant difference.

Figure 2(c) shows multi-cluster samples. It is intuitively clear. It has obvious category boundaries. C5.0, KNN, SVM and LKNDA classify accurately close to 100%. NBC and LDA perform poorly in this case. NBC is based on marginal probability distribution and the independence assumption. Green samples and blue samples in Figure 2(c) have the same marginal probability distribution, which makes NBC unable to distinguish them. LDA is a projection method based on mean and the variance of classes. Three types of samples have the same mean, causing LDA to fail.

Figure 2(d) shows the Taiji diagram. It is an identification problem with complex nonlinear structure and clear edge margins. LKNDA performs best, then KNN, followed by C5.0. These three nonparametric methods have high accuracy. LDA, SVM and NBC do not perform well because they can only explain a linear or simple nonlinear classification situation.

Figure 2(e) shows superimposed curve samples. It is an identification problem with complex nonlinear structure and linear regularity. LKNDA performs best. KNN often misclassifies the points near the intersection point. Other methods fail in this case.

Figure 2(f) shows 3D spiral samples. It is a three-class identification problem with complex nonlinear structure and clear edge margins. LKNDA performs best, then C5.0. Other methods are invalid.

According to the above analysis, the environment of six classifications can be summarized as in . NBC requires a large sample size and an independence assumption. It cannot be used in dependence relationships, both linear and nonlinear. C5.0 is a decision tree classifier with an information entropy theory. It requires a large sample size. Its dividing surfaces are limited in the *x* or *y* direction, which leads to classification errors. KNN is a competent nonparametric classification algorithm and is capable of solving the problem of complex nonlinear characteristics with adequate sample size. However, KNN is not capable of understanding the law of the points near the class interface. LDA is a linear projection classifier. LDA performs well with a small sample size but is invalid in a nonlinear environment. SVM solves simple nonlinear problems by establishing a classification hyper plane. It is still insufficient to handle complex nonlinear problems. LKNDA, proposed in this paper, strives to absorb the advantages of both nonparametric and parametric methods. It takes the advantages of nonparametric methods in solving complex nonlinear problems. At the same time, it draws on the benefits of parametric ones for pattern recognition with a small sample size.

Table 3 Conditions of classification algorithm

In the small sample classification tasks, if data are rendered as linear, LDA can be used properly. If data are simple nonlinear, SVM is fine. If data are complex nonlinear and has adequate samples, KNN is often applied. If KNN behaves poorly, or if you want to further improve the prediction accuracy, you can turn to LKNDA.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.