Application of data mining technology in detecting network intrusion and security maintenance

: In order to correct the de ﬁ ciencies of intrusion detection technology, the entire computer and network security system are needed to be more perfect. This work proposes an improved k - means algorithm and an improved Apriori algorithm applied in data mining technology to detect network intrusion and security maintenance. The classical KDDCUP99 dataset has been utilized in this work for performing the experimentation with the improved algorithms. The algorithm ’ s detection rate and false alarm rate are compared with the experimental data before the improvement. The outcomes of proposed algorithms are analyzed in terms of various simulation parameters like average time, false alarm rate, absolute error as well as accuracy value. The results show that the improved algorithm advances the detection e ﬃ ciency and accuracy using the designed detection model. The improved and tested detection model is then applied to a new intrusion detection system. After intrusion detection experiments, the experimental results show that the proposed system improves detection accuracy and reduces the false alarm rate. A signi ﬁ cant improve - ment of 90.57% can be seen in detecting new attack type intrusion detection using the proposed algorithm.


Introduction
With the development of the Internet, the Internet has become an important part of human work and life. Especially after the realization of globalization and informatization, many enterprises, government agencies, and individuals carry out various businesses and operations on the open Internet, such as Financial companies carry out online banking, etc. [1]. People use the Internet to facilitate work and life, and at the same time, change people's work lifestyle and meet the needs of modern people for being fast, efficiency, and convenience. People use the Internet to obtain the information they need, as we all know The Internet is open and shared. These two characteristics make the resources integrated into the Internet rich, but at the performing the experimentation with the improved algorithms. The contribution of improved algorithms has been tested experimentally in terms of detection rate and false alarm rate. Significant improvement has been noticed in the outcomes obtained and the improved algorithm advances the detection efficiency and accuracy using the designed detection model. When the improved and tested detection model is then applied to a new IDS, a significant improvement is seen in the detection accuracy while simultaneous reduction in the false alarm rate.
The remaining paper is organized as follows: Section 2 presents the literature review of the state of the art work in the field of intrusion detection. The dataset description and data preprocessing are detailed in Section 3 followed by the experimental results and analysis in Section 4. Section 5 presents the conclusion of the entire research work followed by the future research scope.

Literature review
Deng and others analyzed the characteristics of network security and security issues and discussed the system framework of Internet security and some key security technologies, including key management, authentication and access control, routing security, privacy protection, intrusion detection, and Fault tolerance and this article introduces the current problems of the Internet of Things in network security and points out the necessity of intrusion detection. Several intrusion detection technologies are discussed, and their applications in the Internet of Things architecture are analyzed. We compared the applications of different intrusion detection technologies and looked forward to the next phase of research. Using data mining and ML methods to study network intrusion technology has become a hot issue. It is difficult to improve the detection rate of network intrusion detection by a single type of feature or detection model. The performance of the proposed model is verified by a public database [11]. Zhang J., with IDS as the research object, established an IDS model based on data mining, obtained experimental results, and reached relevant experimental conclusions. At the same time, it was compared with traditional IDS, and six experiments were carried out. The output results of the detection rate, false negative rate, and false positive rate of two different methods in six experiments were obtained. The experiment draws a conclusion: the IDS using data mining has better network protection and security performance, and the detection ability of network vulnerability intrusion is stronger. This research provides a new way for the detection and research of network protection security loopholes [12]. Bagaa M. et al. proposed a novel ML-based security framework that can automatically deal with extended security aspects related to the IoT domain. The framework utilizes software-defined networking and network function virtualization enablers to mitigate different threats. The AI framework combines monitoring agents and ML-based AI-based reaction agents. These ML-Models are divided into network pattern analysis and anomaly-based intrusion detection in IoT systems. The framework uses supervised learning, distributed data mining systems, and neural networks to achieve its goals. The experimental results prove the effectiveness of the scheme. In particular, the use of data mining methods to distribute attacks is very successful in detecting attacks with high performance and low cost. Regarding an anomaly-based IDS for the Internet of Things, the experiment has been evaluated in a real smart building scenario using a type of SVM. The detection accuracy of anomalies reaches 99.71%. A feasibility study was conducted to determine the current potential solutions to be adopted and to promote the development of the study in the direction of open challenges [13].
Ektefa et al. [14] evaluated the IDSs using the machine leaning platforms, and among C4.5 and Support Vector Machine (SVM) methods, C4.5 performs better. A hybrid Particle Swarm Optimization (PSO) was proposed by Holden and Freitas [15] to deal with the nominal attributes for intrusion detection. Ardjani et al. [16] applied the combination of PSO and SVM to optimize the ML algorithm performance and 10-fold cross validation is done for accuracy estimation. Multidimensional dataset was dealt with to remove the redundancy and inconsistency in the feature vectors and provide better classification outcomes [17]. Twoclass classification method was proposed by Wael et al. [18] to classify the normal situation from the attack class. This method was effective in providing the least error and better detection rate comparative to its counterparts. New form of attack patterns was captured by Petrussenko [19] for the detection of attacks in the network packets and it was found that higher detection rate was achieved by utilizing a preprocessor module in between [20,21]. Improving IDS by ML needs large sets of labeled data for attack scenarios and normal scenarios. Acquiring such sets is costing time and effort. Also, it can be done by experts. A large set of pre-labelled training data was utilized by an unsupervised method to produce less accuracy; therefore, the authors dealt with the semi-supervised algorithm to provide better accuracy detection rates. A fuzzy clustering-based approach was proposed by the authors in ref. [22], while evaluating both the Euclidean distance and statistical cluster properties to identify the possibility of intrusion. Wang and Megalooikonomou [23] presented a co-training framework to improve the intrusion detection in case of unlabeled data. This method provides the lower error rate, while incorporating an active learning technique to improve the system performance. Higher detection rates were achieved by utilization of a semi-supervised learning mechanism which yields comparatively reduced false alarm rate [24]. In order to improve the accuracy rate of intrusion detection, a tri-training SVM algorithm has been used [25]. A tree-based clustering technique was proposed by Li et al. [26], which finds the clusters among the intrusion detection dataset. This method works efficiently faster comparative to other algorithms. A combination of misuse and anomaly detection was utilized by the authors in ref. [27] for determining the cost function of the semi-supervised strategy involving SVM classifiers. For addressing the distributed denial of service attacks, various advents have been offered [28]. Fu and Papatriantafilou [29] proposed a perspective proactive algorithm to divide the network into set of clusters and the intrusion is detected using these clusters. Fu et al. [30] presented a synchronous communication-based method for intrusion detection and reducing the false alarming rates.
The usage of data mining combined with the ML methods has been made for intrusion technology in various related works. All the proposed models in the literature try to fit into the prescribed dataset and reduce the administrator workload by improving the detection of attacks. However, from this literature survey, it was found that it is challenging to improve the detection rate of network intrusion detection by a single type of feature or detection model, as used in the current existing methods. These methods are still needed to be improved and there are manifolds possibilities of improvement in IDSs.

Introduction to KDD cup99 data set
KDDcup99 is a data set specially used to evaluate the situation of research IDSs. It is a data set collected by Lincoln Laboratory during intrusion detection and evaluation of DARPA networks in 1998. It contains various types of data containing different intrusion categories. The data set contains a total of 4.9 million data; here, only 10% of the data are taken from "kddcup.data_10. percent" [31]; that is, 396,473 normal behavior data and abnormal behavior data. Since all the data in the KDDcup99 data set are the original network data packets obtained by Tcpdump, the data packets are converted into ASCII format.
Intrusion detection evaluation KDDcup99 data set mainly has the following types of attacks: (2) Illegal remote host access R2L attack (Remote to Local); (3) Local unprivileged users conduct unauthorized access U2R attacks (User to Root) of local super users; (4) Reconnaissance and detection of probing attacks; (5) Data transmission data attack.
The data provide 41 characteristics, divided into three different categories: traffic characteristics, content characteristics, and other characteristics.

Basic features
The format conversion of each data packet captured by Tcpdump from the network generates a unified format record, which contains nine basic characteristics, as shown in Table 1.

Data preprocessing
Preprocessing is an important data preparation work for the data mining module before mining. According to the requirements of the system algorithm, the data are processed in advance to make the data more in line with the data required by the system mining. Data preprocessing must complete the following tasks: (1) Data extraction: It refers to the integration of operational databases oriented to transaction processing to databases oriented to data mining. (2) Data cleaning: Including some inconsistent data, empty data, data cleaning is to deal with these data, including inconsistent data transformation, vacant data supplement, and so on. In this work, the two algorithms are used, one is the FP-growth algorithm depicted in Figure 2 and the Apriori algorithm whose flowchart is shown in Figure 3.
The FP-growth algorithm undergoes a process of preparing the clusters and the number of k-clusters are computed followed by the centroid computation. The grouping of different data groups is done depending upon the grouping criteria and the model generation is done using the association rule. Using those association rules from the database, the intrusion scenario is detected.
For the Apriori algorithm, the supports of each item are calculated, and depending upon the support condition, the items are removed or inserted into the frequent item set. The confidence for every non-empty  subset is computed and the selection of the subset is dependent upon the confidence condition. The intrusion detection is accomplished in this algorithm on the basis of the addition of strong rules. In this experiment, the two algorithms depicted in Figures 2 and 3 based on association rules and clustering analysis are used for experimentation with the subset data kddcup.data_10. percent of KDDcup99. The data are divided into training data and detection data. The training data are used to train the data and build a detection model, and the detection data are used to detect the data and analyze the results of the detection performance [32,33]. When constructing the detection model, it is necessary to filter the abnormal data in the training data, filter out most of the intrusion data and leave only a little intrusion data, and extract the normal data to reach 95%. As long as the detection data are directly extracted as it is, 2,000 network data set records are extracted. The value of the holding degree is set from 0.35 to 8%. The data of 20 mining experiments based on the FP-growth algorithm and the Apriori algorithm are shown in Table 2.
In order to more intuitively reflect the mining effect of the experiment, the conversion is shown in Figure 4.
From the above experiment, it can be found that when the value of the minimum support is smaller, the mining efficiency of the FP-growth algorithm is higher than that of Apriori, and when the value of the minimum support is changed, the mining time is more stable [34]. This is because the FP-growth algorithm does not generate candidate item sets during mining, while the Apriori algorithm generates a large number of candidate item sets, which affects the mining efficiency.  The k-means algorithm first needs to input the number of clusters. This number has a greater impact on the results of clustering. Below, we will test the effect of clustering through different numbers of clusters. Select the cluster centers as 10, 12, 16, 18, 20, and 28, respectively, for testing. The training sample data are 8,000 records, and to make the proportion of normal data larger, only 400 intrusion data are left. Reduce 41 attributes to only 13 data field attributes. The two parameters of the improved algorithm are the minimum clustering radius r and exponential factor m. After experiment and observation, the values of the two parameters are r = 0.3 and m = 0.5, respectively. When r = 0.3 and m = 0.5, the detection rate and false alarm rate can be selected in a reasonable range. The data of the experimental results are shown in Table 3.
In order to understand the improved performance of the k-means algorithm more intuitively, the experimental results are plotted as shown in Figure 5.
From Figure 5, we can see that the improved k-means algorithm improves the detection rate and reduces the false alarm rate. In short, the improved detection performance of k-means algorithm has been improved.

Experimental process
• Data collection. Use libpcap tool to collect about 50MB data packet samples from the network, calculate the average value and average absolute error of each continuous attribute value, as shown in Table 4, and save it to the Nample.txt file.
Collect the transmitted data packets within 8 h on the network through snort and save the data as normal.log files, which are used to establish the normal behavior pattern of the network training data. IDS Informer is used to send simulated attacks and save the attack data to the attack.log file. There are about 1,000 attack packets in the attack.log file, which are used to test pre-detection and evaluate the effect of the system [35]. In order to reduce unnecessary abusive detection work, the system's pre-detection module will  discard packets that are considered normal. Therefore, the false detection rate can be obtained by calculating the number of packets discarded when detecting the attack.log file.

Experiment of selecting the parameters of the k-means algorithm
Both the cluster analysis module plug-in and the pre-detection module plug-in apply an improved k-means algorithm. Different settings of the clustering radius r exponential factor m will directly affect the effect of the cluster analysis and the judgment result of the pre-detection module.
(1) The influence of clustering radius r (set exponential factor m = 2) When clustering network data packets, the similarity between the network data packet and the center of all network behavior patterns is calculated, and the behavior pattern class to which the data packet belongs is divided. If there is the similarity and clustering radius of a certain network data packet when r is the same, the data packet is classified into the behavior pattern class; otherwise, that is to say, the similarity is greater than all the cluster radius r, indicating that this type of network data packet cannot be classified into any cluster. It is necessary to establish a new cluster for this data packet, and the cluster analysis module performs experiments on the data in the normal.log file. The results are shown in Table 5.
As can be seen from the above table: The clustering radius r will affect the effect of clustering. As the clustering radius r becomes smaller, the network behavior pattern classes obtained increase, and vice versa, the network behavior pattern classes obtained decrease.
Similarly, the clustering radius r also has a certain impact on the false detection rate of the predetection module. With the increase of the clustering radius r, the greater the probability of mistaking the attack data as normal data, the higher the false detection rate; otherwise, the lower the false detection rate. Experiment with the data in attack.log through the pre-detection module, and get the false detection rate, as shown in Table 6.
It can be seen from the above table that the larger the clustering radius r, the higher the false detection rate of the pre-detection module. The high false detection rate means that the pre-detection module mistakenly treats some attack packets as normal packets, which leads to detection.
(2) The influence of exponential factor (set cluster radius r = 4) Only when the number of data packets in the abnormal network behavior pattern exceeds the exponential factor, can it become the normal network behavior pattern. This shows that when the exponential factor is smaller, there are fewer abnormal behavior patterns in the network, and there are relatively more normal behavior patterns in the network; conversely, the more abnormal behavior  patterns in the network, the fewer normal behavior patterns. The cluster analysis module performs experiments on the data in the normal log, and the results are shown in Table 7.
As can be seen from the above table, when the amount of data packets in the abnormal network behavior pattern class reaches the exponential factor, the system will put this type in the normal network behavior pattern library and no longer consider it as an abnormal network behavior pattern. It can be seen that the smaller the exponential factor, the higher the false detection rate.

Result analysis
Based on the above experiments, it can be seen that the clustering radius and exponential factor will affect the conclusion of the pre-detection module and the effect of clustering. In practical applications, the clustering radius and exponential factor are appropriately adjusted according to the needs to make the detection result better. Usually, the exponential factor is set to a large point, which can avoid taking part of the intrusion operation as normal behavior, and the clustering radius should be adjusted according to the specific situation.
(1) Experiment with unknown intrusion detection capabilities The feature extraction module is used to generate and extract association rules. From the recorded logs, the improved Apriori algorithm is used to mine frequent item sets, generate association rules and convert them into intrusion detection rules conforming to the system association rule grammar, and add them to the rule base, so that the detection module has the ability to detect new intrusions.
(2) Experiment to verify the detection efficiency of the new system The new system adds cluster analysis module plug-in, pre-detection module plug-in, and feature extraction module plug-in based on the original system. The improved algorithm is applied in cluster analysis and feature extraction, which improves the detection efficiency of the system. However, since the operation of the added plug-in itself also consumes time, the detection efficiency of the detection system has a certain impact.
The following experiments verify the influence of the parameters of the k-means algorithm on the detection efficiency of the new system.
(1) The influence of clustering radius r (set exponential factor m = 8) Based on the normal network behavior patterns that have been established by the clustering analysis module, we conducted an experiment on the detection efficiency of the attack.log file, and the results are shown in Table 8.
It can be concluded from the above table that the detection time decreases with the increase of the cluster radius r, that is, inversely proportional; that is to say, with the increase of the cluster radius, the detection time becomes shorter. This is because as the cluster radius becomes smaller, the number of clusters generated will increase, and the cluster analysis module will take more time.
(2) The influence of exponential factor (set cluster radius r = 4) The normal network behavior pattern class is established in the module, and the attack.log file is used to test the efficiency of the experiment. The results are shown in Table 9.
It can be seen from the above table that the time required for system detection is directly proportional to the exponential factor, indicating that the system detection time increases as the exponential factor increases. Because the larger the exponential factor, the time required for the system to establish normal network behavior patterns will also increase, which affects the detection time of the system.
Using KDDCUP99, 10% data set includes a large number of different types of attack data. There are more DOS attack data and 537 data are randomly selected; PROBE attack data are randomly selected, 3,100 data; U2R attack data are less, a total of 60 attack data, 2,124 pieces of R2L data and 2,300 pieces of new attack data. The IDS using the k-means algorithm and the improved k-means algorithm was detected, and the intrusion attacks were run three times. The three detection results are averaged, and the comparison of detection accuracy results is shown in Table 10.
From Table 10, it is revealed that the improved k-means algorithm proposed in this work yields an increased accuracy for intrusion detection from all types of attack evaluated. 7.10% improvement is seen from the DOS type attack, and similarly, 4.63, 78.04, 20.11, and 90.57% improvement is seen for PRONE, R2L, UZR, and new attack types, respectively.
For the clustering analysis module plug-in, the main influencing parameters are the clustering radius r and the exponential factor. The setting of its value will have an impact on the clustering effect, and in the pre-detection module, the main impact is on its detection efficiency. In the feature extraction module, the influencing parameter is the support and confidence of the Apriori algorithm. Experiments show that the new system improves the detection efficiency and reduces the false detection rate. It can detect unknown attack behaviors, which shows that it has the ability to deal with new and unknown attacks and also shows its practical application value.
Improve on the basis of analyzing the commonly used k-means algorithm and Apriori algorithm, and apply them to the new data mining IDS. By implementing the proof algorithm, the detection efficiency of the   system is improved and unknown intrusion attacks can be detected, resulting in reduced false detection rate of the system.

Conclusion
k-means algorithm is an initial value-sensitive algorithm in which different k values produce different clustering results. In view of the shortcomings of k-means algorithm, an improved k-means algorithm is proposed in this work. The k value does not need to be determined in advance, and the data set automatically produces an optimal value through clustering. The clustering center does not need to be changed after determination and the whole data set only needs to be scanned once. The efficiency of both the improved k-means algorithm and the clustering effect has improved greatly.
(1) The improved k-means algorithm is verified through experiments. The experimental data show that the improved k-means algorithm has detection efficiency and detection ability. Compared with the previous algorithm, the algorithm improves the detection efficiency and detection ability of the system. A maximum improvement in the detection accuracy of 90.57% can be seen for new attack type intrusion using the proposed algorithm. (2) Apply the improved algorithm to the IDS and design an IDS based on data mining, which mainly includes pre-detection modules, feature extraction modules, etc. and conducts intrusion detection experiments on the new IDS. The experimental results show that the proposed system improves the detection efficiency and reduces the false detection rate.