Prediction of hot metal temperature based on data mining

Accurately and continuously monitoring the hot metal temperature status of the blast furnace (BF) is a challenging job. To solve this problem, we propose a hot metal temperature prediction model based on the AdaBoost integrated algorithm using the real production data of the BF. We cleaned the raw data using the data analysis technology combined with metallurgical process theory, which mainly included data integration, outliers elimination, and missing value supplement. The redundant features were removed based on Pearson’s thermodynamic diagram analysis, and the input parameters of the model were preliminarily determined by using recursive feature elimination method. We built the hot metal temperature prediction model using the AdaBoost ensemble algorithm on a dataset with selected features as well as derived features by using K-mean clustering tags. The results show that the performance of the hot metal temperature prediction model with K-means clustering tags has been further improved, and the accurate monitoring and forecast of molten iron temperature has been achieved. The model can achieve an accuracy of more than 90% with an error of ±5°C.


Introduction
The long life of the blast furnace (BF) is the precondition of its high efficiency. One of the main factors which affect the lifespan of the large BF is the lifespan of the hearth. There are many factors that affect the lifespan and high efficiency of the hearth. They all reflect, to some extent, if the hearth has a good working condition, which is the important sign of the stableness and smoothness of the operation of the furnace [1].
The temperature of hot metal can reflect the physical heat energy well [2] and it is also a symbol of the heat in the hearth. The high temperature of hot metal during discharging indicates that the heat inside the hearth is abundant and the hearth is active; otherwise, it indicates that the hearth is not hot enough and the activity is decreased. Therefore, the hot metal temperature can directly reflect the temperature status inside the hearth, which is a very important index to measure the thermal state of the hearth.
In order to achieve accurate prediction and optimal control of BF smelting, many experts and scholars have done a lot of research. Li et al. [3] proposed a kind of limit learning machine model based on the grey correlation degree. Through the grey correlation degree to analyze the correlation between the parameters of BF, the influence of strong coupling of data brought by the complexity of BF smelting was effectively reduced, but it was difficult to determine the optimal value of index, so it depended too much on the subjectivity of users. Sun et al. [4] proposed the support vector machine (SVM) based on PCA and least square (LS) method, through PCA to reduce the dimension of input parameters, combined with the LS-SVM to build a time series prediction model, which can effectively solve the strong coupling of BF data and the slow operation speed of SVM due to the complexity of training in the case of multivariables; when the training set was updated, the main factor would fluctuate to affect the prediction effect. Chong [5] introduced quantum theory into the neural network model based on genetic theory in order to solve the problems such as easy fitting ahead of time and limitation of optimization process of genetic algorithm neural network model, which successfully combined the diversity of quantum theory and the accuracy of neural network, weakened the dependence on initial conditions, and ensured the training effect of algorithm on data, but there were problems such as large data demand and poor network extension.
By collecting a large number of original data, big data technology can find potential relationship among data by using complex analysis model [6]. This advanced technology can provide a more reliable basis for decisionmaking, effectively avoid risk accidents, and obtain considerable returns. At present, big data technology has been successfully applied in many aspects of the industrial field, such as predicting the strength of rock materials [7], predicting the air overpressure caused by blasting [8], soil classification [9], etc., which promote the industry into a new era of innovation and change.
The hot metal temperature is closely related to the working state of the hearth. A moderate molten iron temperature is beneficial to the smooth flow of slag and iron inside the hearth of the BF; while if the temperature of the molten iron is too high or too low it will cause the BF to develop in a bad direction, increasing the risk of BF production accidents. In order to avoid the occurrence of furnace heating and furnace cooling conditions caused by manual judgment errors, a research on real-time monitoring of the hearth status was carried out. This article adopts the massive BF production data collected by the steel mill to carry out standardized data preprocessing on the original data and divides the data set into a training set and a test set. Because the K-means clustering algorithm has fast convergence speed, few tuning parameters, and good interpretability, this method is used to cluster historical hot metal temperature, and the clustering results are analyzed and verified through metallurgical process experience. In this article, we used the AdaBoost integrated algorithm to predict the temperature of the hot metal in the production process due to its high accuracy and good generalization ability. The accuracy of the prediction results was verified using actual production data. This model can provide effective guides for the on-site operators to stabilize the working state of the hearth and to control the hearth in real time.

Theoretical analysis of iron making process
The whole process of BF smelting can be summarized as follows: the burden enters the BF from the top of the furnace. During the heat exchange between the furnace and the gas flow, the reduction reaction takes place. The iron oxide in the ore is gradually reduced to iron under the action of gas CO, H 2 , etc. The hot air and fuel are injected from the tuyere on the upper part of the hearth, react with the solid coke to produce the high-temperature gas flow moving from the bottom to the top, and finally are discharged from the top of the furnace for recovery. The ore is heated and softened by the gradual rising temperature in the process of moving down the burden, and finally the hot metal drops into the hearth for storage, and the ash and other impurities form the slag floating on the hot metal to realize the separation. Finally, the hot metal and slag are discharged regularly to ensure that there is enough space in the hearth to maintain the continuity of the smelting process [10][11][12].
There are many factors affecting the temperature of hot metal; the nature of raw materials, operation system, smelting equipment, and production cycle all affect the variation trend of hot metal temperature more or less. Only when the furnace temperature changes greatly, the BF operator will change the raw material ratio. Since the detection of raw material properties had the defects of time asynchrony and serious lack of data, the influence of raw material and fuel properties on the prediction of hot metal temperature is not considered here [13][14][15].
There are two parts to control the temperature of hot metal by the operation system: in the upper part, the position distribution of ore and coke in the furnace is controlled by adjusting the distribution mode, so as to adjust the air permeability in the furnace and ensure the stable downward movement of the furnace charge. In the lower part, the combustion reaction is controlled by adjusting the air supply system and the injection system, so that the hearth area can maintain good thermal state [16]. In daily production, the operator mainly controls the hot metal temperature by adjusting the operation system. Therefore, this article selected the smelting data of nearly 7,000 heat and 71 parameters covering the various operation systems (including top state, air supply state, injection state, etc.) of the smelting process.
3 Data processing

Data collation
The principle of data collation is to organize data into neat, clear, and easy-to-use data forms.

Data classification
BF smelting is a production process which includes multiple operating systems in cooperation. Because the data of the steel plant were all stored in one database, they were messy and unclear. We organized the data according to their subject; put the parameters belonging to the same process together.

Data alignment
First, the collected data were integrated by heat. Since the selected input parameters were collected according to the frequency of the minute (there was no corresponding data set with the frequency of the furnace), all the data were needed to be integrated by the heat. Taking the collection time of hot metal temperature as the node, from the beginning of the previous furnace to the end of the next furnace, all data in the period of parameter selection were added and averaged to reflect an average level state of the parameter in this period. Then, a new data set was formed by aligning the processed parameters with the hot metal temperature according to the heat. The number of batches was needed to be handled separately. As the recording method of material batch number was cumulative material batch number, it was necessary to subtract the cumulative measurement of material batch between two tapping to obtain the corresponding material batch number of this tapping.
Second, the repetitive measurement points of parameters were integrated. Because there may be more than one detection point for a certain position in the BF (Table 1), it was considered to integrate these monitoring points to obtain a new variable to represent these multiple points.
After treatment, the number of parameters was reduced from 71 to 52.

Data cleaning
Data cleaning is an essential step in data preprocessing. It mainly uses mathematical methods to remove redundant data, modify abnormal data, and fill in vacancy data.

Deduplication
One part of the duplicate data was generated by the repeated entry of data caused by mechanical failure or manual entry, and the other part was caused by the repeated selection and connection of data caused by software when the data were connected in the database. Duplicate data are generally deleted directly.

Handling of abnormal data
Abnormal data refer to data points that are obviously different from the overall data distribution, also known as outliers. This kind of data was usually caused by equipment failure. Since the data set used in the experiment was a large sample data set, it was possible to mark Upper differential pressure 1# Upper differential pressure, 2# Upper differential pressure Arithmetic mean 3# Upper differential pressure, 4# Upper differential pressure Lower differential pressure 1# Lower differential pressure, 2# Lower differential pressure Arithmetic mean 3# Lower differential pressure, 4# Lower differential pressure the large error data beyond the range of 3σ by using the Pauta criterion [17,18]. The corresponding relationship between actual physical meaning and parameter name is shown in Table 2. It can be seen from Figure 1 that most of the parameters have abnormal data and the amount of abnormal data is unequal, so correction was needed. The specific correction method was needed to be combined with the processing method of the vacancy value.

Processing of missing value
There are two reasons for the missing value: one is mechanical reason, which is caused by mechanical failure in data collection; the other is human reason, which is caused by human subjective error or intentional act. BF smelting is a continuous production process.
In order to ensure the continuity of time, abnormal data and vacancy data are usually filled instead of completely deleted. The Lagrange linear interpolation method is a common method in the regression algorithm. It connects two or more non-null data and fits the vacant value according to the proposed function. Generally, the most commonly used methods are linear interpolation and square interpolation [19]. In this article, a linear interpolation method is selected, using two adjacent points on the left and right sides of the missing value to fill in the missing value by constructing a straight line.

Parameter selection
BF smelting process parameters consist of state parameters and material flow parameters. The state parameters mainly refer to the equipment state parameters

FHPWMP1563
--and smelting state parameters of the whole BF. The material flow parameters mainly include the quantity and attribute parameters of all the input and output materials of the BF. In the process of modeling, reasonable selection of input parameters can maximize the efficiency of the model, reduce the operation time, and improve the accuracy of the model prediction.

Data correlation analysis
Pearson correlation coefficient (R) is a linear correlation coefficient, which is also most commonly used to describe the degree of correlation between parameters [20]. The range of correlation coefficient is from −1 to 1; greater than 0 means positive correlation, less than 0 means negative correlation, and the larger the absolute value is, the stronger the correlation is. Figure 2 is the thermodynamic diagram of Pearson correlation coefficient between parameters. The darker the color is, the higher the positive correlation is, and the lighter the color is, the higher the negative correlation is.
When Pearson correlation coefficient is in the region of high correlation, that is, when r 0.9 1 ≤ | | ≤ , the degree of information redundancy between parameters is very high, which needs to be processed.

Feature selection 4.2.1 Importance analysis
Importance analysis is to calculate the contribution degree of each parameter to the target parameter according to the algorithm and sort the parameters according to the score. It is a method of directly selecting parameters through the internal algorithm of the model [21]. In this way, we obtain the sensitivity relationship between each variable and the target value, and the target value is more affected by the top-ranked variables (i.e., the target value is more sensitive to the top-ranked variables). Figure 3 is a feature importance ranking diagram obtained by using a decision tree-based gradient boosting  algorithm. It can be seen from the figure that each feature is ranked from small to large according to the importance score: tuyere area, depth of taphole, number of iron notch, batch number, cooling water flow of furnace top, cold blast pressure, soft water supply loop pressure, hot blast pressure, pressure of high-pressure water main pipe, inlet water temperature, pressure of main water supply pipe of bottom cooling water, flow of high pressure water in main pipeline (No. 1564), top pressure, high-pressure water pressure, valve seat temperature, nitrogen flow, top cooling water temperature, oxygen pipe temperature, pressure of air cooling main in front of furnace, gas main pipe pressure, nitrogen press, ventilating index, top temperature, hot blast temperature, oxygen pressure and hot blast pressure difference, flow of nitrogen in main pipe, upper differential pressure, blast kinetic energy, gas utilization rate, CO 2 online analysis, flow of high pressure water in main pipeline (No. 1563), online analysis of gas, top cooling water return temperature, production of BF gas, heat load, oxygenenriched flow rate, lower differential pressure, BF load, total soft water flow of BF, and top cooling water pressure.

Determination of the number of optimum features
Based on the recursive feature elimination and cross-validation methods, the forward selection method is used to select the parameters in combination with the importance of feature parameters and the model prediction accuracy. RFE, also known as feature recursion elimination, is to select the feature with the best score through repeated modeling, and then repeat the process for the remaining features until all samples have been tested at least once, and then sort the features according to the score value to select the optimal feature subset [22]. The number of optimum features can be found by the cross-validation method of recursive feature elimination. Cross-validation is a method of cyclic iteration based on the cut sample set. The obtained sample data are repeatedly divided into different training sets and test sets. The training set is used to train the model and the test set is used to evaluate the quality of model prediction [23]. On this basis, multiple sets of different training sets and test sets can be obtained. A sample in a training set may become a sample in a test set next time, which becomes a cross. Collect the sum of the square error of each sample after inspection.
It can be seen from Figure 4 that with the continuous addition of parameters, the score of the model gradually rises and tends to be stable. It was found that when the number of feature subsets was 33, the score of crossvalidation was the highest, so the number of feature subsets is 33. The classification and summary results of the model input parameters are shown in Table 3.

Model construction
The most intuitive effect of a good model is to improve the accuracy of prediction and reduce the prediction time. The AdaBoost integrated tree algorithm in the integrated model was used to construct the prediction model. The AdaBoost model has good self-adaptability. Through continuous training and autonomous weighting, the error is  minimized to achieve a good prediction effect. However, the AdaBoost integrated tree model is sensitive to outlier samples. In order to make up for this big defect, K-means clustering method is introduced based on the idea of classification. K-means clustering was used to classify the hot metal temperature into highly similar classes, eliminating the differences caused by outliers and further improving the prediction accuracy and hit rate.
In the process of model construction, random sampling is first used to divide the sample into a training set and a test set at a ratio of 9 to 1. Then, two sets of different input parameters (33 input parameters obtained by parameter screening and 34 input parameters derived with K-means clustering result tags) are used to construct a hot metal temperature prediction model based on the AdaBoost ensemble tree algorithm, and the hyperparameters of the model are optimized by a combination of grid search and cross-validation. Finally, the performance of the two sets of models was verified and compared.

Algorithm introduction
AdaBoost, an abbreviation of "adaptive boosting," was proposed in 1995 [24,25]. Its adaption lies in: when the sample data are misclassified, the given weight will be improved; the weighted sample will be used to train the classifier again. At the same time, a new weak classifier will be added until it reaches the predetermined small error rate or the predetermined maximum number of iterations.

Algorithm principle
The whole AdaBoost iterative algorithm is roughly divided into three steps: (1) Initialize the weight distribution of training data.
Each training sample is given the same weight at the beginning. The core formula of the algorithm is as follows: where H final is the final strong classifier; α t is the weight of weak classifier; and H t is the basic classifier.

K-means algorithm
AdaBoost integrated tree algorithm cannot capture discrete values effectively.
Considering that the temperature of hot metal fluctuates greatly in the whole cycle, in order to improve the prediction accuracy and capture the change trend of hot metal temperature more effectively, the temperature of hot metal was first processed by K-means clustering and then mining and learning were carried out according to the characteristics of different categories of data.
K-means clustering is one of the most typical and commonly used clustering algorithms, which has the advantages of good effect, fast, and simple. The main idea of this algorithm is to get the final clustering result by giving the number of clustering centers in advance and iterating until the error value of the objective function converges. The specific process steps [26,27] are as follows: (1) Select K objects randomly, each of which represents the initial value of a cluster center; (2) Calculate the distance between the remaining objects and the cluster center and classify them according to the distance; (3) Recalculate the average value of each cluster population and update the cluster center;  VST  HBP  VI  PACMF  TT  OPHHBPD  UDP  NF  OAG  HBT  BKE  FNMP  TCWT  OPT  UDP  FHPWMP1563  TCWP  OEFR  BFL  NP  COA  HL  GMPP  PBFG  GUR  TSWFBF  TCWRT  FHPWMP1564  TP  PMWSPBCW  PHPW IWT PHWP (4) Repeat steps 2 and 3 until the cluster center no longer changes.

Hot metal temperature clustering
The hot metal temperature data of 6,702 heats from 2017 to 2018 were collected and clustered by K-means clustering method.
The number of clusters is 3, and the initial cluster centers are 1,450, 1,475, and 1,490°C. The clustering results are shown in Figure 5. It can be seen from Figure 5 that the number of sample subsets with hot metal temperature of 1,450, 1,475, and 1,490°C as cluster centers is 4,740, 1,761, and 205, respectively. In contrast, the number of samples with the hot metal temperature of 1,490°C as the cluster center is relatively small. The result based on K-means clustering is almost the same as that of prior analysis. When the temperature of hot metal is between 1,468-1,482°C, it can be regarded as stable and normal state; when it is higher than 1,482°C, it is considered as high; when it is lower than 1,468°C, it is considered as low. The final cluster centers are 1,460, 1,472, and 1,480°C.

AdaBoost prediction model based on K-means clustering results
The 33 variables determined in the parameter screening process and the 34 input variables added with the clustering results were used as the input parameters of the AdaBoost prediction model, and the hot metal temperature prediction model was established using the AdaBoost integrated algorithm model. The method of combining grid search and cross-validation is used to optimize the model hyperparameters, and the performance of the forecast model is adjusted to the best. Table 4 shows the optimal hyper parameter set obtained. Figure 6 shows the comparison curve between the predicted results and the actual values based on AdaBoost model; Figure 7 shows the construction of prediction model using AdaBoost algorithm on the basis that the temperature range of hot metal has been determined.
It can be seen from Figures 6 and 7 that the prediction results of AdaBoost integrated tree model based on K-means clustering algorithm are more accurate than the actual data values only using AdaBoost mode, which can not only predict the trend of hot metal temperature, but also capture the data points with large fluctuations. Figure 8 shows the changes in the prediction error of the AdaBoost model after adding the K-means clustering results. It can be seen that after adding the clustering label, the performance of the model is improved significantly, and it is basically stable at about ±5°C. Such model performance can play a good guiding role for on-site hot metal temperature monitoring and advance prediction.

Model evaluation
By using only AdaBoost integration tree and AdaBoost integration tree based on K-means clustering algorithm, two models for predicting hot metal temperature were constructed, respectively. This article summarized the prediction hit rate ( Table 5) and model evaluation index results ( Table 6) of the two models to comprehensively evaluate the effects of the two models. It can be seen from Tables 5 and 6 that the prediction results of the AdaBoost integrated tree model based on K-means clustering have a hit rate of more than 90% under the accuracy of ±5°C and more than 70% under the accuracy of ±3°C; by comparison, the hit rate of the AdaBoost integrated tree model is relatively low under the same accuracy. It showed that the introduction of K-means clustering method successfully eliminated the defect that the accuracy of AdaBoost prediction model decreases due to outliers.
In addition, we also compared the i10-index of the AdaBoost integrated tree model based on K-means clustering and the AdaBoost integrated tree model. The i10index values of the two models are 0.8663 and 0.6775, respectively. The results of the AdaBoost integrated tree model based on K-means clustering seem to have not yet reached the ideal range (0.90-1.00) [28]. However, for the iron and steel industry, where the environment is changeable and complex, people pay more attention to the hit rate  of the model within a specific error range. Since our model can achieve a hit rate of 92.40% at ±5°C, it can play a good guiding role in actual steel production.

Future work
With the deep integration of industrial big data, new ironmaking technology, and information automation, the establishment of a high-precision BF ironmaking prediction model and the realization of the white box inside the BF are necessary for the future development of the BF ironmaking industry.
Promoting the digital transformation of the ironmaking process and the intelligent upgrading of BF production require a full combination of technologies such as big data platforms, artificial intelligence algorithms, and metallurgical process theory. The construction of a big data platform for the steel industry is the foundation. By integrating all the data information of the iron and steel industry, we can fully supervise the coordinated operation of various processes and conduct more accurate operation status analysis and decision optimization. With the help of advanced artificial intelligence algorithm technology, we can attain the precious values in the massive historical data of iron and steel enterprises, so as to better guide the current production operation and parameter monitoring. In addition, metallurgical process theory and field operation experience are extremely valuable. This part of knowledge is solidified in the big data platform and forecast model of the ironmaking industry to realize the true landing of the digitalization of the iron and steel industry, so as to obtain long-term guiding significance for on-site production.
Due to the limitation of the current monitoring method, the hearth temperature cannot be detected directly. The changing trend of the hearth temperature can only be captured by constructing a high-precision prediction model of hot metal temperature. The future work should focus on the long period optimization of hot metal temperature prediction, in order to achieve stable production, extend the life of the BF, and improve the utilization efficiency of the BF.

Conclusion
(1) After using big data technology for cleaning, correspondence, and analysis, 33 BF operation parameters, including, valve seat temperature, hot blast pressure, oxygen pressure and hot blast pressure difference, upper differential pressure, hot blast temperature, ventilating index, pressure of air cooling main in front of furnace, online analysis of gas, nitrogen flow, tuyere area, flow of nitrogen in main pipe, upper differential pressure, pressure of main water supply pipe of bottom cooling water, soft water supply loop pressure, flow of high-pressure water in main pipeline (No. 1563), top cooling water supply temperature, top cooling water pressure, oxygen pipe temperature, blast kinetic energy, lower differential pressure, gas utilization rate, CO 2 online analysis, permeability   index, nitrogen press, production of BF gas, gas main pipe pressure, total soft water flow of BF, flow of high-pressure water in main pipeline (No. 1564), heat load, top cooling water return temperature, iron notch, BF load, oxygen-enriched flow rate, were selected as the input characteristics of the model. (2) By comparing the prediction results of AdaBoost model and K-means classification-based AdaBoost model, it was found that the latter has the advantages of higher accuracy and more accurate hit. When the accuracy was increased to ±5°C, the former had only 63.64% accuracy, while the latter had 92.40% accuracy. When the accuracy was increased to ±3°C, the former had only 43.07% accuracy, the latter still has a hit rate of 71.98% Funding information: Thanks are given to the financial supports from the key Program of National Nature Science Foundation of China (U1360205) and Science and Technology Project of Hebei Education Department (BJ2021099).