Employment management system for universities based on improved decision tree

: With the popularization of higher education, the number of students in colleges and universities is increasing, and how to timely cope with the various problems faced by students in employment has become a major problem faced by teachers in colleges and universities. Due to the low utilization rate of student information by the traditional employment management of college graduates, the quality of employment guidance services is not high. Therefore, to solve this problem, this study proposes a simpli ﬁ ed, improved Iterative Dichotomizer 3 (ID3) based on the correlation coe ﬃ cient, and the algorithm improves the information gain function and simpli ﬁ es the information entropy formula. The experimental results show that the simpli ﬁ ed modi ﬁ ed ID3 based on correlation coe ﬃ cients converges faster than the other two algorithms, starting to converge after only 17 iterations; the loss value is also smaller than the other algorithms, at around 0.12. Its minimum accuracy, precision, recall, and F 1-measure for employment status prediction were 86.4, 76.8, 72.8, and 0.82%, respectively, all higher than the rest of the algorithms. The time complexity at a sample size of 80 is only 32 ms, which is lower than the rest of the algorithms. It can be seen that the simpli ﬁ ed and improved ID3 based on correlation coe ﬃ cients can accurately and e ﬃ ciently perform predictive analysis of graduates ’ employment status. The university employment management system proposed in the study has achieved e ﬃ cient deep utilization of graduate information through ID3, providing assistance to university employment decision-makers and reference for employment guidance for university graduates.


Introduction
With the increase in the college graduates, the devaluation of academic qualifications has become a certainty, which has led to the huge employment pressure faced by college graduates.At the same time, due to the influence of economic globalization, the employment choices of college graduates are diversified and blind, leading to the fact that most graduates do not have a sound employment outlook in the face of massive employment information, which exacerbates employment difficulties [1,2].To solve the above problems, university teachers are required to analyze various types of information about graduates in order to provide employment guidance.However, due to the large number of graduates and the even larger amount of relevant information, existing data analysis technologies have many shortcomings in big data analysis.At the same time, in the face of the current employment characteristics of graduates, the traditional manual management method gradually reveals various problems; the manual entry of information leads to the inefficiency of employment guidance work and the low accuracy of employment information.This, coupled with the low utilization of various types of student information by traditional management methods, has led to poor quality of career guidance services.Data mining technology is widely used in various fields because it can realize the rapid mining of implicit information of large-scale data.In university management, data mining techniques have also started to be frequently applied, effectively alleviating the problems of difficult statistical queries and the heavy workload of information entry in the process of university management.However, most of them are used in the teaching research of educational administration management machines and the employment analysis of college graduates.Common data mining techniques include decision trees, neural networks, regression analysis, clustering, association rules, Bayesian classification, etc.Among them, decision tree algorithms are widely used in the processing of discrete data due to the advantages of easy extraction of rules, fast running speed, and the ability to handle missing data and irrelevant features.As a type of discrete data, various types of data of students in higher education institutions are suitable for processing using decision trees [3,4].Common decision tree algorithms include Classifier 4.5 (C4.5),Classification and Regression Tree, and Iterative Dichotomizer 3 (ID3).The current management systems for the education field generally use a combination of data mining and text mining methods, but they have the disadvantage of slow learning speed, and when the text set size is large, the rule base will be very large.In addition, this method is highly sensitive to data and prone to overfitting, making it unsuitable for the management of employment information for graduates.Therefore, in order to improve the utilization of graduation information for college graduates and avoid overfitting problems, it is necessary to develop more suitable data mining methods and employment management systems.
As the originator of the decision tree algorithm, ID3 has a clear theory, simple method, and strong learning ability and is widely used in the fields of classification, prediction, and rule extraction.However, the traditional ID3 has the disadvantages of easily falling into local optimum and overfitting, leading to its poor classification effect.Therefore, the study proposes a simplified and improved ID3 algorithm based on a correlation coefficient, which effectively improves the shortcomings of the traditional ID3 algorithm, applies it to the mining of university graduate data information, and establishes an employment management system in colleges and universities.The system improves the ID3 algorithm to the performance of college graduates' performance, competition, and other personal data mining to determine the degree of the impact of each factor on employment, achieve accurate graduate employment prediction, and provide strong support for the employment guidance of college graduates.The innovation of this study lies in first improving the information gain function of the ID3 algorithm and simplifying the information entropy formula.Second, a college graduation management system was established to analyze the employment situation of college graduates from the aspects of gender, major, and competitive ability, providing a reference for the employment guidance work of college graduates.The university employment management system proposed in this study has achieved indepth utilization of graduate information by improving the ID3 algorithm, providing assistance to university employment decision-makers and thus achieving the goal of increasing the graduation rate.
The article is divided into four parts.The first part will give a brief description of the informatization of university management systems and the application of the ID algorithm; the second part will investigate the simplified and enhanced ID3 based on correlation coefficients; the third part will analyze the experimental results; and the fourth part will summarize the full research.

Review of the literature
With the advancement of information technology, universities have gradually started to implement information management, which has greatly reduced the management costs of universities and relieved the pressure on management personnel; management systems for various aspects of universities have also emerged.Xu and Liu propose a university student results management system based on cloud storage technology to address the problems of long response time and low accuracy of university results query systems.The system reduces the cost of data storage and improves security through the cloud storage system.According to the findings, the data storage time is only 0.5 s, the query response time is only 0.3 μs, and the accuracy rate is over 80% [5].Fan et al. propose an information management system based on data mining technology to address the problem of how to predict students' learning behavior.The system uses association rules to mine implicit information in students' educational data to predict their likely course choices.The system was tested to have a minimum support of 0.7 [6].Muhamad and Darwesh proposed a library management system based on radio frequency identification technology for how to upgrade the quality and satisfaction of library services.It uses RFID technology to locate documents, which improves the speed and accuracy of document search; the borrowing process can also be processed quickly through RFID technology.The outcomes indicated that the system can quickly locate books that have not been accurately returned to their place [7].Huang et al. proposed a load prediction method based on long-and short-term memory for the problem of load prediction in university public service management systems.The method predicts performance bottlenecks by mining the relationships between different modules.The outcomes indicated that the method has a high accuracy in predicting load trends and is more efficient compared with the rest of the methods [8].Li proposed an intelligent campus management system based on Internet of Things (IoT) technology for the management problem of smart campuses.The system is managed in the backend of the system through a unified data collection source of face recognition terminal hardware products with IoT technology, and the data is calculated and analyzed to obtain valuable campus big data.The outcomes indicated that the system can effectively help teachers and students to develop teaching and learning plans, and users' satisfaction score is above 8 [9].
The ID3 algorithm is extensively applied in various fields due to its low computational complexity, its suitability for high-dimensional data, and the construction of decision tree classifiers that do not require any domain knowledge or parameter settings.Wu et al. propose an intelligent classification system based on the improved ID3 to address the problem of distance education systems that are difficult to provide personalized instruction.The system can classify learners for personalized instruction [10].De Guzman et al. proposed path-planning algorithms based on exhaustive data-driven energy models and evolutionary algorithms to address the difficulty of traditional data-driven energy models and path-planning algorithms in describing the motion trajectory of quadcopters.The experimental results show that the maximum difference in accuracy of the energy consumption model remains at 0.6% [11].Harti et al. have proposed a wave prediction method based on the ID3 to address the problem of how to predict wave patterns in sea areas.The method can predict the size and location of waves from historical data on the rise and fall of the sea surface.According to the findings, the method can classify the sea surface with an accuracy of 88% [12].Nurkholis et al. proposed a land analysis method based on the ID3 algorithm to address the problem of how to analyze the land use status.The method enables the analysis of the sustainability of the land to determine its impact on agriculture.According to the findings, the accuracy in analyzing the land use status is due to other methods [13].Pathak et al. put forward an analysis method based on data mining for the cost and supply chain management of small and medium-sized enterprises during COVID-19.The results show that the complexity of cost management, social and cultural impacts, and economic differences collectively hinder the development of small and medium-sized enterprises.In addition, the risk perception of small enterprises was found to be inaccurate, which led to ineffective cost management strategies and supply chain management during the COVID-19 epidemic [14].
As mentioned above, with the advent of the "Internet+" era, many universities have begun to change their management methods from traditional to information-based management, and there are numerous different management systems with their own advantages.However, the development of systems for graduate employment management has lagged behind.Compared with academic management, the employment situation of graduates is difficult to predict accurately due to the many factors affecting it.Most of the graduate employment management systems only record the employment situation of the graduates and do not make full use of their information.The ID3 algorithm has the advantage of being easy to understand and interpret and is good at handling discrete data, so it can be used to process graduates' past data.However, the traditional ID3 algorithm is limited by its own limitations, resulting in a low accuracy rate; therefore, the study proposes a simplified and improved ID3 algorithm based on correlation coefficients in order to achieve an accurate prediction of graduates' employment status.In addition, based on the improvement of the ID3 algorithm, the research establishes the university employment management system.Compared with the traditional employment management system, the employment management system established realizes the full use of the information of previous graduates, provides strong support for the employment guidance work, and promotes the improvement of employment rate and employment quality.

A university employment management system based on the enhanced ID3
Among the data mining techniques, ID3 is widely used in various fields due to its advantages of complete search space, good robustness, and not easily affected by noise.To fully understand the employment quality and employment rate of university students, the study proposes an employment management system based on the ID3 as a way to realize the mining of employment data of university students.

Research on improved decision tree ID3
ID3 is a classic algorithm in decision tree algorithms, originating from concept learning systems.The decrease rate of information entropy is the standard for selecting test attributes.That is, the highest information gain attribute for each node that has not yet been used for classification is used as the classification standard.The decision tree obtained can perfectly classify the training samples, thus ending the process.Figure 1 illustrates the ID3 flow.
As can be seen from Figure 1, the ID3 algorithm creates a simple node tree after initializing the threshold; if the samples are of the same kind, they are labeled and returned to the decision tree; otherwise, the feature set is determined, and if the feature set is empty, the decision tree is returned; otherwise, the information gain of each feature is calculated, and when the feature with the maximum gain is greater than the threshold, the decision tree is returned; otherwise, the output data are divided into different categories and the decision tree [15][16][17].The formula for calculating the entropy of the training sample set is shown in the following equation: In equation ( 1), s i denotes the sample, m denotes the number of categories of the sample, and p i denotes the probability that the sample belongs to the i category.The entropy of an attribute of a training sample is calculated by the formula: In equation ( 2), A indicates the attribute of the training sample set, v stands for the values of the sample attribute, and d j indicates the number of samples in the subset = A a j .The formula for calculating the information gain of the attribute A is given in the following equation: In equation ( 3), A Gain( ) denotes the information gain of the attribute A, E s s s , ,…, m 1 2

(
) denotes the entropy of the sample set, and E A ( ) denotes the information entropy of the attribute A. Although the ID3 algorithm has the advantages of fast search speed and a small number of nodes, it is also easy to fall into local optimal solutions.It has the disadvantages of multi-value dependence and weak continuous data processing ability.Therefore, an upgraded ID3 based on correlation coefficients is proposed.The correlation coefficient between discrete variables can be calculated using the − y tau coefficient method, which is defined in the following equation: In equation ( 4), n indicates the samples, f indicates the conditions, F x indicates the edges of the variable, x F y stands for the edges of the variable y, E 1 indicates the error in predicting y when the variable x is not known, and E 2 indicates the error in predicting y when x is known.The improved formula for calculating the information gain is given in the following equation: In equation ( 5), g D A , ( ) represents the improved information gain, ρ ay represents the correlation coeffi- cient in the attribute A and the category Y , and n represents the values of A. By introducing the correlation coefficient, the information gain of attributes with more values and less relevance is effectively reduced; the problem of multi-value bias of the ID3 is overcome.Also, as the calculation of the logarithm in information entropy is more complicated, the study simplifies it using Taylor's number and McLaughlin's formula [18][19][20].The Taylor's theorem formula is given in the following equation: In equation (6), ζ is taken to be in the range x 0 to x.Take = x 0 0 and make = ζ θx to obtain the McLaughlin formula, which is given in the following equation: The approximation equation ( 8) can be obtained from the following equation: Equation ( 8) can be simplified to equation (9) Employment management system for universities based on improved decision tree  5 When ∈ x 0, 1 ( ), equation ( 9) can be rewritten as equation ( 10) From equation (10), it can be seen that the calculation speed is improved by a series of simplifications that reduce logarithmic operations to non-logarithmic operations.The rewritten formula for calculating information entropy is shown in equation ( 11) In equation (11), n represents the number of categories.Since ∈ p 0, 1 i ( ), taking equation ( 10) into the equation ( 11) yields equation ( 12) In order to further simplify the information entropy calculation formula and the logarithmic operation and effectively improve the multi-value bias problem of the original ID3 algorithm, assuming that the number of values of the A is and the set nD consists of subsets of n by these n values, and each subset is divided into subsets of k and k is the number of categories of the D, the formula for calculating the information entropy at this point is given in equation ( 13) In equation ( 13), C m and C i denote the mth and ith subsets of the set D, respectively; C ij denotes the jth subset of the set C i .From equation (13), equation ( 14) is obtained The information gain formula can be simplified by bringing equation ( 14) into equation ( 5), and the simplified improved information gain formula is shown in the following equation: The simplified formula for calculating the improved information gain simplifies the logarithmic operations in information entropy to non-logarithmic operations, effectively reducing the time complexity.

Design of the university employment management system
As the number of university graduates increases, leading to an increasingly large amount of student information data, it is difficult to guide students' employment work through the potentially valuable information contained therein.In order to improve this problem, the study proposes a university student employment analysis system based on the improved ID3 decision tree algorithm, which can help teachers in their employment guidance work by predicting the employment status of graduates through the student learning information data.The conceptual model design is the most important step in the database design, which is designed through the E-R model, which is shown in Figure 2.
The E-R model reflects the basic information of the company, the personal information of the students, and the effectiveness of the table structure, as well as the efficiency and results of data mining.Based on the E-R model, a database can be designed by combining various types of information about the graduates.As the ID3 algorithm is applicable to discrete data, continuous data need to be discretized first.The table structure of the basic information of graduates is shown in Table 1.
As can be seen from Table 1, the basic information database will provide statistics on the student's name, place of origin, political affiliation, major, and whether he/she is a class officer.Employment management system for universities based on improved decision tree  7 In Table 2, the database of courses and grades will provide a detailed record of graduates' subject courses and grades.The structure of the student's competition information table is illustrated in Table 3.
As can be seen from Table 3, if a student participates in any of the competitions, the database will record data such as competition information and awards won.Before processing the employment information, the employment data of previous years should first be analyzed and the important components should be mined.The specific process is as follows.The first step is the definition of the mining object and target; the second step is data preparation, i.e., collecting various types of information of students; the third step is data pre-processing, i.e., discrete processing of continuous data notation, removing or perfecting dirty data, etc.; the fourth step is the establishment of a data mining model, i.e., constructing a prediction model of graduates' employment; the fifth step is the evaluation of classification rules, i.e., analyzing the prediction results; the last step is the application of the classification model.Data pre-processing specifically includes three parts: data integration, data cleaning, and data imputation, of which the data set is to bring all kinds of data together.The structure of the employment information summary table is illustrated in Table 4.
Data cleaning is the elimination of noise from valid data and the processing of missing data, as well as the removal of invalid data.Data normalization is mainly the removal of redundant data and the transformation of data with different attributes.

Results and analysis
To verify the performance of the employment management system based on the improved ID3, simulated experiments were conducted on it and compared it with the upgraded ID3 based on attribute priority values and the upgraded ID3 based on correction functions.The data of the graduates of a university class were the test set, and the employment status was categorized into five types: not employed, further education/going abroad, large enterprises, small and medium enterprises, and state-owned enterprises.In the experiment, the adjustment coefficients for advanced mathematics, English, and professional courses were 0.4, 0.3, and 0.35, respectively.According to the calculation of the adjustment coefficients, the information gains for gender, major, class cadre status, professional course grades, basic course grades, practical course grades, competition ability, and passing of CET-4 and CET-6 were 0.0189, 0.0283, 0.0193, 0.0436, 0.0328, 0.0222, 0.0211, and 0.0281, respectively.Taking the postgraduate entrance examination as an example, the influence of different adjustment coefficients on the prediction accuracy is shown in Figure 3.
As shown in Figure 3, as the adjustment coefficient increases, the prediction accuracy first increases and then decreases.When the adjustment coefficients for advanced mathematics, English, and professional courses are 0.4, 0.3, and 0.35, respectively, the prediction accuracy is the highest, at 61.6, 60.2, and 65.9%, respectively.Table 5 illustrates the sample data statistics of the test set.
As can be seen from Table 5, the data in the test set consisted of 951 items, which were divided into three specialties; the results of each subject were transformed into three grades: "excellent," "good," and "fair," and the grades of Level 4 and 6 were divided into "failed," "passed Level 4," and "passed Level 6."The data are divided into "failed," "passed level 4," and "passed level 6," and the competition ability is divided into "strong, medium, and weak."The three levels of competence are classified as "strong, medium, or weak"; effectively  Employment management system for universities based on improved decision tree  9 reducing the number of useless attributes in the data.The convergence of the simplified and upgraded ID3 based on correlation coefficients, the improved ID3 based on attribute priority values, and the improved ID3 based on correction functions are shown in Figure 4.As can be seen from Figure 4, the modified ID3 algorithm based on the correction function converges after about 25 iterations, and the loss value is about 0.18; the improved ID3 algorithm based on the attribute priority value starts to converge after about 21 iterations, and the loss value is about 0.16 at this time; the simplified upgraded ID3 based on the correlation coefficient starts to converge after about 17 iterations, and the loss value is about 0.12 at this time.The findings demonstrated that the simplified and upgraded ID3 based on the correlation coefficient converges faster and has a smaller loss value.The prediction accuracy and false positive rates of the three improved ID3 algorithms for graduate employment are shown in Figure 5.
From Figure 5(a), the prediction accuracy of the improved ID3 algorithm based on the correction function for the five employment statuses of large enterprises, small and medium enterprises, state-owned enterprises, further studies/going abroad, and pending employment are about 83.5, 82.1, 87.4,85.2, and 81.6%, respectively; the prediction accuracy of the improved ID3 based on the attribute priority value for the five employment statuses is also about 85.1, 84.3, 87.7, 86.1, and 83.3% respectively; the prediction accuracy of the simplified improved ID3 algorithm based on correlation coefficients for the five employment statuses was about 87.8, 86.5, 89.9, 88.7, and 86.4%, respectively.From Figure 5(b), the misjudgment rates of the improved ID3 based on the correction function for the five employment states were about 16.5, 17.9, 12.6, 14.8, and 18.4%, respectively; the misjudgment rates of the improved ID3 based on the attribute priority values were about 14.9, 15.7, 12.3, 13.9, and 16.7% respectively; the simplified ID3 based on the correlation coefficient The misclassification rates of the improved ID3 were about 12.2, 13.5, 9.1, 11.3, and 13.6%, respectively.The simplified and improved ID3 based on correlation coefficients has the highest prediction accuracy and the lowest false positive rate.The accuracy and recall rates of the three algorithms are shown in Figure 6    Employment management system for universities based on improved decision tree  11 From Figure 6(a), the prediction accuracy rates of the improved ID3 algorithm based on the correction function for the five employment states are about 76.  6(b), the recall rates of the improved ID3 based on the correction function for the five employment states are about 71.3, 70.6, 72.5, 71.3, and 70.7%, respectively; the recall rates of the improved ID3 based on the attribute priority value are about 71.8, 71.2, 73.1, 71.9, and 71.45%, respectively; the recall rates of the simplified improved ID3 based on the correlation coefficient are about 71.8, 71.2, 73.1, 71.9, and 71.45%, respectively.The recall rates of the simplified improved ID3 algorithm were about 73.1, 72.8, 74.2, 73.5, and 73.3%, respectively.The findings demonstrated that the accuracy and recall of the simplified and upgraded ID3 based on the correlation coefficient are better than the other two algorithms.The F1-measure and time complexity of the three algorithms are shown in Figure 7.
From Figure 7(a), it can be seen that the F1-measure values of the improved ID3 based on the correction function for the five employment states are about 0.78, 0.81, 0.79, 0.83, and 0.82, respectively; the F1-measure of the improved ID3 based on the attribute priority values are about 0.8, 0.82, 0.84, 0.82, and 0.85, respectively; the F1-measure of the simplified improved ID3 based on the correlation coefficients are about 0.82, 0.84, 0.86, 0.85, and 0.88, respectively; where the F1-measure of the simplified and improved ID3 based on correlation coefficients is the highest.From Figure 7(b), the time complexity of all three algorithms increases with the increase of the number of samples.When the number of samples is 80, the time complexity of the three algorithms is about 43, 38, and 32 ms, respectively, with the simplified and enhanced ID3 based on correlation coefficient having the lowest time complexity.The P-R curves and ROC curves of the three improved ID3 algorithms are shown in Figure 8.
From Figure 8(a), it can be seen that the equilibrium point of the P-R curve of the upgraded ID3 based on the correction function is (0.75, 0.75); the equilibrium point of the P-R curve of the upgraded ID3 based on the attribute priority value is (0.77, 0.77); and the equilibrium point of the simplified upgraded ID3 based on the correlation coefficient is (0.8, 0.8).From Figure 8(b), it can be seen that the area under the ROC curve of the improved ID3 based on the correction function and the attribute priority value is about 0.79 and 0.82, respectively; the area under the ROC curve of the simplified upgraded ID3 based on the correlation coefficient is about 0.87.The above results show that the performance of the simplified upgraded ID3 based on the correlation coefficient is better than the remaining two algorithms.

Conclusion
In recent years, as the number of graduates from universities continues to increase, the employment problems of graduates have become more acute, which requires university teachers to provide proper employment guidance to graduates.However, the student-related data are very large, resulting in the useful information implied in the data not being fully utilized.To improve the quality and efficiency of employment guidance, the study proposes a university employment management system based on the upgraded ID3 of the correlation coefficient, which can improve the quality of employment management by making full use of student past-related data.The outcomes indicated that the simplified upgraded ID3 based on the correlation coefficient starts to converge after about 17 iterations and converges faster than the remaining two improved ID3 algorithms; the loss value is about 0.12 at this point, which is lower than the remaining algorithms.In the experiments on the prediction of employment status, the accuracy of the simplified and upgraded ID3 based on the correlation coefficient for different employment statuses was about 87.8, 86.5, 89.9, 88.7, and 86.4%, respectively; the accuracy was about 78.4,76.8, 80.1, 78.2, and 77.3%, respectively; the recall was about 73.1, 72.8, 74.2, 73.5, and 73.3%, respectively; F1-measure was around 0.82, 0.84, 0.86, 0.85, and 0.88, respectively; all the above metrics were higher than the rest of the algorithms.The misclassification rates were around 12.2, 13.5, 9.1, 11.3, and 13.6%, respectively, which were lower than the rest of the algorithms.The time complexity and area under the ROC curve of the simplified and upgraded ID3 based on the correlation coefficient with a sample size of 80 are 32 ms and 0.87, respectively, which shows that its time complexity is small and its comprehensive performance is good compared with other algorithms.The above results show that the simplified and upgraded ID3 based on correlation coefficients can achieve efficient and accurate processing of correlated data.Although the simplified and upgraded ID3 based on the correlation coefficient can achieve more accurate employment prediction analysis, it still has some errors and is lacking in data collection considering the privacy of students.Employment management system for universities based on improved decision tree  13 manuscript.JL contributes to writing-original draft preparation, formal analysis, validation, software, visualization.YM contributes to writing-review and editing, methodology, data curation.

Figure 3 :
Figure 3: Effect of different adjustment coefficients on the prediction accuracy.

Figure 5 :
Figure 5: Prediction accuracy and misjudgment rate of different algorithms for graduates' employment situation: (a) prediction accuracy of three improved ID3 algorithms and (b) error rate of three improved ID3 algorithms.

Figure 6 :
Figure 6: Precision and recall of three algorithms: (a) precision of three improved ID3 algorithms and (b) recall rate of three improved ID3 algorithms.

Figure 7 :
Figure 7: F1 measure and time complexity of three algorithms: (a) F1 measure of three algorithms and (b) time complexity of three improved ID3 algorithms.

Figure 8 :
Figure 8: P-R curves and ROC curves of three improved ID3 algorithms: (a) P-R curves of three algorithms and (b) ROC curves of three algorithms.

Table 1 :
Table 2 illustrates the table structure for course and grade information.Basic information of graduates

Table 2 :
Course and grade information

Table 3 :
Student competition information

Table 4 :
Summary of employment information

Table 5 :
Sample data statistics of the test set .