Analogy-Based Approaches to Improve Software Project Effort Estimation Accuracy

Abstract In the discipline of software development, effort estimation renders a pivotal role. For the successful development of the project, an unambiguous estimation is necessitated. But there is the inadequacy of standard methods for estimating an effort which is applicable to all projects. Hence, to procure the best way of estimating the effort becomes an indispensable need of the project manager. Mathematical models are only mediocre in performing accurate estimation. On that account, we opt for analogy-based effort estimation by means of some soft computing techniques which rely on historical effort estimation data of the successfully completed projects to estimate the effort. So in a thorough study to improve the accuracy, models are generated for the clusters of the datasets with the confidence that data within the cluster have similar properties. This paper aims mainly on the analysis of some of the techniques to improve the effort prediction accuracy. Here the research starts with analyzing the correlation coefficient of the selected datasets. Then the process moves through the analysis of classification accuracy, clustering accuracy, mean magnitude of relative error and prediction accuracy based on some machine learning methods. Finally, a bio-inspired firefly algorithm with fuzzy analogy is applied on the datasets to produce good estimation accuracy.


Introduction
The need for software project effort prediction has been increasing for the last 20 years. The predicted effort is used to find the overall cost and duration of the project. This prediction may lead to either underestimation or overestimation [5]. If it is over or under, it causes several problems in the business plans of the company. Especially, it causes several budgeting problems and schedule slippage [24].
The first thought of software effort estimation came with the presentation of the rule of thumb [13] during the 1950s. Thereafter in the 1960s, a new approach for software effort estimation was unveiled as the consequence of an expert judgment where domain experts applied their prior experiences to discern the effort of the new project [22]. The existing representations on linear equations and regression analysis were proposed [6] in 1965. The first automated tool for effort estimation was Interactive Productivity and Quality [13] established by the IBM researchers. Subsequently, Barry Boehm put forward a new mathematical model based on the regression analysis named COCOMO (COst COnstructive MOdel). This model predicts the software project effort based on the type of project. Ultimately, he propounded another model named COCOMO II which was an augmented version of COCOMO [5]. Furthermore, the models such as Putnam's Software Lifecycle Management [24], Software Evaluation and Estimation of Resources -Software Estimating Model [6] and Function Point (FP) by Albrecht were also used for effort prediction [1]. Analogy-based estimation (ABE) was fostered in the year 1997 [27] as a comparative method.
Estimation by analogy is one form of expert judgment and it is also known as top-down estimating which mainly determines the duration to finish the project. Analogous estimating uses similar past projects' historical data to estimate the duration or cost of a current project, thus the term used analogy.
ABE put the estimated project alongside with the already completed projects on the basis of a measure, as an uncomplicated process. It distinguished adjacent analogies on account of similarity [3,16,20]. It deploys distance measures. Distance measure bestows how closely one project is tantamount to other projects. Each of the attributes' values taken for effort estimation is applied to the distance measure to descry how contiguous one object is to another. Henceforth, similar data objects are aggregated. Under the aegis of similar data objects, software project effort is estimated [10,25].
The machine learning method of estimation has been popular for the last two decades, because machinelearning-based estimation gives more accurate results when compared with the previous two methods [12]. The machine learning method uses artificial intelligence-based techniques to give better results.
The software project effort estimation was really complicated throughout the rudimentary stages of software development. To provide more veracious results, the experience of the erstwhile project effort estimation attributes is taken into consideration. On these attributes, mining techniques are adapted to procure the effort prediction for the current project.
Predominantly, data mining bestows as a method to turn raw data into profitable and intelligent information. It has numerate functionalities [10], one among which is clustering. Clustering pertains to the grouping of data objects. Clustering follows unsupervised learning where class labels are not used. Preferably, it procreates labels for data objects. The objects are grouped or clustered based on the principle of minimizing the interclass similarity and maximizing the intraclass similarity. Particularly, all objects of that cluster are similar once it is formed. But data objects from other clusters are dissimilar. Clustering is otherwise known as data segmentation because clustering allocates large datasets into groups on the basis of their similarity [25]. Here how clusters of different methods improve the accuracy of the effort estimation is the main core of this paper.

Related Work
Estimation based on analogy compares the estimated project with the already completed projects based on some measures. Here the measurement is mostly distance measures. The distance measure is used to find how closely one project is related to other projects. In the initial stages of software development, software project effort estimation is very difficult. To get the more accurate results, the experience of the previous projects' effort estimation attributes is taken into consideration. On these, attributes mining techniques are applied to get the effort prediction for the current project.
Scarcely there is any model which estimates the software project effort for all domains and all kinds of applications. It is on the basis of existing models that the new models are proposed. To bring forth the effort of the new one, analogy-based estimation confronts the completed projects. Eventually, Khatibi et al. [18] contemplated a novel idea of a framework to combine analogy-based effort estimation and neural networks to ameliorate the accuracy of effort prediction. Then Humayun and Gang [12] assured that machine learning methods give us more accurate effort estimation as compared to the traditional methods of effort estimation.
Malathi and Sridhar [21] proposed an approach based on fuzzy logic, linguistic quantifiers and analogybased reasoning. Their main aim was to enhance the performance of the effort estimation in software projects while dealing with numerical and categorical data. Azzeh and Nassif [4] together proposed a new method to discover the most prudent set of analogies from dataset characteristics to support the different size of datasets that have a lot of categorical features. Also, Prabhakar and Dutta [23] advocated a comparative study on artificial neural network (ANN) and support vector machine for predicting the software effort.
Araujo et al. [2] presented a multilayer dilation-erosion-linear perceptron (MDELP) model to solve problems in effort estimation. They used hybrid morphological operator and a linear operator to solve problems. Kaushik et al. [15] combined fuzzy inference system and cuckoo optimization (COA-FIS) for showing improved accuracy in software cost estimation.
The performance of the cluster subtrees is better than cluster supertrees as congenial to the studies of Kocaguneli et al. [19]. The performance of the analogy-based effort estimation can be improved by selecting project data from regions with small variance.
The hybrid method was planned by Khatibi et al. [17] to reduce the inconsistent project which leads to attaining higher accuracy for effort estimation. The similar projects were obtained in the different clusters through the C-means clustering technique. These clusters comprise the reliable and appropriate projects to estimate the development effort which are suitable to be employed by the ABE and ANN methods. The Fuzzy-class point (FCP) approach was intended by Satapathy et al. [26] for evaluating the cost of different software projects. In order to attain better accuracy, the FCP approach employs the various adaptive regression techniques for effort estimation.
Borandag et al. [7] prepared a case study for the software size estimation through MK II FPA (MK II Function Point Analysis) and FP methods. They used MK II FPA and FP methods to estimate the size of the software product. They implemented the same software by different developers to study their size estimation process and the size of the developed software is compared.
Yücalar et al. [30] developed a new multiple linear regression analysis-based effort estimation method. They used the datasets of the 10 software projects developed by 4 well-established software companies in Turkey. The results of the proposed method were compared with the standard Use Case Point method and simple linear regression-based effort estimation method.

Proposed Work
Finding the accurate effort for the new software project based on the historical dataset is a burden for project managers as there is no such model to estimate the effort directly. They have to think in many ways to reach an appropriate estimation. Our work concentrates on how to improve the accuracy and also on which techniques and datasets, approaches yield good results. Cocomo81, Cocomonasa60, Cocomonasa93, Deshnaris, ALBRECHT, Kemerer, Miyazaki1 and MAXWELL datasets are selected for our analysis. Among those datasets, ALBRECHT and Kemerer are based on FPA. We proposed four steps to reach good estimation accuracy: (a) Select the classifier. (b) Find the best clusters by applying the selected classifier from the first step. (c) Perform analogy and optimization together to reach optimal solutions using best clusters. (d) Find the new effort with the help of optimal solutions.
The diagram in Figure 1 shows the model of our proposed work. It also presents the four steps in arriving better estimation accuracy.

Select the Classifier
Here we have applied two classifiers on the selected datasets: multivariate linear regression and deep structured multilayer perceptron.

Multivariate Linear Regression
Linear regression [10] follows the equation of the line where slope, x-coordinate, y-coordinate and constant are replaced by weight, one or more independent variables, predictor variable (y) and regression coefficient. It takes the form of where b and w are regression coefficients. b is the Y-intercept and w is the slope of the line. These coefficients can be thought of as weights. So the above expression (1) can be rewritten as follows: Let D be the training set of tupels that contains |D| datasets of the form (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x |D| , y |D| ). The regression coefficients (15) can be calculated using the following equations: wherex is the mean of x andȳ is the mean of y.
Simple linear regression is based on only one explanatory variable. The next form of simple linear regression is multiple linear regression [6] which is based on more than one explanatory variable. Another very interesting form of linear regression is multivariate linear regression which relies on more than one predictor variable. For example, it can give more than one predictor. We have used multivariate linear regression for our work with the hope that in future there may be more than one predictor variables.

Deep Structured Multilayer Perceptron
A multilayer perceptron (MLP) is a feed-forward ANN model. It shows an association between given data and appropriate results. This model is represented as a directed graph which shows a set of nodes as each layer, thereby forming multiple layers. Each layer is fully connected to the next layer. The set of nodes in one layer where the input data are given is said to be the input layer. One or more nodes can be used to generate or predict output. Such nodes are in the output layer. The nodes in between the input layer and the output layer form a hidden layer. Except input nodes, all the other nodes use an activation function to generate the output. Input nodes accept only input data and just pass data to the next layer. All the other nodes use a nonlinear activation function [11] to reach the output. It is based on a supervised learning technique called backpropagation for training the network where the errors propagated in the backward direction till appropriate outputs are produced. In deep learning, each node is analyzed for more different parameters.
There are three layers in this model: an input layer, a hidden layer and an output layer. The data are given in the input layer. For all input attributes, there are nodes in the input layer. The nodes where the output is produced are in the output layer. The number of output nodes is to represent the number of classes. The nodes in between the input layer and the output layer are in the hidden layers. The link between nodes has a weight (a number) w and each node performs a weighted sum of its inputs and thresholds the result with the help of the activation function.
Figure 2 [9] shows the MLP neural network. Of the two classifiers, deep structured multilayer perceptron yields good results and it is selected for the next step where this selected classifier is applied to clusters. In our work, we modified each node in a way that it learns itself more and more properties for estimation.

Select the Best Clustering Technique
Two types of clustering techniques are analyzed. They are vector quantized k-means clustering and Probabilistic Model-Based Expectation-Maximization (EM) clustering.

Vector Quantized k-Means Clustering
The k-means clustering method is the simplest form of clustering [10,25]. By exercising a partitioning algorithm, it organizes the data into groups. It also splits dataset D of n objects into k partitions or k-clusters, C 1 , C 2 , C 3 , . . . , C k , where C i ⊂ D and C i ∩ C j = ∅ for 1 ≤ i, j ≤ k. Each of the clusters is represented by its centroid. The centroid can be interpreted by the mean of the objects. And so forth this method is named k-means. In the initial case, some k objects are haphazardly chosen as the cluster center (centroid). Each object of every single iteration is compared with each centroid with the help of distance measures. The object which represents the lowest distance against one particular centroid belongs to the cluster of that centroid. When each iteration ends, the computed mean value for each cluster and new mean becomes the new cluster center of each cluster. And the process will reoccur till there is no change in the cluster center or the sum of squared errors between all objects in C i and the centroid c i is minimum for all k partitions: Instead of simple k-means, we opted for vector quantized k-means to reach better results.

Probabilistic Model-Based Expectation-Maximization Algorithm
The EM algorithm is one of the clustering techniques which compare the given data with some mathematical model. This algorithm is an extended version of k-means clustering. In k-means clustering, each object is assigned to a particular cluster based on the distance measure, whereas in EM, each object is assigned to a cluster based on the probability of membership of the object. The EM algorithm [10] is as follows: (a) Expectation step: each object o i is to a cluster C k with the probability where p(o i | C k ) = N(m k , E k (o i )) follows the normal distribution around mean, m k , with expectation E k . This step calculates the probability of cluster membership of object x i or the expected cluster membership of object o i . (b) Maximization step: re-estimate the model parameters: This step is the "maximization" of the likelihood of the distribution of the given data.
Of the two clusters, EM cluster improves the effort estimation accuracy based on the selected classifier deep structured multilayer perceptron. So EM clusters are used in the next step to perform optimization for generating optimal solutions.

Perform Optimization
Now we have good clusters of a dataset which can improve estimation accuracy. These clusters of data are used for optimization. Here we use fuzzy analogy and firefly optimization.

Original Firefly Algorithm
The firefly algorithm [28,29]  The main feature of the firefly is its attractiveness β, and it varies with respect to distance r between fireflies. It is defined as follows: where β 0 is attractiveness at r = 0, Υ is the light absorption coefficient and r is the distance between two fireflies x i and x j defined as the Cartesian distance.
where d denotes the number of dimensions. The movement of the firefly is updated with the help of the following equation: where x i is the current position of the firefly i, β 0 e −Υr 2 i,j is the firefly's attractiveness, α is the randomization parameter and rand is the random number generator between 0 and 1.

Fuzzy Analogy and Firefly Optimization
The fuzzy analogy is nothing but an analogy based on fuzzy logic. In analogy-based effort estimation, identical projects are identified from the historical dataset and these identified projects are used for effort estimation either by collecting opinion from the experts or by applying some mathematical model based on the similar projects. It consists of the case identification process, case retrieval process and case adaptation process.
There may be lots of instances in the historical data. So the process of finding matching cases is difficult. In the fuzzy analogy approach, all data are converted into fuzzy sets by applying fuzzy logic. Here all the variables are converted to linguistic variables by using membership functions. So categorical variables can be handled efficiently with the help of fuzzy logic. Once the fuzzy datasets are ready, our proposed work generates the fuzzy rules for the fuzzy dataset. From those fuzzy rules, optimal rules are generated with the help of the firefly optimization algorithm. In our work, three initial sets of solutions are formed with the set of flies (fuzzy rules), i.e. each solution consists of a set of flies. For each fly, the fitness value is computed. Here the fitness value is the mean magnitude of relative error (MMRE) value. For each solution, the summation of fitness values of all flies in that solution is calculated and the solution is ranked based on the minimum of that value. The solution with the minimum MMRE value is set aside and the remaining solutions are updated with other sets of rules. And this process is repeated n number of times till we get the best optimal solutions.
Once we reach optimal solutions, the next step in the fuzzy analogy is the identification of similar cases. This is achieved by finding the distance between projects p and pi 2 by comparing each individual attribute of p 1 and p 2 .
The next step in the fuzzy analogy is case adaptation. In this step estimate of the new project is derived from the effort values of similar projects.

Results and Discussions
For assessing the performances of the k-means clustering and EM algorithm, eight datasets have been selected from the PROMISE data repository. They are Cocomo81, Cocomonasa60 and Cocomonasa93, DESHARNAIS, ALBRECHT, Kemerer, Miyazaki1 and MAXWELL datasets. Cocomo81 has 63 instances and 17 attributes (all numeric: 15 for the effort multipliers, one for Lines of Code (LOC) and one for actual development effort), and there is no missing attribute. Cocomonasa60 has 60 instances and 17 attributes (15 discrete in the range very low to extra high). Cocomonasa93 has 93 instances and 24 attributes and DESHARNAIS has 81 instances and 12 attributes. For ALBRECHT, there are 24 instances and 8 attributes. Kemerer has 15 instances and 8 attributes. For Miyazaki1, there are 48 instances and 9 attributes. MAXWELL has 62 instances and 27 attributes. Among those datasets, ALBRECHT and Kemerer are FPA-based datasets and Miyazaki1 is a COBOL dataset.
Parameters for validation: -Correlation coefficient: Correlation tells how much actual and predicted are related. Its value ranges from −1 to 1, where 0 is no relation, 1 is the very strong linear relation and −1 is an inverse linear relation. -Mean magnitude of relative error (MMRE): There are many measures to predict the accuracy of the effort prediction models. But the commonly used measure is the MMRE. The MMRE can be measured by the following formula: MREi (12) where MRE is the magnitude of relative error.
MMRE ≤ 0.25 is the acceptable range [8]. -Prediction (PRED): This is also another measure to estimate the accuracy [14]: where k is the number of observations whose MRE is less than or equal to 0.25 and n is the number of observations. -Classification accuracy: This is the percentage of the ratio of the number of projects classified correctly to the total number of projects within the dataset. A high value of classification accuracy leads to good accuracy. -Clustering accuracy: This is the percentage of the ratio of the number of projects grouped correctly to the total number of projects within the dataset. A high value of clustering accuracy leads to good accuracy. Table 1 shows the results of the four validation parameters for multivariate linear regression effort estimation.
From Table 1, it is noted that with the higher value of correlation coefficient the better prediction yields. For the better correlation coefficient values, the classification accuracy also increases. Table 2 shows the results of the four validation parameters for deep structured multilayer perceptron effort estimation.
From Table 2, it is also noted that higher the correlation coefficient value better the prediction and classification accuracy. But when we compare the above-mentioned two techniques based on MMRE and prediction,  deep structured multilayer perceptron effort estimation performs better. So in the next step, we use deep structured multilayer perceptron classifier as the estimation model to estimate the effort on clustered data. In Table 3, the values of validation parameters for deep structured multilayer perceptron effort estimation using vector quantized k-means clusters are tabulated. Table 4 shows the values of validation parameters for deep structured multilayer perceptron effort estimation using probabilistic model-based EM clusters.
When we compare MMRE and prediction values in Tables 3 and 4, it is found that probabilistic modelbased EM clusters give good accuracy in prediction and less MMRE values than vector quantized k-means  clusters. So probabilistic model-based EM clusters are used by firefly optimization and fuzzy analogy for effort estimation. The result of this approach is shown in Table 5. From Table 5, it is evident that prediction accuracy measures like MMRE and prediction values are much more improved than other previous methods used in steps 1 and 2 of the proposed method. Table 6 tabulates the performance measures of our proposed method with the other two existing methods (MDELP and COA-FIS). Figures 3 and 4 graphically show that MMRE values of the proposed method for the four selected datasets are less than the MMRE values of the existing methods. Also prediction values increase when compared with two methods. Hence, the proposed method improves the accuracy of the effort estimation.

Conclusion
Congenial to the results of several researchers, it is incontestable that there is no approach which is suitable for estimating software project effort for all domains and all kinds of applications. Thus, it is indispensable to use the prior experiences of the projects to estimate the effort of the current project. Analogy-based effort estimation is one among them. So it can be implemented on machine learning techniques to derive better analogies. In this paper, we gave emphasis to two learning approaches such as classification and clustering. Of the two clustering methods, EM clusters are more excelling than vector quantized k-means clusters contingent on MMRE and prediction values. So EM clusters are taken for fuzzy analogy and firefly optimization to get the optimal solutions. From the optimal solutions, the effort of the new project is derived with good accuracy. Different optimization techniques have been left for future for analyzing which optimization suits for which type of domain datasets. Different clustering and classification techniques can be considered for future work. And also, it could be better if some data pre-processing techniques are applied before the data applied for clustering and classification.