Opposition Intensity-Based Cuckoo Search Algorithm for Data Privacy Preservation

Abstract Privacy-preserving data mining (PPDM) is a novel approach that has emerged in the market to take care of privacy issues. The intention of PPDM is to build up data-mining techniques without raising the risk of mishandling of the data exploited to generate those schemes. The conventional works include numerous techniques, most of which employ some form of transformation on the original data to guarantee privacy preservation. However, these schemes are quite multifaceted and memory intensive, thus leading to restricted exploitation of these methods. Hence, this paper intends to develop a novel PPDM technique, which involves two phases, namely, data sanitization and data restoration. Initially, the association rules are extracted from the database before proceeding with the two phases. In both the sanitization and restoration processes, key extraction plays a major role, which is selected optimally using Opposition Intensity-based Cuckoo Search Algorithm, which is the modified format of Cuckoo Search Algorithm. Here, four research issues, such as hiding failure rate, information preservation rate, and false rule generation, and degree of modification are minimized using the adopted sanitization and restoration processes.


Introduction
Nowadays, the quantity of information produced and conveyed among governments, institutions, and other firms has extremely risen [1,13]. In addition, with the speedy improvement of data-mining (DM) tools, hidden interactions among the data could not be exposed with ease to find out the first choice of users. Consequently, a concern that has come up is that data established by DM schemes may totally expose private, sensitive, or confidential information (e.g. home addresses, permanent account number, credit card, and social security numbers data).
To deal with these issues, methods for privacy-preserving data mining (PPDM) [6,15,18,24] were introduced. They comprise database perturbation to sanitize it. PPDM approaches convert the data to maintain privacy. PPDM [12,25,30] offers better privacy throughout the mining stage, and it moreover desires to regard the privacy problems of data post-processing and pre-processing. It deals with the issues met by a person or organization when sensitive data are misused or lost. Therefore, the data require to be updated, and thus other persons will not get any suggestion of the private information. Simultaneously the effectiveness of the data has to be conserved [7,26].
Several techniques have been implemented to hide sensitive patterns, which emerge in binary databases like frequent item sets and association rules [8,19,22]. Generally, these techniques delete item sets or transactions from the original database to decrease the confidence or support of the sensitive patterns throughout the process of sanitization. Numerous secure protocols have been presented so far for machine learning and data-mining schemes [14,29] for clustering, decision tree classification, neural networks, association rule mining (ARM), and Bayesian networks [2,4]. The major concern of these schemes is to preserve the sensitive data [20,23] of parties, while they achieve valuable knowledge from the entire dataset. A major concern in DM [5,16] is the process of finding out frequent item sets, and subsequently association rules [21] are frequently exploited in numerous areas. The majority of the PPDM [9,11] schemes exploit a transformation that minimizes the convenience of the underlying data when it is applied to data-mining algorithms. Anyhow, these schemes are not capable of adjusting the accuracy and privacy efficiently: "when one is preserved, the other appears to suffer" [10,28].
This paper contributes secure PPDM, comprising two phases, namely, data sanitization and data restoration, which are the processes after extracting association rules from the original database. In both phases, key extraction plays a major role, which is selected optimally using Opposition Intensity-based Cuckoo Search Algorithm (OI-CSA). Here, four research challenges such as hiding failure (HF) rate, information preservation (IP) rate, false rule generation (FR), and DM are minimized using the proposed sanitization and restoration processes. In addition, the proposed OI-CSA model is compared with the traditional algorithms such as Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Differential Evolution (DE), and Cuckoo Search Algorithm (CSA), and the results are obtained. The paper is organized as follows. Section 2 portrays the related works and reviews done under this topic. Section 3 portrays the suggested objectives for data sanitization and restoration: proposed model, Section 4 explains the key encoding: an improved optimization algorithm, Section 5 demonstrates the results, and Section 6 concludes the paper.

Related Works
In 2018, Afzali and Mohammadi [1] have suggested a data anonymization scheme for fitting big DM. The distinctive characteristics of big data make it essential to consider every rule as an ARM with an appropriate degree. In addition, the proposed methods were compared with the conventional methods, and the quickness of the DM process was offered.
In 2016, Li et al. [13] have focused on privacy-preserving mining on vertically partitioned databases. To ensure privacy they have designed an efficient homomorphic encryption scheme and a secure comparison scheme. They have also proposed a cloud-aided frequent itemset mining solution, which was used to build an association rule mining solution. These solutions are designed for outsourced databases that allow multiple data owners to efficiently share their data securely without compromising on data privacy. These solutions leak less information about the raw data than most existing solutions. Finally, the proposed scheme was compared with other conventional schemes, and the resource utilization at the data owner end was found to be very low.
In 2016, Lin et al. [15] introduced two novel techniques based on the Maximum Sensitive Utility (MSU), which were established to reduce the effects of the sanitization procedure for hiding sensitive high-utility itemsets (SHUIs). Accordingly, the introduced schemes were modeled to remove SHUIs proficiently or lessen their utilities by means of the conceptions of minimum and maximum utility.
In 2018, Chamikara et al. [6] presented an effectual data stream perturbation technique, known as P2RoCAl, that provides improved data utility when distinguished with its contenders. Moreover, the classification accurateness of P2RoCAl data streams was found adjacent to those of the data streams, which were original. Finally, from simulation results, the P2RoCAl was found to offer better resilience in contrast to data reconstruction attacks.
In 2016, Tripathi [24] adopted an antidiscrimination approach like discrimination discovery and prevention in the DM. There were primarily two kinds of adopted models, namely, direct and indirect discrimination. The former existed in the conditions when decisions were attained depending on the sensitive attributes, while the latter exists in the circumstances when decisions were attained depending on the non-sensitive parameters.
In 2016, Upadhyay et al. [25] implemented a geometric data perturbation (GDP) technique by means of data partitioning and 3D rotations. Accordingly, attributes were alienated into three groups, and every group of attributes revolved around varied pairs of axes. Finally, the investigational assessment shows that the implemented technique delivers worthy privacy preservation outcomes and data utility when distinguished from the traditional techniques.
In 2016, Komishani et al. [12] adopted a scheme known as preserving privacy in trajectory data (PPTD), a new methodology for data preservation depending on the perception of privacy. In addition, it intends to strike stability among the contradictory objectives of data privacy and data utility in agreement with the privacy necessities. From the simulation results, PPTD was found to be proficient for preserving personalized privacy.
In 2015, Yun and Kim [30] proposed a fast perturbation process depending on a tree structure that more rapidly carries out database perturbation procedures for avoiding sensitive information from being revealed. In addition, wide-ranging experimental results were performed for the offered method and traditional schemes by means of both real and synthetic datasets, and the results have offered improved and faster runtime when compared with other conventional schemes.
In 2017, Mehta and Rao [18] proposed techniques such as k-anonymity, l-diversity, and t-closeness, which were utilized to de-identify the data; however, the chances of reidentification are always possible because data were collected from multiple sources. Moreover, it is tough to handle large data for anonymization; MapReduce technique was introduced to handle large volume of data. It distributes large data into smaller chunks across the multiple nodes. Therefore, scalability of privacy-preserving techniques becomes a challenging area of research. The authors explored this area, and they introduced an algorithm named scalable k-anonymization using MapReduce for privacy-preserving big data publishing. Finally, they compared their technique with existing techniques in terms of running time that results into a remarkable improvement. Table 1 shows the methods, features, and challenges of conventional techniques based on the privacy data preservation techniques. At first, Fuzzy logic was adopted in [1], which minimizes unwanted side effects and offers better implementation on a huge amount of data. However, there was no contemplation on information loss. In addition, ARM was suggested in [13] that provides reduced complexity, and it also achieves high level of security; anyhow, there was less consideration on the utility of the proposed scheme. In [15], MSU-MAU and MSU-MIU were proposed, which speed up the evaluations, and along with that, they offer bias among the generated side effects and preservation. However, they require effective consideration on flexible fitness function. Similarly, k-nearest neighbor (kNN) was adopted in [6], which presents better classification of privacy-preserving data and improved accuracy, but there was no contemplation on effectiveness of sampling methods. In [24], ARM was suggested, which provides reduced complexity, and moreover, it achieves a high level of security. However, there was no description of carrying out clustering. In addition, GDP was adopted in [25], which offered enhanced privacy along with high variance, but the decrease in variance increases the chance of attacks. Accordingly, PPTD was proposed in [12], which eliminated critical moving points together with the minimization of attacks. Anyhow, there was no consideration of PPTD with various sensitive attributes. Finally, fast perturbation algorithm (FPA) was presented in [30], which evades the privacy breaches, and it also provides better runtime and scalability. However, there was no consideration on combination with web mining. There, these limitations have to be considered for improving the PPDM techniques effectively in the current research work.

Objective Function
The proposed model OI-CSA intends to attain the objective function for preserving data as given by Eq. (1).
In Eq. (1), F 1 , F 2 , F 3 , and F 4 are the objectives, which demonstrates the importance of the corresponding function as defined in the section below.
In Eq. (2), F 1 denotes the normalized HF rate, f 1 denotes the HF rate, where max(f 1 ) is considered as the worst f 1 of all iterations. In Eq. (3), F 2 denotes the normalized IP rate, and f 2 denotes the IP. In Eq. (4), F 3 denotes the normalized IP rate; f 3 denotes the FR rate. In Eq. (5), F 4 denotes the normalized DM rate; f 4 denotes the DM. Here the original database is considered as O, and the sanitized database is indicated as O′.
HF rate denoted by f 1 is defined as the fraction of sensitive rules which is depicted in O′ as revealed by Eq. (6). Here, the count of sensitive rules available in O′ is described as f 1 = |B′ ∩ SRs|. In Eq. (6), B signifies the association rule produced prior to sanitization, B′ denotes the association rules obtained from O′, and SRs denotes the sensitive rules.
IP rate denoted by f 2 is described as "the rate of non-sensitive rules which are concealed in O′". It is the reciprocal of information loss as revealed in Eq. (7).
FR denoted by f 3 is described as "the rate of artificial rules produced in O′" which is demonstrated by Eq. (8).
DM denoted by f 4 is portrayed as the count of modifications carried out in O′ from O as demonstrated by Eq. (9), in which dist points out the Euclidean distance found between O and O′.

Proposed Architecture
The overall framework of the adopted OI-CSA scheme is portrayed by Figure 1. Initially, the data preservation comprises two major processes, such as data sanitization and data restoration. In data sanitization process, a key is generated to preserve the sensitive data in a protective approach. The key has to be generated such that it must hide the sensitive data effectively for which OI-CSA is deployed for optimal key generation. The authorized person at the receiver side could then restore the sanitized data by exploiting the same key. This is because to sanitize and restore the data faster, a symmetric key is used by the sender as well as the receiver.

Sanitization Process
Binarization of O and pruned key matrix A 2 are performed during sanitization. The resultant key matrix in binarized form is consecutively provided onto the rule hiding process, in which an XOR function is performed with binarized form of O with identical matrix dimensions and added up with one that generates the O′, as specified by Eq. (10). Moreover, O′ obtained from sanitization process attains SRs and association rules following sanitization B′. Similarly, O extracts the relative association rules prior to sanitization B for attaining the above mentioned objectives. The structural design of the proposed sanitization process is illustrated by Figure 2.
Original database

Sanitized database
Sanitization key

Key Generation
The key generation includes solution transformation process, where A, a key representation, is converted with the aid of the Khatri-rao product.
Thus, the key matrix is attained by the Khatri-rao product of two identically restructured A 1 matrixes represented as A 1 ⊗ A 1 in which the kronecker product is indicated by ⊗ and its dimensions are further reduced in terms of the dimension size of the original database. Based on the Khatri-rao product, the key generation process is carried out and generates a matrix with dimensions identical to O, which produces A 2 [ √ M O ×Omax ] . Finally, the process of rule hiding is done to obtain O′ by hiding the sensitive rules. The optimal key generation is made by means of improved Cuckoo Search (CS) algorithm called OI-CSA.

Restoration Process
During the restoration process, the O′ achieved from sanitization and A 2 from key generation technique could be binarized. The binarized S d from the binarization block is minimized from the unit step. In the meantime, the database and key matrix, which is binarized, takes on an XOR function following the subtraction, and subsequently the restored database is extracted. The sanitizing key, A 2 , is reconstructed by exploiting Eqs. (11), (10), and (1) and the proposed OI-CSA update. It is deployed to generate O′ by which the lossless restoring could be carried out by Eq. (12), in whichÔ indicates the restored data. The design of restoration process is given by Figure 3.Ô 4 Key Encoding: an Improved Optimization Algorithm

Key Encoding
The keys (chromosome) A used for the sanitization process are subjected to OI-CSA for encoding. The number of keys ranging from key A 1 to key A M is optimized using OI-CSA, and the optimal key is identified. The solution-encoding process is illustrated by Figure 4. Here, the key length (chromosome) is assigned as √ M′′ O .

Cuckoo Search Algorithm
CS algorithm is a method that is established depending on the reproduction of cuckoos [17]. Usually, cuckoos lay their eggs in the nests of erstwhile cuckoos with the expectation of their babies grown up by alternative parents. Certain periods exists, when the cuckoos find out that the eggs in their nests do not possess to them, in such circumstances, the unfamiliar eggs are push out from the nests and are deserted. This approach is dependent on the subsequent three conditions: 1. Every cuckoo chooses a nest arbitrarily and lays an egg in it. 2. The desired nests with more eggs would be considered for subsequent generation. point, the demonstration which is followed is that every egg in a nest indicates a solution, and every cuckoo can lay only one egg, i.e. one solution. Moreover, no reverence can be performed among an egg, a cuckoo, or a nest. The objective is to exploit the novel and capable cuckoo egg (i.e. solution) to substitute a worst solution in the nest. CS approach is very effectual for global optimization issues as it sustains a balance among global random walk and the local random walk. The balance among global and local random walks is adjusted by a switching constraint Pε[0, 1]. The global and local random walks are demonstrated by Eqs. (13) and (14) correspondingly. Accordingly, in Eq. (13), X t i and X t k denotes the present positions chosen by arbitrary permutation, β indicates the positive step size scaling factor, X t+1 i points out the subsequent position, s denotes the step size, ⊗ indicates the element-wise product of two vectors, F signifies the heavy side function, P symbolizes a variable that is exploited to switch among global and random walks, and ε denotes an arbitrary variable from a uniform distribution. Accordingly, from Eq. (14), N(s, τ) indicates levy distribution exploited to describe the step size of an arbitrary walk.

OI-CSA
The conventional CS algorithm is an uncomplicated and efficient global optimization algorithm; anyhow, it could not be exploited directly to resolve multimodal optimization issues. Hence, the traditional CS approach is improved by modifying the opposition intensity denoted by γ as given in Eq. (15). In Eq. (14), X (w) i indicates the worst solution, X t i denotes the current solution, and γ varies from 0 to 1.
The fundamental phases of the OI-CSA algorithm depending on their conditions is given by Algorithm 1.

Simulation Procedure
The proposed OI-CSA method for privacy preservation of sensitive data was simulated in JAVA, and the results were obtained. The analysis was carried out using four datasets, namely, T10, Chess, Retail, and T40. Moreover, the results were compared with conventional models such as PSO [31], GA [27], DE [32], and CSA [3] algorithms. In addition, the results that were obtained by varying the γ value were also described for the four adopted datasets.

Performance Analysis
The performance analysis of the proposed OI-CSA model for four datasets is given by Figure 5. From Figure 5A, it can be noted that the presented model for chess dataset in terms of F 1 is 17.36%, 14%, 12.24%, and 6.99% better than the PSO, GA, DE, and CSA designs. Also, for F, the suggested scheme is 0.76%, 0.49%, 0.28%, and 0.23% superior to the PSO, GA, DE, and CSA algorithms. Also, from Figure 5B, a retail dataset was implemented, where the implemented scheme for F 2 is 1.89%, 3.02%, 2.64%, and 2.64% better than the PSO, GA, DE, and CSA algorithms. Also, for F, the proposed OI-CSA model is 1.89%, 3.02%, 2.64%, and 2.64% superior to the PSO, GA, DE, and CSA algorithms. Similarly, from Figure 5C, the performance analysis for the T40   Figure 5D, the T10 dataset can be attained, where the F 4 analysis for OI-CSA model is 52.38%, 38.98%, and 10.94% better than the GA, DE, and CSA approaches. Also, from Figure 5D, the performance analysis for the T10 dataset for F can be attained, where the implemented scheme is 0.32%, 0.31%, 0.29%, and 0.1% superior to the PSO, GA, DE, and CSA schemes. Therefore, the enhancement of the proposed OI-CSA model has been substantiated successfully.

Conclusion
The paper has presented a novel PPDM method, which comprises two phases like data sanitization and data restoration, which were started after the association rules generation. Accordingly, in both the sanitization and restoration processes, the key extraction has a major role that was chosen optimally by means of OI-CSA.
Here, four research objectives, namely, HF rate, IP, FR, and DM were reduced using the adopted sanitization and restoration processes. In addition, the proposed scheme was compared with conventional approaches such as the GA, PSO, DE, and CS, and the optimal results were attained for the proposed scheme. From the performance analysis, it can be noted that the proposed model for the chess dataset on considering the overall cost function was 0.76%, 0.49%, 0.28%, and 0.23% superior to the PSO, GA, DE, and CSA algorithms. On considering the performance analysis for the retail dataset, it can be noted that the proposed OI-CSA model was 1.89%, 3.02%, 2.64%, and 2.64% superior to the PSO, GA, DE, and CSA algorithms. Thus, the enhancement of the adopted OI-CSA technique has been confirmed in an effective manner. In future, in order to secure keys, a key management technique is also considered. Key management is a significant step in protecting the generated key and transferring it to the authenticated receiver. Since this paper considers a general platform of data privacy protection, the key management has not been considered. However, appropriate key management protocols/mechanisms need to be considered based on the applications, communication link, communication protocols, and sender/receiver characteristics.