Histopathological Image Segmentation Using Modified Kernel-Based Fuzzy C-Means and Edge Bridge and Fill Technique

Abstract Histopathological lung cancer segmentation using region of interest is one of the emerging research area in the field of health monitoring system. In this paper, the histopathological images were collected from the database Stanford Tissue Microarray Database (TMAD). After image collection, pre-processing was performed using a normalization technique, which enhances the quality of the histopathological image by eliminating unwanted noise. After pre-processing, segmentation was carried out using the modified kernel-based fuzzy c-means clustering (KFCM) approach along with the edge bridge and fill technique (EBFT). It was a flexible high-level machine learning technique to localize the object in a complex template. The experimental result shows that the proposed approach segments the normal and abnormal cancer regions by means of precision, recall, specificity, accuracy, and Jaccard coefficient. The proposed methodology improved the classification accuracy in lung cancer segmentation up to 2.5–5% compared to the existing methods deep convolutional neural network (DCNN) and diffusion-weighted approach.


Introduction
Lung cancer is one of the predominant cancer types, which causes more than 1.4 million deaths annually [19]. Generally, there are two types of lung cancers: the non-small cell lung cancer (NSCLC) with 80%-85% of lung cancer patients and small cell lung cancer (SCLC) with 10%-15% of lung cancer patients [17]. The evaluation of microscopic histopathology slides by experienced pathologists is indispensable for establishing the diagnosis and to define the subtypes and types of lung cancers, which includes two major types of NSCLC: adenocarcinoma and squamous cell carcinoma [5,10,23,26]. The distinction of squamous cell carcinoma from adenocarcinoma is essential for chemotherapeutic selection because certain antineoplastic agents are contraindicated for squamous cell carcinoma patients, due to decreased efficacy or increased toxicity [11,12]. However, the qualitative evaluation of well-known histopathology pattern (cancer grade classification) is insufficient to predict the survival rate of patients with lung adenocarcinoma, and only the best-characterized histopathology features achieve the modest agreements among experienced pathologists [20,24]. Poor cancer differentiation and slide quality are related with diagnostic agreement [6]. Recently, several researches were carried out to describe the additional visual features for prognostic detection of patients with lung adenocarcinoma [3,15]. Still, there is considerable improvement needed for the inter-rater agreements of these features [8]. Erroneous or subjective evaluation of the histopathology image leads to poor therapeutic choice that results in decreased survival and loss of life quality in several patients [16]. Unsupervised image processing technology shows high accuracy, consistency, and efficiency in histopathology evaluations and also provides decision support for ensuring the diagnostic consistency. Automated histopathological analysis is valuable in prognostic determinations of various malignancies [9,22].
In this research, histopathology lung images were collected from the dataset TMAD. After the collection of histopathology images, a new segmentation methodology was implemented along with the edge bridge and fill technique (EBFT) for histopathological lung cancer segmentation. Currently, most of the segmentation algorithms are working based on Euclidean distance, which helps to identify the similarity between the data objects. Euclidean distance is computationally inexpensive and simple, but it is sensitive to outliers and perturbations. Currently, with the popular usage of support vector machine, a new direction appears to use kernel functions. The kernel function projects the data into high-dimensional space, where data can be easily separated. To perform this operation, kernel trick is adopted for transforming the linear algorithm into nonlinear algorithm using a dot operator. The kernel-based FCM works well only if the clusters are round shaped, and it works badly with non-convex shapes, due to the hard assignments of points to the clusters. To address this concern, a new clustering methodology was proposed (modified kernel-based fuzzy c-means clustering (KFCM)) for histopathological lung cancer segmentation. In this experimental research, the Euclidean distance measure was replaced with correlation distance measure because it includes numerous advantages. The distribution of correlation between the random vectors becomes narrowly focused around zero as the dimensionality increases, so the significance of small correlation grows with increasing dimensionality. Also, it is very effective in capturing the similarity between the patterns of feature vectors. Finally, the segmented output was compared with the existing methodologies by means of precision, recall, specificity, accuracy, and Jaccard coefficient.
This research paper is organized as follows. Section 2 surveys several recent papers on histopathological lung cancer detection. In Section 3, an effective approach (modified KFCM with EBFT) for histopathological lung cancer segmentation is presented. Section 4 shows the quantitative and comparative analysis of proposed and existing segmentation approaches. The conclusion is made in Section 5.

Literature Review
Researchers suggested several techniques for the detection of histopathological lung cancer. A brief evaluation of some important contributions to the existing literatures is presented in this section.
Xing and Yang [21] presented a new segmentation methodology, which combines an effective local repulsive balloon snake deformable approach (bottom-up) and a robust selection-based sparse shape approach (top-down) to tackle the segmentation concerns. The developed approach extensively tested on 62 cases with 6000 tumor cells. The experimental outcome confirms that the developed methodology delivers better segmentation performance compared to the existing methods. While performing with semi-supervised methodologies, the semantic gap was maximized between the feature values, which leads to poor segmentation rate.
Coudray et al. [4] developed an effective classification methodology: deep learning convolutional neural network for classifying the whole-slide histopathology images into squamous cell carcinoma and adenocarcinoma. The advantage of developed methodology was it was fully automated; there was no user intervention for segmenting the normal and cancer tissues, so the time consumption of detecting the cancer cell was very less. Extensive experiments were carried out on real-time histopathological images for demonstrating the robustness of the proposed scheme. The developed technique was less suitable for recognizing the inclined histopathology images because it took a high response time and identification rate.
Zhang et al. [25] developed an effective segmentation methodology: Gaussian-based hierarchical voting and repulsive balloon approach for classifying the lung cancers as adenocarcinoma and squamous carcinoma. The developed methodology delivers an effective histopathological lung cancer segmentation in the spatial localization and structural composition. The experiment was carried out on publicly available database, Cancer Genome Atlas (TCGA), in order to validate the developed method accuracy, robustness, and segmentation speed. In some cases, the training data were supervised evaluation or manual adjustment, which is needed to be automated.
Khosravi, et al. [13] presented an effective computational approach based on the convolution neural network for classifying dissimilar histopathology images across dissimilar types of cancer. In this literature, the developed machine learning approach splits the cancer cells from normal tissues and also handles the cell appearance variations. The developed algorithm was tested on different histopathology images with different variations in appearance and shapes. The experimental outcome shows that the developed histopathology cancer cell detection methodology was very effective compared to the existing methodologies. The developed methodology would fail, when the growth of the cancer cell boundaries was constrained.
Vu et al. [18] developed a new feature discovery methodology: discriminative feature-oriented dictionary learning (DFDL) approach for disease grading and classification in histopathology. In this literature, the developed dictionary learning approach was tested on three challenging image datasets: animal diagnostics lab dataset, cancer genome atlas dataset, and intraductal breast lesions histopathological images. The experimental performance suggested that the developed approach was robust and effective in comparison with existing approaches. In a large dataset, DFDL failed to achieve better segmentation by means of accuracy.
An unsupervised algorithm (modified KFCM with EBFT) is implemented for enhancing the performance of histopathological lung cancer segmentation and to overcome the above-mentioned drawbacks.

Proposed Methodology
The proposed methodology for segmenting the normal and abnormal regions from lung histopathological image is divided into three major steps: image collection, image pre-processing, and feature extraction. The workflow of the proposed histopathological lung cancer detection system is represented in Figure 1. The brief description about the proposed technique is described below.

Image Collection
In the initial stage of histopathological lung cancer segmentation, histopathological images are taken from the standard benchmark dataset: TMAD. This dataset contains 205,161 images, which archives 349 distinct probes on 1488 tissue micro-array slides. In that, 31,306 histopathological images for 68 probes on 125 slides are released to the public. The TMAD combines the NCI thesaurus ontology for probing tissues   in the cancer domain. For lung cancer, the TMAD database contains one normal image, 220 adenocarcinoma images, and 68 squamous histopathological images. The sample collected histopathological images of normal, adenocarcinoma, and squamous are represented in Figure 2A-C. After obtaining the histopathological images, an important step: region of interest (ROI) is carried out on the collected images. The ROI is defined as a subset of histopathological image or a database identified for a specific purpose. Therefore, the size of the selected ROI is 256 × 256, which is one-fourth of the original image 1024 × 1024. The sample ROI applied histopathological images of normal, adenocarcinoma, and squamous are denoted in the Figure 3A-C.

Pre-processing of Histopathological Image
The ROI-applied histopathological images are used for pre-processing. In this research, normalization is performed on the histopathological image for enhancing or de-noising the images. Most of the histopathological images are collected or captured from the equipment's, whereas the collected histopathological images majorly consist of two noises: impulse noise and machinery noise (electrical and mechanical noises) [1]. The normalization methodology is very effective to remove impulse and machinery noise and also helps to enhance the image quality significantly.
Normalization modifies the pixel intensity range values to improve the quality of images by reducing the noise from histopathology images. Then, the deformation and alternations that occurred in the histopathological image by inaccurate image capture were evaluated. Usually, image normalization contains pixel intensity variations. The entire histopathological images are converted into pre-defined values because it is a pixel-wise procedure. The general formula of image normalization approach is denoted in equation (1).

Histopathological Image Segmentation
The normalized histopathological images are used for segmentation; an effective methodology modified KFCM is undertaken for segmenting the normal and abnormal regions of histopathological images. Generally, image segmentation is the procedure of sub-dividing an image into different regions that are homogeneous with respect to some image features. The aim of image segmentation is to extract and detect the particular region from an image [14]. In this scenario, consider I as an input histopathological image that consists of a set of p i color images at pixel i(i = 1, 2, . . . , N) and P = {p 1 , p 2 , p 3 . . . p} ⊂ R k , respectively, in the kdimensional area. The cluster centers in the histopathological images are denoted as where c is said to be a positive integer (2 < c ≪ N), and u ij is the membership value for each pixel i in the j-th cluster (j = 1, 2, . . . c). The clusters formed in the image space are combined by assigning a separate membership value to all pixels in the KFCM algorithm. The objective function or general equation of KFCM is written in equation (2).
where m is an exponent of regularization to the degree of fuzziness, m > 1, and ⃦ ⃦ p i − q j ⃦ ⃦ 2 is the grayscale Euclidean distance between i and q j , which is stated in equation (3).
Utilizing the membership function from the alternate optimization, the cluster centers are updated iteratively using equations (4) and (5).
The presence of noise is decreased by adding the spatial information of neighboring pixels that is denoted in equation (6).
where α denotes a spatial information, N i and N r are defined as the set of pixel and cardinality of the pixels employed in the system. To avoid the neighborhood function, the term 1 where,ṕ is a color scale-filtered image, and the Euclidean distance is replaced by the correlation distance measure.
The updated equation is represented in equation (7): In this research, a modified KFCM is proposed, which calculates the parameter η j at every step of the iterations to replace α for every cluster [7]. The correlation function is used to calculate the parameter value, which is represented in equation (8).
Here, C is a correlation distance measure or correlation function. The general identification of C requires a large number of patterns, and also, many cluster centers are required to find the optimal value for η j . To overcome this problem, the combination of spatial context and scale information are made using a fuzzy factor. The fuzzy factor F ij is included in the objective function of the KFCM, which is stated in equation (9).
Then, the altered fuzzy factor F′ ij is derived using equation (10).
This altered fuzzy controls the local neighbor relationship and replaces the distance with a correlation function, where w ik denotes the fuzzy factor i, and 1 − C (︀ p i , q j )︀ denotes correlation metric function. The segmented histopathological images of normal, adenocarcinoma, and squamous are given in Figure 5A-C.

Pseudo Code of Modified KFCM
In this section, the pseudo code for the modified KFCM is presented. In the modified KFCM, a few changes are carried out in step 5, compared to the conventional KFCM. Finally, the iteration stops and returns after calculating the missing value using equation (10).
-Fix the number of clusters c, and m > 1 for some positive constant.
-Set p i = 0, if p i is a missing feature.
-Update all membership function u ij using equation (4).
-Update all prototypes q j using equation (5).
-Replace the Euclidean distance with the correlation function C.

Edge Bridge and Fill Technique
After histopathology cell segmentation, the EBFT is utilized to bridge the gaps in the edges of the histopathology lung image for separating the overlapped nuclei and non-nuclei cells. The step-by-step procedure of the EBFT is described below.
-Initially, the extreme boundary line of the overlapped cell is kept, and all the inner portions were removed.
The overlapped cell region is connected to the background black region because it is necessary to distinguish the overlapped cell region from the background. The overlapped cell edge needs to be closed, so that the cell region can discriminate from the background. -To close the open-edge portion of the cell region, an iterative thinning (skeletenization) and thickening (dilation) morphology is utilized in this research work. Dilation is carried out with a square structural element or disk of size four. The dilation of histopathology image A by the structuring element B is described using equation (11).
-In this research work, the size of the structural element is 12. All overlapped cells in the histopathology image segmented are shown in Figure 6. To make the segmentation of the overlapped cell more reliable and robust for any size of cell, it is better to use the small-size structural element rather than use the large structural element. In this experimental research, the disk structural element with radius 4 is utilized followed by skeletenization iteratively three times. This procedure depends on the number of iterations that happens after the closing of the overlapped cell edge; hence, the maximum limit of iterations does not affect the reliability of the overlapped cell segmentation process.  -The dilated thick line from equation (11) is skeletonized by eliminating the pixels from each side with respect to the central pixels of the line. This procedure makes the boundary lines more sharp with less gap. Then, the sharp boundary line is added to the overlapped cell without any gap. -The holes in the overlapped cells are filled as shown in Figure 6. In the filled cell image, the overlapped nuclei and non-nuclei cells are effectively distinguished.
The labeling is utilized to better distinguish the overlapped nuclei cells and non-nuclei cells. The samplelabeled image is denoted in Figure 7.

Experimental Analysis
For experimental simulation, MATLAB (version 2017a) was employed on a PC with 3.2 GHz with i5 processor [2]. In order to estimate the efficiency of the proposed algorithm, the performance of the proposed method was compared with the diffusion-weighted approach [22] and DCNN [13] on a database: TMAD. The performance of the proposed methodology was compared by means of precision, recall, specificity, accuracy, and Jaccard coefficient.

Performance Measure
Performance measure is defined as the regular measurement of the outcomes and results that develops a reliable information about the effectiveness and efficiency of the proposed system. Also, it is the procedure of reporting, collecting, and analyzing information about the performance of a group or individual. The general formula for calculating the precision, recall, specificity, and accuracy of lung cancer detection is mathematically represented in equations (12)- (15).
Additionally, for segmentation validation, the Jaccard coefficient is expressed in terms of TP, TN, FP, and FN counts, which is obtained by matching the segmented result to the ground truth image. The general formula utilized to calculate the Jaccard coefficient is represented in equation (16).
× 100 (16) where FP is represented as false positive, TP is denoted as true positive, FN is indicated as false negative, and TN is specified as true negative.

Quantitative Analysis on TMAD Dataset
In this experimental analysis, the TMAD dataset is used for comparing the performance evaluation of the proposed approach and the existing segmentation methods. In Table 1, performance evaluation of the proposed method (modified KFCM) and the segmentation methods (FCM and KFCM) are validated by means of precision, recall, and Jaccard coefficient. The TMAD dataset contains three classes of histopathology lung images: normal, adenocarcinoma, and squamous. Here, the performance evaluation is validated for one sample image in each class with two random multi-region ROIs. The validation result shows that the proposed method outperformed the existing methodologies by means of precision, recall, and Jaccard coefficient. For the collected histopathological image, the mean or average precision of the proposed technique is: for the modified KFCM, 68.32%, and the existing methodologies FCM and KFCM delivers 56.71% and 58.4% of the average precision. The average recall of the proposed technique is 99.22%, and the existing methodologies delivered 96.38% and 97.33% of the average recall. Similarly, the average Jaccard coefficient of the proposed technique delivers 67.96%, and the existing methodologies attained 55.55% and 57.82% of the average Jaccard coefficient. The graphical comparison of the average precision, recall, and Jaccard coefficient is denoted in Figures 8 and 9. Similarly, the standard deviation of the proposed technique precision is 5.2561, and the existing techniques (FCM and KFCM) delivered 2.837 and 3.0480. The standard deviation of the proposed technique recall is 0.5178, and the existing techniques achieved 1.0200, and 0.1892. Finally, the standard deviation of    Table 2, performance evaluation of the existing and proposed method is validated in terms of accuracy and specificity. In this section, performance evaluation is validated for one sample image in each class Tables 1 and 2 confirmed that the proposed approach performs effectively compared to the existing segmentation method on the TMAD dataset. The modified KFCM encodes both the local and shape features of the wavelet transformed histopathology image for improving the segmentation efficiency of lung cancer. The proposed modified KFCM algorithm fuses different pixel information, which represents a different correlation space to produce a new correlation value. The graphical comparison of the average accuracy and specificity is denoted in Figure 10 Tables 3 and 4 represent the comparative study of the existing work and the proposed work performance. Yin et al. [22] developed an image analysis chain that was utilized to study the correlation between tumor cellularity from serial histopathological slides of a resected NSCLC tumor and diffusion coefficient (D value) calculated from the diffusion-weighted magnetic resonance imaging (DWI). On digitized histological image, color deconvolution along with cell nuclei segmentation was used in determining the cell type and local two-dimensional densities. Then, the DWT sequence information was over-laid with resected histology using Table 3: Performance comparison of the proposed and existing approach by means of accuracy and specificity.

Method
Cancer stages Accuracy (%) Specificity (%)  non-invasive imaging modality data and prominent anatomical hallmarks of histology tissue blocks. Finally, spatial tumor cell density and cell number information were determined on the basis of the DWT data. For sample 30 histopathological images, the average accuracy of the proposed technique is: for the modified KFCM, 97.28 ± 0.014% and for the existing methodology: diffusion weighted [22] delivers 94.7 ±0.016% of the segmentation accuracy. Similarly, the specificity of the proposed technique delivers 99.95 ± 0.0075%, and the existing methodology delivers 94.3 ± 0.029% of the specificity. Compared to the existing method, the proposed methodology shows 3-5% of improvement in specificity and accuracy. The graphical comparison of accuracy and specificity is denoted in Figure 11. In Table 4, performance evaluation of the proposed method is: for the modified KFCM and the existing methodology: DCNN [13] is validated by means of precision, recall, and Jaccard coefficient. Khosravi et al. [13] developed numerous computational methodologies on the basis of CNN and also built a stand-alone pipeline for classifying the histopathology images across dissimilar types of cancer. In this research paper, stand-alone pipeline demonstrates the discriminate between two sub-types of lung cancer, five bio-markers of breast cancer, and four bio-markers of bladder cancer. The classification phase includes the ensemble of two algorithms (ResNe and inception), a basic CNN architecture, and Google's Inceptions with three training approaches. The average precision of the proposed technique is: for the modified KFCM, 99.21% is delivered and the existing methodology: DCNN attains 92% of segmentation precision. Similarly, the Jaccard coefficient of the proposed technique delivers 93.77%, and the existing methodology achieves 92% of the Jaccard coefficient. Compared to the existing method, the proposed methodology shows better segmentation results by means of precision and Jaccard coefficient. Additionally, the recall of the proposed technique delivers 68.81%, and the existing methodology delivers 92% of the recall value. The proposed approach shows less recall value because while performing with multiple ROI on input histopathology image, the size of the pixel is decreased that reduces the performance of the proposed approach. The graphical representation of precision, recall, and Jaccard coefficient is denoted in Figure 12.

Discussion about Proposed Methodology
Histopathology image segmentation plays a major role in cancer diagnosis, which delivers the necessary information for separating the non-cancer region from the cancer region. In this experimental research, histopathological image segmentation is carried out for the disease: lung cancer, which is one of the growing diseases in the medical field. Previously, several methodologies were developed to distinguish the cancer cells from non-cancer cells, but existing researches do not concentrate on the overlapped cancer and noncancer cells. It is essential to distinguish the overlapped cancer and non-cancer cells for best recognition or segmentation. In this research, a new unsupervised machine learning approach was developed for segmenting the overlapped cancer cells from non-cancer cells. The effectiveness of the proposed methodology is shown in Tables 3 and 4. The performance analysis is verified by determining the performance metrics like precision, recall, specificity, accuracy, and Jaccard coefficient. Under such circumstance, the accuracy of the proposed methodology is 2.5% better than the existing approach (diffusion weighted). Additionally, precision and the Jaccard coefficient of the proposed methodology shows 7% and 1.5% improvement compared with the existing approach (DCNN). The proposed segmentation methodology includes numerous advantages: assists the doctors during surgery, cost efficient related to other existing machine-learning approaches, and earlier detection of lung diseases.

Conclusion
In this research paper, a new texture-based histopathological segmentation methodology is proposed, which is based on the modified KFCM with ROI. The modified KFCM is the most effective methodology in histopathological lung cancer segmentation. In this experimental research, the modified KFCM, along with the EBFT, is utilized for segmenting the cancer and non-cancer regions in histopathological images. The proposed methodology effectively combines the advantage of the EBFT and the modified KFCM methodology. The experimental investigation was verified on a publicly available database (TMAD dataset), which shows a superiority of the proposed methodology. The modified KFCM scheme delivered an effective segmentation performance, compared to the other obtainable approaches in histopathological lung cancer detection. The proposed methodology showed 2.5-5% of the enhancement in segmentation compared with the existing methods in terms of segmentation accuracy. In future work, an appropriate classification methodology will be used in classifying the cancer and non-cancer regions in the segmented histopathology image.