Anal intraepithelial neoplasia (AIN) can be present in various forms and usually precedes anal carcinoma . Assessing alterations of epithelium is crucial to derive an accurate estimate of pre-cancerous stages and carcinoma in-situ. The inspection of biopsy tissue is done qualitatively by physicians to diagnose anorectal disease, either using glass slides and light microscopy, or digital whole slide images. AIN is typically classified into four classes that correspond to non-neoplastic tissue (AIN0), and increasing grades of dysplasia (AIN1-AIN3). Important cues for grading are the density, shape, and texture of cell nuclei, as well as the distribution of normal and abnormal cells within the tissue . However, qualitative analysis strongly depends on the observer’s experience and frequently leads to irreproducible results. Since defining clear transitions between grades is not always possible, the inter-observer variability can be considerable. Using quantitative image analysis, drawbacks of subjective assessments can be tackled by introducing reproducible, validated methods in biomedical diagnostics.
Previous work used fractal image analysis ,  to grade AIN, or cervical intraepithelial neoplasia . Fractal dimensions were computed globally for each image, and the value range was segmented by rigidly searching for statistically significant thresholds to separate the classes. In other work , a variety of data mining methods was explored to grade AIN images based on statistical texture features. In order to add significant value to a diagnostic process in terms of reducing time-consuming tasks, a classifier must be able to deal with a great input variance , generalize well, and reliably predict the grading of new images. Two crucial aspects have not been addressed in , , : Firstly, none of the previous methods involved building a classifier on training data and evaluating it on novel test data. Secondly, since authors reported errors on training sets, we cannot assess the ability of their methods to generalize to unseen data.
The objective of this work is to learn a discriminative classification model for predicting the grade of AIN. We show that a support vector machine (SVM ) learns a robust classifier on a set of global image features. Our learning-based approach outperforms the previously proposed methods by a large margin on a challenging dataset . Furthermore, we assess the influence of common strategies for data augmentation and class label balancing on the generalization performance.
2 Material and methods
2.1 Histopathology dataset
We used a set of hematoxylin and eosin stained images (n = 136, 749 × 580 pixels) of different AIN grades , , cf. Figure 1. An expert pathologist labeled images containing healthy tissue (n0 = 17) as AIN0, and low-grade neoplasia (n1 = 36) as AIN1. High-grade neoplasia was labeled as AIN2 (n2 = 55), and AIN3 (n3 = 28). We assigned 80% of the images in each class to the training, and 20% to a hold-out test set. Further, 20% of the training samples were used as validation set for parameter tuning, before the test set was predicted. This resulted in 84 images in the original training set (10 AIN0 / 22 AIN1 / 35 AIN2 / 17 AIN3), 23 in the validation set (3/6/9/5), and 29 in the test set (4/8/11/6).
Image pre-processing and augmentation: We hypothesized that augmentation and balancing of a training set improves the ability of our classifier to generalize. Hence, two additional training sets were created by applying label-preserving elastic deformations and parameterized random intensity variations of the individual channels in HSV color space. The training sets are referred to as (a) for the original, (b) for the augmented, and (c) for the augmented and balanced training set. For (b), we added 30 versions of each image in (a), such that it comprised 2604 images. However, the class label distribution remained unbalanced. For (c), we oversampled the underrepresented classes in (b) to achieve a uniform distribution over all four grades of AIN (4304 images). All images were resized to 512 × 512 pixels by bilinear interpolation, prior to feature extraction.
2.2 Learning AIN grading classifiers
Feature extraction: Since AIN images show randomly orientated tissue, we considered the rotational invariance of global texture features advantageous. Inspired by , , , , we extracted 22 statistical and 304 fractal features to represent the image content. Statistical features consisted of summed pixel values of individual channels in three different color models (RGB, L*a*b*, HSV) and the gray value image (mean RGB), first and second order statistical parameters  of the NTSC luminance gray value image (variance, energy (1st, 2nd), entropy (1st, 2nd), skewness, kurtosis, third moment, fourth moment, contrast, homogeneity, correlation). Fractal features included estimated values for the fractal dimension based on the Fourier method (DF) applied to the gray value image  (first 216 distance values in frequency space), the box-counting method applied to the nuclei-segmented, binary image  (three scale-ranges: 20 − 24, 25 − 29, 20 − 29). In addition, the recently developed pyramidal gradient and pyramidal differences methods  applied to gray value images (mean RGB,R,G,B,L*,a*,b*,H,S,V; combinations of the scale-ranges 20 − 27 with at least four consecutive scales) were used.
Classifier training and inference: Given a set of labeled training samples, we employed a linear SVM to learn a maximum-margin classifier, characterized by an optimal hyper-plane that separates two classes. To solve our multi-class classification problem (AIN0-AIN3), multiple SVMs were trained in a one-versus-one scheme. The regularization parameter C, determining the trade-off between margin size and training error, was optimized to maximize the F1-score on a validation set. To infer the grading of a test sample, each SVM cast a vote for a grade. Then, majority voting determined the final grade.
Experimental setup: Three different sets of features were examined on each training set (cf. Section 2.1): (I) all features, (II) only the statistical features, (III) only DF (DF showed statistically significant differences for AIN1-AIN3 in ). Hence, nine classifiers were evaluated, which are identified as Ia-IIIc.
Features in each training set were standardized across all samples to zero mean, and unit variance. Validation and test set were standardized using the values computed from the corresponding training set. IQM (11), LIBSVM  and WEKA  were used in our experiments.
Performance metrics: Let TPc be the number of true positives, FPc false positives, and FNc false negatives per class. Precision, recall and F1-score is computed class-wise as PRCc = TPc/(TPc + FPc); RECc = TPc/(TPc + FNc); F1c = 2 ⋅ PRCc ⋅ RECc/(PRCc + RECc). Overall performance was measured in terms of weighted average precision (PRC = ∑c ωc PRCc), recall (REC = ∑c ωc RECc), and F1-score (F1 = ∑c ωc F1c), where ωc = nc/n is the weight of each class.
In Table 1, overall PRC, REC, and F1 are reported for the validation and test set. Detailed results on the test set are presented as confusion matrices for all nine classifiers (Ia–IIIc), cf. Figure 2. Best validation results were obtained when all extracted features were used jointly (I). PRC, REC and F1 measures ranged from 0.81 to 0.91 on the validation, and 0.79–0.90 on the test set. Using statistical features only (II), performance metrics dropped to 0.65–0.76, and 0.45–0.66, respectively. Classification based on DF yielded 0.38–0.69 for the validation, and 0.27–0.45 for the test set (III).
Generally, results obtained by models that used all available features for classification slightly improved, when an augmented training set was used. This behavior was not observed for classifiers trained on statistical features or DF. The results for all classifiers are comparable for unbalanced and balanced training sets.
Figure 3 illustrates qualitative results for four images by presenting their ground truth labels and the grades that were predicted by our trained classifiers. For this illustration, one image per class was chosen randomly from the test set.
4 Discussion and conclusion
We examined different strategies to learn a classification model using SVM for grading histopathological images of AIN. Our results indicate that a combination of multiple fractal and statistical features greatly improved the outcome. For models that used all available features, we obtained highly similar performance on the validation and test set, which emphasizes our system’s ability to generalize well to unseen samples without over-fitting. Balancing the training set did not generally result in increased performance. Nevertheless, we could verify our hypothesis that augmentation aids classifiers during learning phase.
The authors of  claimed that DF reflected AIN grades well, but the mean recall actually was <0.5, excluding AIN0. Here, we included AIN0 and could not confirm that DF properly represents AIN grades. A much larger feature set was required to achieve generalization rates acceptable for the use in biomedical diagnostics. In practice, physicians frequently discriminate only low- and high-grade AIN. Our best performing system (Ib) can predict these two classes with an accuracy of 96.55%.
This much higher performance can be explained by the fact that compared to previous work ,  not only a single value for the fractal dimension, but a multitude of values derived from measurements over varying scale-ranges was used. Features from different scale-ranges reflect measures at different scales, which are important when grading AIN (density, shape, texture of cell nuclei, distribution of cells).
Nevertheless, manually extracting image features remains subject to experience and is application-specific. End-to-end machine learning approaches, e.g. convolutional neural networks , are able to automatically tackle this problem. However, they usually require much larger training datasets. It is common that huge sets of labeled training data are scarce in biomedical imaging. Hence, an evaluation of our approach on more extensive AIN datasets is required as they become available. Depending on the available size of a dataset, a semi-supervised learning setting, i.e. making use of unlabeled instances, should also be considered.
PK and MMR equally contributed to this publication.
Research funding: The author state no funding involved. Conflict of interest: Authors state no conflict of interest. Material and Methods: Informed consent: Informed consent has been obtained from all individuals included in this study. Ethical approval: The research related to human use complies with all the relevant national regulations, institutional policies and was performed in accordance with the tenets of the Helsinki Declaration, and has been approved by the authors’ institutional review board or equivalent committee.
Simpson JAD, Scholefield JH. Diagnosis and management of anal intraepithelial neoplasia and anal cancer. Br Med J. 2011;343:d6818. Google Scholar
Bejarano PA, Boutros M, Berho M. Anal squamous intraepithelial neoplasia. Gastroenterol Clin North Am. 2013;42:893–912. Google Scholar
Ahammer H, Kroepfl JM, Hackl C, Sedivy R. Fractal dimension and image statistics of anal intraepithelial neoplasia. Chaos Soliton Fract. 2011;44:86–92. Google Scholar
Klonowski W, Pierzchalski M, Stepien P, Stepien R, Sedivy R, Ahammer H. Application of Higuchi’s fractal dimension in analysis of images of Anal Intraepithelial Neoplasia. Chaos Soliton Fract. 2013;48:54–60. Google Scholar
Fabrizii M, Moinfar F, Jelinek HF, Karperien A, Ahammer H. Fractal analysis of cervical intraepithelial neoplasia. PLoS One. 2014;9:1–9. Google Scholar
Ahammer H, Kroepfl JM, Hackl C, Sedivy R. Image statistics and data mining of anal intraepithelial neoplasia. Pattern Recogn Lett. 2008;29:2189–96. Google Scholar
Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med. 2001;23:89–109. Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. Google Scholar
Gonzalez RC, Woods RE. Digital image processing. Upper Saddle River, NJ: Prentice Hall International; 2008. Google Scholar
Mayrhofer-Reinhartshuber M, Ahammer H. Pyramidal fractal dimension for high resolution images. Chaos. 2016;26:073109. Google Scholar
Kainz P, Mayrhofer-Reinhartshuber M, Ahammer H. IQM: an extensible and portable open source application for image and signal analysis in Java. PLoS One. 2015;10:e0116329.Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11:10–8. Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86:2278–324. Google Scholar
About the article
Published Online: 2016-09-30
Published in Print: 2016-09-01
Citation Information: Current Directions in Biomedical Engineering, Volume 2, Issue 1, Pages 419–422, ISSN (Online) 2364-5504, DOI: https://doi.org/10.1515/cdbme-2016-0093.
©2016 Philipp Kainz et al., licensee De Gruyter.. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. BY-NC-ND 4.0