Skip to content
BY-NC-ND 4.0 license Open Access Published by De Gruyter September 30, 2016

Learning discriminative classification models for grading anal intraepithelial neoplasia

  • Philipp Kainz EMAIL logo , Michael Mayrhofer-Reinhartshuber , Roland Sedivy and Helmut Ahammer


Grading intraepithelial neoplasia is crucial to derive an accurate estimate of pre-cancerous stages and is currently performed by pathologists assessing histopathological images. Inter- and intra-observer variability can significantly be reduced, when reliable, quantitative image analysis is introduced into diagnostic processes. On a challenging dataset, we evaluated the potential of learning a classifier to grade anal intraepitelial neoplasia. Support vector machines were trained on images represented by fractal and statistical features. We show that pursuing a learning-based grading strategy yields highly reliable results. Compared to existing methods, the proposed method outperformed them by a significant margin.

1 Introduction

Anal intraepithelial neoplasia (AIN) can be present in various forms and usually precedes anal carcinoma [1]. Assessing alterations of epithelium is crucial to derive an accurate estimate of pre-cancerous stages and carcinoma in-situ. The inspection of biopsy tissue is done qualitatively by physicians to diagnose anorectal disease, either using glass slides and light microscopy, or digital whole slide images. AIN is typically classified into four classes that correspond to non-neoplastic tissue (AIN0), and increasing grades of dysplasia (AIN1-AIN3). Important cues for grading are the density, shape, and texture of cell nuclei, as well as the distribution of normal and abnormal cells within the tissue [2]. However, qualitative analysis strongly depends on the observer’s experience and frequently leads to irreproducible results. Since defining clear transitions between grades is not always possible, the inter-observer variability can be considerable. Using quantitative image analysis, drawbacks of subjective assessments can be tackled by introducing reproducible, validated methods in biomedical diagnostics.

Previous work used fractal image analysis [3], [4] to grade AIN, or cervical intraepithelial neoplasia [5]. Fractal dimensions were computed globally for each image, and the value range was segmented by rigidly searching for statistically significant thresholds to separate the classes. In other work [6], a variety of data mining methods was explored to grade AIN images based on statistical texture features. In order to add significant value to a diagnostic process in terms of reducing time-consuming tasks, a classifier must be able to deal with a great input variance [7], generalize well, and reliably predict the grading of new images. Two crucial aspects have not been addressed in [3], [4], [6]: Firstly, none of the previous methods involved building a classifier on training data and evaluating it on novel test data. Secondly, since authors reported errors on training sets, we cannot assess the ability of their methods to generalize to unseen data.

The objective of this work is to learn a discriminative classification model for predicting the grade of AIN. We show that a support vector machine (SVM [8]) learns a robust classifier on a set of global image features. Our learning-based approach outperforms the previously proposed methods by a large margin on a challenging dataset [6]. Furthermore, we assess the influence of common strategies for data augmentation and class label balancing on the generalization performance.

2 Material and methods

2.1 Histopathology dataset

We used a set of hematoxylin and eosin stained images (n = 136, 749 × 580 pixels) of different AIN grades [3], [6], cf. Figure 1. An expert pathologist labeled images containing healthy tissue (n0 = 17) as AIN0, and low-grade neoplasia (n1 = 36) as AIN1. High-grade neoplasia was labeled as AIN2 (n2 = 55), and AIN3 (n3 = 28). We assigned 80% of the images in each class to the training, and 20% to a hold-out test set. Further, 20% of the training samples were used as validation set for parameter tuning, before the test set was predicted. This resulted in 84 images in the original training set (10 AIN0 / 22 AIN1 / 35 AIN2 / 17 AIN3), 23 in the validation set (3/6/9/5), and 29 in the test set (4/8/11/6).

Figure 1 Expert gradings of anal intraepithelial neoplasia.
Figure 1

Expert gradings of anal intraepithelial neoplasia.

Image pre-processing and augmentation: We hypothesized that augmentation and balancing of a training set improves the ability of our classifier to generalize. Hence, two additional training sets were created by applying label-preserving elastic deformations and parameterized random intensity variations of the individual channels in HSV color space. The training sets are referred to as (a) for the original, (b) for the augmented, and (c) for the augmented and balanced training set. For (b), we added 30 versions of each image in (a), such that it comprised 2604 images. However, the class label distribution remained unbalanced. For (c), we oversampled the underrepresented classes in (b) to achieve a uniform distribution over all four grades of AIN (4304 images). All images were resized to 512 × 512 pixels by bilinear interpolation, prior to feature extraction.

2.2 Learning AIN grading classifiers

Feature extraction: Since AIN images show randomly orientated tissue, we considered the rotational invariance of global texture features advantageous. Inspired by [3], [4], [5], [6], we extracted 22 statistical and 304 fractal features to represent the image content. Statistical features consisted of summed pixel values of individual channels in three different color models (RGB, L*a*b*, HSV) and the gray value image (mean RGB), first and second order statistical parameters [9] of the NTSC luminance gray value image (variance, energy (1st, 2nd), entropy (1st, 2nd), skewness, kurtosis, third moment, fourth moment, contrast, homogeneity, correlation). Fractal features included estimated values for the fractal dimension based on the Fourier method (DF) applied to the gray value image [3] (first 216 distance values in frequency space), the box-counting method applied to the nuclei-segmented, binary image [5] (three scale-ranges: 20 − 24, 25 − 29, 20 − 29). In addition, the recently developed pyramidal gradient and pyramidal differences methods [10] applied to gray value images (mean RGB,R,G,B,L*,a*,b*,H,S,V; combinations of the scale-ranges 20 − 27 with at least four consecutive scales) were used.

Classifier training and inference: Given a set of labeled training samples, we employed a linear SVM to learn a maximum-margin classifier, characterized by an optimal hyper-plane that separates two classes. To solve our multi-class classification problem (AIN0-AIN3), multiple SVMs were trained in a one-versus-one scheme. The regularization parameter C, determining the trade-off between margin size and training error, was optimized to maximize the F1-score on a validation set. To infer the grading of a test sample, each SVM cast a vote for a grade. Then, majority voting determined the final grade.

Experimental setup: Three different sets of features were examined on each training set (cf. Section 2.1): (I) all features, (II) only the statistical features, (III) only DF (DF showed statistically significant differences for AIN1-AIN3 in [3]). Hence, nine classifiers were evaluated, which are identified as Ia-IIIc.

Features in each training set were standardized across all samples to zero mean, and unit variance. Validation and test set were standardized using the values computed from the corresponding training set. IQM (11), LIBSVM [12] and WEKA [13] were used in our experiments.

Performance metrics: Let TPc be the number of true positives, FPc false positives, and FNc false negatives per class. Precision, recall and F1-score is computed class-wise as PRCc = TPc/(TPc + FPc); RECc = TPc/(TPc + FNc); F1c = 2 ⋅ PRCc ⋅ RECc/(PRCc + RECc). Overall performance was measured in terms of weighted average precision (PRC = ∑cωc PRCc), recall (REC = ∑cωc RECc), and F1-score (F1 = ∑cωc F1c), where ωc = nc/n is the weight of each class.

3 Results

In Table 1, overall PRC, REC, and F1 are reported for the validation and test set. Detailed results on the test set are presented as confusion matrices for all nine classifiers (Ia–IIIc), cf. Figure 2. Best validation results were obtained when all extracted features were used jointly (I). PRC, REC and F1 measures ranged from 0.81 to 0.91 on the validation, and 0.79–0.90 on the test set. Using statistical features only (II), performance metrics dropped to 0.65–0.76, and 0.45–0.66, respectively. Classification based on DF yielded 0.38–0.69 for the validation, and 0.27–0.45 for the test set (III).

Table 1

Quantitative results of grading methods (linear SVMs with parameters optimized on validation sets) for all four grades of AIN.

Validation setTest set
Classifier (Feats.)PRCRECF1PRCRECF1
Ia (326)0.840.830.810.820.790.79
IIa (22)0.760.740.730.590.620.60
IIIa (1)0.400.610.480.290.450.36
Ib (326)0.870.870.870.900.860.87
IIb (22)0.680.700.680.560.550.54
IIIb (1)0.380.570.450.270.410.33
Ic (326)0.910.870.870.880.860.86
IIc (22)0.650.650.650.660.450.50
IIIc (1)0.690.520.520.360.410.32

The classifiers used different features (I: statistical and fractal features, II: statistical features, III: DF) and were trained with the original training set (a, upper panel), the 30-fold augmented training set (b, middle panel), and the augmented and balanced training set (c, lower panel).

Figure 2 Confusion matrices for the test set obtained with the nine different SVM classifiers Ia–IIIc. Rows: Statistical and fractal features (I), statistical features (II), DF (III). Columns: Original (A), augmented (B), augmented and balanced (C) training set. For each true class, colors towards magenta encode a higher tendency of a classifier to predict a particular class. Higher values along the main diagonal are desired.
Figure 2

Confusion matrices for the test set obtained with the nine different SVM classifiers Ia–IIIc. Rows: Statistical and fractal features (I), statistical features (II), DF (III). Columns: Original (A), augmented (B), augmented and balanced (C) training set. For each true class, colors towards magenta encode a higher tendency of a classifier to predict a particular class. Higher values along the main diagonal are desired.

Generally, results obtained by models that used all available features for classification slightly improved, when an augmented training set was used. This behavior was not observed for classifiers trained on statistical features or DF. The results for all classifiers are comparable for unbalanced and balanced training sets.

Figure 3 illustrates qualitative results for four images by presenting their ground truth labels and the grades that were predicted by our trained classifiers. For this illustration, one image per class was chosen randomly from the test set.

Figure 3 Qualitative comparison of evaluated AIN grading methods. Text in red color denotes classification errors, top right corner shows the ground truth label.
Figure 3

Qualitative comparison of evaluated AIN grading methods. Text in red color denotes classification errors, top right corner shows the ground truth label.

4 Discussion and conclusion

We examined different strategies to learn a classification model using SVM for grading histopathological images of AIN. Our results indicate that a combination of multiple fractal and statistical features greatly improved the outcome. For models that used all available features, we obtained highly similar performance on the validation and test set, which emphasizes our system’s ability to generalize well to unseen samples without over-fitting. Balancing the training set did not generally result in increased performance. Nevertheless, we could verify our hypothesis that augmentation aids classifiers during learning phase.

The authors of [3] claimed that DF reflected AIN grades well, but the mean recall actually was <0.5, excluding AIN0. Here, we included AIN0 and could not confirm that DF properly represents AIN grades. A much larger feature set was required to achieve generalization rates acceptable for the use in biomedical diagnostics. In practice, physicians frequently discriminate only low- and high-grade AIN. Our best performing system (Ib) can predict these two classes with an accuracy of 96.55%.

This much higher performance can be explained by the fact that compared to previous work [3], [4] not only a single value for the fractal dimension, but a multitude of values derived from measurements over varying scale-ranges was used. Features from different scale-ranges reflect measures at different scales, which are important when grading AIN (density, shape, texture of cell nuclei, distribution of cells).

Nevertheless, manually extracting image features remains subject to experience and is application-specific. End-to-end machine learning approaches, e.g. convolutional neural networks [14], are able to automatically tackle this problem. However, they usually require much larger training datasets. It is common that huge sets of labeled training data are scarce in biomedical imaging. Hence, an evaluation of our approach on more extensive AIN datasets is required as they become available. Depending on the available size of a dataset, a semi-supervised learning setting, i.e. making use of unlabeled instances, should also be considered.


PK and MMR equally contributed to this publication.

Author’s Statement

Research funding: The author state no funding involved. Conflict of interest: Authors state no conflict of interest. Material and Methods: Informed consent: Informed consent has been obtained from all individuals included in this study. Ethical approval: The research related to human use complies with all the relevant national regulations, institutional policies and was performed in accordance with the tenets of the Helsinki Declaration, and has been approved by the authors’ institutional review board or equivalent committee.


[1] Simpson JAD, Scholefield JH. Diagnosis and management of anal intraepithelial neoplasia and anal cancer. Br Med J. 2011;343:d6818.10.1136/bmj.d6818Search in Google Scholar

[2] Bejarano PA, Boutros M, Berho M. Anal squamous intraepithelial neoplasia. Gastroenterol Clin North Am. 2013;42:893–912.10.1016/j.gtc.2013.09.005Search in Google Scholar

[3] Ahammer H, Kroepfl JM, Hackl C, Sedivy R. Fractal dimension and image statistics of anal intraepithelial neoplasia. Chaos Soliton Fract. 2011;44:86–92.10.1016/j.chaos.2010.12.004Search in Google Scholar

[4] Klonowski W, Pierzchalski M, Stepien P, Stepien R, Sedivy R, Ahammer H. Application of Higuchi’s fractal dimension in analysis of images of Anal Intraepithelial Neoplasia. Chaos Soliton Fract. 2013;48:54–60.10.1016/j.chaos.2013.01.004Search in Google Scholar

[5] Fabrizii M, Moinfar F, Jelinek HF, Karperien A, Ahammer H. Fractal analysis of cervical intraepithelial neoplasia. PLoS One. 2014;9:1–9.10.1371/journal.pone.0108457Search in Google Scholar

[6] Ahammer H, Kroepfl JM, Hackl C, Sedivy R. Image statistics and data mining of anal intraepithelial neoplasia. Pattern Recogn Lett. 2008;29:2189–96.10.1016/j.patrec.2008.08.008Search in Google Scholar

[7] Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med. 2001;23:89–109.10.1016/S0933-3657(01)00077-XSearch in Google Scholar

[8] Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.10.1007/BF00994018Search in Google Scholar

[9] Gonzalez RC, Woods RE. Digital image processing. Upper Saddle River, NJ: Prentice Hall International; 2008.Search in Google Scholar

[10] Mayrhofer-Reinhartshuber M, Ahammer H. Pyramidal fractal dimension for high resolution images. Chaos. 2016;26:073109.10.1063/1.4958709Search in Google Scholar PubMed

[11] Kainz P, Mayrhofer-Reinhartshuber M, Ahammer H. IQM: an extensible and portable open source application for image and signal analysis in Java. PLoS One. 2015;10:e0116329.10.1371/journal.pone.0116329Search in Google Scholar PubMed PubMed Central

[12] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:27:1–27:27.10.1145/1961189.1961199Search in Google Scholar

[13] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11:10–8.10.1145/1656274.1656278Search in Google Scholar

[14] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86:2278–324.10.1109/5.726791Search in Google Scholar

Published Online: 2016-9-30
Published in Print: 2016-9-1

©2016 Philipp Kainz et al., licensee De Gruyter.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Downloaded on 1.10.2023 from
Scroll to top button