Skip to content
BY-NC-ND 4.0 license Open Access Published by De Gruyter October 18, 2016

Evaluating the effect of unbalanced data in biomedical document classification

  • Rosalía Laza EMAIL logo , Reyes Pavón , Miguel Reboiro-Jato and Florentino Fdez-Riverola


Nowadays, document classification has become an interesting research field. Partly, this is due to the increasing availability of biomedical information in digital form which is necessary to catalogue and organize. In this context, machine learning techniques are usually applied to text classification by using a general inductive process that automatically builds a text classifier from a set of pre-classified documents. Related with this domain, imbalanced data is a well-known problem in many practical applications of knowledge discovery and its effects on the performance of standard classifiers are remarkable. In this paper, we investigate the application of a Bayesian Network (BN) model for the triage of documents, which are represented by the association of different MeSH terms. Our results show that BNs are adequate for describing conditional independencies between MeSH terms and that MeSH ontology is a valuable resource for representing Medline documents at different abstraction levels. Moreover, we perform an extensive experimental evaluation to investigate if the classification of Medline documents using a BN classifier poses additional challenges when dealing with class-imbalanced prediction. The evaluation involves two methods, under-sampling and cost-sensitive learning. We conclude that BN classifier is sensitive to both balancing strategies and existing techniques can improve its overall performance.

Published Online: 2016-10-18
Published in Print: 2011-12-1

© 2011 The Author(s). Published by Journal of Integrative Bioinformatics.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Downloaded on 28.11.2023 from
Scroll to top button