Nowadays new biomedical entities such as proteins, genes, mutations, and diseases are constantly being discovered, which leads to the growth of the biomedical lexicon. However, a new discovered biomedical entity is not always assigned to a new biomedical term, causing the biomedical term to have several possible senses. These ambiguous terms hinder the automatic extraction of biomedical information. Word sense disambiguation (WSD) is responsible for solving these ambiguities in textual documents, having the automatic capability to assign the correct meaning to ambiguous words given the surrounding textual context. WSD is a challenging artificial intelligence problem that has been studied for the last years . Particularly in the biomedical field, there are much more ambiguous terms increasing the WSD difficulty , . For assessing the biomedical WSD systems, some datasets containing biomedical ambiguous terms were proposed , , being the currently most used the MSH WSD dataset that was proposed by Jimeno-Yepes et al. . This last dataset was created by automatic means using the unified medical language system (UMLS) Metathesaurus  and the medical subject headings (MeSH)  indexing of MEDLINE articles.
In a previous work  we applied WSD supervised and knowledge-based methods in a subset of the MSH WSD dataset, achieving top accuracies of 94.7 and 85.1 % respectively. In this paper we extended our previous work by (1) testing more supervised classifiers, (2) combining bag-of-words as local features and word embeddings as global features in the supervised approach, (3) using different word embedding averaging functions, in the knowledge-based method, to calculate the surrounding context vectors, (4) extracting concept definitions for every ambiguous term in the MSH WSD making possible to apply our knowledge-based method to all the ambiguous terms in the MSH WSD dataset.
2 Related Works
Schuemie et al.  makes an overview of WSD in the biomedical domain until 2005. In  the authors showed that metadata information and well structured ontologies can play an important role to improve disambiguation. In 2011, Jimeno-Yepes et al.  proposed the MSH WSD dataset achieving a supervised accuracy of 93.9 %, and a knowledge-based accuracy of 83.8 %. The results of the following works are with respect of this same dataset. Some knowledge-based WSD approaches use semantic similarity measures from the UMLS achieving accuracies of 80.7 %  and 75.0 % . Another knowledge-based method uses the UMLS semantic network achieving an accuracy of 60.3 % . McInnes and Stevenson  explored supervised and knowledge-based WSD methods achieving a supervised state-of-the-art accuracy around 97.0 % and a knowledge-based accuracy around 78.0 %. Their supervised system relies in a vector space model, calculating the cosine between the vector representing the ambiguous term and each of the vectors representing the possible senses. Another knowledge-based method is presented in  achieving an accuracy of 89.1 %, where the authors developed a method to generate word-concept probabilities from a knowledge-base. The state-of-the-art accuracy by knowledge-based means is 92.2 %, and it was obtained using a method proposed by Sabbir et al.  that uses neural concept embeddings. Another supervised state-of-the-art accuracy of 96.0 % was achieved using a combination of unigrams and word embeddings with a SVM classifier by Jimeno-Yepes .
As far as we know, Iacobacci et al.  were the first to weight word embeddings considering the absolute word distance between a specific word and the ambiguous term in the problem of disambiguation. Word embeddings are a recent technique that maps words to high dimensional numeric vectors, being generated from unlabeled training data . These models have shown to improve text mining tasks, such as named entity recognition  and word sense disambiguation , .
In this work we applied supervised and knowledge-based methods to biomedical WSD. Bag-of-words features were used only in the supervised setting. Although, in both approaches, we used word embedding models generated by unlabeled MEDLINE abstracts. These word embeddings were used to calculate embedding vectors of the surrounding contexts of the ambiguous terms, which we will denominate as context vectors or context embeddings. For the knowledge-based approach we extracted concept unique identifier (CUI) textual definitions from the UMLS Metathesaurus, and calculate CUI embedding vectors, which we will denominate as concept vectors or concept embeddings. To find the most plausible meaning for a specific ambiguous term, our knowledge-based method calculates similarities between the context vector and the concept vectors weighted by CUI-CUI association values. Each step is explained more detailed below.
We evaluated our proposed methods in the MSH WSD dataset , which is the most adopted for assessing biomedical WSD methods. This dataset is composed by a total of 203 ambiguous entities, of which 88 are regular terms, 106 are abbreviations, and 9 are a mix of both. Most of the ambiguous entities have only two possible senses, where a minor part of 14 terms have from three up to five senses. For each possible sense there is a maximum of 100 instances. Each instance is a MEDLINE abstract where the ambiguous term occurs. The dataset has a total of 37.090 distinct MEDLINE abstracts.
3.2 Word Embeddings
The word embedding models were generated using PubMed articles which are biomedical domain-specific. MEDLINE abstracts corresponding to the years 1900 to 2015 were used, which contained around 15 million documents involving a total of around 800 thousand unique words. Six word embedding models were trained using windows of 5, 20, and 50 words, and vector sizes of 100 and 300. For generating the word embedding vectors we used the continuous-bag-of-words model proposed by Mikolov et al. , implemented in the Gensim framework .
The word embedding models were used to calculate the context embedding vectors and the CUI embedding vectors, with the last ones only being used in the knowledge-based approach.
3.3 Context Embeddings
The context embeddings are vectors that represent the surrounding contexts of the ambiguous terms. Each surrounding context of an ambiguous term is composed by the words of the respective abstract (excluding the ambiguous term occurrences) in the MSH WSD dataset. All the context vectors were weighted using the inverse document frequency (IDF) scheme and normalized.
In the supervised setting the term frequency (TF) component was added to the IDF weighting. However, since the cross-validation technique was used, these TF-IDF weights were fitted to a linear regression (from the labels of the current training fold), estimating new weights for each word. These final weights were the ones used to weight the word embeddings in the test fold.
In the knowledge-based method we tested five different word embedding averaging functions: the TF-IDF weighting scheme, and four word distance decay functions using also the IDF scheme. The objective of using decay functions was to give greater importance to closest words of the ambiguous terms. The absolute word distance d between some specific word and the closest occurrence of an ambiguous term was defined as being the input of the decay function. Summarily the weighting schemes used were (IDF weighting included):
No decay: f(d) = 1;
Fractional decay: f(d) = 1/d;
Exponential decay: f(d) = exp(−d);
Logarithmic decay: f(d) = 1/ln(1 + d).
3.4 Supervised Learning Classification
We tested five machine learning classifiers from the scikit-learn framework : decision tree (DT), k-nearest neighbor (k-NN, k = 5), logistic regression (LR), multi-layer perceptron (MLP), and support vector machine (SVM). To train the classifiers, bag-of-words features (unigrams, bigrams) and the context embeddings were used.
3.5 Knowledge-Based Method
3.5.1 Concept Embeddings
CUI textual definitions were extracted from UMLS knowledge sources1 to create the concept embedding vectors weighted by the TF-IDF scheme. All the concept vectors were normalized.
3.5.2 CUI-CUI Association Values
We calculated CUI-CUI association values as normalized Pointwise Mutual Information (nPMI) from the MeSH co-occurrence counts in MEDLINE articles.2 The nPMI values are between 0 and 1, with 0 representing no association, and 1 a perfect association. So, a concept in relation to himself has a value of 1. Since there are many CUIs, and consequently much more CUI-CUI relations, we considered only the nPMI values greater or equal than 0.3.
Our knowledge-based method came from the idea to compare the surrounding contexts of the ambiguous terms with the concept textual definitions, in order to find the most similar concept (meaning) given a specific context. With that in mind, we extended this baseline approach calculating a score for each possible CUI (meaning) of an ambiguous term as shown in equation (1).
Accordingly to equation (1), CUI represents the target meaning, CUIj represents any other related concept, t is the context vector, and CUIj is the concept vector of the related concept CUIj. Each context t is compared to the concept textual definitions CUIj by their cosine similarities CS(t, CUIj), which are weighted by their nPMI(CUI, CUIj) association values. The value N is the total number of relations considered, that is the number of non-zero nPMI values, and it is used to normalize the final score. For each possible CUI is calculated a score, and the one who get the highest score is considered the correct meaning.
Supervised learning results were tested using five distinct classifiers as described in Section 3.4. Table Table 1 shows the results using only bag-of-words features (unigrams, bigrams) with a best accuracy of 95.5 % using a SVM classifier. Table 2 shows the results using only word embeddings with a best accuracy of 95.1 % using a MLP classifier. The combination of unigrams and word embeddings (3) improved the individual accuracies achieving a best accuracy of 95.6 % using a MLP classifier. One can see that the differences of using different word embedding models are not significant.
Knowledge-based results were tested using five distinct word embedding averaging functions as described in Section 3.3.2 (Table 4–Table 8). Different thresholds (0.3, 0.5, 0.8, 1.0) for the nPMI values were imposed to filter out the most weighty relations. The threshold 1.0 is the particular case of the baseline scenario where only the cosine similarity between the context vector and the possible CUI (meaning) vector is computed. One can see that, in all the word embedding averaging functions, the threshold 0.3 produced the best accuracies proving that the addition of more related concepts leads to a better score refinement. The fractional decay averaging function obtained the highest results (Table 6), while the exponential decay averaging function obtained the lowest results (Table 7) even when compared to the baseline TF-IDF weighting (Table 4). Also, the word embedding models with higher windows achieved slightly higher accuracies. The top accuracy, 87.4 %, is presented in Table 6 and it was achieved using the fractional decay averaging function, the nPMI threshold set to 0.3, and the word embedding model with size of 100 and window of 50.
In this paper we extended our previous work  by applying more settings to the supervised and knowledge-based approaches. Furthermore, we extracted textual definitions for every CUI included in the MSH WSD dataset, making it possible to apply our knowledge-based method to the entire dataset.
As expected, the supervised classifiers obtained the highest results with a top accuracy around 95.6 %, while on the other hand our knowledge-based approach obtained a best accuracy around 87.4 %. Our supervised accuracy is very close to the state-of-the-art accuracy of 96.0 %, which was also obtained using a combination of unigrams and word embeddings with a SVM classifier .
Our knowledge-based method and results are comparable with other proposed knowledge-based approaches. In , Jimeno-Yepes et al. proposed the MSH WSD dataset, and tested four knowledge-based methods, where the automatic extracted corpus (AEC) method obtained the best accuracy around 84.5 %. McInnes and Pedersen  developed a knowledge-based method based on semantic similarity measures between UMLS concepts, and obtained an accuracy of 75 % in the same dataset. In , the authors used word-concept probabilities achieving a knowledge-based accuracy around 89 %. Our method is similar to the one proposed by Tulkens et al. , which also compared concept representations with the representations of the context of ambiguous terms, who obtained an accuracy of 84.0 % on the same dataset. As far as we know, the knowledge-based state-of-the-art accuracy in the MSH WSD dataset is 92.2 %, and it was obtained using a method proposed by Sabbir et al.  that uses neural word/concept embeddings.
Our work showed that the word embeddings and their averaging function plays a key role in the WSD problem.
This work was supported by Portuguese national funds through FCT - Foundation for Science and Technology, in the context of the project IF/01694/2013/CP1162/CT0018. Sérgio Matos is funded under the FCT Investigator Programme.
Weeber M, Mork JG, Aronson AR. Developing a test collection for biomedical word sense disambiguation. In: Proceedings of the AMIA symposium. American Medical Informatics Association, 2001:746–750. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243574/. Google Scholar
Moon S, Pakhomov S, Liu N, Ryan JO, Melton GB. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. J Am Med Inform Assoc 2014;21:299–307. PubMedWeb of ScienceCrossrefGoogle Scholar
Jimeno-Yepes A, McInnes BT, Aronson AR. Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation. BMC Bioinformatics 2011;12:223. Web of ScienceCrossrefPubMedGoogle Scholar
Antunes R, Matos S. Biomedical word sense disambiguation with word embeddings. In: 11th International Conference on practical applications of Computational Biology & Bioinformatics. Springer International Publishing, 2017:273–9. Google Scholar
Alexopoulou D, Andreopoulos B, Dietze H, Doms A, Gandon F, Hakenberg J, et al. Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy. BMC Bioinformatics 2009;10:28. Web of SciencePubMedCrossrefGoogle Scholar
Garla VN, Brandt C. Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification. J Am Med Inform Assoc 2013;20:882. PubMedCrossrefWeb of ScienceGoogle Scholar
McInnes BT, Pedersen T. Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. J Biomed Inform 2013;46:1116–24. PubMedWeb of ScienceCrossrefGoogle Scholar
El-Rab WG, Zaïane OR, El-Hajj M. Biomedical text disambiguation using UMLS. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Niagara, Ontario, Canada: ACM, 2013:943–947. Google Scholar
Sabbir AKM, Jimeno-Yepes A, Kavuluru R. Knowledge-based biomedical word sense disambiguation with neural concept embeddings. 17th IEEE International Conference on BioInformatics and BioEngineering (BIBE), 2017. Google Scholar
Jimeno-Yepes A. Word embeddings and recurrent neural networks based on long-short term memory nodes in supervised biomedical word sense disambiguation. J Biomed Inform 2017;73(Supplement C):137–47. PubMedCrossrefGoogle Scholar
Iacobacci I, Pilehvar MT, Navigli R. Embeddings for word sense disambiguation: an evaluation study. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany: Association for Computational Linguistics, 2016:897–907. Google Scholar
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv e-print. 2013; Available from: https://arxiv.org/abs/1301.3781. Google Scholar
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017;33:i37–48. CrossrefPubMedWeb of ScienceGoogle Scholar
Taghipour K, Ng HT. Semi-supervised word sense disambiguation using word embeddings in general and specific domains. In: Proceedings of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT 2015). Denver, Colorado, USA, 2015:314–323. Google Scholar
Wu Y, Xu J, Zhang Y, Xu H. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015). Beijing, China: Association for Computational Linguistics, 2015:171–176. Google Scholar
Řehůřek R, Sojka P. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks. Valletta, Malta, 2010:45–50. Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011;12:2825–30. Google Scholar
Tulkens S, Šuster S, Daelemans W. Using distributed representations to disambiguate biomedical and clinical concepts. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Berlin, Germany: Association for Computational Linguistics, 2016:77–82. Google Scholar
About the article
Published Online: 2017-12-13
Conflict of interest statement: Authors state no conflict of interest. All authors have read the journal’s publication ethics and publication malpractice statement available at the journal’s website and hereby confirm that they comply with all its parts applicable to the present scientific work.
Citation Information: Journal of Integrative Bioinformatics, Volume 14, Issue 4, 20170051, ISSN (Online) 1613-4516, DOI: https://doi.org/10.1515/jib-2017-0051.
©2017, Rui Antunes and Sérgio Matos, published by DeGruyter. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0