Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Data and Information Management

4 Issues per year

Open Access
See all formats and pricing
More options …

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Neil R. Smalheiser
  • Corresponding author
  • Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, 1601 West Taylor Street, MC912, Chicago, IL 60612, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Aaron M. Cohen
  • Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA 97239
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2018-05-22 | DOI: https://doi.org/10.2478/dim-2018-0004


Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.

Keywords: Text mining; machine learning; semantic similarity; vector representation; community platforms; data sharing; open science

1 Introduction

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names to automated summarization of articles. A common approach is to define positive and negative training examples, extract features from article metadata or full-text, and use machine learning algorithms. A search in PubMed for [text AND (“machine learning” OR automated)] yields more than 2650 articles. The text mining research community is active, relatively cohesive, and stands poised to participate in the revolution in medical care that is associated with the digitization of medical records, knowledge bases, and scientific publications (Simpson & Demner-Fushman 2012; Przybyła, Shardlow, Aubin, Bossy, de Castilho, Piperidis, et al. 2016).

The public resources provided by the National Library of Medicine (https://www.ncbi.nlm.nih.gov/pubmed) including (among others) MEDLINE, PubMed, Unified Medical Language System, SemRep, and PubMed Central’s full-text open access repository, are a boon to researchers. A number of important efforts have been made to streamline text mining workflows by providing a library of natural language processing (NLP) tools (e.g., stemmers, parsers, and named entity recognizers) that can be connected together in a pipeline Manning, Surdeanu, Bauer, Finkel, Bethard, McClosky, D., 2014; Savova, Masanz, Ogren, Zheng, Sohn, Kipper-Schuler, Chute, 2010; Batista-Navarro, Carter, Ananiadou, 2016; Clarke, Srikumar, Sammons, Roth, 2012). In addition, there are valuable machine learning packages that provide machine learning algorithms in a user-friendly manner (Hall, Frank, Holmes, Pfahringer, Reutemann & Witten, 2009), and there are even emerging approaches to assist researchers in choosing the most appropriate machine learning algorithm, features, and parameter settings for a given problem (Zeng & Luo 2017).

Nevertheless, at present, most informatics investigators maintain independent databases, extract features on their own for each project, and generally carry out their research efforts on small subsets of the biomedical literature and largely in isolation from each other. Thus, they do not benefit fully from the savings and diversity (and possible reuse) associated with shared resources. In particular, we believe that additional savings can be achieved if the community of text mining researchers can reuse and contribute to a common pool of features extracted comprehensively from the metadata or full-text of PubMed articles.

Features are the results that are generated when an NLP tool or machine learning model processes a passage or corpus of text. These features can be rather simple – for example, here is the title of a PubMed article: Discovering foodborne illness in online restaurant reviews.

One can employ the Porter stemmer (Porter, 1980) to process the text, which will result in: Discov foodborn ill in onlin restaur reviews.

One can also employ a Bio tokenizer/stemmer (Torvik, Smalheiser, Weeber, 2007) instead, which will result in: discovering foodborne illness in online restaurant review.

The two stemmers produce quite different results; the Porter stemmer is designed to collapse word variants into a single form (largely by stripping word endings), whereas the Bio tokenizer/stemmer is designed to be very gentle, and it merely stripped the final –s of reviews. The resulting processed text can then be used as an input for further text mining and modeling, e.g., stopword removal and part of speech tagging.

In the case just discussed earlier, the raw text is readily available as a shared resource in the open access form (from MEDLINE, PubMed, and the publisher). The tools to process short text passages are also openly available in this case (from a query interface maintained at our project website http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/tokenizer.cgi). However, if one were to process long passages – say, all PubMed titles and abstracts – one would need to download the code for the stemmers, set up an input–output environment, and deal with possible bugs and complications such as software dependencies and incompatibilities, as well as the storage, retrieval, and updating of the massive amounts of generated data. In short, even processing text for a single simple task can be time-consuming and requires a lot of fiddling. Once the resulting processed text has been produced and validated, should not it be archived so that it can be reused by others without the need to process the text all over again?

The issue becomes even more compelling for features that require more sophisticated, large-scale modeling to produce and that do not merely process a piece of text based upon the text itself but draw upon external databases and knowledge bases, which may undergo incremental updating over time. In such cases, it may not be feasible for others to attempt to duplicate the modeling or tagging on their own. Rather than distributing the code and back-end databases to users, both of which are quite large and complex and cumbersome to distribute and get running at another site, it is much more efficient to simply provide users with the end results. Indeed, our laboratory has created a suite of such precomputed resources that are freely available online for viewing or download (http://arrowsmith.psych.uic.edu) (Torvik, Smalheiser, Weeber, 2007; Torvik, Weeber, Swanson & Smalheiser, 2005; Torvik & Smalheiser, 2009; Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015; D’Souza, & Smalheiser, 2014; Smalheiser & Bonifield 2016, 2018).

In this paper, we outline our vision for creating a shared platform that will allow the biomedical text mining community to download (and conversely, to donate) a suite of features that can be used for a variety of machine learning and modeling tasks. This effort is nontrivial, not because of computational issues, but because it is necessary to design the system in a manner that is flexible and scalable, will serve future needs, and will gain general acceptance. One of the factors that might limit acceptance of a shared platform is the fact that there are a plethora of different machine learning approaches, NLP pipelines, and feature sets that are used by different groups. It may not be easy to identify a single core set of features that are likely to be generally used across these various approaches. Therefore, a second component of our vision is to propose a generic approach to machine learning that can take advantage of many potential feature types and their similarity metrics that will have strong potential for reusability.

For simplicity and concreteness, we first describe the generic approach and show how it could support the indexing of articles according to one or more publication types (PTs) (https://www.nlm.nih.gov/mesh/pubtypes.html) such as biography, review, clinical trial, clinical case report and editorial. Then, we indicate how the system can be extended and generalized for other types of text mining projects.

2 Methods

2.1 Overview

The overall framework of our generic approach is to represent each PubMed article as a vector consisting of n multiple metadata features. Each training set is represented as a set, cloud, or cluster of these vectors in n-dimensional space. The distance between any two PubMed articles can be calculated as a weighted sum of the pairwise similarity scores of the underlying features between each PubMed article. Then, the overall distance between a PubMed article and a training set will be some function of the weighted pairwise similarity scores (for each of the articles that make up the training set). Finally, articles can be classified as belonging to one or more categories (depending on the relative distance of an article to the positive vs. negative training sets) or similar articles can be clustered together (the preferred clustering strategy and end point may vary depending on the project).

2.2 Index each article in PubMed by representing it as a multidimensional set of article features

These features should cover all of the basic types of metadata that are likely to be relevant for text mining researchers. However, in any given project, only a subset of these are likely to be informative (these can be selected either manually or via automated feature selection strategies). Furthermore, the complete set of useful metadata features are likely to expand over time as new techniques are invented and released. The system presented here has no practical limit on the number of metadata feature types, and metadata features can be added to the system and made available for future use at any time.

As a matter of policy, each feature at a minimum should be represented in its most basic “raw” or unprocessed form; if these features are processed or further encoded for the purposes of a specific project, the processed form should be represented as a separate feature. In this manner, it is easy to customize and process features in new ways to meet the demands of new projects. For example, the title of the article (encoded as a string of raw text) would constitute one feature in the feature set. Note that different investigators and different projects call for preprocessing text differently, so that no single or uniform method of preprocessing is likely to satisfy everyone. Thus, the title of the article after processing (e.g., via a particular NLP pipeline of tokenizing, making lower case, stoplisting, and stemming) would be placed as a separate feature in the article feature set. Further processing this form of the title into a “bag of words” encoding with counts for each non-stopword token would form another feature. Each of the basic metadata fields of the PubMed or MEDLINE record (title, abstract, journal, publication date, affiliations, etc.) would be extracted and possibly further processed to give rise to additional components of the article feature set. Altogether, several dozen features may be represented, some representing the same fields but in different ways. The full list of Medical Subject Headings (MeSH) extracted from the record would be one feature; another feature would be the same list but extracting only the major headings (discarding the subheadings) and removing the most frequent MeSH terms via stoplisting (Smalheiser & Bonifield, 2016).

Besides extracting information directly from the metadata as contained in the XML record downloaded from PubMed, some of the article features may be derived from external sources. For example, if one feature is the list of author names on the article, then another feature may be the list of disambiguated author IDs as assigned in the Author-ity author name disambiguation dataset (Torvik & Smalheiser, 2009). The raw list of author names must be kept so that it is possible to identify at least the first author, last author, and middle authors. The associated features may be the frequency of each author name within MEDLINE as a whole, the affiliations associated with each author, etc. Table 1 shows a simplified schema of the article feature vectors used in the Authority modeling project, which can be taken as a baseline set that can be extended with additional features that may be relevant for other modeling projects. In the case of classifying articles as randomized controlled trials, we found that the number of authors listed on a paper was a significant feature (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015), which is easily encoded from the raw author list feature and then stored as a processed feature set.

Table 1

Article metadata used in the Author-ity author name disambiguation project feature set (simplified and updated from refs. 1 and 2).

Each article feature will be held in a central database where each one can potentially be called upon to create further processed and simplified, specific article features designed for a given purpose. Ideally, groups that use article feature information and customize it should donate the customized version(s) back to the database, so that these features can be used by others. For example, customized stoplists or text that has been processed by specific tokenizer/stemmers should be archived in the database so that they can be reused by others for processing text. This saves both time and effort, as well as contributing to reproducibility and allowing for very detailed specific comparison experiments to be performed easily and precisely.

2.3 Create pairwise similarity vectors that compare article features across any two PubMed articles

For any pair of articles, most, if not all, of the article features can be compared and scored for similarity. A collection of these similarity features can be represented as a vector and used to compute an overall paired article similarity score. Generally, similarity can be computed in more than one way. For example, the titles of two articles can be scored in terms of how many words they share using raw text; one might count shared words using stoplisted, stemmed text; or one might do a weighted counting, in which rare words are counted more heavily than frequent ones. Table 2 shows a simplified schema of the pairwise similarity measures used in the Author-ity modeling project. We anticipate that a few of the more popular pairwise similarity schemes will be implemented as part of the pairwise similarity vector for two articles. As other investigators utilize other similarity schemes for a given article feature, the scripts for processing them should be donated back so that the option can be implemented by others at will.

Table 2

Pairwise similarity measures employed in the Author-ity author name disambiguation project similarity vector (simplified and updated from refs. 1 and 2).

In any specific project, each article will be represented by only a subset of features, and each article pair will be represented by only a subset of the possible pairwise similarity measures. For example, in the Author-ity author name disambiguation project, we consider the likelihood that two articles are written by the same individual – they may tend to share similar co-authors, journals, and affiliations, among other relevant features. However, if we consider two clinical case reports describing a similar condition, there is no reason to think that they will share co-authors, journals, or affiliations; rather, shared title terms, MeSH terms, and MeSH term pairs (Smalheiser & Bonifield, 2016) are likely to be important. The point here is to make it easy to create and compute pairwise similarity vectors for any given project, drawing from a larger pool both of individual article-based features and potential pairwise similarity schemes.

2.4 A machine learning algorithm is trained that optimally computes the similarity of the two articles in the context of a particular project (Figure 1)

Given a similarity vector representing multiple, heterogeneous measures corresponding to a pair of articles, one needs to “collapse” the vector to a single real number that represents the overall paired article similarity in the context of the given task, in order to have a single value that can be used for clustering similar articles together. This may be done in many ways, but perhaps the simplest method is to compute the similarity value as comprising a weighted sum of each pairwise similarity score. These weights can be determined using a machine learning algorithm, such as a support vector machine, logistic regression, and neural network, which carries out training on appropriately labeled data. Given a sufficient number of data samples, the labels can be somewhat noisy without degrading performance of the model (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015; Agarwal, Podchiyska, Banda, Goel, Leung, Minty,... & Shah, 2016; Aslam & Decatur, 1996).

Overall process flow for the generic article tagger architecture.
Figure 1

Overall process flow for the generic article tagger architecture.

Ideally, for each project, one should define sets of articles that comprise positive and/or negative training sets for machine learning. To define a positive training set, we pull a set of articles and their associated article feature vectors that have some desired property, that is, they are “similar” in a manner we are interested in. For example, for author name disambiguation, we might take a set of articles known to be authored by a particular individual or for training a model to identify randomized controlled trials, we may take as positive set those articles that have been manually indexed by MEDLINE as randomized controlled trials (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015). The negative training set may consist of all articles not in the positive training set. Alternatively, in a multitask classification project, there may be a series of different positive training sets, each positive for a different class, so that each positive training set is contrasted against each of the others. Each PubMed article may then be assigned to belong to one of a series of positive training sets based on its PT where each is in the positive training set for its PT and is contrasted against each training set for the other PTs.

One way to visualize this scheme is shown in Figure 2. Each of the articles in PubMed is represented as a point in a multidimensional article feature space. Each positive set comprises a cloud of such points. The cloud may be more or less cohesive – they may or may not cluster tightly around a single central centroid, though a good positive training set ought to be relatively cohesive. The machine learning objective is to compute pairwise similarity measures for each two articles, such that any given article that is in a positive set will be, on average, “closer” to other articles that are in the same positive set than to articles that are in the negative set (or other positive sets). One chooses some machine learning framework (e.g., SVM) and trains the model to adjust the weightings on the similarity vectors so as to minimize the average distance between the members of a positive set and maximize the average distance between the members of the positive set and the members of the negative set (or other positive sets). In this manner, one learns the optimal weightings on different similarity measures that make up the pairwise similarity vectors, which compute optimally a single similarity value for any pair of articles.

Assignment of publication types to an arbitrary article by optimized cluster distance. The unclassified article (red circle) may be assigned any combination of publication types A, B, and C or none of these, based on the distance in the similarity vector space from the unclassified article to each of the publication type clusters.
Figure 2

Assignment of publication types to an arbitrary article by optimized cluster distance. The unclassified article (red circle) may be assigned any combination of publication types A, B, and C or none of these, based on the distance in the similarity vector space from the unclassified article to each of the publication type clusters.

2.5 Using the learned similarity metric for article pairs, articles can be classified as belonging to one or more categories or similar articles can be clustered together (the preferred clustering strategy and end point may vary depending on the project)

Having trained the machine learning model as described earlier, to classify any new article (not in the training sets), one computes the similarity values pairwise between that article and all the articles in the positive set and between that article and all the articles in the negative set. This gives a distribution of similarity values for the positive set vs. the negative set (or each of the other positive sets). Then, one can ask, for this article, on average, which training set is it closest to? (see Figure 2). Depending on the nature of the classification task, one might assign the article to the closest positive set (or possibly to more than one positive set, if it is about equally close to more than one) or to the negative set, if an article is not sufficiently close to any of the positive set clouds. Rather than binary (yes/no) classification, it is also possible to assign a probability of belonging to a given class (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015;Niculescu-Mizil, Caruana, 2005).

Furthermore, task-specific customization of the assignment algorithm is possible using a number of standard distances to cluster measures (Aggarwal & Reddy, 2013), such as closest cluster, closest cluster median, and average cluster member distance. These can be computed very efficiently and easily compared, and an optimal cluster selection method can be chosen for a given task. Furthermore, the cluster selection method can be extended to produce cluster assignment probabilities by incorporating distances to multiple members of each cluster. For example, a K-nearest neighbors strategy could be used here. Another potential approach is to use the final cluster distance to compute a probability of cluster membership directly using a nonlinear transformation such as isotonic regression Torvik, Weeber, Swanson & Smalheiser, 2005). Alternatively, similarity values across article pairs can also be used for unsupervised clustering to identify groups of articles in a data-driven manner.

3 Results and Discussion

Let us first consider the concrete case of assigning PubMed articles automatically (and probabilistically) to one or more PTs. We show that the framework of encoding articles as multidimensional vectors, constructing pairwise similarity vectors for pairs of articles, and computing distances between articles and training sets (Figures 1 and 2) is well suited for this task. We further show the scheme benefits from having a core suite of precomputed pairwise similarity features, which are publicly available from our project website http://arrowsmith.psych.uic.edu.

3.1 Training sets

We take the list of MEDLINE PTs (https://www.nlm.nih.gov/mesh/pubtypes.html) plus additional MeSH that refer to study designs such as case–control studies or cohort studies (https://meshb-prev.nlm.nih.gov/search) as the universe of possible PTs. To ensure that training sets are likely to be adequate in size, we consider only those PTs that have greater than 10,000 articles manually indexed by MEDLINE reviewers as corresponding to that PT. Then, the set of manually indexed articles comprises the positive training set for that PT. There is no explicit negative training set; rather, the different PT positive training sets are contrasted against each other. The end goal is to determine which PT set(s) a given article is most similar to, as measured by its similarity distance to each PT cluster(s). Note that MEDLINE indexing decisions are manually curated but are not entirely consistent or totally accurate (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015); thus, this should be regarded as noisy learning rather than fully supervised learning. In addition, note that a single article can belong to more than one PT at the same time. For simplicity, articles that are indexed by more than one PT can be excluded from the training sets. (Alternatively, individual articles can be viewed as comprising a “bag” of multiple PTs (Law, Yu, Urtasun, Zemel, Xing, 2017)).

Feature sets. Next, for each article in PubMed, we assign a feature set that includes metadata features extracted from the PubMed XML record (or computed from information contained in the record), which we know (or suspect) may provide information that will help in assigning PTs. The feature set includes a variety of textual features – for example, words that appear in the title and/or in the abstract, as well as low-dimensional vector representations of these words (e.g., implicit term metrics (Smalheiser & Bonifield, 2018) or word2vec neural embeddings (Smalheiser & Bonifield, 2018; Mikolov, Sutskever, Chen, Corrado, Dean, 2013)). The feature set also includes journal name (since PTs are not distributed equally across journals), MeSH, and other features such as number of authors listed on the article (note that reviews are often single authored, whereas clinical trials generally have many author names on each paper). Feature selection may be performed on the basic set to select only those features that have the most utility for discriminating different PTs (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015) and to minimize collinearity (Aggarwal & Reddy, 2013; Witten, Frank, Hall, & Pal, 2016).

3.2 Pairwise similarity measures for each feature

Once each article has a feature set to describe it, the next step is to construct a pairwise similarity vector that contains multiple, heterogeneous similarity measures that contribute to the overall similarity of any two articles. Here, each feature is compared pairwise and a feature similarity score is assigned; this is done for each pairwise feature comparison within the similarity vector. The essential requirement is to have monotonicity; that is, for any given feature, a higher similarity score corresponds to a higher probability that the two articles share the same PT. We have precomputed pairwise similarity metrics for journals (D’Souza, & Smalheiser, 2014); MeSH (Smalheiser & Bonifield, 2016); biomedical terms including words, bigrams, trigrams, and abbreviations (Smalheiser & Bonifield 2018); and the title+abstract considered as a single text passage (Smalheiser & Bonifield, 2018). (Except for the title+abstract similarity measures, which are still in the process of being made available, all of these measures can be downloaded from the project website.) These measures cover almost all the pairwise features that are likely to be included in the multitask model, and we have shown that there is limited redundancy between term-based and title+abstract-based similarity measures (Smalheiser & Bonifield, 2018), so that including both types of features is likely to be warranted. We believe this should be a valuable resource for the biomedical text mining community (http://arrowsmith.psych.uic.edu) and should encourage others to use these features in their own modeling (Mohammadi, Kylasa, Kollias & Grama, 2016).

3.3 Optimizing the weighted similarity metric for one article to another and for one article to a training set

The next goal is to learn how to estimate the weighting of the different similarity scores in the pairwise similarity vector, to estimate the overall similarity of any two articles. We examine each PT in turn, and for each article in the positive training set for this PT, we use machine learning to train a model that minimizes the pairwise distance of this article to the other articles in the same training set, while maximizing the distance of this article to the articles in the other PTs. A variety of machine learning methods could be explored, e.g., SVMs (linear or nonlinear), isotonic regression, random forests, or neural networks.

Note that in the abovementioned description, each article in PubMed is assigned a single feature set for each article, yet it is possible that each PT may utilize a different optimized similarity vector and weighting for comparing any two articles; for example, the optimal weighting scheme for discriminating review articles from clinical case reports may be different from the optimal weighting scheme for discriminating review articles from editorials. Another alternative approach is to customize the feature set for each PT training set individually for making similarity comparisons. For example, the word “randomized” has a high discriminative value when assigning articles to randomized controlled trials, whereas the word “cohort” has a high value when assigning articles to cohort studies. Given any PubMed article, it might be selectively compared for similarity against discriminative terms such as “randomized” when comparing the article to the randomized controlled trial training set but compared for similarity against terms such as “cohort” when comparing the article to the cohort studies training set. This is a topic that will require further research. Similarly, as discussed in the Methods section, given a list of distances from one article to a given PT training set, it is an open question how best to compute an “overall” distance. A popular choice is to represent the entire training set by its centroid, but this may not be appropriate if the training set is not coherent or if one is using a nonlinear similarity metric instead of a simple weighted sum of feature similarity scores.

The similarity-based multitask framework described here differs from our previous method (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015) to estimate the probability that a given article is a randomized controlled trial. Our previous study of classifying randomized controlled trials used features derived directly from metadata – for example, title bigrams. In contrast, the present strategy formulates implicit features that measure similarity of the feature in a pair of articles. The use of implicit features increases the coverage and robustness of the modeling. For example, when using title bigrams as a direct feature, if an article does not have any title bigrams that are mentioned in the positive training set, the feature is scored as 0. However, using implicit term similarity metrics (Smalheiser & Bonifieldm, 2018), a title bigram that is not identical to that found in the positive training set will still receive a partial similarity score. Another difference observed in our new approach is that previously, we simply attempted to ask whether a given article was more similar to the positive training set vs. a single negative training set (consisting of all other human-related articles in PubMed), but here, we propose comparing an article’s similarity to multiple positive training sets, each one much smaller and more coherent than the previous (large, heterogeneous) negative training set. This should make the model more sensitive, while also allowing the model to assign one article to more than one PT if several PTs have similar high similarity values.

3.4 Can this framework be generalized to other biomedical text mining tasks?

Although the use of implicit similarity metrics is common in certain machine learning applications (e.g., image analysis and bioinformatics), in our experience, the pairwise similarity-based approach we propose here has been less commonly used for biomedical text mining projects. Certainly, the list of our currently precomputed similarity features is not exhaustive. For example, a set of PubMed articles can be subjected to topic modeling and the articles can then be represented as a weighted vector of these topics (Hashimoto, Kontonatsios, Miwa & Ananiadou, 2016). In addition, two text passages can be assessed in terms of their string similarity (Mrabet, Kilicoglu, Demner-Fushman, 2017). Author name matches on first name (with partial matches given for nicknames), middle initial, and suffix are features important for author name disambiguation (Torvik, Weeber, Swanson & Smalheiser, 2005; Torvik & Smalheiser, 2009) and other tasks (Tables 1 and 2). Therefore, we envision our project website as comprising an open repository, wherein outside groups can not only utilize our existing resources but also donate their own processed features and similarity metrics (subject to evaluation and space limitations). We have added a Repository of Processed Text and Resources page to the project website that encourages others to donate their processed text and features back to us so that we can integrate them into our suite and host them publicly. This is not unlike the UCSC Genome Browser (https://genome.ucsc.edu/) where any group can donate tracks, download tracks, or juxtapose their own private custom tracks on public data.

Our approach is compatible with other text mining frameworks, such as PubRunner (Anekalla, Courneya, Fiorini, Lever, Muchow, Busby, 2017), for updating processed citations with the latest PubMed entries, and the many available text processing toolkits, which can be used to process raw article metadata into processed feature sets, e.g., the NLTK (http://www.nltk.org/), Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/), and Apache OpenNLP (http://opennlp.apache.org/). The approach is also amenable to implementation on large-scale parallel processing data analytic systems, such as Apache Spark (https://spark.apache.org/), which includes parallel implementations of several machine learning algorithms including SVM (Meng, Bradley, Yavuz, Sparks, Venkataraman, Liu, ... & Xin, 2016; Shanahan & Dai 2015).

Having a central infrastructure repository of metadata features and similarity measures can benefit the broader biomedical community of investigators as well. One can envision that specialized (perhaps proprietary) NLP tools can be run on PubMed articles as a public service and the results can be stored publicly so that end-users can utilize the results without having to acquire or learn how to use the tools themselves. For example, RobotReviewer (Marshall, Kuiper, Wallace, 2015) processes clinical trial articles to identify the clinical populations and interventions studied in the trial (among other things). If one were to store the results as metadata attached to the articles, then teams writing systematic reviews could obtain the results of tools such as RobotReviewer without needing to process articles themselves.

It may be argued that our emphasis on detailed feature engineering is old-fashioned, and even obsolete, in the face of recent advances in deep learning. Deep learning can theoretically learn the most relevant features and detect higher-order associations automatically. However, this depends on having enough data (billions of points) and enough underlying deep architecture, both of which lie beyond the scope of most deep learning frameworks being studied in biomedical text mining today. Moreover, deep learning is not certain to capture all the relevant implicit associations anyway, especially those that draw upon external reference data from the UMLS or other knowledge bases.

3.5 Extension to full-text features

We have emphasized the use of metadata as features for text mining, in part because full-text of biomedical articles has not been generally accessible. The PubMed Central Open Access dataset currently contains 1.8 million full-text articles available for download in the XML format. This can be augmented further by precomputing features that can be archived for use by others. For example, Europe PMC (https://europepmc.org/advancesearch) has delineated article section boundaries in full-text articles, so that searches can be restricted to (say) Introduction vs. Methods vs. Results. Certain additional features have been extracted and annotated for full-text articles by Europe PMC, including some named entities, methods, and software cited. In the long term, we are interested in extracting new features from full-text and using them to create new similarity metrics. For example, a biomedical article is likely to have only one (or a few) main point; knowing the main points is much more specific and informative than knowing its general topic. Thus, we would like to carry out text mining analyses to identify the main point(s) of any given article. In turn, it would be desirable to compare any two articles to see how similar their main points are. The main points can be attached as metadata to the articles in our repository, and the similarity metrics can be precomputed as resources, both for text mining efforts and for end-users seeking to retrieve biomedical articles based on their main findings (and not just their topics).


Our studies are supported by NIH grants R01LM10817 and P01AG03934. We thank Sophia Ananiadou for discussions about ways to share NLP tools and their products with end-users.


  • Simpson, M. S., & Demner-Fushman, D. (2012). Biomedical text mining: a survey of recent progress. In Mining text data (pp. 465-517). Springer US. Google Scholar

  • Przybyła, P., Shardlow, M., Aubin, S., Bossy, R., Eckart de Castilho, R., Piperidis, S.,… & Ananiadou, S. (2016). Text mining resources for the life sciences. Database, 2016(0), baw145. Google Scholar

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In ACL (System Demonstrations) (pp. 55-60). Google Scholar

  • Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C., & Chute, C. G. (2010). Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association: JAMIA, 17(5), 507–513. http://doi.org/10.1136/jamia.2009.001560

  • Batista-Navarro R., Carter J., & Ananiadou S. (2016) Argo: enabling the development of bespoke workflows and services for disease annotation. Database (Oxford). May 17;2016. pii: baw066. . CrossrefGoogle Scholar

  • Clarke, J., Srikumar, V., Sammons, M., & Roth, D. (2012). An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines). In LREC (pp. 3276-3283). Google Scholar

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18. Google Scholar

  • Zeng, X. & Luo, G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection Health Inf Sci Syst (2017) 5: 2. https://doi.org/10.1007/s13755-017-0023-z

  • Porter, M. F. (1980). An algorithm for suffix stripping, Program, 14(3) pp 130-137. Google Scholar

  • Torvik VI, Smalheiser NR, Weeber, M. 2007. A simple Perl tokenizer and stemmer for biomedical text. Unpublished technical report, accessed January 15, 2018 from http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/tokenizer.cgi

  • Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the Association for Information Science and Technology, 56(2), 140-158. Google Scholar

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 11. Google Scholar

  • Cohen, A. M., Smalheiser, N. R., McDonagh, M. S., Yu, C., Adams, C. E., Davis, J. M., & Yu, P. S. (2015). Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. Journal of the American Medical Informatics Association, 22(3), 707-717. Google Scholar

  • D’Souza, J. L., & Smalheiser, N. R. (2014). Three journal similarity metrics and their application to biomedical journals. PloS one, 9(12), e115681. Google Scholar

  • Smalheiser, N. R., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH):: An Aid to Biomedical Text Mining and Author Name Disambiguation. Journal of biomedical discovery and collaboration, 7. Google Scholar

  • Smalheiser, N. R., & Bonifield, G. (2018). Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings. arXiv preprint arXiv:1801.01884. Google Scholar

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Google Scholar

  • Agarwal, V., Podchiyska, T., Banda, J. M., Goel, V., Leung, T. I., Minty, E. P.,… & Shah, N. H. (2016). Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6), 1166-1173. Google Scholar

  • Aslam, J. A., & Decatur, S. E. (1996). On the sample complexity of noise-tolerant learning. Information Processing Letters, 57(4), 189-195. Google Scholar

  • Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). ACM. Google Scholar

  • Aggarwal, C. C., & Reddy, C. K. (Eds.). (2013). Data clustering: algorithms and applications. CRC press. Google Scholar

  • Law, M. T., Yu, Y., Urtasun, R., Zemel, R. S., & Xing, E. P. Efficient Multiple Instance Metric Learning using Weakly Supervised Data. http://www.cs.toronto.edu/~zemel/documents/mimlca_cvpr_2017.pdf 

  • Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Google Scholar

  • Mohammadi, S., Kylasa, S., Kollias, G., & Grama, A. (2016, December). Context-Specific Recommendation System for Predicting Similar PubMed Articles. In Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on (pp. 1007-1014). IEEE. Google Scholar

  • Hashimoto, K., Kontonatsios, G., Miwa, M., & Ananiadou, S. (2016). Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of biomedical informatics, 62, 59-65. Google Scholar

  • Mrabet Y, Kilicoglu H, Demner-Fushman D. TextFlow: A Text Similarity Measure based on Continuous Sequences. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017 (Vol. 1, pp. 763-772). Google Scholar

  • Anekalla KR, Courneya JP, Fiorini N, Lever J, Muchow M, Busby B. PubRunner: A light-weight framework for updating text mining results. F1000Res. 2017;6:612. Google Scholar

  • Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D.,… & Xin, D. (2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), 1235-1241. Google Scholar

  • Shanahan, J. G., & Dai, L. (2015, August). Large scale distributed data science using apache spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2323-2324). ACM. Google Scholar

  • Marshall, I. J., Kuiper, J., & Wallace, B. C. (2015). RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association, 23(1), 193-201. Google Scholar

About the article

Received: 2017-12-08

Accepted: 2018-02-11

Published Online: 2018-05-22

Citation Information: Data and Information Management, Volume 2, Issue 1, Pages 27–36, ISSN (Online) 2543-9251, DOI: https://doi.org/10.2478/dim-2018-0004.

Export Citation

© 2018 Neil R. Smalheiser, Aaron M. Cohe, published by Sciendo. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in