Due to existence of a huge amount of textual data either on the World Wide Web or in textual databases like PubMed, the development of novel automatic keyphrase extraction methods has emerged as one of the key research problems in recent past. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to extract keyphrases from text documents. But, one of the main bottlenecks that hinders the success of such systems is the requirement of annotated corpora for training purpose. In this paper, we propose the design of a deep text mining system to identify keyphrases in text documents that are either unstructured or semi-structured in nature. The novelty of our system lies in its applicability on a single document, instead of demanding a collection of annotated texts for training, to identify keyphrases embedded within it. The proposed system applies parsing techniques to identify candidate phrases. After mapping the original set of candidate phrases into a low-dimensional space using Singular Value Decomposition (SVD), the Markov Clustering (MCL) technique is applied to cluster related sentences together. Finally, considering each cluster as a document, Latent Dirichlet Allocation (LDA) is applied to identify feasible keyphrases that are presented to users in non-increasing order of their relevance score values. The efficacy of the proposed system is established through experimentation on datasets from two different domains. On comparative evaluation, we found that the proposed system outperforms KEA and KEA that apply the supervised machine learning approach for automatic keyphrase extraction from text documents.
de Gruyter 2011
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.