Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Acta Universitatis Sapientiae, Informatica

The Journal of "Sapientia" Hungarian University of Transylvania

2 Issues per year

Open Access
Online
ISSN
2066-7760
See all formats and pricing
More options …

Finding sequential patterns with TF-IDF metrics in health-care databases

Zsolt T. Kardkovács / Gábor Kovács
Published Online: 2015-01-27 | DOI: https://doi.org/10.1515/ausi-2015-0008

Abstract

Finding frequent sequential patterns has been defined as finding ordered list of items that occur more times in a database than a user defined threshold. For big and dense databases that contain really long sequences and large itemset such as medical case histories, algorithm proposed on this idea of counting the occurrences output enourmous number of highly redundant frequent sequences, and are therefore simply impractical. Therefore, there is a need for algorithm that perform frequent pattern search and prefiltering simultaneously. In this paper, we propose an algorithm that reinterprets the term support on text mining basis. Experiments show that our method not only eliminates redundancy among the output sequences, but it scales much better with huge input data sizes. We apply our algorithm for mining medical databases: what diagnoses are likely to lead to a certain future health condition.

Keywords : sequence mining; frequent sequential pattern; TF-IDF; health care database

References

  • [1] R. Agrawal, R. Srikant, Mining sequential patterns, Proc. Eleventh International Conference on Data Engineering, Taipei, Taiwan, 1995, pp. 3-14. ⇒300Google Scholar

  • [2] L. M. Aouad, Nhien-An Le-Khac, T. M. Kechadi, Performance study of distributed apriori-like frequent itemsets mining, Knowledge and Information Systems, 23, 1 (2009) 55-72. ⇒300Google Scholar

  • [3] J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using bitmaps, Proc. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002, pp. 429-435. ⇒291Google Scholar

  • [4] P. Fournier-Viger, SPMF - an open-source data mining library, 2014. ⇒ 291, 306Google Scholar

  • [5] T. Z. Gál, G. Kovács, Z. T. Kardkovács, Survey on privacy preserving data mining techniques in health care databases, Acta Univ. Sapientiae, Informatica, 6, 1 (2014) 33-55. ⇒305Google Scholar

  • [6] L. Geng, H. J. Hamilton, Interestingness measures for data mining: A survey, ACM Computing Surveys (CSUR), 38, 3 (2006) ⇒292, 293, 294Google Scholar

  • [7] K. Gouda, M. Hassaan, Mining sequential patterns in dense databases, International Journal of Database Management Systems (IJDMS), 3, 1 (2011) 179-194. ⇒291Google Scholar

  • [8] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Proc. International Conference Management of Data (ACM-SIGMOD ’00), Dallas, USA, May 2000, pp. 1-12. ⇒290Google Scholar

  • [9] T. P. Hong, C. W. Lin, K. T. Yang, S. L. Wang, A heuristic data-sanitization approach based on TF-IDF, Proc. 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Lecture Notes in Artificial Intelligence 6703 (2011) 156-164. ⇒301Google Scholar

  • [10] K. McGarry, A survey of interestingness measures for knowledge discovery, The Knowledge Engineering Review, 20, 1 (2005) 39-61. ⇒292Google Scholar

  • [11] P. W. Purdom, D. Van Gucht , D. P. Groth, Average-case performance of the apriori algorithm, SIAM Journal on Computing, 33, 5 (2004) 1223-1260. ⇒300CrossrefGoogle Scholar

  • [12] G. Salton, E. A. Fox, H. Wu, Extended boolean information retrieval, Communications of ACM, 26, 12 (1983) 1022-1036. ⇒288, 301Google Scholar

  • [13] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, Proc. 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT ’96), Lecture Notes in Security and Cryptology 1057, (1996) 3-17. ⇒290Google Scholar

  • [14] Y. Tabei, An imprementation of PrefixSpan (prefix-projected sequential pattern mining), 2008. ⇒306Google Scholar

  • [15] M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning, 42, 1-2 (2001) 31-60. ⇒290, 295 CrossrefGoogle Scholar

About the article

Received: 2014-09-11

Revised: 2014-11-10

Published Online: 2015-01-27

Published in Print: 2014-12-01


Citation Information: Acta Universitatis Sapientiae, Informatica, ISSN (Online) 2066-7760, DOI: https://doi.org/10.1515/ausi-2015-0008.

Export Citation

© 2015. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in