Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Open Computer Science

Editor-in-Chief: van den Broek, Egon

1 Issue per year

Open Access
Online
ISSN
2299-1093
See all formats and pricing
More options …

Clustgrams: an extension to histogram densities based on the minimum description length principle

Panu Luosto
  • Department of Computer Science, University of Helsinki, Helsinki, Finland
  • Email:
/ Petri Kontkanen
  • Department of Computer Science, University of Helsinki, Helsinki, Finland
  • Helsinki Institute for Information Technology, HIIT, Helsinki, Finland
  • Email:
Published Online: 2011-12-27 | DOI: https://doi.org/10.2478/s13537-011-0033-x

Abstract

Density estimation is one of the most important problems in statistical inference and machine learning. A common approach to the problem is to use histograms, i.e., piecewise constant densities. Histograms are flexible and can adapt to any density given enough bins. However, due to the simplicity of histograms, a large number of parameters and a large sample size might be needed for learning an accurate density, especially in more complex problem instances. In this paper, we extend the histogram density estimation framework by introducing a model called clustgram, which uses arbitrary density functions as components of the density rather than just uniform components. The new model is based on finding a clustering of the sample points and determining the type of the density function for each cluster. We regard the problem of learning clustgrams as a model selection problem and use the theoretically appealing minimum description length principle for solving the task.

Keywords: density estimation; minimum description length (MDL) principle; clustering; histograms

  • [1] Grünwald P.D., The minimum description length principle, The MIT Press, Cambridge, MA, USA, 2007 Google Scholar

  • [2] Kontkanen P., Myllymäki P., MDL histogram density estimation, In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), Meila M., Shen X. (Eds.), San Juan, Puerto Rico, 2007 Google Scholar

  • [3] Kontkanen P., Myllymäki P., A linear-time algorithm for computing the multinomial stochastic complexity, Inform. Process. Lett., 103(6), 227–233, 2007 http://dx.doi.org/10.1016/j.ipl.2007.04.003CrossrefWeb of ScienceGoogle Scholar

  • [4] Luosto P., Kivinen J., Mannila H., Gaussian clusters and noise: an approach based on the minimum description length principle, In: 13th International Conference on Discovery Science (DS), Pfahringer B., Holmes G., Hoffmann A. (Eds.), Canberra, Australia, 251–255, 2010 Google Scholar

  • [5] Rissanen J., Speed T., Yu B., Density estimation by stochastic complexity, IEEE T. Inform. Theory., 38(2), 315–323, 1992 http://dx.doi.org/10.1109/18.119689CrossrefGoogle Scholar

  • [6] Rissanen J., Modeling by shortest data description, Automatica, 14(5), 465–471, 1978 http://dx.doi.org/10.1016/0005-1098(78)90005-5CrossrefGoogle Scholar

  • [7] Rissanen J., A universal prior for integers and estimation by minimum description length, Ann. Stat., 11(2), 416–431, 1983 http://dx.doi.org/10.1214/aos/1176346150CrossrefGoogle Scholar

  • [8] Rissanen J., Stochastic complexity, J. R. Stat. Soc., 49(3), 223–239, 1987 Google Scholar

  • [9] Rissanen J., Fisher information and stochastic complexity, IEEE T. Inform. Theory., 42(1), 40–47, 1996 http://dx.doi.org/10.1109/18.481776CrossrefGoogle Scholar

  • [10] Rissanen J., Information and complexity in statistical modeling, Springer Verlag, New York, 2007 Google Scholar

  • [11] Szpankowski W., Average case analysis of algorithms on sequences, John Wiley & Sons, New York, 2001 http://dx.doi.org/10.1002/9781118032770CrossrefGoogle Scholar

  • [12] Wallace C., Statistical and Inductive Inference by minimum message length, Springer-Verlag, New York, 2005 Google Scholar

  • [13] Wallace C., Dowe D., MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions, Stat. Comput., 10(1), 73–83, 2000 http://dx.doi.org/10.1023/A:1008992619036CrossrefGoogle Scholar

About the article

Published Online: 2011-12-27

Published in Print: 2011-12-01


Citation Information: Open Computer Science, ISSN (Online) 2299-1093, DOI: https://doi.org/10.2478/s13537-011-0033-x.

Export Citation

© 2011 Versita Warsaw. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in