Skip to main content

A Fast Implementation of the EM Algorithm for Mixture of Multinomials

  • Conference paper
Advanced Data Mining and Applications (ADMA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4093))

Included in the following conference series:

  • 2942 Accesses

Abstract

We propose several simple techniques which dramatically reduce both the memory demand and computational effort in building multinomial mixture models using the EM algorithm. The reason of the dramatic improvement in performance is that the techniques make use of certain properties of the data. These properties are: the data is sparse and there are many repeating records. We claim that particular sources of data consistently satisfy these properties. Excellent examples are Clickstream and retail data which are very sparse and consist of many repititions. Using simple techniques huge speed-ups and compression rates, on real life clickstream data sets, are observed compared to the standard implementation of the EM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: SIGMOD 1996: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp. 103–114. ACM Press, New York (1996)

    Google Scholar 

  2. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  3. Moore, A.: Very fast EM-based mixture model clustering using multiresolution kd-trees. In: Kearns, M., Cohn, D. (eds.) Advances in Neural Information Processing Systems, pp. 543–549. Morgan Kaufman, San Francisco (1999)

    Google Scholar 

  4. Bradley, P., Fayyad, U., Reina, C.: Scaling EM clustering to large databases, Microsoft research report, msr-tr-98-35 (1998)

    Google Scholar 

  5. Domingos, P., Hulten, G.: Learning from infinite data in finite time. Advances in Neural Information Processing Systems 14, 673–680 (2002)

    Google Scholar 

  6. Verbeek, J., Nunnink, J., Vlassis, N.: Accelerated EM-based clustering of large datasets. In: Data Mining and Knowledge Discovery (in press, 2006)

    Google Scholar 

  7. Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, Kluwer, Dordrecht (1998)

    Google Scholar 

  8. Cadez, I.V., Smyth, P., Mannila, H.: Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In: KDD 2001: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 37–46. ACM Press, New York (2001)

    Google Scholar 

  9. Thiesson, B., Meek, C., Heckerman, D.: Accelerating EM for large databases. Machine Learning 45, 279–299 (2001)

    Article  MATH  Google Scholar 

  10. Blekas, K., Likas, A.: Incremental mixture learning for clustering discrete data. In: Vouros, G.A., Panayiotopoulos, T. (eds.) SETN 2004. LNCS (LNAI), vol. 3025, pp. 210–219. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Cadez, I.V., Heckerman, D., Meek, C., Smyth, P., White, S.: Visualization of navigation patterns on a web site using model-based clustering. In: Knowledge Discovery and Data Mining, pp. 280–284 (2000)

    Google Scholar 

  12. Vakali, A., Pokorný, J., Dalamagas, T.: An overview of web data clustering practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  13. Nasraoui, O., Zaiane, O.R., Spiliopoulou, M., Mobasher, B., Masand, B., Yu, P.S.: Webkdd 2005: web mining and web usage analysis post-workshop report. SIGKDD Explor. Newsl. 7(2), 139–142 (2005)

    Article  Google Scholar 

  14. Hofgesang, P., Kowalczyk, W.: Analysis clickstream data: From anomaly detection to visitor profiling. In: ECML/PKDD Discovery Challenge 2005 (2005)

    Google Scholar 

  15. Hofgesang, P.: Web usage mining - Structuring enriched clickstream data. Msc. Thesis, Free University Amsterdam, The Netherlands (2005)

    Google Scholar 

  16. UCI-KDD-archive: msnbc.com, University of California, Irvine http://kdd.ics.uci.edu/

  17. Patist, J.P.: Toolbox for multinomial mixture building, www.few.vu.nl/~jpp/multimix.rar

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patist, J.P. (2006). A Fast Implementation of the EM Algorithm for Mixture of Multinomials. In: Li, X., Zaïane, O.R., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2006. Lecture Notes in Computer Science(), vol 4093. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11811305_57

Download citation

  • DOI: https://doi.org/10.1007/11811305_57

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37025-3

  • Online ISBN: 978-3-540-37026-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics