Skip to main content

High Performance Data Mining

  • Conference paper
  • First Online:
High Performance Computing for Computational Science — VECPAR 2002 (VECPAR 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2565))

Abstract

Recent times have seen an explosive growth in the availability of various kinds of data. It has resulted in an unprecedented opportunity to develop automated data-driven techniques of extracting useful knowledge. Data mining, an important step in this process of knowledge discovery, consists of methods that discover interesting, non-trivial, and useful patterns hidden in the data [SAD+93, CHY96]. The field of data mining builds upon the ideas from diverse fields such as machine learning, pattern recognition, statistics, database systems, and data visualization. But, techniques developed in these traditional disciplines are often unsuitable due to some unique characteristics of today’s data-sets, such as their enormous sizes, high-dimensionality, and heterogeneity. There is a necessity to develop effective parallel algorithms for various data mining techniques. However, designing such algorithms is challenging, and the main focus of the paper is a description of the parallel formulations of two important data mining algorithms: discovery of association rules, and induction of decision trees for classification. We also briefly discuss an application of data mining to the analysis of large data sets collected by Earth observing satellites that need to be processed to better understand global scale changes in biosphere processes and patterns.

This work was supported by NSF CCR-9972519, by NASA grant # NCC 2 1231, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE grant LLNL/DOE B347714, and by Army High Performance Computing Research Center cooperative agreement number DAAD19-01-2-0014. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputer Institute. Related papers are available via WWW at URL: http://www.cs.umn.edu/~Rkumar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925, December 1993. 116

    Article  Google Scholar 

  2. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of 1993 ACM-SIGMOD Int. Conf. on Management of Data, Washington, D. C., 1993. 113

    Google Scholar 

  3. R. Agrawal and J.C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Eng., 8(6):962–969, December 1996. 114

    Article  Google Scholar 

  4. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487–499, Santiago, Chile, 1994. 114

    Google Scholar 

  5. J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H.W. To, and D. Yang. Large scale data mining: Challenges and responses. In Proc. of the Third Int’l Conference on Knowledge Discoveryand Data Mining, 1997. 117

    Google Scholar 

  6. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Eng., 8(6):866–883, December 1996. 111, 112

    Article  Google Scholar 

  7. D. J. Spiegelhalter D. Michie and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 116

    Google Scholar 

  8. S. Goil, S. Aluru, and S. Ranka. Concatenated parallelism: A technique for efficient parallel divide and conquer. In Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96), 1996. 117

    Google Scholar 

  9. D.E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman, 1989. 116

    Google Scholar 

  10. R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu. Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001. 112

    Google Scholar 

  11. E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proc. of 1997 ACM-SIGMOD Int. Conf. on Management of Data, Tucson, Arizona, 1997. 114, 115

    Google Scholar 

  12. E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Eng., 12(3), May/June 2000. 115

    Google Scholar 

  13. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan-Kaufman, 2000. 112

    Google Scholar 

  14. D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001. 112

    Google Scholar 

  15. M.V. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Computer Science: Lecture Notes in Artificial Intelligence (LNCS/LNAI), volume 1759. Springer-Verlag, 2000. 113, 114, 115

    Google Scholar 

  16. M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proc. of the International Parallel Processing Symposium, 1998. 117, 120

    Google Scholar 

  17. M.V. Joshi, G. Karypis, and V. Kumar. Universal formulation of sequential patterns. Technical Report TR 99-021, Department of Computer Science, University of Minnesota, Minneapolis, 1999. 115

    Google Scholar 

  18. R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier Science,1997. 117

    Google Scholar 

  19. R. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22), April 1987. 116

    Google Scholar 

  20. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. 116

    Google Scholar 

  21. R. A. Pearson. A coarse grained parallel induction heuristic. In H. Kitano, V. Kumar, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, pages 207–226. Elsevier Science, 1994. 117

    Google Scholar 

  22. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, SanMateo, CA, 1993. 116

    Google Scholar 

  23. J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd VLDB Conference, 1996. 116, 117, 120

    Google Scholar 

  24. A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237–261, September 1999. 117

    Article  Google Scholar 

  25. M. Steinbach, P. Tan, V. Kumar, S. Klooster, and C. Potter. Temporal data mining for the discovery and analysis of ocean climate indices. In KDD Workshop on Temporal Data Mining(KDD’2002), Edmonton, Alberta, Canada, 2001. 122

    Google Scholar 

  26. M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and A. Reuter. DBMS research at a crossroads: The vienna update. In Proc. of the 19th VLDB Conference, pages 688–692, Dublin, Ireland, 1993. 111

    Google Scholar 

  27. P. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Finding spatio-temporal patterns in earth science data. In KDD Workshop on Temporal Data Mining(KDD’2001), San Francisco, California, 2001. 121

    Google Scholar 

  28. M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency (Special Issue on Data Mining), December 1999. 114

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kumar, V., Joshi, M.V., Han, EH.(., Tan, PN., Steinbach, M. (2003). High Performance Data Mining. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_8

Download citation

  • DOI: https://doi.org/10.1007/3-540-36569-9_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00852-1

  • Online ISBN: 978-3-540-36569-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics