High Performance Data Mining

Kumar, Vipin; Joshi, Mahesh V.; Han, Eui-Hong (Sam); Tan, Pang-Ning; Steinbach, Michael

doi:10.1007/3-540-36569-9_8

Vipin Kumar⁷,
Mahesh V. Joshi⁷,
Eui-Hong (Sam) Han⁷,
Pang-Ning Tan⁷ &
…
Michael Steinbach⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2565))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

727 Accesses
3 Citations

Abstract

Recent times have seen an explosive growth in the availability of various kinds of data. It has resulted in an unprecedented opportunity to develop automated data-driven techniques of extracting useful knowledge. Data mining, an important step in this process of knowledge discovery, consists of methods that discover interesting, non-trivial, and useful patterns hidden in the data [SAD+93, CHY96]. The field of data mining builds upon the ideas from diverse fields such as machine learning, pattern recognition, statistics, database systems, and data visualization. But, techniques developed in these traditional disciplines are often unsuitable due to some unique characteristics of today’s data-sets, such as their enormous sizes, high-dimensionality, and heterogeneity. There is a necessity to develop effective parallel algorithms for various data mining techniques. However, designing such algorithms is challenging, and the main focus of the paper is a description of the parallel formulations of two important data mining algorithms: discovery of association rules, and induction of decision trees for classification. We also briefly discuss an application of data mining to the analysis of large data sets collected by Earth observing satellites that need to be processed to better understand global scale changes in biosphere processes and patterns.

This work was supported by NSF CCR-9972519, by NASA grant # NCC 2 1231, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE grant LLNL/DOE B347714, and by Army High Performance Computing Research Center cooperative agreement number DAAD19-01-2-0014. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputer Institute. Related papers are available via WWW at URL: http://www.cs.umn.edu/~Rkumar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925, December 1993. 116
Article Google Scholar
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of 1993 ACM-SIGMOD Int. Conf. on Management of Data, Washington, D. C., 1993. 113
Google Scholar
R. Agrawal and J.C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Eng., 8(6):962–969, December 1996. 114
Article Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487–499, Santiago, Chile, 1994. 114
Google Scholar
J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H.W. To, and D. Yang. Large scale data mining: Challenges and responses. In Proc. of the Third Int’l Conference on Knowledge Discoveryand Data Mining, 1997. 117
Google Scholar
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Eng., 8(6):866–883, December 1996. 111, 112
Article Google Scholar
D. J. Spiegelhalter D. Michie and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 116
Google Scholar
S. Goil, S. Aluru, and S. Ranka. Concatenated parallelism: A technique for efficient parallel divide and conquer. In Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96), 1996. 117
Google Scholar
D.E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman, 1989. 116
Google Scholar
R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu. Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001. 112
Google Scholar
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proc. of 1997 ACM-SIGMOD Int. Conf. on Management of Data, Tucson, Arizona, 1997. 114, 115
Google Scholar
E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Eng., 12(3), May/June 2000. 115
Google Scholar
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan-Kaufman, 2000. 112
Google Scholar
D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001. 112
Google Scholar
M.V. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Computer Science: Lecture Notes in Artificial Intelligence (LNCS/LNAI), volume 1759. Springer-Verlag, 2000. 113, 114, 115
Google Scholar
M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proc. of the International Parallel Processing Symposium, 1998. 117, 120
Google Scholar
M.V. Joshi, G. Karypis, and V. Kumar. Universal formulation of sequential patterns. Technical Report TR 99-021, Department of Computer Science, University of Minnesota, Minneapolis, 1999. 115
Google Scholar
R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier Science,1997. 117
Google Scholar
R. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22), April 1987. 116
Google Scholar
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. 116
Google Scholar
R. A. Pearson. A coarse grained parallel induction heuristic. In H. Kitano, V. Kumar, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, pages 207–226. Elsevier Science, 1994. 117
Google Scholar
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, SanMateo, CA, 1993. 116
Google Scholar
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd VLDB Conference, 1996. 116, 117, 120
Google Scholar
A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237–261, September 1999. 117
Article Google Scholar
M. Steinbach, P. Tan, V. Kumar, S. Klooster, and C. Potter. Temporal data mining for the discovery and analysis of ocean climate indices. In KDD Workshop on Temporal Data Mining(KDD’2002), Edmonton, Alberta, Canada, 2001. 122
Google Scholar
M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and A. Reuter. DBMS research at a crossroads: The vienna update. In Proc. of the 19th VLDB Conference, pages 688–692, Dublin, Ireland, 1993. 111
Google Scholar
P. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Finding spatio-temporal patterns in earth science data. In KDD Workshop on Temporal Data Mining(KDD’2001), San Francisco, California, 2001. 121
Google Scholar
M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency (Special Issue on Data Mining), December 1999. 114
Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota 4-192 EE/CSci Building, 200 Union Street SE, 55455, Minneapolis, MN, USA
Vipin Kumar, Mahesh V. Joshi, Eui-Hong (Sam) Han, Pang-Ning Tan & Michael Steinbach

Authors

Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Mahesh V. Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Eui-Hong (Sam) Han
View author publications
You can also search for this author in PubMed Google Scholar
Pang-Ning Tan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Steinbach
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculdade de Engenharia da, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
José M. L. M. Palma & A. Augusto Sousa &
Department of Computer Science, University of Tennessee, 37996-1301, Knoxville, TN, USA
Jack Dongarra
Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Camino de Vera, s/n, Apartado 22012, 46020, Valencia, Spain
Vicente Hernández

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, V., Joshi, M.V., Han, EH.(., Tan, PN., Steinbach, M. (2003). High Performance Data Mining. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_8

Download citation

DOI: https://doi.org/10.1007/3-540-36569-9_8
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00852-1
Online ISBN: 978-3-540-36569-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics