Skip to main content

A Requirements Analysis for Parallel KDD Systems

  • Conference paper
  • First Online:
Parallel and Distributed Processing (IPDPS 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1800))

Included in the following conference series:

Abstract

The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel file systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, security, quality of service and pricing metrics. State of the art technology in these areas is surveyed with an eye towards an integration strategy leading to a complete solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg., 8(6):962–969, December 1996.

    Article  Google Scholar 

  • R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational DBMS. In Int’l Conf. on Knowledge Discovery and Data Mining, 1996.

    Google Scholar 

  • R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective, IEEE Trans. on Knowledge and Data Engg., 5(6):914–925, December 1993.

    Article  Google Scholar 

  • S. Anand, et al. Designing a kernel for data mining. IEEE Expert, pages 65–74, March 1997.

    Google Scholar 

  • H. Boral, et al. Prototyping Bubba, a highly parallel database system. IEEE Trans. on Knowledge and Data Engg., 2(1), March 1990.

    Google Scholar 

  • J. Carretero, et al. ParFiSys: A parallel file system for MPP. ACM Operating Systems Review, 30(2):74–80, 1996.

    Article  Google Scholar 

  • F. Chang and G. Gibson. Automatic hint generation through speculative execution. In Symp. on Operating Systems Design and Implementation, February 1999.

    Google Scholar 

  • P. M. Chen, et al. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145–185, June 1994.

    Article  Google Scholar 

  • D. Cheung, et al. A fast distributed algorithm for mining association rules. In 4th Int’l Conf. Parallel and Distributed Info. Systems, December 1996.

    Google Scholar 

  • A. Choudhary and D. Kotz. Large-scale file systems with the flexibility of databases. ACM Computing Surveys, 28A(4), December 1996.

    Google Scholar 

  • T. Cortes. High Performance Cluster Computing, Vol. 1, chapter Software Raid and Parallel File Systems, pages 463–495. Prentice Hall, 1999.

    Google Scholar 

  • D. DeWitt et al. The GAMMA database machine project. IEEE Trans. on Knowledge and Data Engg., 2(1):44–62, March 1990.

    Article  Google Scholar 

  • D. DeWitt and J. Gray. Parallel database systems: The future of high-performance database systems. Communications of the ACM, 35(6):85–98, June 1992.

    Article  Google Scholar 

  • I. S. Dhillon and D. S. Modha. A clustering algorithm on distributed memory machines. In Zaki and Ho, 2000.

    Google Scholar 

  • A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic Pub., 1998.

    Google Scholar 

  • V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2): 170–231, 1998.

    Article  Google Scholar 

  • G. Gibson, et al. NASD scalable storage systems. In USENIX99, Extreme Linux Workshop, June 1999.

    Google Scholar 

  • J. Han, et al. DMQL: A data mining query language for relational databases. In SIGMOD Workshop on Research Issues in Data Mining and. Knowledge Discovery, June 1996.

    Google Scholar 

  • E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997.

    Google Scholar 

  • M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor: Searching the nuggets in parallel. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.

    Google Scholar 

  • D. Hsiao. Advanced Database Machine Architectures. Prentice Hall, 1983.

    Google Scholar 

  • T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of the ACM, 39(11), November 1996.

    Google Scholar 

  • T. Imielinski, A. Virmani, and A. Abdulghani. DataMine: Application programming interface and query language for database mining. In Int’l Conf. Knowledge Discovery and Data Mining, August 1996.

    Google Scholar 

  • Scalable I/O Initiative. http://www.cacr.caltech.edu/SIO . California Institute of Technology.

  • M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classfication algorithm for mining large datasets. In Int’l Parallel Processing Symposium, 1998.

    Google Scholar 

  • D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int’l Conf. Pattern Recognition, 1996.

    Google Scholar 

  • H. Kargupta and P. Chan, editors. Advances in Distributed Data Mining. AAAI Press, 2000.

    Google Scholar 

  • K. Keeton, D. Patterson, and J.M. Hellerstein. The case for intelligent disks. SIGMOD Record, 27(3):42–52, September 1998.

    Article  Google Scholar 

  • M.F. Khan, et al. Intensive data management in parallel systems: A survey. Distributed and Parallel Databases, 7:383–414, 1999.

    Article  Google Scholar 

  • T. Kimbrel, et al. A trace-driven comparison of algorithms for parallel prefetching and caching. In USENIX Symp. on Operating Systems Design and Implementation, pages 19–34, October 1996.

    Google Scholar 

  • D. Kotz. The parallel i/o archive. Includes pointers to his Parallel I/O Bibliography, can be found at http://www.cs.dartmouth.edu/pario/ .

  • C. E. Kozyrakis and D. A. Patterson. New direction in computer architecture research. IEEE Computer, pages 24–32, November 1998.

    Google Scholar 

  • R. Lorie, et al. Adding inter-transaction parallelism to existing DBMS: Early experience. IEEE Data Engineering Newsletter, 12(1), March 1989.

    Google Scholar 

  • T. M. Madhyastha and D. A. Reed. Exploiting global input/output access pattern classification. In Proceedings of SC’97, 1997. On CDROM.

    Google Scholar 

  • R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Int’l Conf. Very Large Databases, 1996.

    Google Scholar 

  • S. A. Moyer and V. S. Sunderam. PIOUS: a scalable parallel I/O system for distributed computing environments. In Scalable High-Performance Computing Conf., 1994.

    Google Scholar 

  • N. Nieuwejaar and D. Kotz. The galley parallel file system. Parallel Computing, 23(4), June 1997.

    Google Scholar 

  • M. T. Oszu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999.

    Google Scholar 

  • R. H. Patterson III. Informed Prefetching and Caching. PhD thesis, Carnegie Mellon University, December 1997.

    Google Scholar 

  • Pirahesh et al. Parallelism in Relational Data Base Systems. In nt’l Symp. on Parallel and Distributed Systems, July 1990.

    Google Scholar 

  • D. A. Reed, et al. Performance analysis of parallel systems: Approaches and open problems. In Joint Symposium on Parallel Processing (JSPP), June 1998.

    Google Scholar 

  • E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia. In Int’l Conf. on Very Large Databases, August 1997.

    Google Scholar 

  • H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999.

    Google Scholar 

  • S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In ACM SIGMOD Conf. on Management of Data, June 1998.

    Google Scholar 

  • E. Schikuta, T. Fuerle, and H. Wanek. ViPIOS: The vienna parallel input/output system. In Euro-Par’98, September 1998.

    Google Scholar 

  • K. E. Seamons and M.. Winslett. Multidimensional array I/O in Panda 1.0. Journal of Supercomputing, 10(2):191–211, 1996.

    Article  Google Scholar 

  • J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Int’I Conf. on Very Large Databases, March 1996.

    Google Scholar 

  • T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.

    Google Scholar 

  • A. Siebes. Foundations of an inductive query language. In Int’l Conf. on Knowledge Discovery and Data Mining, August 1995.

    Google Scholar 

  • D. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, 7(4):26–35, October-December 1999.

    Article  Google Scholar 

  • M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application to classification trees. In Int’l Parallel Processing Symposium, April 1999.

    Google Scholar 

  • H. Stockinger. Dictionary on parallel input/output. Master’s thesis, Dept. of Data Engineering, University of Vienna, February 1998.

    Google Scholar 

  • Tandem Performance Group. A benchmark of non-stop SQL on the debit credit transaction. In SIGMOD Conference, June 1988.

    Google Scholar 

  • R. Thakur, W. Gropp, and E. Lusk. On implementing mpi-io portably and with high performance. In Workshop on I/O in Parallel and Distributed Systems, May 1999.

    Google Scholar 

  • P. Valduriez. Parallel database systems: Open problems and new issues. Distributed and Parallel Databases, 1:137–165, 1993.

    Article  Google Scholar 

  • G. Williams, et al. The integrated delivery of large-scale data mining: The ACSys data mining project. In Zaki and Ho, 2000.

    Google Scholar 

  • M. J. Zaki and C-T. Ho, editors. Large-Scale Parallel Data Mining, LNCS Vol. 1759. Springer-Verlag, 2000.

    Google Scholar 

  • M. J. Zaki, et al. Parallel algorithms for fast discovery association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4):343–373, December 1997.

    Article  Google Scholar 

  • M. J. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In Int’I Conf. on Data Engineering, March 1999.

    Google Scholar 

  • M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14–25, 1999.

    Article  Google Scholar 

  • M. J. Zaki. Parallel sequence mining on SMP machines. In Zaki and Ho, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Maniatty, W.A., Zaki, M.J. (2000). A Requirements Analysis for Parallel KDD Systems. In: Rolim, J. (eds) Parallel and Distributed Processing. IPDPS 2000. Lecture Notes in Computer Science, vol 1800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45591-4_47

Download citation

  • DOI: https://doi.org/10.1007/3-540-45591-4_47

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67442-9

  • Online ISBN: 978-3-540-45591-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics