Skip to main content

Running Data Mining Applications on the Grid: A Bag-of-Tasks Approach

  • Conference paper
Computational Science and Its Applications – ICCSA 2004 (ICCSA 2004)

Abstract

Data mining (DM) applications are composed of computing-intensive processing tasks working on huge datasets. Due to its computing-intensive nature, these applications are natural candidates for execution on high performance, high throughput platforms such as PC clusters and computational grids. Many data mining algorithms can be implemented as bag-of-tasks (BoT) applications, i.e., parallel applications composed of independent tasks. This paper discusses the use of computing grids for the execution of DM algorithms as BoT applications, investigates the scalability of the execution of an application and proposes an approach to improve its scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fayyad, U.M., Shapiro, G.P., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 1–37. MIT Press, Cambridge (1996)

    Google Scholar 

  2. Freitas, A.A., Lavington, S.H.: Mining Very Large Databases with Parallel Processing. Kluwer Academic Publishers, Dordrecht (1998)

    MATH  Google Scholar 

  3. Baraglia, R., et al.: Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations. In: Proc. of the 3rd Workshop on High Performance Data Mining, International Parallel and Distributed Processing Symposium, Cancun, Mexico (2000)

    Google Scholar 

  4. Baker, M., Buyya, R., Laforenza, D.: Grids and Grid Technologies for Wide-area Distributed Computing. Software, Pratice and Experience 32, 1437–1466 (2002)

    Article  MATH  Google Scholar 

  5. Cirne, W., et al.: Running Bag-of_Tasks Applications on Ccmputational Grids: The My-Grid Approach. In: Proc. of the 2003 International Conference on Parallel Processing (October 2003)

    Google Scholar 

  6. Hruschka, E.R., Ebecken, N.F.F.: A genetic algorithm for cluster analysis. Intelligent Data Analysis (IDA) 7, 15–25 (2003)

    Google Scholar 

  7. Canataro, M., Talia, D.: The Knowledge Grid. Communications of the ACM 46(1) (2003)

    Google Scholar 

  8. Orlando, S., Palmerini, P., Perego, R., Silvestri, F.: Scheduling High Performance Data Mining Tasks on a Data Grid Environment. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 375–384. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  9. Hinke, H., Novotny, J.: Data Mining on NASAś Information Power Grid. In: HPDC 2000, Pittsburgh, Pennsylvania, USA, pp. 292–293. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  10. Agrawal, R., et al.: Fast Discovery of Association Rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. MIT Press, Cambridge (1996)

    Google Scholar 

  11. Goldberg, D.E.: Genetic Algorithms in Search. In: Optimization and Machine Learning, USA, Addison Wesley Longman Inc., Amsterdam (1989)

    Google Scholar 

  12. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics (1990)

    Google Scholar 

  13. Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases Irvine, CA, University of California, http://www.ics.uci.edu

  14. Litzkow, M., Livny, M., Mutka, M.: Condor – A Hunter of Idle Workstations. In: Proc. of the 8th International Conference of Distributed Computing Systems, June 1988, pp. 104–111 (1988)

    Google Scholar 

  15. Grimshaw, A., Wulf, W.: Legion: The next logical step toward the world-wide virtual computer. Communications of the ACM 40(1), 39–45 (1997)

    Article  Google Scholar 

  16. BOINC. Project homepage, available at http://boinc.berkeley.edu

  17. Foster, I., Kesselman, C.: Globus: A Metacomputing Infrastructure Toolkit. Intl J. Supercomputer Applications 11(2), 115–128 (1997)

    Article  Google Scholar 

  18. Falkenauer, E.: Genetic Algorithms and Grouping Problems. John Wiley & Sons, Chichester (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

da Silva, F.A.B., Carvalho, S., Senger, H., Hruschka, E.R., de Farias, C.R.G. (2004). Running Data Mining Applications on the Grid: A Bag-of-Tasks Approach. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds) Computational Science and Its Applications – ICCSA 2004. ICCSA 2004. Lecture Notes in Computer Science, vol 3044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24709-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24709-8_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22056-5

  • Online ISBN: 978-3-540-24709-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics