Skip to main content

DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers

  • Conference paper
  • First Online:
Advanced Parallel Processing Technologies (APPT 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9231))

Included in the following conference series:

Abstract

Reliability has become an issue to the Tianhe supercomputer series with the scaling of the system. Proactive fault-tolerance based on failure prediction turns into an effective way to improve the system’s fault tolerance ability. Data collection is the basis of the failure prediction which has a great impact on the prediction accuracy, while current data collection methods for failure prediction only got limited data with large overhead. This paper presents DDC data collection framework for failure prediction in Tianhe supercomputers. DDC adopts a distributed data collection architecture which can fully collect the data related to the compute nodes’ health with high efficiency. Through the testing for DDC which ran on TH-1A, the results indicated that DDC had the advantage of low cost and good scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)

    Article  MathSciNet  Google Scholar 

  2. Philp, I.R.: Software failures and the road to a petaflop machine. In: Proceedings of the 1st Workshop on High Performance Computing Reliability Issues, San Francisco, CA, USA (2005)

    Google Scholar 

  3. Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: SC 1997, NY, USA (1997)

    Google Scholar 

  4. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys: Conf. Ser. 46(1), 494–499 (2006)

    Google Scholar 

  5. Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: DSN 2006, Washington, DC, USA, pp. 425–434 (2006)

    Google Scholar 

  6. Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: The Seventh IEEE International Conference on Data Mining, pp. 583–588 (2007)

    Google Scholar 

  7. Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: An adaptive semantic filter for Blue Gene/L failure log analysis. In: IEEE International Parallel and Distributed Processing Symposium, pp. 1–8 (2007)

    Google Scholar 

  8. Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: CCGRID 2006, Washington, DC, USA, pp. 531–538 (2006)

    Google Scholar 

  9. Lan, Z., Gu, J., Zheng, Z., Thakur, R., Coghlan, S.: A study of dynamic meta-learning for failure prediction in large-scale systems. J. Parallel Distrib. Comput. 70(6), 630–643 (2010)

    Article  MATH  Google Scholar 

  10. Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: IPDPS 2011, pp. 840–851 (2011)

    Google Scholar 

  11. Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD 2003, NY, USA, pp. 426–435 (2003)

    Google Scholar 

  12. Oliner, A., Rudolph, L., Sahoo, R.: Cooperative checkpointing theory. In: IPDPS 2006, Washington, DC, USA, pp. 132–141 (2006)

    Google Scholar 

  13. Oliner, A., Ganapathi, A., Xu, W.: Advances and challenges in log analysis. Commun. ACM 55(2), 55–61 (2012)

    Article  Google Scholar 

  14. Yamanishi, K., Maruyama, Y.: Dynamic syslog mining for network failure monitoring. In: KDD 2005, New York, NY, USA, pp. 499–508 (2005)

    Google Scholar 

  15. Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: SOSP 2009, NY, USA, pp. 117–132 (2009)

    Google Scholar 

  16. Vaarandi, R.: A breadth-first algorithm for mining frequent patterns from event logs. In: Aagesen, F.A., Anutariya, C., Wuwongse, V. (eds.) INTELLCOMM 2004. LNCS, vol. 3283, pp. 293–308. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault prediction under the microscope: a closer look into HPC systems. In: SC 2012, Los Alamitos, CA, USA (2012)

    Google Scholar 

  18. Scott, S.L., Engelmann, C., Vallee, G.R., Naughton, T., Tikotekar, A., Ostrouchov, G., et al.: A tunable holistic resiliency approach for high-performance computing systems. In: PPoPP 2009, NY, USA, pp. 305–306 (2009)

    Google Scholar 

  19. Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for HPC with Xen virtualization. In: ICS 2007, NY, USA, pp. 23–32 (2007)

    Google Scholar 

  20. Rajachandrasekar, R., Besseron, X., Panda, D.K.: Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1136–1143 (2012)

    Google Scholar 

  21. Buyya, R.: PARMON: a portable and scalable monitoring system for clusters. Softw. Pract. Exper. 30(7), 723–739 (2000)

    Article  MATH  Google Scholar 

  22. Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  23. Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J.R., Pebay, P.P., Thompson, D., et al.: Ovis-2: a robust distributed architecture for scalable RAS. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)

    Google Scholar 

Download references

Acknowledgments

This paper is supported by the National Natural Science Foundation of China (NSFC) No. 61272141, No. 61120106005 and the National High Technology Research and Development Program of China (863 Program) No. 2012AA01A301.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhuang Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hu, W., Jiang, Y., Liu, G., Dong, W., Cai, G. (2015). DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen, Y., Ienne, P., Ji, Q. (eds) Advanced Parallel Processing Technologies. APPT 2015. Lecture Notes in Computer Science(), vol 9231. Springer, Cham. https://doi.org/10.1007/978-3-319-23216-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23216-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23215-7

  • Online ISBN: 978-3-319-23216-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics