DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers

Hu, Wei; Jiang, Yanhuang; Liu, Guangming; Dong, Wenrui; Cai, Guilin

doi:10.1007/978-3-319-23216-4_2

Wei Hu^16,17,
Yanhuang Jiang¹⁶,
Guangming Liu^16,17,
Wenrui Dong^16,17 &
…
Guilin Cai¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9231))

Included in the following conference series:

International Workshop on Advanced Parallel Processing Technologies

563 Accesses
1 Citations

Abstract

Reliability has become an issue to the Tianhe supercomputer series with the scaling of the system. Proactive fault-tolerance based on failure prediction turns into an effective way to improve the system’s fault tolerance ability. Data collection is the basis of the failure prediction which has a great impact on the prediction accuracy, while current data collection methods for failure prediction only got limited data with large overhead. This paper presents DDC data collection framework for failure prediction in Tianhe supercomputers. DDC adopts a distributed data collection architecture which can fully collect the data related to the compute nodes’ health with high efficiency. Through the testing for DDC which ran on TH-1A, the results indicated that DDC had the advantage of low cost and good scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Article MathSciNet Google Scholar
Philp, I.R.: Software failures and the road to a petaflop machine. In: Proceedings of the 1st Workshop on High Performance Computing Reliability Issues, San Francisco, CA, USA (2005)
Google Scholar
Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: SC 1997, NY, USA (1997)
Google Scholar
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys: Conf. Ser. 46(1), 494–499 (2006)
Google Scholar
Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: DSN 2006, Washington, DC, USA, pp. 425–434 (2006)
Google Scholar
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: The Seventh IEEE International Conference on Data Mining, pp. 583–588 (2007)
Google Scholar
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: An adaptive semantic filter for Blue Gene/L failure log analysis. In: IEEE International Parallel and Distributed Processing Symposium, pp. 1–8 (2007)
Google Scholar
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: CCGRID 2006, Washington, DC, USA, pp. 531–538 (2006)
Google Scholar
Lan, Z., Gu, J., Zheng, Z., Thakur, R., Coghlan, S.: A study of dynamic meta-learning for failure prediction in large-scale systems. J. Parallel Distrib. Comput. 70(6), 630–643 (2010)
Article MATH Google Scholar
Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: IPDPS 2011, pp. 840–851 (2011)
Google Scholar
Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD 2003, NY, USA, pp. 426–435 (2003)
Google Scholar
Oliner, A., Rudolph, L., Sahoo, R.: Cooperative checkpointing theory. In: IPDPS 2006, Washington, DC, USA, pp. 132–141 (2006)
Google Scholar
Oliner, A., Ganapathi, A., Xu, W.: Advances and challenges in log analysis. Commun. ACM 55(2), 55–61 (2012)
Article Google Scholar
Yamanishi, K., Maruyama, Y.: Dynamic syslog mining for network failure monitoring. In: KDD 2005, New York, NY, USA, pp. 499–508 (2005)
Google Scholar
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: SOSP 2009, NY, USA, pp. 117–132 (2009)
Google Scholar
Vaarandi, R.: A breadth-first algorithm for mining frequent patterns from event logs. In: Aagesen, F.A., Anutariya, C., Wuwongse, V. (eds.) INTELLCOMM 2004. LNCS, vol. 3283, pp. 293–308. Springer, Heidelberg (2004)
Chapter Google Scholar
Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault prediction under the microscope: a closer look into HPC systems. In: SC 2012, Los Alamitos, CA, USA (2012)
Google Scholar
Scott, S.L., Engelmann, C., Vallee, G.R., Naughton, T., Tikotekar, A., Ostrouchov, G., et al.: A tunable holistic resiliency approach for high-performance computing systems. In: PPoPP 2009, NY, USA, pp. 305–306 (2009)
Google Scholar
Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for HPC with Xen virtualization. In: ICS 2007, NY, USA, pp. 23–32 (2007)
Google Scholar
Rajachandrasekar, R., Besseron, X., Panda, D.K.: Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1136–1143 (2012)
Google Scholar
Buyya, R.: PARMON: a portable and scalable monitoring system for clusters. Softw. Pract. Exper. 30(7), 723–739 (2000)
Article MATH Google Scholar
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J.R., Pebay, P.P., Thompson, D., et al.: Ovis-2: a robust distributed architecture for scalable RAS. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)
Google Scholar

Download references

Acknowledgments

This paper is supported by the National Natural Science Foundation of China (NSFC) No. 61272141, No. 61120106005 and the National High Technology Research and Development Program of China (863 Program) No. 2012AA01A301.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, China
Wei Hu, Yanhuang Jiang, Guangming Liu, Wenrui Dong & Guilin Cai
National Supercomputer Centre of Tianjin, Tianjin, China
Wei Hu, Guangming Liu & Wenrui Dong

Authors

Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhuang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenrui Dong
View author publications
You can also search for this author in PubMed Google Scholar
Guilin Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanhuang Jiang .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Yunji Chen
EPFL IC ISIM LAP, Lausanne, Switzerland
Paolo Ienne
Inspur, Shangdong, China
Qing Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, W., Jiang, Y., Liu, G., Dong, W., Cai, G. (2015). DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen, Y., Ienne, P., Ji, Q. (eds) Advanced Parallel Processing Technologies. APPT 2015. Lecture Notes in Computer Science(), vol 9231. Springer, Cham. https://doi.org/10.1007/978-3-319-23216-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-23216-4_2
Published: 15 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23215-7
Online ISBN: 978-3-319-23216-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics