Skip to main content

Fault-Tolerant LU Factorization Is Low Cost

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing (Euro-Par 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12820))

Included in the following conference series:

Abstract

At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.top500.org/lists/2020/11/.

References

  1. Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10214-6_10

    Chapter  Google Scholar 

  2. Benacchio, T., et al.: Resilience and fault-tolerance in high-performance computing for numerical weather and climate prediction. Int. J. High Perform. Comput. Appl. (2020)

    Google Scholar 

  3. Benoît, A., Cavelan, A., Cappello, F., Raghavan, P., Robert, Y., Sun, H.: Coping with silent and fail-stop errors at scale by combining replication and checkpointing. J. Parallel Distrib. Comput. 122, 209–225 (2018)

    Article  Google Scholar 

  4. Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) Recent Advances in the Message Passing Interface, pp. 193–203. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-33518-1_24

    Chapter  Google Scholar 

  5. Bosilca, G., et al.: Failure detection and propagation in HPC systems. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 312–322 (2016)

    Google Scholar 

  6. Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)

    Article  Google Scholar 

  7. Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault-tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)

    Article  Google Scholar 

  8. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1), 5–28 (2014)

    Google Scholar 

  9. Coti, C.: Exploiting redundant computation in communication-avoiding algorithms for algorithm-based fault tolerance. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 214–219 (2016). https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2016.59

  10. Coti, C.: Scalable, robust, fault-tolerant parallel QR factorization. In: 2016 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES), pp. 626–633 (2016). https://doi.org/10.1109/CSE-EUC-DCABES.2016.250

  11. Coti, C., et al.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 18 (2006)

    Google Scholar 

  12. Coti, C., Petrucci, L., Torres Gonzalez, D.A.: Fault-tolerant matrix factorisation: a formal model and proof. In: 6th International Workshop on Synthesis of Complex Parameters (SynCoP) 2019 (2019)

    Google Scholar 

  13. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). https://doi.org/10.1137/080731992

    Article  MathSciNet  MATH  Google Scholar 

  14. Dey, T., et al.: Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1036–1043. IEEE (2020)

    Google Scholar 

  15. Di, S., Cappello, F.: Optimization of error-bounded lossy compression for hard-to-compress HPC data. IEEE Trans. Parallel Distrib. Syst. 29(1), 129–143 (2017)

    Article  Google Scholar 

  16. Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011)

    Google Scholar 

  17. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002). https://doi.org/10.1145/568522.568525

    Article  Google Scholar 

  18. Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47

    Chapter  Google Scholar 

  19. Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 346–355 (2016)

    Google Scholar 

  20. Gamell, M., et al.: Exploring failure recovery for stencil-based applications at extreme scales. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp. 279–282. HPDC 2015, Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2749246.2749260

  21. Grigori, L., Demmel, J.W., Xiang, H.: CALU: a communication optimal LU factorization algorithm. SIAM J. Matrix Anal. Appl. 32(4), 1317–1350 (2011). https://doi.org/10.1137/100788926

    Article  MathSciNet  MATH  Google Scholar 

  22. Gropp, W., Snir, M.: Programming for exascale computers. Comput. Sci. Eng. 15, 27 (2013)

    Article  Google Scholar 

  23. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)

    Article  Google Scholar 

  24. Jones, W.M., Daly, J.T., DeBardeleben, N.: Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 276–279 (2010)

    Google Scholar 

  25. Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 115–124. IEEE (2004)

    Google Scholar 

  26. Losada, N., Bouteiller, A., Bosilca, G.: Asynchronous receiver-driven replay for local rollback of MPI applications. In: 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp. 1–10. IEEE (2019)

    Google Scholar 

  27. Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, pp. 610–621. IEEE Computer Society, Washington (2014). https://doi.org/10.1109/DSN.2014.62

  28. Reed, D., Lu, C., Mendes, C.: Reliability challenges in large systems. Future Gener. Comput. Syst. 22(3), 293–302 (2006). https://doi.org/10.1016/j.future.2004.11.015

  29. Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19328-6_1

    Chapter  Google Scholar 

  30. Tao, D., Di, S., Liang, X., Chen, Z., Cappello, F.: Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing, pp. 52–65 (2018)

    Google Scholar 

  31. Thakur, R., et al.: MPI at exascale. In: Procceedings of SciDAC 2010 (2010)

    Google Scholar 

Download references

Acknowledgement

Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations(see https://www.grid5000.fr).

The authors would like to thank Julien Langou for the discussions on the numerical stability of the computation of the L matrix.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Alberto Torres González .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Coti, C., Petrucci, L., Torres González, D.A. (2021). Fault-Tolerant LU Factorization Is Low Cost. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85665-6_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85664-9

  • Online ISBN: 978-3-030-85665-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics