Skip to main content

Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers

  • Conference paper
  • First Online:
High Performance Computing for Computational Science – VECPAR 2016 (VECPAR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

Abstract

On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large problems. Consequently, it becomes critical to design parallel numerical linear algebra kernels that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large-scale parallel implementations. Our main objective is to provide robust resilient schemes so that the solver may keep converging in the presence of the hard fault without restarting the calculation from scratch. For this purpose, we study interpolation-restart (IR) strategies. For a given numerical scheme, the IR strategies consist of extracting relevant information from available data after a fault. After data extraction, a well-selected part of the missing data is regenerated through interpolation strategies to constitute a meaningful input to restart the numerical algorithm. In this paper, we revisit a few state-of-the-art methods in numerical linear algebra in the light of our IR strategies. Through a few numerical experiments, we illustrate the respective robustness of the resulting resilient schemes with respect to the MTBF via qualitative illustrations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agullo, E., Cools, S., Giraud, L., Vanroose, W., Yetkin, F.E.: On the sensitivity of CG to soft-errors and robust numerical detection mechanisms. Research Report in Preparation, Inria (2017)

    Google Scholar 

  2. Agullo, E., GiraudL, L., Moreau, A.: Adaptive soft-error detection criterion for GMRES. Research Report in Preparation, Inria (2017)

    Google Scholar 

  3. Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23, 888–905 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  5. Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)

    Article  Google Scholar 

  6. Anfinson, J., Luk, F.T.: A linear algebraic model of algorithm-based fault tolerance. IEEE Trans. Comput. 37, 1599–1604 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  7. Austin, T.M.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, Washington, DC, pp. 196–207. IEEE Computer Society (1999)

    Google Scholar 

  8. Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21, 111–132 (2011)

    Article  MathSciNet  Google Scholar 

  9. Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: ACM SIGPLAN Notices, vol. 48, pp. 167–176. ACM (2013)

    Google Scholar 

  10. Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  11. Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proceedings of the 11th Symposium on Reliable Distributed Systems, pp. 39–47, October 1992

    Google Scholar 

  12. Gunnels, J.A., Van De Geijn, R.A., Katz, D.S., Quintana-ortí, E.S.: Fault-tolerant high-performance matrix multiplication: theory and practice. In: Dependable Systems and Networks, pp. 47–56 (2001)

    Google Scholar 

  13. Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)

    Article  MATH  Google Scholar 

  14. Iyer, R.K., Nakka, N.M., Kalbarczyk, Z.T., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)

    Article  Google Scholar 

  15. Johnson, D.B., Zwaenepoel, W.: Sender-based message logging (1987)

    Google Scholar 

  16. Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  17. Li, C.-C.J., Fuchs, W.K.: Catch-compiler-assisted techniques for checkpointing. In: 20th International Symposium on Fault-Tolerant Computing. FTCS-20. Digest of Papers, pp. 74–81, June 1990

    Google Scholar 

  18. Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pp. 1–9, April 2008

    Google Scholar 

  19. Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51(1), 63–75 (2002)

    Article  Google Scholar 

  20. Plank, J.S., Kim, Y., Dongarra, J.: Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput. 43(2), 125–138 (1997)

    Article  Google Scholar 

  21. Plank, J.: An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical report UT-CS-97-372, Department of Computer Science, University of Tennessee (1997)

    Google Scholar 

  22. Plank, J.S., Li, K.: ICKP: a consistent checkpointer for multicomputers. Parallel Distrib. Technol. Syst. Appl. 2(2), 62–67 (1994). IEEE

    Article  Google Scholar 

  23. Raju, N., Liu, Y., Leangsuksun, C.B., Nassar, R., Scott, S.: Reliability Analysis in HPC clusters. In: Proceedings of the High Availability and Performance Computing Workshop (2006)

    Google Scholar 

  24. Sancho, J.C., Petrini, F., Davis, K., Gioiosa, R., Jiang, S.: Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium, April 2005

    Google Scholar 

  25. Scholzel, M.: Reduced triple modular redundancy for built-in self-repair in VLIW-processors. In: Signal Processing Algorithms, Architectures, Arrangements and Applications, pp. 21–26 (2007)

    Google Scholar 

  26. Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)

    Google Scholar 

  27. Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. Department of Computer Science, North Carolina State University (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. Giraud .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Agullo, E. et al. (2017). Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61982-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61981-1

  • Online ISBN: 978-3-319-61982-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics