Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers

Agullo, E.; Cools, S.; Giraud, L.; Moreau, A.; Salas, P.; Vanroose, W.; Yetkin, E. F.; Zounon, M.

doi:10.1007/978-3-319-61982-8_3

E. Agullo¹⁷,
S. Cools¹⁸,
L. Giraud¹⁷,
A. Moreau¹⁷,
P. Salas¹⁹,
W. Vanroose¹⁸,
E. F. Yetkin²⁰ &
…
M. Zounon²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

International Conference on Vector and Parallel Processing

443 Accesses
3 Citations

Abstract

On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large problems. Consequently, it becomes critical to design parallel numerical linear algebra kernels that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large-scale parallel implementations. Our main objective is to provide robust resilient schemes so that the solver may keep converging in the presence of the hard fault without restarting the calculation from scratch. For this purpose, we study interpolation-restart (IR) strategies. For a given numerical scheme, the IR strategies consist of extracting relevant information from available data after a fault. After data extraction, a well-selected part of the missing data is regenerated through interpolation strategies to constitute a meaningful input to restart the numerical algorithm. In this paper, we revisit a few state-of-the-art methods in numerical linear algebra in the light of our IR strategies. Through a few numerical experiments, we illustrate the respective robustness of the resulting resilient schemes with respect to the MTBF via qualitative illustrations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agullo, E., Cools, S., Giraud, L., Vanroose, W., Yetkin, F.E.: On the sensitivity of CG to soft-errors and robust numerical detection mechanisms. Research Report in Preparation, Inria (2017)
Google Scholar
Agullo, E., GiraudL, L., Moreau, A.: Adaptive soft-error detection criterion for GMRES. Research Report in Preparation, Inria (2017)
Google Scholar
Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23, 888–905 (2016)
Article MathSciNet MATH Google Scholar
Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)
Article MathSciNet MATH Google Scholar
Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)
Article Google Scholar
Anfinson, J., Luk, F.T.: A linear algebraic model of algorithm-based fault tolerance. IEEE Trans. Comput. 37, 1599–1604 (1988)
Article MathSciNet MATH Google Scholar
Austin, T.M.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, Washington, DC, pp. 196–207. IEEE Computer Society (1999)
Google Scholar
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21, 111–132 (2011)
Article MathSciNet Google Scholar
Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: ACM SIGPLAN Notices, vol. 48, pp. 167–176. ACM (2013)
Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proceedings of the 11th Symposium on Reliable Distributed Systems, pp. 39–47, October 1992
Google Scholar
Gunnels, J.A., Van De Geijn, R.A., Katz, D.S., Quintana-ortí, E.S.: Fault-tolerant high-performance matrix multiplication: theory and practice. In: Dependable Systems and Networks, pp. 47–56 (2001)
Google Scholar
Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)
Article MATH Google Scholar
Iyer, R.K., Nakka, N.M., Kalbarczyk, Z.T., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)
Article Google Scholar
Johnson, D.B., Zwaenepoel, W.: Sender-based message logging (1987)
Google Scholar
Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)
Article MathSciNet MATH Google Scholar
Li, C.-C.J., Fuchs, W.K.: Catch-compiler-assisted techniques for checkpointing. In: 20th International Symposium on Fault-Tolerant Computing. FTCS-20. Digest of Papers, pp. 74–81, June 1990
Google Scholar
Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pp. 1–9, April 2008
Google Scholar
Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51(1), 63–75 (2002)
Article Google Scholar
Plank, J.S., Kim, Y., Dongarra, J.: Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput. 43(2), 125–138 (1997)
Article Google Scholar
Plank, J.: An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical report UT-CS-97-372, Department of Computer Science, University of Tennessee (1997)
Google Scholar
Plank, J.S., Li, K.: ICKP: a consistent checkpointer for multicomputers. Parallel Distrib. Technol. Syst. Appl. 2(2), 62–67 (1994). IEEE
Article Google Scholar
Raju, N., Liu, Y., Leangsuksun, C.B., Nassar, R., Scott, S.: Reliability Analysis in HPC clusters. In: Proceedings of the High Availability and Performance Computing Workshop (2006)
Google Scholar
Sancho, J.C., Petrini, F., Davis, K., Gioiosa, R., Jiang, S.: Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium, April 2005
Google Scholar
Scholzel, M.: Reduced triple modular redundancy for built-in self-repair in VLIW-processors. In: Signal Processing Algorithms, Architectures, Arrangements and Applications, pp. 21–26 (2007)
Google Scholar
Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)
Google Scholar
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. Department of Computer Science, North Carolina State University (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Inria, Bordeaux, France
E. Agullo, L. Giraud & A. Moreau
University of Antwerp, Antwerp, Belgium
S. Cools & W. Vanroose
Sherbrooke University, Sherbrooke, Canada
P. Salas
Istanbul Kemerburgaz University, Istanbul, Turkey
E. F. Yetkin
The University of Manchester, Manchester, UK
M. Zounon

Authors

E. Agullo
View author publications
You can also search for this author in PubMed Google Scholar
S. Cools
View author publications
You can also search for this author in PubMed Google Scholar
L. Giraud
View author publications
You can also search for this author in PubMed Google Scholar
A. Moreau
View author publications
You can also search for this author in PubMed Google Scholar
P. Salas
View author publications
You can also search for this author in PubMed Google Scholar
W. Vanroose
View author publications
You can also search for this author in PubMed Google Scholar
E. F. Yetkin
View author publications
You can also search for this author in PubMed Google Scholar
M. Zounon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Giraud .

Editor information

Editors and Affiliations

University of Porto, Porto, Portugal
Inês Dutra
University of Porto, Porto, Portugal
Rui Camacho
University of Porto, Porto, Portugal
Jorge Barbosa
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Osni Marques

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agullo, E. et al. (2017). Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-61982-8_3
Published: 14 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61981-1
Online ISBN: 978-3-319-61982-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics