Checkpointing Systems

Wolter, Katinka

doi:10.1007/978-3-642-11257-7_8

Katinka Wolter²

528 Accesses

Abstract

Checkpointing applies to large software systems subject to failures. In the absence of failures the software system continuously serves requests, performs transactions, or executes long-running batch processes. If the execution time of the task and the time at which processing starts is known, then the moment of completion of the task is known as well. If failures can happen the completion of a task severely depends on the underling fault model. The typical fault model employed in checkpointing consists in the assumption that faults are detected immediately as they happen. This implies that only crash-faults are considered and no transient or Byzantine faults that would require fault-detection mechanisms. Some checkpointing models assume that faults are detected only at the end of the software module [152].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reference

M. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
S. Kalaiselvi, V. Rajaraman, A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25(5), 489–510 (2000)
Article Google Scholar
S. Thanawastien, R.S. Pamula, Y.L. Varol, Evaluation of Global Checkpoint Rollback Strategies for Error Recovery in Concurrent Processing Systems. In Proceedings of the 16th International Symposium on Fault-Tolerant Computing, New York (IEEE Computer Society, Washington, DC, 1986), pp. 246–251
Google Scholar
J. Hong, S. Kim, Y. Cho, Cost analysis of optimistic recovery model for forked checkpointing. IEICE Trans. Inform. Syst. E-85A(1), 1534–1541 (2002)
Google Scholar
S. Toueg, Ö. Babaoglu, On the optimum checkpoint selection problem. SIAM J. Comput. 13(3), 630–649 (1984)
Article MathSciNet MATH Google Scholar
A.J. Oliner, R. Sahoo, Evaluating Cooperative Checkpointing for Supercomputing Systems. In SMTPS’06: Proceedings of the 2nd Workshop on System Management Tools for Large-Scale Parallel Systems at IPDPS, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)
Google Scholar
N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans. Comput. 46(8), 942–947 (1997)
Article Google Scholar
F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation. IEEE Trans. Parall. Distr. Syst. 12(4), 346–362 (2001)
Article Google Scholar
S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, E. Roman, The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
Article Google Scholar
E.N. Elnozahy, J.S. Plank, Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 1(2), 97–108 (2004)
Article Google Scholar
H.G. Naik, R. Gupta, P. Beckman, Analyzing Checkpointing Trends for Applications on Peta-Scale Systems. In P2S2’09: Proceedings of the 2nd International Workshop on Parallel Programming Models and Systems Software (P2S2) for High-End Computing (IEEE Computer Society, Vienna, Austria, 2009)
Google Scholar
A.J. Oliner, L. Rudolph, R. Sahoo, Cooperative Checkpointing Theory. In Proceedings of the Parallel and Distributed Processing Symposium, Rhodes Island, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Freie Universität Berlin, Takustr. 9, D-14195, Berlin, Germany
Katinka Wolter

Authors

Katinka Wolter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katinka Wolter .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wolter, K. (2010). Checkpointing Systems. In: Stochastic Models for Fault Tolerance. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11257-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-11257-7_8
Published: 19 February 2010
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11256-0
Online ISBN: 978-3-642-11257-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics