Abstract
Checkpointing applies to large software systems subject to failures. In the absence of failures the software system continuously serves requests, performs transactions, or executes long-running batch processes. If the execution time of the task and the time at which processing starts is known, then the moment of completion of the task is known as well. If failures can happen the completion of a task severely depends on the underling fault model. The typical fault model employed in checkpointing consists in the assumption that faults are detected immediately as they happen. This implies that only crash-faults are considered and no transient or Byzantine faults that would require fault-detection mechanisms. Some checkpointing models assume that faults are detected only at the end of the software module [152].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Reference
M. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
S. Kalaiselvi, V. Rajaraman, A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25(5), 489–510 (2000)
S. Thanawastien, R.S. Pamula, Y.L. Varol, Evaluation of Global Checkpoint Rollback Strategies for Error Recovery in Concurrent Processing Systems. In Proceedings of the 16th International Symposium on Fault-Tolerant Computing, New York (IEEE Computer Society, Washington, DC, 1986), pp. 246–251
J. Hong, S. Kim, Y. Cho, Cost analysis of optimistic recovery model for forked checkpointing. IEICE Trans. Inform. Syst. E-85A(1), 1534–1541 (2002)
S. Toueg, Ö. Babaoglu, On the optimum checkpoint selection problem. SIAM J. Comput. 13(3), 630–649 (1984)
A.J. Oliner, R. Sahoo, Evaluating Cooperative Checkpointing for Supercomputing Systems. In SMTPS’06: Proceedings of the 2nd Workshop on System Management Tools for Large-Scale Parallel Systems at IPDPS, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)
N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans. Comput. 46(8), 942–947 (1997)
F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation. IEEE Trans. Parall. Distr. Syst. 12(4), 346–362 (2001)
S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, E. Roman, The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
E.N. Elnozahy, J.S. Plank, Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 1(2), 97–108 (2004)
H.G. Naik, R. Gupta, P. Beckman, Analyzing Checkpointing Trends for Applications on Peta-Scale Systems. In P2S2’09: Proceedings of the 2nd International Workshop on Parallel Programming Models and Systems Software (P2S2) for High-End Computing (IEEE Computer Society, Vienna, Austria, 2009)
A.J. Oliner, L. Rudolph, R. Sahoo, Cooperative Checkpointing Theory. In Proceedings of the Parallel and Distributed Processing Symposium, Rhodes Island, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Wolter, K. (2010). Checkpointing Systems. In: Stochastic Models for Fault Tolerance. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11257-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-11257-7_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11256-0
Online ISBN: 978-3-642-11257-7
eBook Packages: Computer ScienceComputer Science (R0)