Synonyms
Definition
In the most general sense, Checkpointing refers to the ability to store the state of a computation in a way that allows it be continued at a later time without changing the computation’s behavior. The preserved state is called the Checkpoint and the continuation is typically referred to as a Restart.
Checkpointing is most typically used to provide fault tolerance to applications. In this case, the state of the entire application is periodically saved to some kind of stable storage, e.g., disk, and can be retrieved in case the original application crashes due to a failure in the underlying system. The application is then restarted (or recovered) from the checkpoint that was created last and continued from that point on, thereby minimizing the time lost due to the failure.
Discussion
Checkpointing is a mechanism to store the state of a computation so that it can be retrieved at a later point in time and continued. The process of...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Bibliography
Vadhiyar S, Dongarra J (2003) SRS – a framework for developing malleable and migratable parallel software. Parallel Process Lett 13(2):291–312
Beck M, Plank JS, Kingsley G, Kingsley G (1994) Compiler-assisted checkpointing. In: Technical report CS-94-269, department of computer science, University of Tennessee, Knoxville, December 1994
Chung chi Jim Li, Stewart EM, Fuchs WK (1994) Compiler-assisted full checkpointing. Pract Exper 24(10):871–886
University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputing Sites. http://www.top500.org/
Lawrence Livermore National Laboratory. NNSA awards IBM contract to build next generation supercomputer, press release. https://publica _airs.llnl.gov/news/newsreleases/2009/NR-09-02-01.html. Accessed Feb 2009
Bronevetsky G, Pingali K, Stodghill P (2006) Experimental evaluation of application-level checkpointing for OpenMP programs. In: International conference on supercomputing (ICS), Queensland, June 2006
Chandy M, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Transact Comput Syst 3(1):63–75
Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghil l P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proceedings of IEEE/ACM supercomputing ’04, Washington, DC, November 2004
Silva LM, Silva JG (1998) An experimental study about diskless checkpointing. EUROMICRO Conf 1:10395
Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986
Zheng G, Shi L, Kale LV (2004) FTC-Charm++: an In-Memory checkpoint-based fault tolerant runtime for Charm + + and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103, San Diego, September 2004
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of IEEE/ACM supercomputing ’10, New Orleans, LA, 2010
Agarwal S, Garg R, Gupta MS, Moreira JE (2004) Adaptive incremental checkpointing for massively parallel systems. In: ICS ’04: proceedings of the 18th annual international conference on supercomputing. ACM, New York, pp 277–286
Sancho JC, Petrini F, Johnson G, Fernndez J, Frachtenberg E (2004) On the feasibility of incremental checkpointing for scientific computing. Parallel Distrib Process Symp Int 1:58b
Litzkow JBM, Tannenbaum T, Livny M2 (1997). Checkpoint and migration of UNIX processes in the condor distributed processing system. In: Technical report 1346, University of Wisconsin, Madison, 1997
CHARM research group. http://charm.cs.uiuc.edu/
Kale LV, Krishnan S (1993) CHARM++: a portable concurrent object oriented system based on C++. Parallel Process Lett 28(10):91–108
Elnozahy M, Alvisi L, Wang YM, Johnson DB (1996) A survey of rollback-recovery protocols in message passing systems. In: Technical report CMU-CS-96-181, school of computer science, Carnegie Mellon University, Pittsburgh, October 1996
Librato. Availability Services (AvS). http://www.librato.com/products/availability.services
Plank JS, Beck M, Kingsley G, Li K (1994) Libckpt: transparent checkpointing under UNIX. In: Technical report UT-CS-94-242, Department of Computer Science, University of Tennessee, Princeton University
Duell J The design and implementation of Berkeley lab’s linux checkpoint/restart. http://www.nersc.gov/research/FTG/checkpoint/reports.html
Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS ’96), Honolulu, 1996
Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarnier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of IEEE/ACM supercomputing ’03, Phoenix, November 2003
Wang YM, Fuchs WK (1992) Optimistic message logging for independent checkpointing in message-passing systems. In: Proceedings of the 11th symposium on reliable distributed systems, Houston, October 1992, pp 147–154
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Schulz, M. (2011). Checkpointing. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_62
Download citation
DOI: https://doi.org/10.1007/978-0-387-09766-4_62
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09765-7
Online ISBN: 978-0-387-09766-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering