Checkpointing

Schulz, Martin

doi:10.1007/978-0-387-09766-4_62

Martin Schulz Dr.²

552 Accesses

Synonyms

Checkpoint-recovery; Checkpoint/Restart

Definition

In the most general sense, Checkpointing refers to the ability to store the state of a computation in a way that allows it be continued at a later time without changing the computation’s behavior. The preserved state is called the Checkpoint and the continuation is typically referred to as a Restart.

Checkpointing is most typically used to provide fault tolerance to applications. In this case, the state of the entire application is periodically saved to some kind of stable storage, e.g., disk, and can be retrieved in case the original application crashes due to a failure in the underlying system. The application is then restarted (or recovered) from the checkpoint that was created last and continued from that point on, thereby minimizing the time lost due to the failure.

Discussion

Checkpointing is a mechanism to store the state of a computation so that it can be retrieved at a later point in time and continued. The process of...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 1,600.00; Price excludes VAT (USA)

Hardcover Book: USD 1,799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bibliography

Vadhiyar S, Dongarra J (2003) SRS – a framework for developing malleable and migratable parallel software. Parallel Process Lett 13(2):291–312
Article MathSciNet Google Scholar
Beck M, Plank JS, Kingsley G, Kingsley G (1994) Compiler-assisted checkpointing. In: Technical report CS-94-269, department of computer science, University of Tennessee, Knoxville, December 1994
Google Scholar
Chung chi Jim Li, Stewart EM, Fuchs WK (1994) Compiler-assisted full checkpointing. Pract Exper 24(10):871–886
Article Google Scholar
University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputing Sites. http://www.top500.org/
Lawrence Livermore National Laboratory. NNSA awards IBM contract to build next generation supercomputer, press release. https://publica _airs.llnl.gov/news/newsreleases/2009/NR-09-02-01.html. Accessed Feb 2009
Bronevetsky G, Pingali K, Stodghill P (2006) Experimental evaluation of application-level checkpointing for OpenMP programs. In: International conference on supercomputing (ICS), Queensland, June 2006
Google Scholar
Chandy M, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Transact Comput Syst 3(1):63–75
Article Google Scholar
Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghil l P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proceedings of IEEE/ACM supercomputing ’04, Washington, DC, November 2004
Google Scholar
Silva LM, Silva JG (1998) An experimental study about diskless checkpointing. EUROMICRO Conf 1:10395
Google Scholar
Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986
Article Google Scholar
Zheng G, Shi L, Kale LV (2004) FTC-Charm++: an In-Memory checkpoint-based fault tolerant runtime for Charm + + and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103, San Diego, September 2004
Google Scholar
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of IEEE/ACM supercomputing ’10, New Orleans, LA, 2010
Book Google Scholar
Agarwal S, Garg R, Gupta MS, Moreira JE (2004) Adaptive incremental checkpointing for massively parallel systems. In: ICS ’04: proceedings of the 18th annual international conference on supercomputing. ACM, New York, pp 277–286
Google Scholar
Sancho JC, Petrini F, Johnson G, Fernndez J, Frachtenberg E (2004) On the feasibility of incremental checkpointing for scientific computing. Parallel Distrib Process Symp Int 1:58b
Google Scholar
Litzkow JBM, Tannenbaum T, Livny M2 (1997). Checkpoint and migration of UNIX processes in the condor distributed processing system. In: Technical report 1346, University of Wisconsin, Madison, 1997
Google Scholar
Condor. http://www.cs.wisc.edu/condor/manual
CHARM research group. http://charm.cs.uiuc.edu/
Kale LV, Krishnan S (1993) CHARM++: a portable concurrent object oriented system based on C++. Parallel Process Lett 28(10):91–108
Google Scholar
Elnozahy M, Alvisi L, Wang YM, Johnson DB (1996) A survey of rollback-recovery protocols in message passing systems. In: Technical report CMU-CS-96-181, school of computer science, Carnegie Mellon University, Pittsburgh, October 1996
Google Scholar
Librato. Availability Services (AvS). http://www.librato.com/products/availability.services
Plank JS, Beck M, Kingsley G, Li K (1994) Libckpt: transparent checkpointing under UNIX. In: Technical report UT-CS-94-242, Department of Computer Science, University of Tennessee, Princeton University
Google Scholar
Duell J The design and implementation of Berkeley lab’s linux checkpoint/restart. http://www.nersc.gov/research/FTG/checkpoint/reports.html
Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS ’96), Honolulu, 1996
Google Scholar
Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarnier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of IEEE/ACM supercomputing ’03, Phoenix, November 2003
Google Scholar
Wang YM, Fuchs WK (1992) Optimistic message logging for independent checkpointing in message-passing systems. In: Proceedings of the 11th symposium on reliable distributed systems, Houston, October 1992, pp 147–154
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Advanced Scientific Computing, Lawrence Livermore National Laboratory, P.O. Box 808, L-560, Livermore, CA, 94551, USA
Martin Schulz Dr.

Authors

Martin Schulz Dr.
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, USA
David Padua

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Schulz, M. (2011). Checkpointing. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_62

Download citation

DOI: https://doi.org/10.1007/978-0-387-09766-4_62
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09765-7
Online ISBN: 978-0-387-09766-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics