Scalable Distributed Consensus to Support MPI Fault Tolerance

Buntinas, Darius

doi:10.1007/978-3-642-24449-0_39

Darius Buntinas¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6960))

Included in the following conference series:

European MPI Users' Group Meeting

1080 Accesses
2 Citations

Abstract

As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a scalable, distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum’s fault-tolerance working group. The algorithm was implemented and evaluated on a 4,096-core Blue Gene/P. The implementation was able to perform a full-scale distributed consensus in 305 μs and scaled logarithmically.

This work was supported in part by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anfinson, J., Luk, F.T.: A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computing 37, 1599–1604 (1988)
Article MathSciNet MATH Google Scholar
Buntinas, D.: Scalable distributed consensus to support MPI fault tolerance. Tech. Rep. ANL/MCS-TM-314, Argonne National Laboratory (June 2011)
Google Scholar
Chen, Z., Dongarra, J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems 19(12) (December 2008)
Google Scholar
Chen, Z., Dongarra, J.: Highly scalable self-healing algorithms for high performance scientific computing. IEEE Transactions on Computers (July 2009)
Google Scholar
Fault Tolerance Working Group: Run-though stabilization proposal, http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization

Download references

Author information

Authors and Affiliations

Argonne National Laboratory, USA
Darius Buntinas

Authors

Darius Buntinas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Telecommunications, University of Athens, 15784, Athens, Greece
Yiannis Cotronis
University of Tennessee, 1122 Volunteer Blvd, 37996-3450, Knoxville, TN, USA
Anthony Danalis & Jack Dongarra &
University of Crete, Heraklion, Greece
Dimitrios S. Nikolopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buntinas, D. (2011). Scalable Distributed Consensus to Support MPI Fault Tolerance. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2011. Lecture Notes in Computer Science, vol 6960. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24449-0_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-24449-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24448-3
Online ISBN: 978-3-642-24449-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics