Reconfiguration and checkpointing in massively parallel systems

Bieker, Bernd; Deconinck, Geert; Maehle, Erik; Vounckx, Johan

doi:10.1007/3-540-58426-9_141

Bernd Bieker¹,
Geert Deconinck²,
Erik Maehle¹ &
…
Johan Vounckx²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

European Dependable Computing Conference

132 Accesses
4 Citations

Abstract

Despite the improvements in hardware design massively parallel systems lack on dependability due to the huge amount of components these systems consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. The ESPRIT Project 6731 “A Practical Approach to Fault-Tolerant Massively Parallel Systems” follows such an approach and covers the aspects of error detection, diagnosis, checkpointing and reconfiguration. Target systems are multi-computers consisting of grid-wise connected modules using message passing. A first implementation will be made for the Parsytec GCel under PARIX. This paper focuses on recovery by reconfiguration and checkpointing. The project is based on switching in spares and routing around failed components via virtual links (interval routing). For the recovery a user-driven as well as a user-transparent approach are provided based on the new recovery-line-manager.

This research was partially supported by the EC as Esprit project 6731 (FTMPS)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Altmann, F. Balbach, A. Hein: An Approach for Hierarchical System-Level Diagnosis of Massively Parallel Computers combined with a Simulation-Based Method for Dependability Analysis. To appear in EDCC-1, 1994
Google Scholar
P. Banerjee: Reconfiguring a Hypercube Multiprocessor in the Presence of Faults. Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications, Montery, California, March 1989, pp. 95–102
Google Scholar
A. Bauch, B. Bieker, E. Maehle: Backward Error Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 6–7, 1992, pp. 36–43
Google Scholar
W.J. Dally, C.L. Seitz: The torus routing chip. Distributed Computing, 1, 1986, pp. 187–196
Article Google Scholar
G. Deconinck, J. Vounckx, R. Lauwereins, J.A. Peperstraete: Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback, Proc. of the IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10–12, 1993, pp. 262–265
Google Scholar
ESPRIT-Project 6731: A Practical Approach to Fault-Tolerant Massively Parallel Systems. Technical Annex, 1992
Google Scholar
T.M. Frazier, Y. Tamir: Application-Transparent Error-Recovery Techniques for Multicomputer. Proc. of the 4th Conf. on Hypercubes, Concurrent Computers and Applications, Mar. 1989, pp. 103–108
Google Scholar
R. Geist, R. Reynolds, J. Westall: Selection of a Checkpoint Interval in a Critical-Task Environment. IEEE Trans. on Reliability, Vol. 37, No. 4, Oct., 1988
Google Scholar
J.P. Hayes: A Graph Model for Fault-Tolerant Computing Systems. IEEE Transactions on Computers, Vol. 25, No. 9, 1976, pp. 875–884
Google Scholar
D.B. Johnson, W. Zwaenepoel: Sender-Based Message Logging. Proceedings of the 17th International Symposium on Fault-tolerant Computing, FTCS-17, July 1987, pp. 14–19
Google Scholar
R. Koo, S. Toueg: Checkpointing and Rollback Recovery for Distributed Systems. IEEE Trans. on Software Engineering, Vol. 13, No. 1, 1987, pp. 23–31
Google Scholar
T.C. Lee, J.P. Hayes: A Fault-Tolerant Communication Scheme for Hyper cube Computers. IEEE Transactions on Computers, Vol. 41. No. 10, October 1992, pp. 1242–1256
Article Google Scholar
CE. Leierson, et. al: The Network Architecture of the Connection Machine CM-5.4th Annual Int. Symp. on Parallel Algorithms and Architectures, pp. 272–285. ACM Press 1992
Google Scholar
D.H. Linder, J.C. Harden: An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, IEEE Transactions on Computers, Vol. 40. No. 1, January 1991, pp. 2–12
Article Google Scholar
E. Maehle, K. Moritzen, K. Wirl: A Graph Model for Diagnosis and Reconfiguration and Its Application to a Fault-Tolerant Multiprocessor System. Proceedings of the 16th International Symposium on Fault-tolerant Computing, FTCS-16, 1986, pp. 292–297
Google Scholar
W. Oed, M. Walker: An Overview of Cray Research Computers including the Y-MP/C90 and the new MPP T3D. 5th ACM Symposium on Parallel Algorithms and Architectures, Velen, Germany, June 1993, pp. 271–272
Google Scholar
Parsytec Computer GmbH: PARIX 1.2 Software Documentation. March 1993
Google Scholar
Y. Tamir, T. Frazier: Error Recovery in Multicomputers Using Global Checkpoints. Hawaii Int. Conf. on System Sciences-22. Kailua-Kona, Hawaii, January 1989
Google Scholar
F. Tied: Parsytec GCel Supercomputer. Technical Report, Preliminary Documentation, July 1992
Google Scholar
J. van Leeuwen, R.B. Tan: Interval Routing. The Computer Journal, Vol. 30, No. 4, 1987, pp. 298–307
Article Google Scholar
J. Vounckx; G. Deconinck; R. Cuyvers; R. Lauwereins; J.A. Peperstraete: Network Fault-Tolerance with Interval Routing Devices, LASTED Int. Symp. Applied Informatics, France. May 1993, pp. 293–296
Google Scholar
J. Vounckx, G. Deconinck, R. Cuyvers, R. Lauwereins: Multiprocessor Routing techniques, Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993
Google Scholar

Download references

Author information

Authors and Affiliations

FG Datentechnik, Universität-GH Paderborn, Deutschland
Bernd Bieker & Erik Maehle
Dept. Elektrotechniek-ESAT, Katholieke Universiteit Leuven, Belgien
Geert Deconinck & Johan Vounckx

Authors

Bernd Bieker
View author publications
You can also search for this author in PubMed Google Scholar
Geert Deconinck
View author publications
You can also search for this author in PubMed Google Scholar
Erik Maehle
View author publications
You can also search for this author in PubMed Google Scholar
Johan Vounckx
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bieker, B., Deconinck, G., Maehle, E., Vounckx, J. (1994). Reconfiguration and checkpointing in massively parallel systems. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_141

Download citation

DOI: https://doi.org/10.1007/3-540-58426-9_141
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics