Abstract
Despite the improvements in hardware design massively parallel systems lack on dependability due to the huge amount of components these systems consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. The ESPRIT Project 6731 “A Practical Approach to Fault-Tolerant Massively Parallel Systems” follows such an approach and covers the aspects of error detection, diagnosis, checkpointing and reconfiguration. Target systems are multi-computers consisting of grid-wise connected modules using message passing. A first implementation will be made for the Parsytec GCel under PARIX. This paper focuses on recovery by reconfiguration and checkpointing. The project is based on switching in spares and routing around failed components via virtual links (interval routing). For the recovery a user-driven as well as a user-transparent approach are provided based on the new recovery-line-manager.
This research was partially supported by the EC as Esprit project 6731 (FTMPS)
Preview
Unable to display preview. Download preview PDF.
References
J. Altmann, F. Balbach, A. Hein: An Approach for Hierarchical System-Level Diagnosis of Massively Parallel Computers combined with a Simulation-Based Method for Dependability Analysis. To appear in EDCC-1, 1994
P. Banerjee: Reconfiguring a Hypercube Multiprocessor in the Presence of Faults. Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications, Montery, California, March 1989, pp. 95–102
A. Bauch, B. Bieker, E. Maehle: Backward Error Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 6–7, 1992, pp. 36–43
W.J. Dally, C.L. Seitz: The torus routing chip. Distributed Computing, 1, 1986, pp. 187–196
G. Deconinck, J. Vounckx, R. Lauwereins, J.A. Peperstraete: Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback, Proc. of the IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10–12, 1993, pp. 262–265
ESPRIT-Project 6731: A Practical Approach to Fault-Tolerant Massively Parallel Systems. Technical Annex, 1992
T.M. Frazier, Y. Tamir: Application-Transparent Error-Recovery Techniques for Multicomputer. Proc. of the 4th Conf. on Hypercubes, Concurrent Computers and Applications, Mar. 1989, pp. 103–108
R. Geist, R. Reynolds, J. Westall: Selection of a Checkpoint Interval in a Critical-Task Environment. IEEE Trans. on Reliability, Vol. 37, No. 4, Oct., 1988
J.P. Hayes: A Graph Model for Fault-Tolerant Computing Systems. IEEE Transactions on Computers, Vol. 25, No. 9, 1976, pp. 875–884
D.B. Johnson, W. Zwaenepoel: Sender-Based Message Logging. Proceedings of the 17th International Symposium on Fault-tolerant Computing, FTCS-17, July 1987, pp. 14–19
R. Koo, S. Toueg: Checkpointing and Rollback Recovery for Distributed Systems. IEEE Trans. on Software Engineering, Vol. 13, No. 1, 1987, pp. 23–31
T.C. Lee, J.P. Hayes: A Fault-Tolerant Communication Scheme for Hyper cube Computers. IEEE Transactions on Computers, Vol. 41. No. 10, October 1992, pp. 1242–1256
CE. Leierson, et. al: The Network Architecture of the Connection Machine CM-5.4th Annual Int. Symp. on Parallel Algorithms and Architectures, pp. 272–285. ACM Press 1992
D.H. Linder, J.C. Harden: An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, IEEE Transactions on Computers, Vol. 40. No. 1, January 1991, pp. 2–12
E. Maehle, K. Moritzen, K. Wirl: A Graph Model for Diagnosis and Reconfiguration and Its Application to a Fault-Tolerant Multiprocessor System. Proceedings of the 16th International Symposium on Fault-tolerant Computing, FTCS-16, 1986, pp. 292–297
W. Oed, M. Walker: An Overview of Cray Research Computers including the Y-MP/C90 and the new MPP T3D. 5th ACM Symposium on Parallel Algorithms and Architectures, Velen, Germany, June 1993, pp. 271–272
Parsytec Computer GmbH: PARIX 1.2 Software Documentation. March 1993
Y. Tamir, T. Frazier: Error Recovery in Multicomputers Using Global Checkpoints. Hawaii Int. Conf. on System Sciences-22. Kailua-Kona, Hawaii, January 1989
F. Tied: Parsytec GCel Supercomputer. Technical Report, Preliminary Documentation, July 1992
J. van Leeuwen, R.B. Tan: Interval Routing. The Computer Journal, Vol. 30, No. 4, 1987, pp. 298–307
J. Vounckx; G. Deconinck; R. Cuyvers; R. Lauwereins; J.A. Peperstraete: Network Fault-Tolerance with Interval Routing Devices, LASTED Int. Symp. Applied Informatics, France. May 1993, pp. 293–296
J. Vounckx, G. Deconinck, R. Cuyvers, R. Lauwereins: Multiprocessor Routing techniques, Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bieker, B., Deconinck, G., Maehle, E., Vounckx, J. (1994). Reconfiguration and checkpointing in massively parallel systems. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_141
Download citation
DOI: https://doi.org/10.1007/3-540-58426-9_141
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive