Abstract
This paper presents a forward recovery method for the fault-tolerant execution of parallel software systems on multicomputers such that faults are neither detected nor diagnosed until the fault prevents progress in the computation of the system. The method minimizes the communication and synchronization overhead required to verify the reliability of the system and consequently minimizes the impact of fault-tolerance on the throughput of the computation. We say the system is credible provided that the system is diagnosable and complete, where complete means that at least one copy of each process exists on a fault-free processor. We apply the method to the process structure deriving from parallel, bounded-time decision systems and show through an exact Markov analysis that the method will yield a very credible system. We then introduce a much simpler but approximate Markov model that facilitates credibility analysis over a larger range of parameters and applications.
Zusammenfassung
Die Arbeit stellt eine Recovery-Methode zur fehlertoleranten Ausführung paralleler Softwaresysteme auf Multicomputern vor, welche Fehler solange weder entdeckt noch diagnostiziert bis der Fehler die weitere Ausführung verhindert. Die Methode minimiert den Kommunikations- und Synchronisationsaufwand der zur Feststellung der Zuverlässigkeit des Systems notwendig ist und minimiert dementsprechend die Auswirkungen der Fehlertoleranz auf den Durchsatz. Wir sagen, das System ist glaubwürdig, falls es diagnostizierbar und vollständig ist, wobei vollständig heißt, daß mindestens eine Prozeßkopie auf einem fehlerfreien Prozessor läuft. Die Methode wird auf die Prozeßstruktur angewandt, die sich von parallelen, zeitbeschränkten Entscheidungssystemen herleitet und wir zeigen durch eine Makro-Analysis, daß die Methode ein glaubwürdiges System liefert. Dann wird ein einfaches, approximatives Markov-Modell eingeführt, das eine Glaubwürdigkeitsanalyse für einen weiteren Bereich von Parametern und Anwendungen liefert.
Similar content being viewed by others
References
[Agh86] Agha, G.: Actors: A model of computation. Cambridge, Mass.: MIT Press 1986.
[Agr85] Agrawal, P.: RAFT. A recursive algorithm for fault tolerance. In: Pro., International Conference on Parallel Processing, August 1985, pp. 814–821.
[AS82] Adams, G. B., Siegel, H. J.: The extra stage cube: A fault-tolerant interconnection network for supercomputers. IEEE Transactions on Computers1982, 443–454.
[BEG+90] Browne, J. C., Emerson, A., Gouda, M., Miranker, D. P., Mok, A., Rosier, L.: Bounded-time, fault-tolerance, rule-based systems. In: Proc Goddard Conference on Space Applications of Artificial Intelligence, 1990.
[BFKM85] Brownston L., Farrell, R., Kant, E., Martin, N.: Programming expert systems in OPS5. Reading Mass.: Addison-Wesley 1985.
[CKWC90] Mok, A., Wang, C. K., Cheng A.: MRL: A real-time rule-based production system. In: Proc. 11th Real-Time Systems Symposium, December 1990, pp. 267–276.
[CM88] Chandy, K. M., Misra, J.: Parallel program design: A foundation Reading, Mass.: Addison-Wesley 1988.
[Dij76] Dijkstra, E. W.: A discipline of programming: Englewood Cliffs, N.J.: Prentice-Hall 1976.
[DM84] Dabhura, A. T., Masson, G. M.: AnO(n2.5) Fault identification algorithm for diagnosable systems. IEEE Transactions on Computers1984, 486–492.
[ER89] Fussel, D., Rangarajan, S.: Probabilistic diagnosis of multiprocessor systems. In: Fault-Tolerant Conference Symposium, 1989.
[FWA87] Fuchs, W. K., Wu, K. L., Abraham, J. A.: Comparison and diagnosis of large replicated files. IEEE Transactions on Software Engineering1987, 15.
[Kle75] Kleinrock, L.: Queueing systems, Vol. 1: Theory: Wiley 1975.
[LFA90] Long, J., Fuchs, W. K., Abraham, J. A.: Forward recovery using checkpointing in parallel systems. In: Proc. IEEE International Conference on Parallel Processing, Vol. 1, pp. 272–275.
[Mal80] Malek, M.: A comparison connection assignment for diagnosis of multiprocessor systems, pp. 31–35, 1980.
[Mir88] Miranker, D. P.: A high level language approach to the Fault tolerant exectuation of AI expert systems. Artificial Intelligence Laboratory, University of Texas at Austin, March 1988, Technical Report AI88-71.
[MKB90] Miranker, D. P., Kuo, C. M., Browne, J. C.: Parallelizing compilation of rule-based programs. In: Proc. IEEE International Conference on Parallel Processing, 1990, Vol. 2, pp. 247–251.
[ML90] Miranker, D. P., Lofaso, B. J.: The organization and performance of a treat-based production system compiler. IEEE Trans. on Knowledge and Data Engineering1991, 3–9.
[Mok89] Mok, A.: Formal analysis of real-time equational rule-based. In: Proc., 10th Real-Time Systems Symposium, December 1989, pp. 308–318.
[PMC67] Preperate, F. P., Metze, G., Chien, R. T.: On the connection assignment problem of diagnosable systems. IEEE Transactions on Elec. ComputersEC-16: 848–854 (1967).
[Pra82] Pradhan, D. K.: On a class of Fault-tolerant multiprocessor network architecture. 1982.
[Sha90] Shankar, R.: Credible execution of parallel systems with delayed diagnosis. Master's thesis, Department of Electrical and Computer Engineering, University of Texas at Austin, 1990.
[Sto84] Stolfo, S. J.: Five parallel algorithms for production system execution on the DADO machine. In: Proc. National Conference on Artificial Intelligence, Austin, Texas, August 1984.
Author information
Authors and Affiliations
Additional information
Dedicated to Professor Willard L. Miranker on the occasion of his 60th birthday
This research was supported in part by the Office of Naval Research under contracts N00014-86-K-0763 and N00014-90-J-1366 and by a grant from Texas Instruments.
Rights and permissions
About this article
Cite this article
Shankar, R., Miranker, D.P. Credible execution of bounded-time parallel systems with delayed diagnosis. Computing 48, 21–37 (1992). https://doi.org/10.1007/BF02241704
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02241704