Skip to main content
Log in

Credible execution of bounded-time parallel systems with delayed diagnosis

Glaubwürdige Ausführung von zeitbeschränkten parallelen Systemen mit verzögerter Diagnose

  • Published:
Computing Aims and scope Submit manuscript

Abstract

This paper presents a forward recovery method for the fault-tolerant execution of parallel software systems on multicomputers such that faults are neither detected nor diagnosed until the fault prevents progress in the computation of the system. The method minimizes the communication and synchronization overhead required to verify the reliability of the system and consequently minimizes the impact of fault-tolerance on the throughput of the computation. We say the system is credible provided that the system is diagnosable and complete, where complete means that at least one copy of each process exists on a fault-free processor. We apply the method to the process structure deriving from parallel, bounded-time decision systems and show through an exact Markov analysis that the method will yield a very credible system. We then introduce a much simpler but approximate Markov model that facilitates credibility analysis over a larger range of parameters and applications.

Zusammenfassung

Die Arbeit stellt eine Recovery-Methode zur fehlertoleranten Ausführung paralleler Softwaresysteme auf Multicomputern vor, welche Fehler solange weder entdeckt noch diagnostiziert bis der Fehler die weitere Ausführung verhindert. Die Methode minimiert den Kommunikations- und Synchronisationsaufwand der zur Feststellung der Zuverlässigkeit des Systems notwendig ist und minimiert dementsprechend die Auswirkungen der Fehlertoleranz auf den Durchsatz. Wir sagen, das System ist glaubwürdig, falls es diagnostizierbar und vollständig ist, wobei vollständig heißt, daß mindestens eine Prozeßkopie auf einem fehlerfreien Prozessor läuft. Die Methode wird auf die Prozeßstruktur angewandt, die sich von parallelen, zeitbeschränkten Entscheidungssystemen herleitet und wir zeigen durch eine Makro-Analysis, daß die Methode ein glaubwürdiges System liefert. Dann wird ein einfaches, approximatives Markov-Modell eingeführt, das eine Glaubwürdigkeitsanalyse für einen weiteren Bereich von Parametern und Anwendungen liefert.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • [Agh86] Agha, G.: Actors: A model of computation. Cambridge, Mass.: MIT Press 1986.

    Google Scholar 

  • [Agr85] Agrawal, P.: RAFT. A recursive algorithm for fault tolerance. In: Pro., International Conference on Parallel Processing, August 1985, pp. 814–821.

  • [AS82] Adams, G. B., Siegel, H. J.: The extra stage cube: A fault-tolerant interconnection network for supercomputers. IEEE Transactions on Computers1982, 443–454.

  • [BEG+90] Browne, J. C., Emerson, A., Gouda, M., Miranker, D. P., Mok, A., Rosier, L.: Bounded-time, fault-tolerance, rule-based systems. In: Proc Goddard Conference on Space Applications of Artificial Intelligence, 1990.

  • [BFKM85] Brownston L., Farrell, R., Kant, E., Martin, N.: Programming expert systems in OPS5. Reading Mass.: Addison-Wesley 1985.

    Google Scholar 

  • [CKWC90] Mok, A., Wang, C. K., Cheng A.: MRL: A real-time rule-based production system. In: Proc. 11th Real-Time Systems Symposium, December 1990, pp. 267–276.

  • [CM88] Chandy, K. M., Misra, J.: Parallel program design: A foundation Reading, Mass.: Addison-Wesley 1988.

    Google Scholar 

  • [Dij76] Dijkstra, E. W.: A discipline of programming: Englewood Cliffs, N.J.: Prentice-Hall 1976.

    Google Scholar 

  • [DM84] Dabhura, A. T., Masson, G. M.: AnO(n2.5) Fault identification algorithm for diagnosable systems. IEEE Transactions on Computers1984, 486–492.

  • [ER89] Fussel, D., Rangarajan, S.: Probabilistic diagnosis of multiprocessor systems. In: Fault-Tolerant Conference Symposium, 1989.

  • [FWA87] Fuchs, W. K., Wu, K. L., Abraham, J. A.: Comparison and diagnosis of large replicated files. IEEE Transactions on Software Engineering1987, 15.

  • [Kle75] Kleinrock, L.: Queueing systems, Vol. 1: Theory: Wiley 1975.

  • [LFA90] Long, J., Fuchs, W. K., Abraham, J. A.: Forward recovery using checkpointing in parallel systems. In: Proc. IEEE International Conference on Parallel Processing, Vol. 1, pp. 272–275.

  • [Mal80] Malek, M.: A comparison connection assignment for diagnosis of multiprocessor systems, pp. 31–35, 1980.

  • [Mir88] Miranker, D. P.: A high level language approach to the Fault tolerant exectuation of AI expert systems. Artificial Intelligence Laboratory, University of Texas at Austin, March 1988, Technical Report AI88-71.

  • [MKB90] Miranker, D. P., Kuo, C. M., Browne, J. C.: Parallelizing compilation of rule-based programs. In: Proc. IEEE International Conference on Parallel Processing, 1990, Vol. 2, pp. 247–251.

  • [ML90] Miranker, D. P., Lofaso, B. J.: The organization and performance of a treat-based production system compiler. IEEE Trans. on Knowledge and Data Engineering1991, 3–9.

  • [Mok89] Mok, A.: Formal analysis of real-time equational rule-based. In: Proc., 10th Real-Time Systems Symposium, December 1989, pp. 308–318.

  • [PMC67] Preperate, F. P., Metze, G., Chien, R. T.: On the connection assignment problem of diagnosable systems. IEEE Transactions on Elec. ComputersEC-16: 848–854 (1967).

    Google Scholar 

  • [Pra82] Pradhan, D. K.: On a class of Fault-tolerant multiprocessor network architecture. 1982.

  • [Sha90] Shankar, R.: Credible execution of parallel systems with delayed diagnosis. Master's thesis, Department of Electrical and Computer Engineering, University of Texas at Austin, 1990.

  • [Sto84] Stolfo, S. J.: Five parallel algorithms for production system execution on the DADO machine. In: Proc. National Conference on Artificial Intelligence, Austin, Texas, August 1984.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Dedicated to Professor Willard L. Miranker on the occasion of his 60th birthday

This research was supported in part by the Office of Naval Research under contracts N00014-86-K-0763 and N00014-90-J-1366 and by a grant from Texas Instruments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shankar, R., Miranker, D.P. Credible execution of bounded-time parallel systems with delayed diagnosis. Computing 48, 21–37 (1992). https://doi.org/10.1007/BF02241704

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02241704

AMS Subject Classifications

Key words

Navigation