Credible execution of bounded-time parallel systems with delayed diagnosis

Shankar, R.; Miranker, D. P.

doi:10.1007/BF02241704

Credible execution of bounded-time parallel systems with delayed diagnosis

Glaubwürdige Ausführung von zeitbeschränkten parallelen Systemen mit verzögerter Diagnose

Published: March 1992

Volume 48, pages 21–37, (1992)
Cite this article

Computing Aims and scope Submit manuscript

R. Shankar¹ &
D. P. Miranker²

23 Accesses
Explore all metrics

Abstract

This paper presents a forward recovery method for the fault-tolerant execution of parallel software systems on multicomputers such that faults are neither detected nor diagnosed until the fault prevents progress in the computation of the system. The method minimizes the communication and synchronization overhead required to verify the reliability of the system and consequently minimizes the impact of fault-tolerance on the throughput of the computation. We say the system is credible provided that the system is diagnosable and complete, where complete means that at least one copy of each process exists on a fault-free processor. We apply the method to the process structure deriving from parallel, bounded-time decision systems and show through an exact Markov analysis that the method will yield a very credible system. We then introduce a much simpler but approximate Markov model that facilitates credibility analysis over a larger range of parameters and applications.

Zusammenfassung

Die Arbeit stellt eine Recovery-Methode zur fehlertoleranten Ausführung paralleler Softwaresysteme auf Multicomputern vor, welche Fehler solange weder entdeckt noch diagnostiziert bis der Fehler die weitere Ausführung verhindert. Die Methode minimiert den Kommunikations- und Synchronisationsaufwand der zur Feststellung der Zuverlässigkeit des Systems notwendig ist und minimiert dementsprechend die Auswirkungen der Fehlertoleranz auf den Durchsatz. Wir sagen, das System ist glaubwürdig, falls es diagnostizierbar und vollständig ist, wobei vollständig heißt, daß mindestens eine Prozeßkopie auf einem fehlerfreien Prozessor läuft. Die Methode wird auf die Prozeßstruktur angewandt, die sich von parallelen, zeitbeschränkten Entscheidungssystemen herleitet und wir zeigen durch eine Makro-Analysis, daß die Methode ein glaubwürdiges System liefert. Dann wird ein einfaches, approximatives Markov-Modell eingeführt, das eine Glaubwürdigkeitsanalyse für einen weiteren Bereich von Parametern und Anwendungen liefert.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Necessary and sufficient conditions for t-diagnosability of multiprocessor computer systems for various models of nonreliable testing established using the system graph-theoretical model

Article 14 June 2016

On Undecidability Aspects of Resilient Computations and Implications to Exascale

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

Article 09 July 2015

References

[Agh86] Agha, G.: Actors: A model of computation. Cambridge, Mass.: MIT Press 1986.
Google Scholar
[Agr85] Agrawal, P.: RAFT. A recursive algorithm for fault tolerance. In: Pro., International Conference on Parallel Processing, August 1985, pp. 814–821.
[AS82] Adams, G. B., Siegel, H. J.: The extra stage cube: A fault-tolerant interconnection network for supercomputers. IEEE Transactions on Computers1982, 443–454.
[BEG⁺90] Browne, J. C., Emerson, A., Gouda, M., Miranker, D. P., Mok, A., Rosier, L.: Bounded-time, fault-tolerance, rule-based systems. In: Proc Goddard Conference on Space Applications of Artificial Intelligence, 1990.
[BFKM85] Brownston L., Farrell, R., Kant, E., Martin, N.: Programming expert systems in OPS5. Reading Mass.: Addison-Wesley 1985.
Google Scholar
[CKWC90] Mok, A., Wang, C. K., Cheng A.: MRL: A real-time rule-based production system. In: Proc. 11th Real-Time Systems Symposium, December 1990, pp. 267–276.
[CM88] Chandy, K. M., Misra, J.: Parallel program design: A foundation Reading, Mass.: Addison-Wesley 1988.
Google Scholar
[Dij76] Dijkstra, E. W.: A discipline of programming: Englewood Cliffs, N.J.: Prentice-Hall 1976.
Google Scholar
[DM84] Dabhura, A. T., Masson, G. M.: AnO(n^2.5) Fault identification algorithm for diagnosable systems. IEEE Transactions on Computers1984, 486–492.
[ER89] Fussel, D., Rangarajan, S.: Probabilistic diagnosis of multiprocessor systems. In: Fault-Tolerant Conference Symposium, 1989.
[FWA87] Fuchs, W. K., Wu, K. L., Abraham, J. A.: Comparison and diagnosis of large replicated files. IEEE Transactions on Software Engineering1987, 15.
[Kle75] Kleinrock, L.: Queueing systems, Vol. 1: Theory: Wiley 1975.
[LFA90] Long, J., Fuchs, W. K., Abraham, J. A.: Forward recovery using checkpointing in parallel systems. In: Proc. IEEE International Conference on Parallel Processing, Vol. 1, pp. 272–275.
[Mal80] Malek, M.: A comparison connection assignment for diagnosis of multiprocessor systems, pp. 31–35, 1980.
[Mir88] Miranker, D. P.: A high level language approach to the Fault tolerant exectuation of AI expert systems. Artificial Intelligence Laboratory, University of Texas at Austin, March 1988, Technical Report AI88-71.
[MKB90] Miranker, D. P., Kuo, C. M., Browne, J. C.: Parallelizing compilation of rule-based programs. In: Proc. IEEE International Conference on Parallel Processing, 1990, Vol. 2, pp. 247–251.
[ML90] Miranker, D. P., Lofaso, B. J.: The organization and performance of a treat-based production system compiler. IEEE Trans. on Knowledge and Data Engineering1991, 3–9.
[Mok89] Mok, A.: Formal analysis of real-time equational rule-based. In: Proc., 10th Real-Time Systems Symposium, December 1989, pp. 308–318.
[PMC67] Preperate, F. P., Metze, G., Chien, R. T.: On the connection assignment problem of diagnosable systems. IEEE Transactions on Elec. ComputersEC-16: 848–854 (1967).
Google Scholar
[Pra82] Pradhan, D. K.: On a class of Fault-tolerant multiprocessor network architecture. 1982.
[Sha90] Shankar, R.: Credible execution of parallel systems with delayed diagnosis. Master's thesis, Department of Electrical and Computer Engineering, University of Texas at Austin, 1990.
[Sto84] Stolfo, S. J.: Five parallel algorithms for production system execution on the DADO machine. In: Proc. National Conference on Artificial Intelligence, Austin, Texas, August 1984.

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, University of Texas at Austin, 78712, Austin, U.S.A.
R. Shankar
Department of Computer Sciences, University of Texas at Austin, 78712, Austin, U.S.A.
D. P. Miranker

Authors

R. Shankar
View author publications
You can also search for this author in PubMed Google Scholar
D. P. Miranker
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Dedicated to Professor Willard L. Miranker on the occasion of his 60th birthday

This research was supported in part by the Office of Naval Research under contracts N00014-86-K-0763 and N00014-90-J-1366 and by a grant from Texas Instruments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shankar, R., Miranker, D.P. Credible execution of bounded-time parallel systems with delayed diagnosis. Computing 48, 21–37 (1992). https://doi.org/10.1007/BF02241704

Download citation

Received: 04 September 1991
Revised: 18 December 1991
Issue Date: March 1992
DOI: https://doi.org/10.1007/BF02241704

AMS Subject Classifications

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Credible execution of bounded-time parallel systems with delayed diagnosis

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Necessary and sufficient conditions for t-diagnosability of multiprocessor computer systems for various models of nonreliable testing established using the system graph-theoretical model

On Undecidability Aspects of Resilient Computations and Implications to Exascale

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

AMS Subject Classifications

Key words

Navigation

Credible execution of bounded-time parallel systems with delayed diagnosis

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Necessary and sufficient conditions for t-diagnosability of multiprocessor computer systems for various models of nonreliable testing established using the system graph-theoretical model

On Undecidability Aspects of Resilient Computations and Implications to Exascale

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

AMS Subject Classifications

Key words

Search

Navigation