Skip to main content

Fault Tolerance

  • Chapter
Real-Time Systems

Part of the book series: The International Series in Engineering and Computer Science ((SECS,volume 395))

  • 380 Accesses

Overview

Fault tolerance is important in safety-critical real-time systems because otherwise a single component failure may lead to a catastrophic system failure. This chapter starts with an explanation of the concepts of failure, error, and fault. It then proceeds to investigate the topic of error detection. Error detection requires knowledge about the intended behavior of a system. This knowledge can stem either from a priori established regularity constraints and known properties of the correct behavior of a computation, or from the comparison of the results that have been computed by two redundant channels. Different error detection techniques for the detection of timing errors and value errors are discussed.

In a distributed system, a node is an appropriate unit of failure. A node implements a self-contained function so that the established architectural principle “form follows function” can be maintained even in a failure scenario. The node implementation must map all internal node failures into simple external failure modes. The problem of node failure detection and membership in event-triggered and time-triggered architectures is elaborated. A set of replica-determinate nodes is grouped together to form a fault-tolerant unit (FTU) that masks a failure of one of its nodes. Two different types of fault-tolerant units are introduced, and the problem of the reintegration of a node into an operating cluster is taken up. The key issue is to find a reintegration point where the h-state of the node is minimal. Different techniques for h-state minimization are discussed.

The final section is devoted to a discussion about the utility of design diversity in the implementation of safety-critical systems. An industrial example of a fail-safe system that uses design diversity to increase the safety of the application is described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

(2002). Fault Tolerance. In: Real-Time Systems. The International Series in Engineering and Computer Science, vol 395. Springer, Boston, MA. https://doi.org/10.1007/0-306-47055-1_6

Download citation

  • DOI: https://doi.org/10.1007/0-306-47055-1_6

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-9894-3

  • Online ISBN: 978-0-306-47055-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics