Skip to main content

Software Fault Tolerance: An Overview

  • Conference paper
  • First Online:
Reliable Software Technologies — Ada-Europe 2003 (Ada-Europe 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2655))

Included in the following conference series:

Abstract

This paper presents an overview of the techniques that can be used by developers to produce software that can tolerate design faults and faults of the surrounding environment. After reviewing the basic terms and concepts of fault tolerance, the most well-known fault-tolerance techniques exploiting software-, information- and time redundancy are presented, classified according to the kind of concurrency they support.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pullum, L.L.: Software Fault Tolerance-Techniques and Implementation, Artech House, Boston, 2001.

    MATH  Google Scholar 

  2. Laprie, J.-C.: “Dependable Computing and Fault Tolerance: Concepts and Terminology”, in Proceedings of the 15th International Symposium on Fault-Tolerant Computing Systems (FTCS-15), pp. 2–11, Ann Arbour, MI, USA, June 1985.

    Google Scholar 

  3. Cristian, F.: “Understanding Fault-Tolerant Distributed Systems”, Communications of the ACM 34(2), February 1991, pp. 56–78.

    Article  Google Scholar 

  4. Lamport, L.; Shostak, R.; Pease, M.: “The Byzantine Generals Problem”, ACM Transactions on Programming Languages and Systems 4(3), pp. 382–401, 1982.

    Article  MATH  Google Scholar 

  5. Kopetz, H.: Real-Time Systems — Design Principles for Distributed Embedded Applications, Kluwer Academic Publishers, 1997.

    Google Scholar 

  6. Lee, P.A.; Anderson, T.: “Fault Tolerance — Principles and Practice”, in Dependable Computing and Fault-Tolerant Systems, Springer Verlag, 2nd ed., 1990.

    Google Scholar 

  7. Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1–21, in Lyu, M.R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.

    Google Scholar 

  8. IEEE Standard 729-1982: “IEEE Glossary of Software Engineering Terminology”, 1982.

    Google Scholar 

  9. Horning, J.J, et al.: “A Program Strucure for Error Detection and Recovery”, in E. Gelenbe and C. Kaiser (eds.), Lecture Notes in Computer Science 16, pp. 171–187, Springer, 1974.

    Google Scholar 

  10. Randell, B.: “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering SE-1(2), pp. 220–232, 1975.

    Google Scholar 

  11. Ammann, P.E.; Knight, J.C.: “Data Diversity: An Approach to Software Fault Tolerance”, Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems (FTCS-17), Pittsburgh, PA, pp. 122–126, 1987.

    Google Scholar 

  12. Elmendorf, W.R.: “Fault Tolerant Programming”, Proceedings of the 2nd International Symposium on Fault-Tolerant Computing Systems (FTCS-2), Newton, MA, pp. 79–83, 1972.

    Google Scholar 

  13. Chen, L. and Avizienis, A.: “N-Version Programming: A Fault Tolerance Approach to Reliability of Software Operation”, Proceedings of the 8th International Symposium on Fault-Tolerant Computing Systems (FTCS-8), Toulouse, France, pp. 3–9, 1978.

    Google Scholar 

  14. Ammann, P.E.; Knight, J.C.: “Data Diversity: An Approach to Software Fault Tolerance”, IEEE Transactions on Computers 37(4), pp. 418–425, 1988.

    Article  Google Scholar 

  15. Brilliant, S.S.; Knight, J.C.; Leveson, N.G.: “The Consistent Comparison Problem in NVersion Software”, IEEE Transactions on Software Engineering 15(11), pp. 1481–1485, 1989.

    Article  Google Scholar 

  16. Avizienis, A.: “The N-Version Approach to Fault-Tolerant Software”, IEEE Transactions on Software Engineering SE-11(12), pp. 1491–1501, 1985.

    Article  Google Scholar 

  17. Vouk, M.A. et al.: “An Empirical Evaluation of Consensus Voting and Consensus Recovery Block Reliability in the Presence of Failure Correlation”, Journal of Computer and Software Engineering 1(4), pp. 367–388, 1993.

    Google Scholar 

  18. Gray, J.; Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo, California, 1993.

    MATH  Google Scholar 

  19. Pu, C.; Kaiser, G.E.; Hutchinson, N.C.: “Split-Transactions for Open-Ended Activities”, in 14th International Conference on Very Large Data Bases, pp. 26–37, Los Angeles, California, Morgan Kaufmann, 1988.

    Google Scholar 

  20. Vinter, S.; Ramamritham, K.; Stemple, D.: “Recoverable Actions in Gutenberg”, in Proceedings of the 6th International Conference on Distributed Computing Systems, pp. 242–249, Los Angeles, Ca., USA, IEEE Computer Society Press, 1986.

    Google Scholar 

  21. Garcia-Molina, H.; Salem, K.: “SAGAS”, in Proceedings of the SIGMod 1987 Annual Conference, pp. 249–259, San Francisco, CA, ACM Press, May 1987.

    Chapter  Google Scholar 

  22. Moss, J. E. B.: Nested Transactions, An Approach to Reliable Computing. PhD Thesis, MIT, Cambridge, April 1981.

    Google Scholar 

  23. Kienzle, J.; Strohmeier, A.; Romanovsky, A.: “Auction System Design Using Open Multithreaded Transactions”. In Proceedings of the 7th IEEE International Worshop on Object-Oriented Real-Time Dependable Systems (WORDS’02), San Diego, CA, USA, January 7th–9th, 2002, ppp. 95–104, IEEE Computer Society Press, Los Alamitos, California, USA, 2002.

    Chapter  Google Scholar 

  24. Randell, B.: “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering 1 (2), pp. 220–232, 1975.

    Google Scholar 

  25. Strigini, L.; Giandomenico, F.D.; Romanovsky, A.: “Coordinated Backward Recovery between Client Processes and Data Servers”, IEEE Proceedings — Software Engineering 144 (2), pp. 134–146, April 1997.

    Article  Google Scholar 

  26. Campbell, R.H.; Randell, B.: “Error Recovery in Asynchronous Systems”, IEEE Transactions on Software Engineering SE-12 (8), pp. 811–826, August 1986.

    Google Scholar 

  27. Xu, J.; Randell, B.; Romanovsky, A.; Rubira, C.M.F.; Stroud, R.J.; Wu, Z.: “Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery”, in Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems (FTCS-25), pp. 499–509, Pasadena, California, 1995.

    Google Scholar 

  28. Xu, J.; Romanovsky, A.; Randell, B.: “Concurrent Exception Handling and Resolution in Distributed Object Systems”, IEEE Transactions on Parallel and Distributed Systems 11 (11), pp. 1019–1032, November 2000.

    Google Scholar 

  29. Kienzle, J.; Romanovsky, A.; Strohmeier, A.: “Open Multithreaded Transactions: Keeping Threads and Exceptions under Control”. In Proceedings of the 6th International Worshop on Object-Oriented Real-Time Dependable Systems, Universita di Roma La Sapienza, Roma, Italy, January 8th–10th, 2001, pp. 197–205, IEEE Computer Society Press, Los Alamitos, California, USA, 2001.

    Chapter  Google Scholar 

  30. Athavale, A.: “Performance Evaluation of Hybrid Voting Schemes”, M.S. thesis, North Carolina State University, Department of Computer Science, 1989.

    Google Scholar 

  31. Tai, A.T.; Meyer, J.F.; Aviziensis, A.: “Performability Enhancement of Fault-Tolerant Software”, IEEE Transactions on Reliability 42(2), pp. 227–237, 1993.

    Article  MATH  Google Scholar 

  32. Kim, K.H.: “Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults”, Proceedings of the Fourth International Conference on Distributed Computing Systems, pp. 526–532, 1984.

    Google Scholar 

  33. Kim, K.H.: “The Distributed Recovery Block Scheme”, in M.R. Lyu (ed.), Software Fault Tolerance, New York, John Wiley & Sons, pp. 189–209, 1995.

    Google Scholar 

  34. Scott, R.K.; Gault, J.W.; Mc Allister, D.F.: “The Consensus Recovery Block”, Proceedings of the Total Systems Reliability Symposium, Gaithersburg, MD, pp. 95–104, 1983.

    Google Scholar 

  35. Pullum, L.L.: “Fault-Tolerant Software Decision-Making Under the Occurrence of Multiple Correct Results”, Ph.D. thesis, Southeastern Institute of Technology, 1992.

    Google Scholar 

  36. Bondavelli, A.; Di Giandomenico, F.; Xu, J.: “Cost-Effective and Flexible Scheme for Software Fault Tolerance”, Journal of Computer System Science & Engineering 8(4), pp. 234–244, 1993.

    Google Scholar 

  37. Object Management Group, Inc.: Object Transaction Service, Version 1.1, May 2000.

    Google Scholar 

  38. Shannon, B.; Hapner, M.; Matena, V.; Davidson, J.; Pelegri-Llopart, E.; Cable, L.: Java 2 Platform Enterprise Edition: Platform and Component Specification. The Java Series, Addison Wesley, Reading, MA, USA, 2000.

    Google Scholar 

  39. ISO: International Standard ISO/IEC 8652:1995(E): Ada Reference Manual, Lecture Notes in Computer Science 1246, Springer Verlag, 1997; ISO, 1995.

    Google Scholar 

  40. Rodgers, P.; Wellings, A.J.: “An Incremental Recovery Cache Supporting Software Fault Tolerance”, in Reliable Software Technologies-Ada-Europe’99, Santander, Spain, June 7–11, 1999, Lecture Notes in Computer Science 1622, pp. 385–396, 1999.

    Chapter  Google Scholar 

  41. Kienzle, J.; Strohmeier, A.: “Shared Recoverable Objects”, in Reliable Software Technologies-Ada-Europe’99, Santander, Spain, June 7–11, 1999, Lecture Notes in Computer Science 1622, pp. 397–411, 1999.

    Chapter  Google Scholar 

  42. Romanovsky, A.; Mitchell, S.E.; Wellings, A.J.: “On Programming Atomic Actions in Ada 95”, Ada Europe’97, London, Lecture Notes in Computer Science 1251, pp. 254–265, 1997.

    Google Scholar 

  43. Mitchell, S.E.; Wellings, A.J.; Romanovsky, A.: “Distributed Atomic Actions in Ada 95”, The Computer Journal 41(7), pp. 486–502, 1998.

    Article  MATH  Google Scholar 

  44. Romanovsky, A.; Randell, B.; Stroud, R.; Xu, J.; Zorzo, A.: “Implementation of Blocking Coordinated Atomic Actions Based on Forward Error Recovery”, Journal of System Architecture (Special Issue on Dependable Parallel Computing Systems) 43(10), pp. 687–699, September, 1997.

    Google Scholar 

  45. Kienzle, J.; Jiménez-Peris, R.; Romanovsky, A.; Patiño-Martinez, M.: “Transaction Support for Ada”. In Reliable Software Technologies — Ada-Europe’2001, Leuven, Belgium, May 14–18, 2001, pp. 290–304, Lecture Notes in Computer Science 2043, Springer Verlag, 2001.

    Google Scholar 

  46. Maes, P.: “Concepts and Experiments in Computational Reflection”, ACM SIGPLAN Notices 22 (12), December 1987, pp. 147–155.

    Article  Google Scholar 

  47. Xu, J.; Randell, B.; Zorzo, A. F.: “Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach”, Proc. Int. Workshop on Computer-Aided Design, Test, and Evaluation for Dependability (CADTED96), Beijing, China, pp. 224–229, Int. Academic Publ., 1996.

    Google Scholar 

  48. Elrad, T.; Aksit, M.; Kiczales, G.; Lieberherr, K.; Ossher, H.: “Discussing Aspects of AOP”. Communications of the ACM 44(10), pp. 33–38, October 2001.

    Article  Google Scholar 

  49. Kienzle, J.; Guerraoui, R.: “AOP — Does it make sense? The case of concurrency and failures”. In Proceedings of the 16th European Conference on Object-Oriented Programming (ECOOP 2002), pp. 37–54, Malaga, Spain, June 2002, Lecture Notes in Computer Science 2374, Springer Verlag, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kienzle, J. (2003). Software Fault Tolerance: An Overview. In: Rosen, JP., Strohmeier, A. (eds) Reliable Software Technologies — Ada-Europe 2003. Ada-Europe 2003. Lecture Notes in Computer Science, vol 2655. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44947-7_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-44947-7_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40376-0

  • Online ISBN: 978-3-540-44947-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics