Skip to main content

Failure Characteristics and Soft Error Behavior in a Large Storage System

  • Chapter
Dependable Network Computing

Abstract

This chapter analyzes the error behavior of a 3.2TB disk storage system. We report reliability data for 18 months of the prototype’s operation, and analyze 6 months of error logs from nodes in the prototype. We found that the disks drives were among the most reliable components in the system. We were also able to divide errors into eleven categories, comprising disk errors, network errors and SCSI errors that appeared repeatedly across all nodes. We also gained insight into the types of error messages reported by devices in various conditions, and the effects of these events on the operating system. We also present data from four cases of disk drive failures. These results and insights should be useful to any designer of a fault tolerant storage system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Burkhard, W. and Menon, J. (1993). Disk Array Storage System reliability. In Proceedings 23rd annual International Symposium on Fault Tolerant Computing.

    Google Scholar 

  • Cao, P., Lim, S., Venkataraman, S., and Wilkes, J. (1993). The TickerTAIP Parallel RAID Architecture. In Proceedings 20th Annual International Symposium on Computer Architecture.

    Google Scholar 

  • Chen, P., Lee, E., Gibson, G., Katz, R., and Patterson, D. (1994). RAID: High Performance Reliable Secondary Storage. ACM Computing Surveys, 26 (no.2):145–188.

    Article  Google Scholar 

  • Gibson, G. (1992). Redundant Disk Arrays: Reliable Parallel Secondary Storage. The MIT Press, Cambridge Massachusetts.

    Google Scholar 

  • Gray, J. (1990). A Census of Tandem System Availability Between 1985 and 1990. IEEE Transactions on Reliability, 29(no. 4).

    Google Scholar 

  • Hartman, J. and Ousterhout, J. (1995). The Zebra Striped Network File System. ACM Transactions on Computer Systems.

    Google Scholar 

  • IBM (1998). Predictive Failure Analysis. In http://www.storage.ibm.com/stor-age/oem/tech/pfa.htm.

  • Lin, T.-T. (1988). Design and Evaluation of an on-line predictive diagnostic system. In Ph.D Thesis, Technical Report, CMUCSD-88-1. Electrical and Computer Engineering, Carnegie Mellon University.

    Google Scholar 

  • Lin, T.-T. and Siewiorek, D. (1990). Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, 39(no.4).

    Google Scholar 

  • Ng, S. (1994). Crosshatch disk array for improved reliability and performance. In Proceedings the 21st Annual International Symposium on Computer Architecture, pages 255–264.

    Google Scholar 

  • Schulze, M. (1988). Considerations in the Design of a RAID Prototype. In Technical Report UCB/CSD 88/448. Computer Science Division, University of California at Berkeley.

    Google Scholar 

  • SCSI2 (1998). The SCSI-2 Interface Specification.

    Google Scholar 

  • Seagate (1997). Cheetah Disk Drive Specification.

    Google Scholar 

  • Seagate (1998). Self Monitoring. Analysis and Reporting Technology (S.M.A.R.T) Frequently Asked Questions. In http://www.seagate.com:80 /support/disc/faq/smart.shtml.

    Google Scholar 

  • Talagala, N., Asami, S., and Patterson, D. (1999). Access Patterns of a Web Based Image Collection. In Proceedings of the 1999 IEEE Symposium on Mass Storage Systems.

    Google Scholar 

  • Tsao, M. (1988). Trend Analysis and Fault Prediction. In PhD. Dissertation, Technical Report CMU-CS 83/130. Computer Science Division, Carnegie Mellon University.

    Google Scholar 

  • Worthington, B., Ganger, G., Patt, Y., and Wilkes, J. (1995). On-line extraction of SCSI disk drive parameters. In 1995 Joint International Conference on Measurement and Modeling of Computer Systems.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer Science+Business Media New York

About this chapter

Cite this chapter

Talagala, N., Patterson, D. (2000). Failure Characteristics and Soft Error Behavior in a Large Storage System. In: Avresky, D.R. (eds) Dependable Network Computing. The Springer International Series in Engineering and Computer Science, vol 538. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-4549-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-4549-1_2

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-7053-6

  • Online ISBN: 978-1-4615-4549-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics