Failure Characteristics and Soft Error Behavior in a Large Storage System

Talagala, Nisha; Patterson, David

doi:10.1007/978-1-4615-4549-1_2

Nisha Talagala² &
David Patterson²

Part of the book series: The Springer International Series in Engineering and Computer Science ((SECS,volume 538))

88 Accesses

Abstract

This chapter analyzes the error behavior of a 3.2TB disk storage system. We report reliability data for 18 months of the prototype’s operation, and analyze 6 months of error logs from nodes in the prototype. We found that the disks drives were among the most reliable components in the system. We were also able to divide errors into eleven categories, comprising disk errors, network errors and SCSI errors that appeared repeatedly across all nodes. We also gained insight into the types of error messages reported by devices in various conditions, and the effects of these events on the operating system. We also present data from four cases of disk drive failures. These results and insights should be useful to any designer of a fault tolerant storage system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burkhard, W. and Menon, J. (1993). Disk Array Storage System reliability. In Proceedings 23rd annual International Symposium on Fault Tolerant Computing.
Google Scholar
Cao, P., Lim, S., Venkataraman, S., and Wilkes, J. (1993). The TickerTAIP Parallel RAID Architecture. In Proceedings 20th Annual International Symposium on Computer Architecture.
Google Scholar
Chen, P., Lee, E., Gibson, G., Katz, R., and Patterson, D. (1994). RAID: High Performance Reliable Secondary Storage. ACM Computing Surveys, 26 (no.2):145–188.
Article Google Scholar
Gibson, G. (1992). Redundant Disk Arrays: Reliable Parallel Secondary Storage. The MIT Press, Cambridge Massachusetts.
Google Scholar
Gray, J. (1990). A Census of Tandem System Availability Between 1985 and 1990. IEEE Transactions on Reliability, 29(no. 4).
Google Scholar
Hartman, J. and Ousterhout, J. (1995). The Zebra Striped Network File System. ACM Transactions on Computer Systems.
Google Scholar
IBM (1998). Predictive Failure Analysis. In http://www.storage.ibm.com/stor-age/oem/tech/pfa.htm.
Lin, T.-T. (1988). Design and Evaluation of an on-line predictive diagnostic system. In Ph.D Thesis, Technical Report, CMUCSD-88-1. Electrical and Computer Engineering, Carnegie Mellon University.
Google Scholar
Lin, T.-T. and Siewiorek, D. (1990). Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, 39(no.4).
Google Scholar
Ng, S. (1994). Crosshatch disk array for improved reliability and performance. In Proceedings the 21st Annual International Symposium on Computer Architecture, pages 255–264.
Google Scholar
Schulze, M. (1988). Considerations in the Design of a RAID Prototype. In Technical Report UCB/CSD 88/448. Computer Science Division, University of California at Berkeley.
Google Scholar
SCSI2 (1998). The SCSI-2 Interface Specification.
Google Scholar
Seagate (1997). Cheetah Disk Drive Specification.
Google Scholar
Seagate (1998). Self Monitoring. Analysis and Reporting Technology (S.M.A.R.T) Frequently Asked Questions. In http://www.seagate.com:80 /support/disc/faq/smart.shtml.
Google Scholar
Talagala, N., Asami, S., and Patterson, D. (1999). Access Patterns of a Web Based Image Collection. In Proceedings of the 1999 IEEE Symposium on Mass Storage Systems.
Google Scholar
Tsao, M. (1988). Trend Analysis and Fault Prediction. In PhD. Dissertation, Technical Report CMU-CS 83/130. Computer Science Division, Carnegie Mellon University.
Google Scholar
Worthington, B., Ganger, G., Patt, Y., and Wilkes, J. (1995). On-line extraction of SCSI disk drive parameters. In 1995 Joint International Conference on Measurement and Modeling of Computer Systems.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Division, University of California at Berkeley, Berkeley, CA, 94720, USA
Nisha Talagala & David Patterson

Authors

Nisha Talagala
View author publications
You can also search for this author in PubMed Google Scholar
David Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Dimiter R. Avresky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Talagala, N., Patterson, D. (2000). Failure Characteristics and Soft Error Behavior in a Large Storage System. In: Avresky, D.R. (eds) Dependable Network Computing. The Springer International Series in Engineering and Computer Science, vol 538. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-4549-1_2

Download citation

DOI: https://doi.org/10.1007/978-1-4615-4549-1_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7053-6
Online ISBN: 978-1-4615-4549-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics