Fault-Tolerance Issues of Local Area Multiprocessor (LAMP) Storage Subsystem

Li, Qiang; Hong, Edward; Tsukerman, Alex

doi:10.1007/978-1-4615-5449-3_8

Qiang Li³,
Edward Hong³ &
Alex Tsukerman³

111 Accesses

Abstract

This paper discusses the fault tolerance issues of the Local Area Multiprocessor (LAMP) storage subsystem, and presents its architecture design, error detection and recovery algorithms, and logical volume reconstruction procedure. LAMP is a network of workstations with shared physical memory. Its basic communication protocol is load and store. The LAMP storage subsystem is developed for this class of distributed computing system: 1) It is with distributed shared memory; 2) It uses low-latency and high-bandwidth interconnection; 3) It provides remote DMA support. The LAMP storage subsystem stripes data across multiple nodes for higher I/O performance and availability. It organizes logical volumes (virtual disks) to store files according to the file size, data access pattern, as well as other criteria performance, availability, and security requirements. The LAMP storage subsystem implements RAID technology: RAID-0, 1, and 5 for each logical volume. The write-ahead logging is used to log data, metadata and parity updates of a recovery unit, which allows LAMP storage subsystem to perform fast error recovery. For rapid reconstruction of a failed logical volume, the LAMP logical volume reconstruction algorithm is implemented. In this paper, three main fault tolerance issues of the LAMP storage subsystem are discussed: system configurability for fault tolerance and performance, fast error detection and recovery, and fast logical volume reconstruction.

This work is sponsored in part by a grant from National Science Foundation CCR-941006

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Asami, N. Talagala, T. Anderson, K. Lutz, and D. Patterson. The Design of Large-Scale Do-It-Yourself RAIDs. Draft 1.0. http://www.cs.berkeley.edu, Nov 10, 1995.
Google Scholar
L.-F. Cabrera and D. Long. Swift: Using Distributed Disk Striping to Provide High I/O Data Rates. Computer Systems, 4(4):405–436, fall 1991.
Google Scholar
D. Long, B. Montague, and L.-F. Cabrera. Swift/RAID: A Distributed RAID System. Computing Systems, 7(3):333–359, summer 1994
Google Scholar
P. Dibble, M. Scott, and C. Ellis. Bridge: A High-Performance File System for Parallel Processors. Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS). IEEE, New York, 154–161, 1988.
Google Scholar
P. Dibble, and M. Scott. Beyond Striping: The Bridge Multiprocessor File System. Computer Architecture News, 17(5):32–39, September 1989
Article Google Scholar
J. Hartman, and J. Ousterhout. The Zebra Striped Network File System. ACM Transactions on Computer Systems, 13(3):274–310, August 1995.
Article Google Scholar
R. Wong, and T. Anderson. xFS: A Wide Area Mass Storage File System. 4th Workshop on Workstation Operating Systems, 71–78, October 1993.
Google Scholar
T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Severless Network File Systems. 15th ACM Symposium on Operating Systems Principles, December 1995.
Google Scholar
M. Rosenblum, and J. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Trans. on Computer Systems, 10(1):26–52, February 1992.
Article Google Scholar
P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2): 145–188, June 1994.
Article Google Scholar
G. Gibson. Redundant Disk Arrays Reliable, Parallel Secondary Storage, MIT Press, 1992.
Google Scholar
P. Corbett, D. Feitelson, J. Prost et.al. Parallel File Systems for the IBM SP Computers, IBM Systems Journal, 34(2): 222–248, 1995.
Article Google Scholar
S. Lo Verso, M. Isman, A. Nanopoulos et. al. sfs: A parallel File System for the CM5, Proceedings of the Summer 1993 USENIX Conference (Cincinnati, Ohio), 291–305. June 1993.
Google Scholar
P. Pierce. A Concurrent File System for a Highly Parallel Mass Storage Subsystem, Proceedings of the 4th Conference on Hypercubes, Concurrent Computers and Applications (Monterey, California), 155–160, March 1989.
Google Scholar
B. Walker, G Popek, R. English, et. al. The LOCUS Distributed Operating System, ACM SIGOPS Operating Systems Review 17(5):49–70, 1993.
Article Google Scholar
M. Satyanarayanan, J. Kistler, P. Kumar, et. al. Coda: A Highly Available File System for a Distributed Workstation Environment, IEEE Transactions on Computers 39(4):447–459, April 1990.
Article Google Scholar
B. Liskov, S. Ghemawat, R. Gruber, et. al. Replication in the Harp File System, ACM SIGOPS Operating Systems Review 25(5):226–238, 1991.
Article Google Scholar
J. del Rosario, R. Bordawekar, and A. Choudhary. Improved Parallel I/O via a Twophase Run-time Access Strategy, Computer Architecture News, 21(5): 31–38, December 1993.
Article Google Scholar
G. Gibson, D. Stodolsky, F. Chang, et. al. The Scotch Parallel Storage Systems, Proceedings of the IEEE CompCon Conference (San Francisco, California), March 1995.
Google Scholar
N. Nieuwejaar, and D. Kotz. The Gaily Parallel File System, PCS-TR96-286, Department of Computer Science, Dartmouth College, Hanover, NH, available at URL ftp://ftp.cs.dartmouth.edU/pub/CS-techreports/TR96-286.ps.Z, 1996.
Google Scholar
ANSI/IEEE std 1596–1992, Scalable Coherent Interface, August 1993.
Google Scholar
D. Gustavson, and Q. Li. Local Area Multiprocessor: the Scalable Coherent Interface, Proceedings of the Second International Workshop on SCI-based High Performance Low-Cost Computing: 131–154, March 1995.
Google Scholar
W. de Jonge, M. Kaashoek, and W. Hsieh. The Logical Disk: A New Approach to Improving File Systems, Laboratory for Computer Science, MIT, Cambridge, MA. 1994.
Google Scholar
W. Courtright, and G. Gibson. Backward Error Recovery in Redundant Disk Arrays, Proceedings of the 1994 Computer Measurement Group (CMG) Conference, Vol. 1:63–74, December 1994
Google Scholar
W. Courtright, G. Gibson, and M. Holland, et. al. A Structured Approach to Redundant Disk Array Implementation, Proceedings of the International Computer Performance and Dependency Symposium (IPDS), September 4–6, 1996.
Google Scholar
M. Holland. On-Line Data Reconstruction In Redundant Disk Arrays, PhD Dissertation, Department of Electrical and Computer Engineering, Carnegie Mellon University, 1994.
Google Scholar
M. Holland, G. Gibson, and D. Siewiorek. Architectures and Algorithms for On-Line Failure Recovery in Redundant Disk Arrays, Journal of Distributed and Parallel Databases, 2(3), July 1994.
Google Scholar
M. Holland, and G. Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays, Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
Google Scholar
D. Stodolsky, G. Gibson, and M. Holland. Parity Logging: Overcoming the Small Write Problem in Redundant Disk Arrays, Proceedings of the 21th Annual International Symposium on Computer Architecture, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Santa Clara University, Santa Clara, California, 95053, USA
Qiang Li, Edward Hong & Alex Tsukerman

Authors

Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Edward Hong
View author publications
You can also search for this author in PubMed Google Scholar
Alex Tsukerman
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, Q., Hong, E., Tsukerman, A. (1998). Fault-Tolerance Issues of Local Area Multiprocessor (LAMP) Storage Subsystem. In: Fault-Tolerant Parallel and Distributed Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5449-3_8

Download citation

DOI: https://doi.org/10.1007/978-1-4615-5449-3_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7488-6
Online ISBN: 978-1-4615-5449-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics