Abstract
The main objective of this chapter is to provide information and guidance for building a Hadoop distributed file system to address the big data classification problem. This system can help one to implement, test, and evaluate various machine-learning techniques presented in this book for learning purposes. The objectives include a detailed explanation of the Hadoop framework and the Hadoop system, the presentation of the Internet resources that can help you build a virtual machine-based Hadoop distributed file system with the R programming platform, and the establishment of an easy-to-follow, step-by-step instruction to build the RevolutionAnalytics’ RHadoop system for your big data computing environment. The objective also includes the presentation of simple examples to test the system to ensure the Hadoop system works. A brief discussion on setting up a multi node Hadoop system is also presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
T. White. “Hadoop: the definitive guide.” O’Reilly Inc, 2009.
D. Borthakur. “The hadoop distributed file system: Architecture and design.” Hadoop Project Website 11: 21, 2007.
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The hadoop distributed file system.” In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
J. Dean, and S. Ghemawat, “MapReduce: simplified data processing on large clusters.” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
J. Dean, and S. Ghemawat. “MapReduce: a flexible data processing tool.” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
https://github.com/RevolutionAnalytics/rmr2/tree/master/build
https://github.com/RevolutionAnalytics/rhdfs/tree/master/build
http://www.meetup.com/Learning-Machine-Learning-by-Example/pages/Installing_R_and_RHadoop/
http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
Acknowledgements
I would like to thank my graduate student Sumanth Reddy Yanala for helping to produce the drawing in Fig. 4.1. The information and discussions on “wrapletters” available at http://www.latex-community.org/forum/viewtopic.php?f=44&t=3798 helped the formatting of several long continuous text, like Uniform Resource Locator (URL), in this book.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this chapter
Cite this chapter
Suthaharan, S. (2016). Distributed File System. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol 36. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7641-3_4
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7641-3_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7640-6
Online ISBN: 978-1-4899-7641-3
eBook Packages: Business and ManagementBusiness and Management (R0)