Intelligent Data Placement in Heterogeneous Hadoop Cluster

Paik, Subhendu Sekhar; Goswami, Rajat Subhra; Roy, D. S.; Reddy, K. Hemant

doi:10.1007/978-981-10-8657-1_43

Subhendu Sekhar Paik¹³,
Rajat Subhra Goswami¹⁴,
D. S. Roy¹³ &
…
K. Hemant Reddy¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 827))

Included in the following conference series:

International Conference on Next Generation Computing Technologies

1376 Accesses
9 Citations

Abstract

The MapReduce programming model and Hadoop has become the de facto standard for data-intensive applications. Hadoop tasks are mapped to certain nodes within the Hadoop cluster with data required by tasks. Such a strategy is intuitively appealing for a homogeneous cluster, both in terms of computation and storage capabilities. However most commonplace clusters are indeed heterogeneous, since nodes are added over a prolonged period. This necessitates the use of an intelligent data placement strategy among cluster nodes that accounts for the inherent heterogeneity, which otherwise incurs performance bottleneck. In this paper, we propose to have a performance based clustering of Hadoop nodes and subsequently place data among the nodes. Performance based profiling of nodes can be achieved by running multiple benchmarks in an offline manner and segregating dividing the cluster nodes into two subsets namely low and high performance nodes. Additionally, execution process of Hadoop tasks is monitored using Hadoop’s task speculation mechanism and computations are dynamically migrated for slow running tasks based on a prior knowledge of data block regarding the task. Experiments conducted demonstrates that the proposed intelligent data placement improve network utilization and cluster performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Apache Hadoop Homepage. http://hadoop.apache.org. Accessed 10 Mar 2017
Hadoopinrealworld Topic Page. http://hadoopinrealworld.com/data-locality-in-hadoop/. Accessed 25 May 2017
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Ph.D Forum (IPDPSW), pp. 1–9. IEEE, April 2010
Google Scholar
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Article Google Scholar
Bardhan, S., Menascé, D.A.: The anatomy of MapReduce jobs, scheduling, and performance challenges. In: International CMG Conference, November 2013
Google Scholar
Google Hadoop Big Data Webpage. https://cloud.google.com/solutions/hadoop/. Accessed 10 Apr 2017
White, T.: Hadoop: The definitive guide. O’Reilly Media Inc, Sebastopol (2012)
Google Scholar
Hemant Kumar Reddy, K., Patra, M.R., Roy, D.S., Pradhan, B.: An adaptive scheduling mechanism for computational desktop grid using gridgain. Proc. Technol. 4, 573–578 (2012)
Article Google Scholar
Pradhan, B., Nayak, A., Roy, D.S.: An elegant load balancing scheme in grid computing using GridGain. Int. J. Comput. Sci. Appl. 1(1), 254–257 (2011)
Google Scholar
Hemant Kumar Reddy, K., Roy, D.S.: A hierarchical load balancing algorithm for efficient job scheduling in a computational grid testbed. In: 2012 1st International Conference on Recent Advances in Information Technology (RAIT), pp. 363–368. IEEE, March 2012
Google Scholar
Harsha, L.S., Hemant Kumar Reddy, K., Roy, D.S.: A novel delay based application scheduling for energy efficient cloud operations. In: 2015 International Conference on Man and Machine Interfacing (MAMI), pp. 1–5. IEEE, December 2015
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7, December 2008
Google Scholar
Xiong, R., Luo, J., Dong, F.: Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput. 18(4), 1465–1480 (2015)
Article Google Scholar
Jaykishan, B., Hemant Kumar Reddy, K., Roy, D.S.: A data-aware scheduling framework for parallel applications in a cloud environment. In: Sengupta, S., Das, K., Khan, G. (eds.) Emerging Trends in Computing and Communication. LNEE, vol. 298, pp. 459–463. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1817-3_49
Chapter Google Scholar
Hemant Kumar Reddy, K., Roy, D.S.: DPPACS: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)
Google Scholar
Hemant Kumar Reddy, K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. In: Elkhodr, M., Hassan, Q.F., Shahrestani, S. (eds.) Big Data and the Internet of Things, Part IV. Networks of the Future: Architectures, Technologies, and Implementations. CRC Press, Taylor & Francis Group, LLC, Florida, USA (2017)
Google Scholar
Huang, S., Huang, J., Liu, Y., Yi, L., Dai, J.: HiBench: a representative and comprehensive hadoop benchmark suite. In: Proceedings ICDE Workshops (2010)
Google Scholar
Github Repo. https://github.com/intel-hadoop/HiBench. Accessed 10 Apr 2017
https://github.com/apache/hadoop. Accessed 03 May 2017
https://wiki.apache.org/hadoop/HowToContribute. Accessed 22 Jun 2017
https://pravinchavan.wordpress.com/2013/04/14/building-apache-hadoop-from-source/. Accessed 20 Jun 2017

Download references

Author information

Authors and Affiliations

National Institute of Technology, Meghalaya, Laitumkhrah, Shillong, 793003, India
Subhendu Sekhar Paik & D. S. Roy
National Institute of Technology, Arunachal Pradesh, Yupia, 791112, India
Rajat Subhra Goswami
National Institute of Science and Technology, Berhampur, 761008, Odisha, India
K. Hemant Reddy

Authors

Subhendu Sekhar Paik
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Subhra Goswami
View author publications
You can also search for this author in PubMed Google Scholar
D. S. Roy
View author publications
You can also search for this author in PubMed Google Scholar
K. Hemant Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. S. Roy .

Editor information

Editors and Affiliations

Indian Institute of Technology Patna, Patna, Bihar, India
Pushpak Bhattacharyya
University of Petroleum and Energy Studies, Dehradun, India
Hanumat G. Sastry
University of Petroleum and Energy Studies, Dehradun, India
Venkatadri Marriboyina
University of Petroleum and Energy Studies, Dehradun, India
Rashmi Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H. (2018). Intelligent Data Placement in Heterogeneous Hadoop Cluster. In: Bhattacharyya, P., Sastry, H., Marriboyina, V., Sharma, R. (eds) Smart and Innovative Trends in Next Generation Computing Technologies. NGCT 2017. Communications in Computer and Information Science, vol 827. Springer, Singapore. https://doi.org/10.1007/978-981-10-8657-1_43

Download citation

DOI: https://doi.org/10.1007/978-981-10-8657-1_43
Published: 09 June 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8656-4
Online ISBN: 978-981-10-8657-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics