Skip to main content

Intelligent Data Placement in Heterogeneous Hadoop Cluster

  • Conference paper
  • First Online:
Smart and Innovative Trends in Next Generation Computing Technologies (NGCT 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 827))

Included in the following conference series:

Abstract

The MapReduce programming model and Hadoop has become the de facto standard for data-intensive applications. Hadoop tasks are mapped to certain nodes within the Hadoop cluster with data required by tasks. Such a strategy is intuitively appealing for a homogeneous cluster, both in terms of computation and storage capabilities. However most commonplace clusters are indeed heterogeneous, since nodes are added over a prolonged period. This necessitates the use of an intelligent data placement strategy among cluster nodes that accounts for the inherent heterogeneity, which otherwise incurs performance bottleneck. In this paper, we propose to have a performance based clustering of Hadoop nodes and subsequently place data among the nodes. Performance based profiling of nodes can be achieved by running multiple benchmarks in an offline manner and segregating dividing the cluster nodes into two subsets namely low and high performance nodes. Additionally, execution process of Hadoop tasks is monitored using Hadoop’s task speculation mechanism and computations are dynamically migrated for slow running tasks based on a prior knowledge of data block regarding the task. Experiments conducted demonstrates that the proposed intelligent data placement improve network utilization and cluster performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Apache Hadoop Homepage. http://hadoop.apache.org. Accessed 10 Mar 2017

  3. Hadoopinrealworld Topic Page. http://hadoopinrealworld.com/data-locality-in-hadoop/. Accessed 25 May 2017

  4. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Ph.D Forum (IPDPSW), pp. 1–9. IEEE, April 2010

    Google Scholar 

  5. Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)

    Article  Google Scholar 

  6. Bardhan, S., Menascé, D.A.: The anatomy of MapReduce jobs, scheduling, and performance challenges. In: International CMG Conference, November 2013

    Google Scholar 

  7. Google Hadoop Big Data Webpage. https://cloud.google.com/solutions/hadoop/. Accessed 10 Apr 2017

  8. White, T.: Hadoop: The definitive guide. O’Reilly Media Inc, Sebastopol (2012)

    Google Scholar 

  9. Hemant Kumar Reddy, K., Patra, M.R., Roy, D.S., Pradhan, B.: An adaptive scheduling mechanism for computational desktop grid using gridgain. Proc. Technol. 4, 573–578 (2012)

    Article  Google Scholar 

  10. Pradhan, B., Nayak, A., Roy, D.S.: An elegant load balancing scheme in grid computing using GridGain. Int. J. Comput. Sci. Appl. 1(1), 254–257 (2011)

    Google Scholar 

  11. Hemant Kumar Reddy, K., Roy, D.S.: A hierarchical load balancing algorithm for efficient job scheduling in a computational grid testbed. In: 2012 1st International Conference on Recent Advances in Information Technology (RAIT), pp. 363–368. IEEE, March 2012

    Google Scholar 

  12. Harsha, L.S., Hemant Kumar Reddy, K., Roy, D.S.: A novel delay based application scheduling for energy efficient cloud operations. In: 2015 International Conference on Man and Machine Interfacing (MAMI), pp. 1–5. IEEE, December 2015

    Google Scholar 

  13. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7, December 2008

    Google Scholar 

  14. Xiong, R., Luo, J., Dong, F.: Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput. 18(4), 1465–1480 (2015)

    Article  Google Scholar 

  15. Jaykishan, B., Hemant Kumar Reddy, K., Roy, D.S.: A data-aware scheduling framework for parallel applications in a cloud environment. In: Sengupta, S., Das, K., Khan, G. (eds.) Emerging Trends in Computing and Communication. LNEE, vol. 298, pp. 459–463. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1817-3_49

    Chapter  Google Scholar 

  16. Hemant Kumar Reddy, K., Roy, D.S.: DPPACS: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)

    Google Scholar 

  17. Hemant Kumar Reddy, K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. In: Elkhodr, M., Hassan, Q.F., Shahrestani, S. (eds.) Big Data and the Internet of Things, Part IV. Networks of the Future: Architectures, Technologies, and Implementations. CRC Press, Taylor & Francis Group, LLC, Florida, USA (2017)

    Google Scholar 

  18. Huang, S., Huang, J., Liu, Y., Yi, L., Dai, J.: HiBench: a representative and comprehensive hadoop benchmark suite. In: Proceedings ICDE Workshops (2010)

    Google Scholar 

  19. Github Repo. https://github.com/intel-hadoop/HiBench. Accessed 10 Apr 2017

  20. https://github.com/apache/hadoop. Accessed 03 May 2017

  21. https://wiki.apache.org/hadoop/HowToContribute. Accessed 22 Jun 2017

  22. https://pravinchavan.wordpress.com/2013/04/14/building-apache-hadoop-from-source/. Accessed 20 Jun 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. S. Roy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H. (2018). Intelligent Data Placement in Heterogeneous Hadoop Cluster. In: Bhattacharyya, P., Sastry, H., Marriboyina, V., Sharma, R. (eds) Smart and Innovative Trends in Next Generation Computing Technologies. NGCT 2017. Communications in Computer and Information Science, vol 827. Springer, Singapore. https://doi.org/10.1007/978-981-10-8657-1_43

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8657-1_43

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8656-4

  • Online ISBN: 978-981-10-8657-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics