Skip to main content

HcBench: Methodology, Development, and Full-System Characterization of a Customer Usage Representative Big Data/Hadoop Benchmark

  • Conference paper
  • First Online:
Advancing Big Data Benchmarks (WBDB 2013, WBDB 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8585))

Included in the following conference series:

Abstract

The Hadoop platform for Map-Reduce is extensively for Big Data batch analytics as well as interactive applications in e-commerce, telecom, media, retail, social networking, and other areas. However, to date no industry standard benchmarks exist to evaluate the true performance of a Hadoop cluster.

Current Hadoop benchmarks such as HiBench, Terasort, etc. in the open source domain fail to capture the real usages and performance of a Hadoop cluster in a datacenter. Given that typical Hadoop deployments process jobs under strict Service Level Agreement requirements, benchmarks are needed to evaluate the effects of concurrently running such diverse analytics jobs for performance comparison and cluster configuration.

In this paper, we present the methodology and the development of a customer usage representative Hadoop benchmark which includes a mix of job types, variety of data sizes, with inter-job arrival times as in a typical datacenter. We present the details of this benchmark and discuss application level, micro-architectural and cluster level performance characterization on an Intel Sandy Bridge Xeon Processor Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

    Google Scholar 

  2. Chen, Y., Asplaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross industry study of MapReduce workloads. In: International Conference on Very Large Data Bases (VLDB), Aug 2012

    Google Scholar 

  3. Chen, Y., Ganapathi, Griffith, R., Katz, R.; The case for evaluating MapReduce performance using workload suites. In: 19th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2011)

    Google Scholar 

  4. Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking Big Data systems and the BigData Top100 list. Big Data (IMPETUS Innov. Archit.) 1(1), 60–64 (2013)

    Google Scholar 

  5. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B., The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: ICDEW (2010)

    Google Scholar 

  6. Wiki, PigMix Benchmark. http://wiki.apache.org/pig/PigMix

  7. GridMix3 – Emulating Production Workload for Apache Hadoop. https://git.apache.org/hadoop-mapreduce.git/src/contrib/gridmix

  8. STAC: Comparison of IBM Platform Symphony and Apache Hadoop Using Berkeley SWIM. STAC, LLC. Nov 2012

    Google Scholar 

  9. SWIMProjectUCB, 2012. https://github.com/SWIMProjectUCB/SWIM/wiki

  10. Jia, Y., Shao, Z.: A Benchmark for Hive, Pig, and Hadoop. https://issues.apache.org/jira/browse/hive-396

  11. Thusoo, A., Sen-Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehouse solution over a Map-Reduce framework. In: VLDB (2009)

    Google Scholar 

  12. Rumen. http://hadoop.apache.org/docs/r1.1.2/rumen.html

  13. Zujie, R., Xu, X., Wan, J., Shi, W., Zhou, M.: Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao. In: IISWC (2012)

    Google Scholar 

  14. TPC-C Benchmark. http://www.tpc.org

  15. Poess, M., Floyd, C.: New TPC benchmarks for decision support and Web commerce. In: SIGMOD (2000)

    Google Scholar 

  16. The HiBench Suite. https://github.com/intel-hadoop/HiBench

  17. Wikipedia. Gamma Distribution. http://en.wikipedia.org/wiki/Gamma_distribution

  18. Krishnan, K., Saletore, V.A.: Sysviz: system visualizer for cluster performance characterization. Internal report, Intel. Corp (2012)

    Google Scholar 

Download references

Acknowledgements

Karthik Krishnan, now with Amazon Web Services at Amazon.com Inc. contributed to the significant development of this benchmark when he was at Intel. We would also like to thank our manager, Intel Fellow and Chief Server Architect of the Data Center Group, Dr. Faye Briggs for encouraging us to develop this benchmark for platform performance architectures projections.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vikram A. Saletore .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Saletore, V.A., Krishnan, K., Viswanathan, V., Tolentino, M.E. (2014). HcBench: Methodology, Development, and Full-System Characterization of a Customer Usage Representative Big Data/Hadoop Benchmark. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10596-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10595-6

  • Online ISBN: 978-3-319-10596-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics