Skip to main content

A Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems

  • Conference paper
  • First Online:
Web Technologies and Applications (APWeb 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9313))

Included in the following conference series:

  • 2792 Accesses

Abstract

Table layout determines the way how the relational row-column data values are organized and stored. In recent years, considerable candidates have been developed in MapReduce based query systems; they differ on storage space utilization, data loading time, query performance and so on. In most time, users are confronted with the problem of choosing the comprehensive optimum table layout given the workloads and the schema of tables. The straightforward way to run queries on generated data and compare the results is time consuming, and incurs the inaccuracy due to the MapReduce’s nondeterministic execution runtime. In this paper, we propose a lightweight framework to evaluate table layouts without running the query. The framework adopts the black box method to test critical metrics, and the query aware strategy that extracts table-layout-related operations from query. Based on the metrics and operations, the framework makes suggestions to users. We conduct extensive experiments to empirically study the popular table layouts. Through the results illustration, we discover that column projection and compression are the most two prominent factors for general cases. Moreover, we discuss optimization chances for the intermediate tables produced in high level language systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apache Hadoop. http://hadoop.apache.org/

  2. ORCFile. https://issues.apache.org/jira/browse/HIVE-3874

  3. Parquet. A Columnar Storage Format for Hadoop. http://parquet.io/

  4. Trevni. http://avro.apache.org/docs/1.7.6/trevni/spec.html

  5. Zebra. Columnar Storage Format. https://wiki.apache.org/pig/zebra

  6. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI Conference (2004)

    Google Scholar 

  7. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE Conference (2010)

    Google Scholar 

  8. Olston, C., Reed, B., Srivastava, U., et al.: Pig Latin: a not so foreign language for data processing. In: SIGMOD (2008)

    Google Scholar 

  9. He, Y., et al.: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: ICDE (2011)

    Google Scholar 

  10. Lin, Y., et al.: Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In: SIGMOD (2011)

    Google Scholar 

  11. Avrilia, F., Patel, J.M., Shekita, E.J., et al.: Column-oriented storage techniques for MapReduce. In: VLDB (2011)

    Google Scholar 

  12. Guo, S., Xiong, J., Wang, W., et al.: Mastiff: A MapReduce-based system for time-based big data analytics. In: CLUSTER (2012)

    Google Scholar 

  13. Alekh, J., Quiane-Ruiz, J.-A., et al.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)

    Google Scholar 

  14. Huai, Y., et al.: Understanding insights into the basic structure and essential issues of table placement methods in clusters. In: VLDB (2014)

    Google Scholar 

  15. Ramakrishnan, R., et al.: Database Management Systems. McGraw-Hill (2003)

    Google Scholar 

  16. Copeland, G.P., Khoshafian, S.: A decomposition storage model. In: SIGMOD Conference, pp. 268–279 (1985)

    Google Scholar 

  17. Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., et al.: C-store: A column-oriented DBMS. In: VLDB (2005)

    Google Scholar 

  18. Abadi, D.J., Madden, S., Hachem, N.: Column-stores vs. Row-stores: How different are they really? In: SIGMOD (2008)

    Google Scholar 

  19. Tsirogiannis, D., Harizopoulos, S., Shah, M.A., et al.: Query Processing techniques for solid state drives. In: SIGMOD (2009)

    Google Scholar 

  20. Herodotou, H., Babu, S.: Profiling, what-if analysis, and Cost-based optimization of MapReduce programs. In: VLDB (2011)

    Google Scholar 

  21. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., et al.: MRShare: sharing across multiple queries in MapReduce. In: VLDB (2010)

    Google Scholar 

  22. Lim, H., Herodotou, H., et al.: Stubby: A transformation-based optimizer for MapReduce workflows. In: VLDB (2012)

    Google Scholar 

  23. Lee, R., Luo, T., Huai, Y., Wang, F., et al.: Ysmart: Yet another sql to mapreduce translator. In: ICDCS (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, F., Liu, J., Xu, L., Ye, D., Wei, J., Huang, T. (2015). A Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25255-1_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25254-4

  • Online ISBN: 978-3-319-25255-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics