Skip to main content

Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

  • Conference paper
  • First Online:
Information Systems (EMCIS 2017)

Abstract

The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dumbill, E.: Making sense of big data. Big Data 1, 1–2 (2013). doi:10.1089/big.2012.1503

    Article  Google Scholar 

  2. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014). doi:10.1007/s11036-013-0489-0

    Article  Google Scholar 

  3. Ward, J.S., Barker, A.: Undefined by data: a survey of big data definitions. arXiv:1309.5821 [cs. DB] (2013)

  4. NBD-PWG: NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. National Institute of Standards and Technology (2015)

    Google Scholar 

  5. Philip Chen, C.L., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014). doi:10.1016/j.ins.2014.01.015

    Article  Google Scholar 

  6. Krishnan, K.: Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc., San Francisco (2013)

    Google Scholar 

  7. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)

    Google Scholar 

  8. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1013–1020. ACM, New York (2010)

    Google Scholar 

  9. Apache Hadoop: Welcome to Apache Hadoop. https://hadoop.apache.org/

  10. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 5:1–5:16. ACM, New York (2013)

    Google Scholar 

  11. Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Wiley, Hoboken (2013)

    Google Scholar 

  12. Russom, P.: Evolving data warehouse architectures in the age of big data. The Data Warehouse Institute (2014)

    Google Scholar 

  13. Russom, P.: Data warehouse modernization in the age of big data analytics. The Data Warehouse Institute (2016)

    Google Scholar 

  14. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

    Google Scholar 

  15. Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache Hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1235–1246. ACM, New York (2014)

    Google Scholar 

  16. O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (SSB) (2009)

    Google Scholar 

  17. Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. Proc. VLDB Endow. 7, 1295–1306 (2014). doi:10.14778/2732977.2733002

    Article  Google Scholar 

  18. Tria, F.D., Lefons, E., Tangorra, F.: Design process for big data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518 (2014)

    Google Scholar 

  19. Goss, R.G., Veeramuthu, K.: Heading towards big data building a better data warehouse for more data, more speed, and more users. In: 2013 24th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pp. 220–225. IEEE (2013)

    Google Scholar 

  20. Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise: Big Data Warehouse, BI Implementations and Analytics. Apress, New York City (2013)

    Google Scholar 

  21. Santos, M.Y., Costa, C.: Data models in NoSQL databases for big data contexts. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 1–11. Springer, Cham (2016). doi:10.1007/978-3-319-40973-3_48

    Google Scholar 

  22. Santos, M.Y., Costa, C.: Data warehousing in big data: from multidimensional to tabular data models. In: Ninth International C* Conference on Computer Science & Software Engineering (C3S2E), pp. 51–60. ICPS (ACM) (2016)

    Google Scholar 

  23. Jukic, N., Jukic, B., Sharma, A., Nestorov, S., Korallus Arnold, B.: Expediting analytical databases with columnar approach. Decis. Support Syst. 95, 61–81 (2017). doi:10.1016/j.dss.2016.12.002

    Article  Google Scholar 

  24. Apache Hive: Apache Hive Documentation - Apache Software Foundation. https://cwiki.apache.org/confluence/display/Hive/Home

  25. Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39, 12–27 (2011). doi:10.1145/1978915.1978919

    Article  Google Scholar 

  26. Chevalier, M., Malki, M.E., Kopliku, A., Teste, O., Tournier, R.: Document-oriented models for data warehouses - NoSQL document-oriented for data warehouses. Presented at the 18th International Conference on Enterprise Information Systems 2 March (2017)

    Google Scholar 

  27. Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base. Procedia Comput Sci. 96, 255–264 (2016). doi:10.1016/j.procs.2016.08.138

    Article  Google Scholar 

  28. Presto: Presto | Distributed SQL Query Engine for Big Data. https://prestodb.io/

  29. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)

    Google Scholar 

  30. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)

    Google Scholar 

  31. Hausenblas, M., Nadeau, J.: Apache Drill: interactive ad-hoc analysis at scale. Big Data 1, 100–104 (2013)

    Article  Google Scholar 

  32. Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of SQL-on-hadoop systems. In: Zhan, J., Han, R., Weng, C. (eds.) BPOE 2014. LNCS, vol. 8807, pp. 154–166. Springer, Cham (2014). doi:10.1007/978-3-319-13021-7_12

    Google Scholar 

  33. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)

    Google Scholar 

  34. Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for big data warehousing on not-so-good hardware. In: Proceedings of International Database Engineering & Applications Symposium (IDEAS 2017), Bristol, United Kingdom (2017)

    Google Scholar 

  35. Santos, M.Y., Martinho, B., Costa, C.: Modelling and implementing big data warehouses for decision support. J. Manag. Anal. 4, 111–129 (2017)

    Google Scholar 

  36. Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media Inc., Sebastopol (2012)

    Google Scholar 

  37. Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2005)

    Google Scholar 

Download references

Acknowledgements

This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and funded by the SusCity project, MITP-TB/CS/0026/2013, and by the Portugal Incentive System for Research and Technological Development, Project in co-promotion nº. 002814/2015 (iFACTORY 2015–2018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Costa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Costa, E., Costa, C., Santos, M.Y. (2017). Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses. In: Themistocleous, M., Morabito, V. (eds) Information Systems. EMCIS 2017. Lecture Notes in Business Information Processing, vol 299. Springer, Cham. https://doi.org/10.1007/978-3-319-65930-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65930-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65929-9

  • Online ISBN: 978-3-319-65930-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics