Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

Costa, Eduarda; Costa, Carlos; Santos, Maribel Yasmina

doi:10.1007/978-3-319-65930-5_1

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 299))

Included in the following conference series:

European, Mediterranean, and Middle Eastern Conference on Information Systems

2265 Accesses
14 Citations

Abstract

The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dumbill, E.: Making sense of big data. Big Data 1, 1–2 (2013). doi:10.1089/big.2012.1503
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014). doi:10.1007/s11036-013-0489-0
Article Google Scholar
Ward, J.S., Barker, A.: Undefined by data: a survey of big data definitions. arXiv:1309.5821 [cs. DB] (2013)
NBD-PWG: NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. National Institute of Standards and Technology (2015)
Google Scholar
Philip Chen, C.L., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014). doi:10.1016/j.ins.2014.01.015
Article Google Scholar
Krishnan, K.: Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc., San Francisco (2013)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)
Google Scholar
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1013–1020. ACM, New York (2010)
Google Scholar
Apache Hadoop: Welcome to Apache Hadoop. https://hadoop.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 5:1–5:16. ACM, New York (2013)
Google Scholar
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Wiley, Hoboken (2013)
Google Scholar
Russom, P.: Evolving data warehouse architectures in the age of big data. The Data Warehouse Institute (2014)
Google Scholar
Russom, P.: Data warehouse modernization in the age of big data analytics. The Data Warehouse Institute (2016)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Google Scholar
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache Hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1235–1246. ACM, New York (2014)
Google Scholar
O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (SSB) (2009)
Google Scholar
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. Proc. VLDB Endow. 7, 1295–1306 (2014). doi:10.14778/2732977.2733002
Article Google Scholar
Tria, F.D., Lefons, E., Tangorra, F.: Design process for big data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518 (2014)
Google Scholar
Goss, R.G., Veeramuthu, K.: Heading towards big data building a better data warehouse for more data, more speed, and more users. In: 2013 24th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pp. 220–225. IEEE (2013)
Google Scholar
Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise: Big Data Warehouse, BI Implementations and Analytics. Apress, New York City (2013)
Google Scholar
Santos, M.Y., Costa, C.: Data models in NoSQL databases for big data contexts. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 1–11. Springer, Cham (2016). doi:10.1007/978-3-319-40973-3_48
Google Scholar
Santos, M.Y., Costa, C.: Data warehousing in big data: from multidimensional to tabular data models. In: Ninth International C* Conference on Computer Science & Software Engineering (C3S2E), pp. 51–60. ICPS (ACM) (2016)
Google Scholar
Jukic, N., Jukic, B., Sharma, A., Nestorov, S., Korallus Arnold, B.: Expediting analytical databases with columnar approach. Decis. Support Syst. 95, 61–81 (2017). doi:10.1016/j.dss.2016.12.002
Article Google Scholar
Apache Hive: Apache Hive Documentation - Apache Software Foundation. https://cwiki.apache.org/confluence/display/Hive/Home
Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39, 12–27 (2011). doi:10.1145/1978915.1978919
Article Google Scholar
Chevalier, M., Malki, M.E., Kopliku, A., Teste, O., Tournier, R.: Document-oriented models for data warehouses - NoSQL document-oriented for data warehouses. Presented at the 18th International Conference on Enterprise Information Systems 2 March (2017)
Google Scholar
Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base. Procedia Comput Sci. 96, 255–264 (2016). doi:10.1016/j.procs.2016.08.138
Article Google Scholar
Presto: Presto | Distributed SQL Query Engine for Big Data. https://prestodb.io/
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Google Scholar
Hausenblas, M., Nadeau, J.: Apache Drill: interactive ad-hoc analysis at scale. Big Data 1, 100–104 (2013)
Article Google Scholar
Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of SQL-on-hadoop systems. In: Zhan, J., Han, R., Weng, C. (eds.) BPOE 2014. LNCS, vol. 8807, pp. 154–166. Springer, Cham (2014). doi:10.1007/978-3-319-13021-7_12
Google Scholar
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)
Google Scholar
Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for big data warehousing on not-so-good hardware. In: Proceedings of International Database Engineering & Applications Symposium (IDEAS 2017), Bristol, United Kingdom (2017)
Google Scholar
Santos, M.Y., Martinho, B., Costa, C.: Modelling and implementing big data warehouses for decision support. J. Manag. Anal. 4, 111–129 (2017)
Google Scholar
Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2005)
Google Scholar

Download references

Acknowledgements

This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and funded by the SusCity project, MITP-TB/CS/0026/2013, and by the Portugal Incentive System for Research and Technological Development, Project in co-promotion nº. 002814/2015 (iFACTORY 2015–2018).

Author information

Authors and Affiliations

ALGORITMI Research Centre, University of Minho, 4800 058, Guimarães, Portugal
Eduarda Costa, Carlos Costa & Maribel Yasmina Santos

Authors

Eduarda Costa
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Costa
View author publications
You can also search for this author in PubMed Google Scholar
Maribel Yasmina Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Costa .

Editor information

Editors and Affiliations

Department of Digital Systems, University of Piraeus, Piraeus, Greece
Marinos Themistocleous
Department of Management and Technology, Bocconi University, Milan, Italy
Vincenzo Morabito

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, E., Costa, C., Santos, M.Y. (2017). Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses. In: Themistocleous, M., Morabito, V. (eds) Information Systems. EMCIS 2017. Lecture Notes in Business Information Processing, vol 299. Springer, Cham. https://doi.org/10.1007/978-3-319-65930-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-65930-5_1
Published: 15 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65929-9
Online ISBN: 978-3-319-65930-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics