Abstract
The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dumbill, E.: Making sense of big data. Big Data 1, 1–2 (2013). doi:10.1089/big.2012.1503
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014). doi:10.1007/s11036-013-0489-0
Ward, J.S., Barker, A.: Undefined by data: a survey of big data definitions. arXiv:1309.5821 [cs. DB] (2013)
NBD-PWG: NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. National Institute of Standards and Technology (2015)
Philip Chen, C.L., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014). doi:10.1016/j.ins.2014.01.015
Krishnan, K.: Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc., San Francisco (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1013–1020. ACM, New York (2010)
Apache Hadoop: Welcome to Apache Hadoop. https://hadoop.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 5:1–5:16. ACM, New York (2013)
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Wiley, Hoboken (2013)
Russom, P.: Evolving data warehouse architectures in the age of big data. The Data Warehouse Institute (2014)
Russom, P.: Data warehouse modernization in the age of big data analytics. The Data Warehouse Institute (2016)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache Hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1235–1246. ACM, New York (2014)
O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (SSB) (2009)
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. Proc. VLDB Endow. 7, 1295–1306 (2014). doi:10.14778/2732977.2733002
Tria, F.D., Lefons, E., Tangorra, F.: Design process for big data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518 (2014)
Goss, R.G., Veeramuthu, K.: Heading towards big data building a better data warehouse for more data, more speed, and more users. In: 2013 24th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pp. 220–225. IEEE (2013)
Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise: Big Data Warehouse, BI Implementations and Analytics. Apress, New York City (2013)
Santos, M.Y., Costa, C.: Data models in NoSQL databases for big data contexts. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 1–11. Springer, Cham (2016). doi:10.1007/978-3-319-40973-3_48
Santos, M.Y., Costa, C.: Data warehousing in big data: from multidimensional to tabular data models. In: Ninth International C* Conference on Computer Science & Software Engineering (C3S2E), pp. 51–60. ICPS (ACM) (2016)
Jukic, N., Jukic, B., Sharma, A., Nestorov, S., Korallus Arnold, B.: Expediting analytical databases with columnar approach. Decis. Support Syst. 95, 61–81 (2017). doi:10.1016/j.dss.2016.12.002
Apache Hive: Apache Hive Documentation - Apache Software Foundation. https://cwiki.apache.org/confluence/display/Hive/Home
Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39, 12–27 (2011). doi:10.1145/1978915.1978919
Chevalier, M., Malki, M.E., Kopliku, A., Teste, O., Tournier, R.: Document-oriented models for data warehouses - NoSQL document-oriented for data warehouses. Presented at the 18th International Conference on Enterprise Information Systems 2 March (2017)
Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base. Procedia Comput Sci. 96, 255–264 (2016). doi:10.1016/j.procs.2016.08.138
Presto: Presto | Distributed SQL Query Engine for Big Data. https://prestodb.io/
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Hausenblas, M., Nadeau, J.: Apache Drill: interactive ad-hoc analysis at scale. Big Data 1, 100–104 (2013)
Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of SQL-on-hadoop systems. In: Zhan, J., Han, R., Weng, C. (eds.) BPOE 2014. LNCS, vol. 8807, pp. 154–166. Springer, Cham (2014). doi:10.1007/978-3-319-13021-7_12
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-milne, S., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the CIDR 2015, California, USA (2015)
Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for big data warehousing on not-so-good hardware. In: Proceedings of International Database Engineering & Applications Symposium (IDEAS 2017), Bristol, United Kingdom (2017)
Santos, M.Y., Martinho, B., Costa, C.: Modelling and implementing big data warehouses for decision support. J. Manag. Anal. 4, 111–129 (2017)
Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media Inc., Sebastopol (2012)
Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2005)
Acknowledgements
This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and funded by the SusCity project, MITP-TB/CS/0026/2013, and by the Portugal Incentive System for Research and Technological Development, Project in co-promotion nº. 002814/2015 (iFACTORY 2015–2018).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Costa, E., Costa, C., Santos, M.Y. (2017). Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses. In: Themistocleous, M., Morabito, V. (eds) Information Systems. EMCIS 2017. Lecture Notes in Business Information Processing, vol 299. Springer, Cham. https://doi.org/10.1007/978-3-319-65930-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-65930-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65929-9
Online ISBN: 978-3-319-65930-5
eBook Packages: Computer ScienceComputer Science (R0)