Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQL

Benkrid, Soumia; Bellatreche, Ladjel; Mestoui, Yacine; Ordonez, Carlos

doi:10.1007/978-3-030-86534-4_3

Soumia Benkrid¹³,
Ladjel Bellatreche¹⁴,
Yacine Mestoui¹³ &
…
Carlos Ordonez¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12925))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

784 Accesses
1 Citations

Abstract

Nowadays Parallel DBMSs and Spark SQL compete with each other to query Big Data. Parallel DBMSs feature extensive experience embodied by powerful data partitioning and data allocation algorithms, but they suffer when handling dynamic changes in query workload. On the other hand, Spark SQL has become a solution to process query workloads on big data, outside the DBMS realm. Unfortunately, Spark SQL incurs into significant random disk I/O cost, because there is no correlation detected between Spark jobs and data blocks read from the disk. In consequence, Spark fails at providing high performance in a dynamic analytic environment. To solve such limitation, we propose an adaptive query-aware framework for partitioning big data tables for query processing, based on a genetic optimization problem formulation. Our approach intensively rewrites queries by exploiting different dimension hierarchies that may exist among dimension attributes, skipping irrelevant data to improve I/O performance. We present an experimental validation on a Spark SQL parallel cluster, showing promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akal, F., Böhm, K., Schek, H.-J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: ADBIS, pp. 218–231 (2002)
Google Scholar
Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) ACM SIGMOD, pp. 1009–1024 (2017)
Google Scholar
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: ACM SIGMOD, pp. 1103–1114 (2014)
Google Scholar
Asad, O., Kemme, B.: AdaptCache: adaptive data partitioning and migration for distributed object caches. In: Proceedings of the 17th International Middleware Conference, pp. 1–13 (2016)
Google Scholar
Benkrid, S., Bellatreche, L.: A framework for designing autonomous parallel data warehouses. In: ICA3PP, pp. 97–104 (2019)
Google Scholar
Benkrid, S., Mestoui, Y., Bellatreche, L., Ordonez, C.: A genetic optimization physical planner for big data warehouses. In: IEEE Big Data, pp. 406–412 (2020)
Google Scholar
Bruno, N., Chaudhuri, S.: Automatic physical database tuning: a relaxation-based approach. In: ACM SIGMOD, pp. 227–238 (2005)
Google Scholar
Durand, G.C., et al.: GridFormation: towards self-driven online data partitioning using reinforcement learning. In: aiDM Workshop, pp. 1–7 (2018)
Google Scholar
Garcia-Alvarado, C., Ordonez, C.: Query processing on cubes mapped from ontologies to dimension hierarchies. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, pp. 57–64 (2012)
Google Scholar
Hilprecht, B., Binnig, C., Röhm, U.: Towards learning a partitioning advisor with deep reinforcement learning. In: aiDM Workshop, pp. 1–4 (2019)
Google Scholar
Jindal, A., Karanasos, K., Rao, S., Patel, H.: Selecting subexpressions to materialize at datacenter scale. Proc. VLDB Endow. 11(7), 800–812 (2018)
Article Google Scholar
Karanasos, K., et al.: Dynamically optimizing queries over large scale data platforms. In: ACM SIGMOD, pp. 943–954 (2014)
Google Scholar
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018)
Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: GECCO, pp. 1141–1146 (2016)
Google Scholar
Li, Y., Li, M., Ding, L., Interlandi, M.: RIOS: runtime integrated optimizer for spark. In: ACM Symposium on Cloud Computing, pp. 275–287 (2018)
Google Scholar
Lima, A.A.B., Furtado, C., Valduriez, P., Mattoso, M.: Parallel OLAP query processing in database clusters with data replication. DaPD 25(1–2), 97–123 (2009)
Google Scholar
Ma, L., Van Aken, D., Hefny, A., Mezerhane, G., Pavlo, A., Gordon, G.J.: Query-based workload forecasting for self-driving database management systems. In: ACM SIGMOD, pp. 631–645 (2018)
Google Scholar
Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: ACM SIGMOD, pp. 1137–1148 (2011)
Google Scholar
Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)
Google Scholar
Serafini, M., Taft, R., Elmore, A.J., Pavlo, A., Aboulnaga, A., Stonebraker, M.: Clay: fine-grained adaptive partitioning for general database schemas. VLDB Endow. 10(4), 445–456 (2016)
Article Google Scholar
Stöhr, T., Märtens, H., Rahm, E.: Multi-dimensional database allocation for parallel data warehouses. In: VLDB, pp. 273–284 (2000)
Google Scholar
Taft, R., et al.: E-store: fine-grained elastic partitioning for distributed transaction processing systems. VLDB Endow. 8(3), 245–256 (2014)
Article Google Scholar
Zhang, T., Tomasic, A., Sheng, Y., Pavlo, A.: Performance of OLTP via intelligent scheduling. In: ICDE, pp. 1288–1291 (2018)
Google Scholar
Zhang, W., Kim, J., Ross, K.A., Sedlar, E., Stadler, L.: Adaptive code generation for data-intensive analytics. Proc. VLDB Endow. 14(6), 929–942 (2021)
Article Google Scholar
Zilio, D.C., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Ecole nationale Supérieure d’Informatique (ESI), Oued Smar, Algeria
Soumia Benkrid & Yacine Mestoui
LIAS/ISAE-ENSMA, Poitiers, France
Ladjel Bellatreche
University of Houston, Houston, TX, USA
Carlos Ordonez

Authors

Soumia Benkrid
View author publications
You can also search for this author in PubMed Google Scholar
Ladjel Bellatreche
View author publications
You can also search for this author in PubMed Google Scholar
Yacine Mestoui
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Ordonez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soumia Benkrid .

Editor information

Editors and Affiliations

University of Bologna, Bologna, Forli/Cesena, Italy
Matteo Golfarelli
Poznań University of Technology, Poznan, Poland
Robert Wrembel
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
TU Wien, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benkrid, S., Bellatreche, L., Mestoui, Y., Ordonez, C. (2021). Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQL. In: Golfarelli, M., Wrembel, R., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2021. Lecture Notes in Computer Science(), vol 12925. Springer, Cham. https://doi.org/10.1007/978-3-030-86534-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-86534-4_3
Published: 05 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86533-7
Online ISBN: 978-3-030-86534-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics