Skip to main content

Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQL

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12925))

Included in the following conference series:

Abstract

Nowadays Parallel DBMSs and Spark SQL compete with each other to query Big Data. Parallel DBMSs feature extensive experience embodied by powerful data partitioning and data allocation algorithms, but they suffer when handling dynamic changes in query workload. On the other hand, Spark SQL has become a solution to process query workloads on big data, outside the DBMS realm. Unfortunately, Spark SQL incurs into significant random disk I/O cost, because there is no correlation detected between Spark jobs and data blocks read from the disk. In consequence, Spark fails at providing high performance in a dynamic analytic environment. To solve such limitation, we propose an adaptive query-aware framework for partitioning big data tables for query processing, based on a genetic optimization problem formulation. Our approach intensively rewrites queries by exploiting different dimension hierarchies that may exist among dimension attributes, skipping irrelevant data to improve I/O performance. We present an experimental validation on a Spark SQL parallel cluster, showing promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akal, F., Böhm, K., Schek, H.-J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: ADBIS, pp. 218–231 (2002)

    Google Scholar 

  2. Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) ACM SIGMOD, pp. 1009–1024 (2017)

    Google Scholar 

  3. Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: ACM SIGMOD, pp. 1103–1114 (2014)

    Google Scholar 

  4. Asad, O., Kemme, B.: AdaptCache: adaptive data partitioning and migration for distributed object caches. In: Proceedings of the 17th International Middleware Conference, pp. 1–13 (2016)

    Google Scholar 

  5. Benkrid, S., Bellatreche, L.: A framework for designing autonomous parallel data warehouses. In: ICA3PP, pp. 97–104 (2019)

    Google Scholar 

  6. Benkrid, S., Mestoui, Y., Bellatreche, L., Ordonez, C.: A genetic optimization physical planner for big data warehouses. In: IEEE Big Data, pp. 406–412 (2020)

    Google Scholar 

  7. Bruno, N., Chaudhuri, S.: Automatic physical database tuning: a relaxation-based approach. In: ACM SIGMOD, pp. 227–238 (2005)

    Google Scholar 

  8. Durand, G.C., et al.: GridFormation: towards self-driven online data partitioning using reinforcement learning. In: aiDM Workshop, pp. 1–7 (2018)

    Google Scholar 

  9. Garcia-Alvarado, C., Ordonez, C.: Query processing on cubes mapped from ontologies to dimension hierarchies. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, pp. 57–64 (2012)

    Google Scholar 

  10. Hilprecht, B., Binnig, C., Röhm, U.: Towards learning a partitioning advisor with deep reinforcement learning. In: aiDM Workshop, pp. 1–4 (2019)

    Google Scholar 

  11. Jindal, A., Karanasos, K., Rao, S., Patel, H.: Selecting subexpressions to materialize at datacenter scale. Proc. VLDB Endow. 11(7), 800–812 (2018)

    Article  Google Scholar 

  12. Karanasos, K., et al.: Dynamically optimizing queries over large scale data platforms. In: ACM SIGMOD, pp. 943–954 (2014)

    Google Scholar 

  13. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018)

  14. Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: GECCO, pp. 1141–1146 (2016)

    Google Scholar 

  15. Li, Y., Li, M., Ding, L., Interlandi, M.: RIOS: runtime integrated optimizer for spark. In: ACM Symposium on Cloud Computing, pp. 275–287 (2018)

    Google Scholar 

  16. Lima, A.A.B., Furtado, C., Valduriez, P., Mattoso, M.: Parallel OLAP query processing in database clusters with data replication. DaPD 25(1–2), 97–123 (2009)

    Google Scholar 

  17. Ma, L., Van Aken, D., Hefny, A., Mezerhane, G., Pavlo, A., Gordon, G.J.: Query-based workload forecasting for self-driving database management systems. In: ACM SIGMOD, pp. 631–645 (2018)

    Google Scholar 

  18. Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: ACM SIGMOD, pp. 1137–1148 (2011)

    Google Scholar 

  19. Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)

    Google Scholar 

  20. Serafini, M., Taft, R., Elmore, A.J., Pavlo, A., Aboulnaga, A., Stonebraker, M.: Clay: fine-grained adaptive partitioning for general database schemas. VLDB Endow. 10(4), 445–456 (2016)

    Article  Google Scholar 

  21. Stöhr, T., Märtens, H., Rahm, E.: Multi-dimensional database allocation for parallel data warehouses. In: VLDB, pp. 273–284 (2000)

    Google Scholar 

  22. Taft, R., et al.: E-store: fine-grained elastic partitioning for distributed transaction processing systems. VLDB Endow. 8(3), 245–256 (2014)

    Article  Google Scholar 

  23. Zhang, T., Tomasic, A., Sheng, Y., Pavlo, A.: Performance of OLTP via intelligent scheduling. In: ICDE, pp. 1288–1291 (2018)

    Google Scholar 

  24. Zhang, W., Kim, J., Ross, K.A., Sedlar, E., Stadler, L.: Adaptive code generation for data-intensive analytics. Proc. VLDB Endow. 14(6), 929–942 (2021)

    Article  Google Scholar 

  25. Zilio, D.C., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumia Benkrid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benkrid, S., Bellatreche, L., Mestoui, Y., Ordonez, C. (2021). Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQL. In: Golfarelli, M., Wrembel, R., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2021. Lecture Notes in Computer Science(), vol 12925. Springer, Cham. https://doi.org/10.1007/978-3-030-86534-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86534-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86533-7

  • Online ISBN: 978-3-030-86534-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics