Optimizing Execution Plans in a Multistore

Forresi, Chiara; Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo

doi:10.1007/978-3-030-82472-3_11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12843))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

681 Accesses
2 Citations
1 Altmetric

Abstract

Multistores are data management systems that enable query processing across different database management systems (DBMSs); besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. In a recent work [2], we have proposed a multistore solution that relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ (generalized projection, selection and join) queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by finding the most efficient execution plan on the multistore. In particular, we devise three different strategies to carry out joins and data fusion, and we build a cost model to enable the evaluation of different execution plans. Through the experimental evaluation, we are able to profile the suitability of each strategy to different multistore configurations, thus validating our multi-strategy approach and motivating further research on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Remarkably, many-to-one relationships are at the base of the multidimensional model and GPSJ queries [12], as well as our dataspace-based approach [2].
2.
Although the level of parallelism in Spark is given in terms of CPU cores, we consider the number of machines because the cost model is focused on disk IO rather than on CPU computation.

References

Baldacci, L., Golfarelli, M.: A cost model for SPARK SQL. IEEE Trans. Knowl. Data Eng. 31(5), 819–832 (2019)
Article Google Scholar
Ben Hamadou, H., Gallinucci, E., Golfarelli, M.: Answering GPSJ queries in a polystore: a dataspace-based approach. In: Laender, A.H.F., Pernici, B., Lim, E.-P., de Oliveira, J.P.M. (eds.) ER 2019. LNCS, vol. 11788, pp. 189–203. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33223-5_16
Chapter Google Scholar
Bimonte, S., Gallinucci, E., Marcel, P., Rizzi, S.: Data variety, come as you are in multi-model data warehouses. Inf. Syst. 101734 (2021)
Google Scholar
Bleiholder, J., Naumann, F.: Declarative data fusion – syntax, semantics, and implementation. In: Eder, J., Haav, H.-M., Kalja, A., Penjam, J. (eds.) ADBIS 2005. LNCS, vol. 3631, pp. 58–73. Springer, Heidelberg (2005). https://doi.org/10.1007/11547686_5
Chapter Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41(1), 1–41 (2009)
Article Google Scholar
Bonaque, R., et al.: Mixed-instance querying: a lightweight integration architecture for data journalism. Proc. VLDB Endow. 9(13), 1513–1516 (2016)
Article Google Scholar
DeWitt, D.J., et al.: Implementation techniques for main memory database systems. In: Proceedings of the 1984 SIGMOD Annual Meeting, pp. 1–8 (1984)
Google Scholar
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: 2016 ACM SIGMOD International Conference on Management of Data, pp. 295–310. ACM (2016)
Google Scholar
Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Article Google Scholar
Gadepally, V., et al.: The BIGDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. IEEE (2016)
Google Scholar
Gallinucci, E., Golfarelli, M., Rizzi, S.: Approximate OLAP of document-oriented databases: a variety-aware approach. Inf. Syst. 85, 114–130 (2019)
Article Google Scholar
Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: a conceptual model for data warehouses. Int. J. Coop. Inf. Syst. 7(2–3), 215–247 (1998)
Article Google Scholar
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: 2008 ACM SIGMOD International Conference on Management of Data, pp. 847–860. ACM (2008)
Google Scholar
Kolev, B., et al.: CloudMDSQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)
Article Google Scholar
Maccioni, A., Torlone, R.: Augmented access for querying and exploring a polystore. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, pp. 77–88. IEEE Computer Society (2018)
Google Scholar
Mandreoli, F., Montangero, M.: Dealing with data heterogeneity in a data fusion perspective: models, methodologies, and algorithms. In: Data Handling in Science and Technology, vol. 31, pp. 235–270. Elsevier (2019)
Google Scholar
Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)
Article Google Scholar
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Inf. Syst. 29(7), 583–615 (2004)
Article Google Scholar
Sadalage, P.J., Fowler, M.: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, London (2013)
Google Scholar
Shi, J., et al.: Clash of the titans: mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)
Article Google Scholar
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017)
Google Scholar
Zhang, C., Lu, J., Xu, P., Chen, Y.: UniBench: a benchmark for multi-model database management systems. In: Nambiar, R., Poess, M. (eds.) TPCTC 2018. LNCS, vol. 11135, pp. 7–23. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11404-6_2
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Bologna, Cesena, Italy
Chiara Forresi, Matteo Francia, Enrico Gallinucci & Matteo Golfarelli

Authors

Chiara Forresi
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Francia
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Gallinucci
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Golfarelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrico Gallinucci .

Editor information

Editors and Affiliations

LIAS/ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
University of Tartu, Tartu, Estonia
Marlon Dumas
Aarhus University, Aarhus, Denmark
Panagiotis Karras
University of Tartu, Tartu, Estonia
Raimundas Matulevičius

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2021). Optimizing Execution Plans in a Multistore. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-82472-3_11
Published: 16 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82471-6
Online ISBN: 978-3-030-82472-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics