Parallel query processing in a polystore

Kranas, Pavlos; Kolev, Boyan; Levchenko, Oleksandra; Pacitti, Esther; Valduriez, Patrick; Jiménez-Peris, Ricardo; Patiño-Martinez, Marta

doi:10.1007/s10619-021-07322-5

Parallel query processing in a polystore

Published: 03 February 2021

Volume 39, pages 939–977, (2021)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Pavlos Kranas^1,2,
Boyan Kolev ORCID: orcid.org/0000-0003-4871-0434^1,3,
Oleksandra Levchenko³,
Esther Pacitti³,
Patrick Valduriez³,
Ricardo Jiménez-Peris¹ &
…
Marta Patiño-Martinez²

611 Accesses
5 Citations
Explore all metrics

Abstract

The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

The New Hardware Development Trend and the Challenges in Data Management and Analysis

Article Open access 24 September 2018

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Notes

References

Stonebraker, M., Cetintemel, U.: One size fits all: an idea whose time has come and gone. In: ICDE, pp. 2–11 (2015)
Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on innovative data systems research (CIDR) (2015)
Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., Zdonik, S.: The BigDAWG polystore system. SIGMOD Record 44(2), 11–16 (2015)
Article Google Scholar
Gadepally, V., Chen, P., Duggan, J., Elmore, A.J., Haynes, B., Kepner, J., Madden, S., Mattson, T., Stonebraker, M.: The BigDawg polystore system and architecture. In: IEEE high performance extreme computing conference (HPEC), pp. 1–6 (2016)
Minpeng, Z., Tore, R.: Querying combined cloud-based and relational databases. In: International conference on cloud and service computing (CSC), pp. 330–335 (2011)
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: ACM SIGMOD, pp. 829–840 (2012)
Kolev, B., Bondiombouy, C., Valduriez, P., Jimenez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore sytem. In: ACM SIGMOD, pp. 2113–2116 (2016)
Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. In: Distributed and parallel databases, vol. 34, pp. 463–503. Springer, Berlin (2015)
Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Multistore big data integration with CloudMdsQL. In: Transactions on large-scale data and knowledge-centered systems (TLDKS), pp. 48–74. Springer, Berlin (2016)
Abouzeid, A., Badja-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2, 922–933 (2009)
Google Scholar
DeWitt, D., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in Polybase. In: ACM SIGMOD, pp. 1255–1266 (2013)
Hacigümüs, H., Sankaranarayanan, J., Tatemura, J., LeFevre, J., Polyzotis, N.: Odyssey: a multi-store system for evolutionary analytics. PVLDB 6, 1180–1181 (2013)
Google Scholar
LeFevre, J., Sankaranarayanan, J., Hacıgümüs, H., Tatemura, J., Polyzotis, N., Carey, M.: MISO: souping up big data query processing with a multistore system. In: ACM SIGMOD, pp. 1591–1602 (2014)
Yuanyuan, T., Zou, T., Özcan, F., Gonscalves, R., Pirahesh, H.,: Joins for hybrid warehouses: exploiting massive parallelism in hadoop and enterprise data warehouses. In: EDBT/ICDT Conf., pp. 373–384 (2015)
Kolev, B., Pau, R., Levchenko, O., Valduriez, P., Jimenez-Peris, R., Pereira, J.: Benchmarking polystores: the CloudMdsQL experience. In: IEEE international conference on Big Data, pp. 2574–2579 (2016)
Haas, L., Kossmann, D., Wimmers, E., Yang, J.: Optmizing queries across diverse data sources. In: International conference on very large databases (VLDB), pp. 276–285 (1997)
Kolev, B., Levchenko, O., Paciti, E., Valduriez, P., Vilaca, R., Goncalves, R., Jimenez-Peris, R., Kranas, P.: Parallel polyglot query processing on heterogeneous cloud data stores with LeanXcale. In IEEE international conference on Big Data, pp. 1756–1765 (2018)
Özsu, T., Valduriez, P.: Principles of Distributed Database Systems. Springer, Berlin (2020)
Book Google Scholar
Tomasic, A., Raschid, L., Valduriez, P.: “Scaling access to heterogeneous data sources with DISCO.” IEEE Trans. Knowl. Data Eng. 10, 808–823 (1998)
Article Google Scholar
Bondiombouy, C., Valduriez, P.: Query processing in multistore systems: an overview. Int. J. Cloud Comput. 5(4), 309–346 (2016)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. PVLDB 2, 1626–1629 (2009)
Google Scholar
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1, 1265–1276 (2008)
Google Scholar
Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. PVLDB 21, 611–636 (2012)
Google Scholar
Dasgupta, S., Coakley, K., Gupta, A.: Analytics-driven data ingestion and derivation in the AWESOME polystore. In: IEEE international conference on big data, pp. 2555–2564 (2016)
Khan, Y., Zimmermann, A., Jha, A., Rebholz-Schuhmann, D., Sahay, R.: Querying web polystores. In: IEEE international conference on Big Data (2017)
Alotaibi, R., Bursztyn, D., Deutsch, A., Manolescu, I.: Towards scalable hybrid stores: constraint-based rewriting to the rescue. In: ACM SIGMOD, pp. 1660–1677 (2019)
Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., Meng, X., Kaftan, T., Franklin, M., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in Spark. In: ACM SIGMOD, pp. 1383–1394 (2015)
Presto—Distributed Query Engine for Big Data, https://prestodb.io/
Apache Drill—Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage, https://drill.apache.org/
Wang, J., Baker, T., Balazinska, M., Halperin, D., Haynes, B., Howe, B., Hutchison, D., Jain, S., Maas, R., Mehta, P., Moritz, D., Myers, B., Ortiz, J., Suciu, D., Whitaker, A., Xu, S.: The Myria big data management and analytics system and cloud service. In: Conference on innovative data systems research (CIDR) (2017)
Apache Impala, http://impala.apache.org/
Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S.: Musketeer: all for one, one for all in data processing systems. In: Proceedings of the tenth european conference on computer systems (EuroSys '15). Article 2, pp. 1–16. ACM (2015)
Agrawal, D., Chawla, S., Contreras-Rojas, B., Elmagarmid, A., Idris, Y., Kaoudi, Z., Kruse, S., Lucas, J., Mansour, E., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Thirumuruganathan, S., Troudi, A.: RHEEM: enabling cross-platform data processing: may the big data be with you! Proc. VLDB Endow. 11(11), 1414–1427 (2018)
Article Google Scholar
Kruse, S., Kaoudi, Z., Contreras-Rojas, B., Chawla, S., Naumann, F., Quiané-Ruiz, J.-A.: RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems. VLDB J. (2020). https://doi.org/10.1007/s00778-020-00612-x
Article Google Scholar
Awada, K., Eltabakh, M., Tang, C., Al-Kateb, M., Nair, S., Au, G.: Cost estimation across heterogeneous SQL-based big data infrastructures in teradata IntelliSphere. In: EDBT, pp. 534–545 (2020)
Jiménez-Peris, R., Patiño-Martinez, M.: System and method for highly scalable decentralized and low contention transactional processing, Filed at USPTO: 2011. European Patent #EP2780832, US Patent #US9760597 (2011)
Begoli, E., Camacho-Rodriguez, J., Hyde, J., Mior, M., Lemire, D.: Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In: ACM SIGMOD, pp. 221–230 (2018)
Darema, F.: The SPMD model: past, present and future. In: Recent advances in parallel virtual machine and message passing interface, vol. 2131. Springer, Berlin (2001)
TPC-H. http://www.tpc.org/tpch/

Download references

Acknowledgements

This research has been partially funded by the European Union's Horizon 2020 Programme, project BigDataStack (Grant 779747), project INFINITECH (Grant 856632), project PolicyCLOUD (Grant 870675), by the Madrid Regional Council, FSE and FEDER, project EDGEDATA (P2018/TCS-4499), CLOUDDB project TIN2016-80350-P (MINECO/FEDER, UE), and industrial doctorate grant for Pavlos Kranas (IND2017/TIC-7829). Prof. Jose Pereira, Ricardo Vilaça, and Rui Gonçalves contributed to this work when they were with LeanXcale.

Author information

Authors and Affiliations

LeanXcale, Madrid, Spain
Pavlos Kranas, Boyan Kolev & Ricardo Jiménez-Peris
Distributed Systems Lab at Universidad Politécnica de Madrid, Madrid, Spain
Pavlos Kranas & Marta Patiño-Martinez
Inria, University of Montpellier, CNRS, LIRMM, Montpellier, France
Boyan Kolev, Oleksandra Levchenko, Esther Pacitti & Patrick Valduriez

Authors

Pavlos Kranas
View author publications
You can also search for this author in PubMed Google Scholar
Boyan Kolev
View author publications
You can also search for this author in PubMed Google Scholar
Oleksandra Levchenko
View author publications
You can also search for this author in PubMed Google Scholar
Esther Pacitti
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Valduriez
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Jiménez-Peris
View author publications
You can also search for this author in PubMed Google Scholar
Marta Patiño-Martinez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boyan Kolev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kranas, P., Kolev, B., Levchenko, O. et al. Parallel query processing in a polystore. Distrib Parallel Databases 39, 939–977 (2021). https://doi.org/10.1007/s10619-021-07322-5

Download citation

Accepted: 09 January 2021
Published: 03 February 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10619-021-07322-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel query processing in a polystore

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The New Hardware Development Trend and the Challenges in Data Management and Analysis

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel query processing in a polystore

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The New Hardware Development Trend and the Challenges in Data Management and Analysis

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation