Quarry: A User-centered Big Data Integration Platform

Jovanovic, Petar; Nadal, Sergi; Romero, Oscar; Abelló, Alberto; Bilalli, Besim

doi:10.1007/s10796-020-10001-y

Quarry: A User-centered Big Data Integration Platform

Published: 18 April 2020

Volume 23, pages 9–33, (2021)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Petar Jovanovic ORCID: orcid.org/0000-0003-4635-6646¹,
Sergi Nadal¹,
Oscar Romero¹,
Alberto Abelló¹ &
…
Besim Bilalli¹

1141 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

Big Data Analytics in Healthcare

Notes

References

Abiteboul, S., André, B., & Kaplan, D. (2015). Managing your digital life. Communications of the ACM, 58(5), 32–35.
Google Scholar
Angles, R., & Gutiérrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 1:1–1:39.
Google Scholar
Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J. L., & Vrgoc, D. (2017). Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5), 68:1–68:40.
Google Scholar
Bean, R. (2016). Variety, not volume, is driving big data initiatives. URL https://sloanreview.mit.edu/article/variety-not-volume-is-driving-big-data-initiatives
Bilalli, B., Abelló, A., Aluja-Banet, T., Munir, R. F., & Wrembel, R. (2018a). PRESISTANT: data pre-processing assistant. In CAiSE Forum (pp. 57–65).
Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2018b). Intelligent assistance for data pre-processing. Computer Standards & Interfaces, 57, 101–109.
Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123, 100–122.
Google Scholar
Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., & Xiao, G. (2017). Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3), 471–487.
Google Scholar
Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., van Keulen, M., Jarrar, M., Santucci, G., Sattler, K., Scannapieco, M., Wimmer, M., Wrembel, R., & Zaraket, F. A. (2018). Big data semantics. J. Data Semantics, 7(2), 65–85.
Google Scholar
Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 1802–1813.
Google Scholar
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., & Tang, N. (2013). Nadeef: A commodity data cleaning system. In SIGMOD (pp. 541–552).
Google Scholar
Deng, D., Fernandez, R. C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A. K., Ilyas, I. F., Madden, S., Ouzzani, M., & Tang, N. (2017). The data civilizer system. In CIDR.
Google Scholar
Doan, A., Halevy, A. Y., & Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., & Zdonik, S. B. (2015). The BigDAWG Polystore System. SIGMOD Record, 44(2), 11–16.
Google Scholar
Fernandez, R. C., & Madden, S. (2019). Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD (p. 7:1–7:8).
Google Scholar
Fletcher, G. H. L., & Mandreoli, F. (2016). No users no dataspaces! query-driven dataspace orchestration? In SEBD (pp. 150–157).
Google Scholar
Franklin, M. J., Halevy, A. Y., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4), 27–33.
Google Scholar
Friedman, M., Levy, A. Y., & Millstein, T. D. (1999). Navigational plans for data integration. IJCAI: In.
Google Scholar
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., & Widom, J. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information System, 8(2), 117–132.
Google Scholar
Golshan, B., Halevy, A. Y., Mihaila, G. A., & Tan, W. (2017). Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017 (pp. 101–106).
Google Scholar
Gorawski, M., & Lorek, M. (2017). Efficient storage, retrieval and analysis of poker hands: An adaptive data framework. Applied Mathematics and Computer Science, 27(4), 713–726.
Google Scholar
Gorton, I., & Klein, J. (2015). Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Software, 32(3), 78–85.
Google Scholar
Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In SIGMOD (pp. 2097–2100).
Google Scholar
Halevy, A. Y., Rajaraman, A., & Ordille, J. J. (2006). Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006 (pp. 9–16).
Google Scholar
Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Managing google’s data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3), 5–14.
Google Scholar
Hewasinghage, M., Varga, J., Abelló, A., & Zimányi, E. (2018). Managing polyglot systems metadata with hypergraphs. In ER (pp. 463–478).
Google Scholar
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014a). A requirement-driven approach to the design and evolution of data warehouses. Information Systems, 44, 94–119.
Google Scholar
Jovanovic, P., Simitsis, A., & Wilkinson, K. (2014b). Engine independence for logical analytic flows. In ICDE (pp. 1060–1071).
Google Scholar
Jovanovic, P., Romero, O., Simitsis, A., & Abelló, A. (2016). Incremental consolidation of data-intensive multi-flows. IEEE Transactions on Knowledge and Data Engineering, 28(5), 1203–1216.
Google Scholar
Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A. A. A., Gottlob, G., Keane, J. A., Libkin, L., & Paton, N. W. (2017). The VADA architecture for cost-effective data wrangling. In SIGMOD (pp. 1599–1602).
Google Scholar
Lenzerini, M. (2002). Data integration: A theoretical perspective. In PODS (pp. 233–246).
Google Scholar
Lerman, K., Minton, S., & Knoblock, C. A. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18, 149–181.
Google Scholar
Luján-Mora, S., & Trujillo, J. (2006). Applying the UML and the unified process to the design of data warehouses. JCIS, 46(5), 30–58.
Google Scholar
Munir, R. F., Nadal, S., Romero, O., Abelló, A., Jovanovic, P., Thiele, M., & Lehner, W. (2018). Intermediate results materialization selection and format for data-intensive flows. Fundam. Inform., 163(2), 111–138.
Google Scholar
Nadal, S., Rabbani, K., Romero, O., & Tadesse, S. (2019a). ODIN: A dataspace management system. In ISWC (pp. 185–188).
Google Scholar
Nadal, S., Romero, O., Abelló, A., Vassiliadis, P., & Vansummeren, S. (2019b). An integration-oriented ontology to govern evolution in big data ecosystems. Information Systems, 79, 3–19.
Google Scholar
Popovic, A., Hackney, R., Tassabehji, R., & Castelli, M. (2018). The impact of big data analytics on firms’ high value business performance. Information Systems Frontiers, 20(2), 209–222.
Google Scholar
Priyatna, F., Corcho, Ó., & Sequeda, J. F. (2014). Formalisation and experiences of r2rml-based SPARQL to SQL query translation using morph. In WWW (pp. 479–490).
Google Scholar
Quix, C., & Hai, R. (2019). Data lake. Encyclopedia of Big Data Technologies: In.
Google Scholar
Rabbani, K. (2019). Supporting the Semi-Automatic Creation of the Target Schema in Data Integration Systems. Master’s thesis, Technische Univesitat Berlin - Universitat Politècnica de Catalunya, BarcelonaTech.
Saltor, F., Castellanos, M., & García-Solaco, M. (1991). Suitability of data models as canonical models for federated databases. SIGMOD Record, 20(4), 44–48.
Google Scholar
Sarma, A. D., Dong, X. L., & Halevy, A. Y. (2011). Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping (pp. 75–108).
Google Scholar
Simitsis, A., Vassiliadis, P., & Sellis, T. K. (2005). State-space optimization of ETL workflows. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1404–1419.
Google Scholar
Simitsis, A., Wilkinson, K., Dayal, U., & Hsu, M. (2013). HFMS: managing the lifecycle and complexity of hybrid analytic data flows. In ICDE (pp. 1174–1185).
Google Scholar
Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semantic Web Inf. Syst., 3(4), 1–24.
Google Scholar
Stonebraker, M. (2019). The Case for Polystores – ACM SIGMOD Blog. [Online; accessed 27. Jun. 2019].
Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., Pagan, A., & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.
Google Scholar
Tadesse, S., Gómez, C., Romero, O., Hose, K., & Rabbani, K. (2019). ARDI: Automatic Generation of RDFS Models from Heterogeneous Data Sources. In: EDOC.
Google Scholar
Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging yourney from the wild to the lake. In CIDR.
Google Scholar
Touma, R., Romero, O., & Jovanovic, P. (2015). Supporting data integration tasks with semi-automatic ontology construction. In DOLAP (pp. 89–98).
Google Scholar
Varga, J., Romero, O., Pedersen, T. B., & Thomsen, C. (2014). Towards next generation BI systems: The analytical metadata challenge. In DaWaK (pp. 89–101).
Google Scholar
Wojciechowski, A. (2018). ETL workflow reparation by means of case-based reasoning. Information Systems Frontiers, 20(1), 21–43.
Google Scholar

Download references

Acknowledgements

We thank Dr. Lise Grout and Dr. Pedro Albajar-Viñas from the Neglected Tropical Diseases (NTD) department at WHO, for providing the use case. This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya (BarcelonaTech), Campus Nord Omega-125, UPC - dept ESSI, C/Jordi Girona 1-3, E-08034, Barcelona, Spain
Petar Jovanovic, Sergi Nadal, Oscar Romero, Alberto Abelló & Besim Bilalli

Authors

Petar Jovanovic
View author publications
You can also search for this author in PubMed Google Scholar
Sergi Nadal
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Besim Bilalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petar Jovanovic.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jovanovic, P., Nadal, S., Romero, O. et al. Quarry: A User-centered Big Data Integration Platform. Inf Syst Front 23, 9–33 (2021). https://doi.org/10.1007/s10796-020-10001-y

Download citation

Published: 18 April 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10796-020-10001-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quarry: A User-centered Big Data Integration Platform

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

Big Data Analytics in Healthcare

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation