Skip to main content

Advertisement

Log in

Quarry: A User-centered Big Data Integration Platform

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. https://www.who.int/neglected_diseases/disease_management/wiscentds

  2. https://www.who-umc.org

  3. https://www.promedmail.org

  4. https://www.who.int/chagas/en

  5. http://mss4ntd.essi.upc.edu/wiki/index.php?title=WHO_Integrated_Data_Platform_(WIDP)

  6. https://www.dhis2.org

  7. http://mss4ntd.essi.upc.edu/wiki/index.php?title=WHO_Integrated_Medical_Supplies_System_(WIMEDS)

  8. https://www.bonitasoft.com

  9. http://data.un.org

  10. https://spark.apache.org

  11. https://flink.apache.org

  12. https://hadoop.apache.org

  13. https://neo4j.com

  14. https://www.postgresql.org

  15. WebVOWL:Web-based Visualization of Ontologies - http://vowl.visualdataweb.org/webvowl.html

References

  • Abiteboul, S., André, B., & Kaplan, D. (2015). Managing your digital life. Communications of the ACM, 58(5), 32–35.

    Google Scholar 

  • Angles, R., & Gutiérrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 1:1–1:39.

    Google Scholar 

  • Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J. L., & Vrgoc, D. (2017). Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5), 68:1–68:40.

    Google Scholar 

  • Bean, R. (2016). Variety, not volume, is driving big data initiatives. URL https://sloanreview.mit.edu/article/variety-not-volume-is-driving-big-data-initiatives

  • Bilalli, B., Abelló, A., Aluja-Banet, T., Munir, R. F., & Wrembel, R. (2018a). PRESISTANT: data pre-processing assistant. In CAiSE Forum (pp. 57–65).

    Google Scholar 

  • Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2018b). Intelligent assistance for data pre-processing. Computer Standards & Interfaces, 57, 101–109.

    Google Scholar 

  • Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123, 100–122.

    Google Scholar 

  • Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., & Xiao, G. (2017). Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3), 471–487.

    Google Scholar 

  • Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., van Keulen, M., Jarrar, M., Santucci, G., Sattler, K., Scannapieco, M., Wimmer, M., Wrembel, R., & Zaraket, F. A. (2018). Big data semantics. J. Data Semantics, 7(2), 65–85.

    Google Scholar 

  • Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 1802–1813.

    Google Scholar 

  • Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., & Tang, N. (2013). Nadeef: A commodity data cleaning system. In SIGMOD (pp. 541–552).

    Google Scholar 

  • Deng, D., Fernandez, R. C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A. K., Ilyas, I. F., Madden, S., Ouzzani, M., & Tang, N. (2017). The data civilizer system. In CIDR.

    Google Scholar 

  • Doan, A., Halevy, A. Y., & Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.

  • Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., & Zdonik, S. B. (2015). The BigDAWG Polystore System. SIGMOD Record, 44(2), 11–16.

    Google Scholar 

  • Fernandez, R. C., & Madden, S. (2019). Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD (p. 7:1–7:8).

    Google Scholar 

  • Fletcher, G. H. L., & Mandreoli, F. (2016). No users no dataspaces! query-driven dataspace orchestration? In SEBD (pp. 150–157).

    Google Scholar 

  • Franklin, M. J., Halevy, A. Y., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4), 27–33.

    Google Scholar 

  • Friedman, M., Levy, A. Y., & Millstein, T. D. (1999). Navigational plans for data integration. IJCAI: In.

    Google Scholar 

  • Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., & Widom, J. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information System, 8(2), 117–132.

    Google Scholar 

  • Golshan, B., Halevy, A. Y., Mihaila, G. A., & Tan, W. (2017). Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017 (pp. 101–106).

    Google Scholar 

  • Gorawski, M., & Lorek, M. (2017). Efficient storage, retrieval and analysis of poker hands: An adaptive data framework. Applied Mathematics and Computer Science, 27(4), 713–726.

    Google Scholar 

  • Gorton, I., & Klein, J. (2015). Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Software, 32(3), 78–85.

    Google Scholar 

  • Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In SIGMOD (pp. 2097–2100).

    Google Scholar 

  • Halevy, A. Y., Rajaraman, A., & Ordille, J. J. (2006). Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006 (pp. 9–16).

    Google Scholar 

  • Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Managing google’s data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3), 5–14.

    Google Scholar 

  • Hewasinghage, M., Varga, J., Abelló, A., & Zimányi, E. (2018). Managing polyglot systems metadata with hypergraphs. In ER (pp. 463–478).

    Google Scholar 

  • Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014a). A requirement-driven approach to the design and evolution of data warehouses. Information Systems, 44, 94–119.

    Google Scholar 

  • Jovanovic, P., Simitsis, A., & Wilkinson, K. (2014b). Engine independence for logical analytic flows. In ICDE (pp. 1060–1071).

    Google Scholar 

  • Jovanovic, P., Romero, O., Simitsis, A., & Abelló, A. (2016). Incremental consolidation of data-intensive multi-flows. IEEE Transactions on Knowledge and Data Engineering, 28(5), 1203–1216.

    Google Scholar 

  • Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A. A. A., Gottlob, G., Keane, J. A., Libkin, L., & Paton, N. W. (2017). The VADA architecture for cost-effective data wrangling. In SIGMOD (pp. 1599–1602).

    Google Scholar 

  • Lenzerini, M. (2002). Data integration: A theoretical perspective. In PODS (pp. 233–246).

    Google Scholar 

  • Lerman, K., Minton, S., & Knoblock, C. A. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18, 149–181.

    Google Scholar 

  • Luján-Mora, S., & Trujillo, J. (2006). Applying the UML and the unified process to the design of data warehouses. JCIS, 46(5), 30–58.

    Google Scholar 

  • Munir, R. F., Nadal, S., Romero, O., Abelló, A., Jovanovic, P., Thiele, M., & Lehner, W. (2018). Intermediate results materialization selection and format for data-intensive flows. Fundam. Inform., 163(2), 111–138.

    Google Scholar 

  • Nadal, S., Rabbani, K., Romero, O., & Tadesse, S. (2019a). ODIN: A dataspace management system. In ISWC (pp. 185–188).

    Google Scholar 

  • Nadal, S., Romero, O., Abelló, A., Vassiliadis, P., & Vansummeren, S. (2019b). An integration-oriented ontology to govern evolution in big data ecosystems. Information Systems, 79, 3–19.

    Google Scholar 

  • Popovic, A., Hackney, R., Tassabehji, R., & Castelli, M. (2018). The impact of big data analytics on firms’ high value business performance. Information Systems Frontiers, 20(2), 209–222.

    Google Scholar 

  • Priyatna, F., Corcho, Ó., & Sequeda, J. F. (2014). Formalisation and experiences of r2rml-based SPARQL to SQL query translation using morph. In WWW (pp. 479–490).

    Google Scholar 

  • Quix, C., & Hai, R. (2019). Data lake. Encyclopedia of Big Data Technologies: In.

    Google Scholar 

  • Rabbani, K. (2019). Supporting the Semi-Automatic Creation of the Target Schema in Data Integration Systems. Master’s thesis, Technische Univesitat Berlin - Universitat Politècnica de Catalunya, BarcelonaTech.

  • Saltor, F., Castellanos, M., & García-Solaco, M. (1991). Suitability of data models as canonical models for federated databases. SIGMOD Record, 20(4), 44–48.

    Google Scholar 

  • Sarma, A. D., Dong, X. L., & Halevy, A. Y. (2011). Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping (pp. 75–108).

    Google Scholar 

  • Simitsis, A., Vassiliadis, P., & Sellis, T. K. (2005). State-space optimization of ETL workflows. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1404–1419.

    Google Scholar 

  • Simitsis, A., Wilkinson, K., Dayal, U., & Hsu, M. (2013). HFMS: managing the lifecycle and complexity of hybrid analytic data flows. In ICDE (pp. 1174–1185).

    Google Scholar 

  • Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semantic Web Inf. Syst., 3(4), 1–24.

    Google Scholar 

  • Stonebraker, M. (2019). The Case for Polystores – ACM SIGMOD Blog. [Online; accessed 27. Jun. 2019].

  • Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., Pagan, A., & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.

    Google Scholar 

  • Tadesse, S., Gómez, C., Romero, O., Hose, K., & Rabbani, K. (2019). ARDI: Automatic Generation of RDFS Models from Heterogeneous Data Sources. In: EDOC.

    Google Scholar 

  • Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging yourney from the wild to the lake. In CIDR.

    Google Scholar 

  • Touma, R., Romero, O., & Jovanovic, P. (2015). Supporting data integration tasks with semi-automatic ontology construction. In DOLAP (pp. 89–98).

    Google Scholar 

  • Varga, J., Romero, O., Pedersen, T. B., & Thomsen, C. (2014). Towards next generation BI systems: The analytical metadata challenge. In DaWaK (pp. 89–101).

    Google Scholar 

  • Wojciechowski, A. (2018). ETL workflow reparation by means of case-based reasoning. Information Systems Frontiers, 20(1), 21–43.

    Google Scholar 

Download references

Acknowledgements

We thank Dr. Lise Grout and Dr. Pedro Albajar-Viñas from the Neglected Tropical Diseases (NTD) department at WHO, for providing the use case. This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petar Jovanovic.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jovanovic, P., Nadal, S., Romero, O. et al. Quarry: A User-centered Big Data Integration Platform. Inf Syst Front 23, 9–33 (2021). https://doi.org/10.1007/s10796-020-10001-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-020-10001-y

Keywords

Navigation