Towards Data Discovery by Example

Rezig, El Kindi; Vanterpool, Allan; Gadepally, Vijay; Price, Benjamin; Cafarella, Michael; Stonebraker, Michael

doi:10.1007/978-3-030-71055-2_6

El Kindi Rezig¹⁶,
Allan Vanterpool^16,17,
Vijay Gadepally¹⁸,
Benjamin Price¹⁸,
Michael Cafarella¹⁶ &
…
Michael Stonebraker¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12633))

Included in the following conference series:

VLDB Workshop on Data Management and Analytics for Medicine and Healthcare
VLDB Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances

558 Accesses
1 Citations

Abstract

Data scientists today have to query an avalanche of multi-source data (e.g., data lakes, company databases) for diverse analytical tasks. Data discovery is labor-intensive as users have to find the right tables, and the combination thereof to answer their queries. Data discovery systems automatically find and link (e.g., joins) tables across various sources to aid users in finding the data they need. In this paper, we outline our ongoing efforts to build a data discovery by example system, DICE, that iteratively searches for new tables guided by user-provided data examples. Additionally, DICE asks users to validate results to improve the discovery process over multiple iterations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Elmore, A.J., et al.: A demonstration of the BigDAWG polystore system. Proc. VLDB Endow. 8(12), 1908 (2015)
Article Google Scholar
Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1001–1012. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00094
Fernandez, R.C., et al.: Seeping semantics: linking datasets using word embeddings for data discovery. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 989–1000. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00093
Fernandez, R.C., Tang, N., Ouzzani, M., Stonebraker, M., Madden, S.: Dataset-on-demand: automatic view search and presentation for data discovery. CoRR abs/1911.11876 (2019). http://arxiv.org/abs/1911.11876
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2016)
Google Scholar
Halevy, A.Y., et al.: Goods: organizing Google’s datasets. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 795–806. ACM (2016). https://doi.org/10.1145/2882903.2903730
Mattson, T., Gadepally, V., She, Z., Dziedzic, A., Parkhurst, J.: Demonstrating the BigDAWG polystore system for ocean metagenomics analysis. In: CIDR (2017)
Google Scholar
Rezig, E.K., et al.: Data civilizer 2.0: a holistic framework for data preparation and analytics. PVLDB 12(12), 1954–1957 (2019). https://doi.org/10.14778/3352063.3352108. http://www.vldb.org/pvldb/vol12/p1954-rezig.pdf
Rezig, E., Cafarella, M., Gadepally, V.: Technical report: an overview of data integration and preparation (2020)
Google Scholar
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 3211–3220. IEEE (2017)
Google Scholar
Zhu, E., He, Y., Chaudhuri, S.: Auto-join: joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). https://doi.org/10.14778/3115404.3115409. http://www.vldb.org/pvldb/vol10/p1034-he.pdf

Download references

Acknowledgement

Research was sponsored by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Author information

Authors and Affiliations

MIT, Cambridge, USA
El Kindi Rezig, Allan Vanterpool, Michael Cafarella & Michael Stonebraker
United States Air Force, Washington, D.C., USA
Allan Vanterpool
MIT Lincoln Laboratory, Lexington, USA
Vijay Gadepally & Benjamin Price

Authors

El Kindi Rezig
View author publications
You can also search for this author in PubMed Google Scholar
Allan Vanterpool
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Gadepally
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Price
View author publications
You can also search for this author in PubMed Google Scholar
Michael Cafarella
View author publications
You can also search for this author in PubMed Google Scholar
Michael Stonebraker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to El Kindi Rezig .

Editor information

Editors and Affiliations

Massachusetts Institute of Technology, Lexington, MA, USA
Vijay Gadepally
Intel Corporation, Portland, OR, USA
Timothy Mattson
Massachusetts Institute of Technology, Cambridge, MA, USA
Michael Stonebraker
Massachusetts Institute of Technology, Cambridge, MA, USA
Tim Kraska
Stony Brook University, Stony Brook, NY, USA
Fusheng Wang
University of Washington, Seattle, WA, USA
Gang Luo
Georgia State University, Atlanta, GA, USA
Jun Kong
Lucerne Unviersity of Applied Sciences, Rotkreuz, Switzerland
Alevtina Dubovitskaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rezig, E.K., Vanterpool, A., Gadepally, V., Price, B., Cafarella, M., Stonebraker, M. (2021). Towards Data Discovery by Example. In: Gadepally, V., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2020 2020. Lecture Notes in Computer Science(), vol 12633. Springer, Cham. https://doi.org/10.1007/978-3-030-71055-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-71055-2_6
Published: 04 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71054-5
Online ISBN: 978-3-030-71055-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics