Abstract
Data scientists today have to query an avalanche of multi-source data (e.g., data lakes, company databases) for diverse analytical tasks. Data discovery is labor-intensive as users have to find the right tables, and the combination thereof to answer their queries. Data discovery systems automatically find and link (e.g., joins) tables across various sources to aid users in finding the data they need. In this paper, we outline our ongoing efforts to build a data discovery by example system, DICE, that iteratively searches for new tables guided by user-provided data examples. Additionally, DICE asks users to validate results to improve the discovery process over multiple iterations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Elmore, A.J., et al.: A demonstration of the BigDAWG polystore system. Proc. VLDB Endow. 8(12), 1908 (2015)
Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1001–1012. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00094
Fernandez, R.C., et al.: Seeping semantics: linking datasets using word embeddings for data discovery. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 989–1000. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00093
Fernandez, R.C., Tang, N., Ouzzani, M., Stonebraker, M., Madden, S.: Dataset-on-demand: automatic view search and presentation for data discovery. CoRR abs/1911.11876 (2019). http://arxiv.org/abs/1911.11876
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2016)
Halevy, A.Y., et al.: Goods: organizing Google’s datasets. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 795–806. ACM (2016). https://doi.org/10.1145/2882903.2903730
Mattson, T., Gadepally, V., She, Z., Dziedzic, A., Parkhurst, J.: Demonstrating the BigDAWG polystore system for ocean metagenomics analysis. In: CIDR (2017)
Rezig, E.K., et al.: Data civilizer 2.0: a holistic framework for data preparation and analytics. PVLDB 12(12), 1954–1957 (2019). https://doi.org/10.14778/3352063.3352108. http://www.vldb.org/pvldb/vol12/p1954-rezig.pdf
Rezig, E., Cafarella, M., Gadepally, V.: Technical report: an overview of data integration and preparation (2020)
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 3211–3220. IEEE (2017)
Zhu, E., He, Y., Chaudhuri, S.: Auto-join: joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). https://doi.org/10.14778/3115404.3115409. http://www.vldb.org/pvldb/vol10/p1034-he.pdf
Acknowledgement
Research was sponsored by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rezig, E.K., Vanterpool, A., Gadepally, V., Price, B., Cafarella, M., Stonebraker, M. (2021). Towards Data Discovery by Example. In: Gadepally, V., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2020 2020. Lecture Notes in Computer Science(), vol 12633. Springer, Cham. https://doi.org/10.1007/978-3-030-71055-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-71055-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71054-5
Online ISBN: 978-3-030-71055-2
eBook Packages: Computer ScienceComputer Science (R0)