Skip to main content

Towards Data Discovery by Example

  • Conference paper
  • First Online:
Heterogeneous Data Management, Polystores, and Analytics for Healthcare (DMAH 2020, Poly 2020)

Abstract

Data scientists today have to query an avalanche of multi-source data (e.g., data lakes, company databases) for diverse analytical tasks. Data discovery is labor-intensive as users have to find the right tables, and the combination thereof to answer their queries. Data discovery systems automatically find and link (e.g., joins) tables across various sources to aid users in finding the data they need. In this paper, we outline our ongoing efforts to build a data discovery by example system, DICE, that iteratively searches for new tables guided by user-provided data examples. Additionally, DICE asks users to validate results to improve the discovery process over multiple iterations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Elmore, A.J., et al.: A demonstration of the BigDAWG polystore system. Proc. VLDB Endow. 8(12), 1908 (2015)

    Article  Google Scholar 

  2. Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1001–1012. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00094

  3. Fernandez, R.C., et al.: Seeping semantics: linking datasets using word embeddings for data discovery. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 989–1000. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00093

  4. Fernandez, R.C., Tang, N., Ouzzani, M., Stonebraker, M., Madden, S.: Dataset-on-demand: automatic view search and presentation for data discovery. CoRR abs/1911.11876 (2019). http://arxiv.org/abs/1911.11876

  5. Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2016)

    Google Scholar 

  6. Halevy, A.Y., et al.: Goods: organizing Google’s datasets. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 795–806. ACM (2016). https://doi.org/10.1145/2882903.2903730

  7. Mattson, T., Gadepally, V., She, Z., Dziedzic, A., Parkhurst, J.: Demonstrating the BigDAWG polystore system for ocean metagenomics analysis. In: CIDR (2017)

    Google Scholar 

  8. Rezig, E.K., et al.: Data civilizer 2.0: a holistic framework for data preparation and analytics. PVLDB 12(12), 1954–1957 (2019). https://doi.org/10.14778/3352063.3352108. http://www.vldb.org/pvldb/vol12/p1954-rezig.pdf

  9. Rezig, E., Cafarella, M., Gadepally, V.: Technical report: an overview of data integration and preparation (2020)

    Google Scholar 

  10. Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 3211–3220. IEEE (2017)

    Google Scholar 

  11. Zhu, E., He, Y., Chaudhuri, S.: Auto-join: joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). https://doi.org/10.14778/3115404.3115409. http://www.vldb.org/pvldb/vol10/p1034-he.pdf

Download references

Acknowledgement

Research was sponsored by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to El Kindi Rezig .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rezig, E.K., Vanterpool, A., Gadepally, V., Price, B., Cafarella, M., Stonebraker, M. (2021). Towards Data Discovery by Example. In: Gadepally, V., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2020 2020. Lecture Notes in Computer Science(), vol 12633. Springer, Cham. https://doi.org/10.1007/978-3-030-71055-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71055-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71054-5

  • Online ISBN: 978-3-030-71055-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics