Skip to main content

Semantic of Data Dependencies to Improve the Data Quality

  • Conference paper
  • First Online:
Model and Data Engineering

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9344))

Abstract

Data quality in databases is a critical challenge because the cost of anomalies may be very high, especially for large databases. Therefore, the correction of these anomalies represents an issue that has become more and more important both in enterprises and in academia. In this work, we address the problems of intra-column and inter-columns anomalies in big data. We propose a new approach for data cleaning that takes into account the semantic dependencies between the columns of a data source. The novelty of our proposal is the reduction of the size of the search space in the process of functional dependency discovery based on data semantics. In this paper, we present the first steps of our work. They allow recognizing the semantics of data and correct intra-column anomalies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ben Salem, A., Boufarès, F., Correia, S.: Semantic recogonition of a data structure in Big-data. In: 6th International Conference on Computational Intelligence and software Engineering, Beijing, pp. 93–103 (2014)

    Google Scholar 

  2. Ben Salem, A.: Qualité contextuelle des données: Détection et nettoyage guidés par la sémantique des données. Thèse de doctorat, de l’université Sorbonne Paris cité, Paris (2015)

    Google Scholar 

  3. Berkopec, A.: HyperQuick algorithm for discrete hypergeometric distribution. J. Discrete Algorithms. 5(2), 341–347 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  4. Boufarès, F., Ben Salem, A., Correia, S.: Qualité de données dans les entrepôts de données: élimination des similaires. In: 8èmes Journées francophones sur les Entrepôts de Données et l’Analyse en ligne, Bordeaux, France, pp. 32–41 (2012)

    Google Scholar 

  5. Boufarès, F., Ben Salem, A., Rehab, M., Correia, S.: Similar elimination data: MFB algorithm. In: IEEE-2013 International Conference on Control, Decision and Information Technologies, Hammamet, Tunisie, pp. 289–293 (2013)

    Google Scholar 

  6. Dallachiesay, M., Ebaidz, A., Eldawy, A., Elmagarmid, A, Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 541–552. IEEE Press, New York (2013)

    Google Scholar 

  7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Conference on Symposium on Operating System Design and Implementation, California, pp. 137–150 (2004)

    Google Scholar 

  8. Diallo, T., Novelli, N.: Découverte des dépendances fonctionnelles conditionnelles. In: 10th conférence internationale sur l’extraction et la gestion des connaissances, Hammamet, Tunisie, pp. 315–326 (2010)

    Google Scholar 

  9. Novelli, N., Cicchetti, R.: FUN: an efficient algorithm for mining functional and embedded dependencies. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 189–203. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  10. PentahoDataIntegration. http://www.pentaho.fr/explore/pentaho-data-integration

  11. Raman, V., Hellerstein J.M.: Potter’s wheel: an interactive data cleaning system. In: 27th International Conference on Very Large Data Bases, Rome, Italy, pp. 381–390 (2001)

    Google Scholar 

  12. Simonenko, E., Novelli, N.: Extraction de dépendances fonctionnelles approximatives: une approche incrémentale. In: 12th Conférence Internationale Francophone sur l’Extraction et la Gestion des Connaissances, Bordeaux, France, pp. 95–100 (2012)

    Google Scholar 

  13. Talend. https://www.talend.com/

  14. Vassiliadis, P., Simitsis A., Georgantas, P., Terrovitis, M.: A framework for the design of ETL scenarios. In: 15th Conference on Advanced Information Systems Engineering, Klagenfurt, Austria, pp. 520–535 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Houda Zaidi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zaidi, H., Pollet, Y., Boufarès, F., Kraiem, N. (2015). Semantic of Data Dependencies to Improve the Data Quality. In: Bellatreche, L., Manolopoulos, Y. (eds) Model and Data Engineering. Lecture Notes in Computer Science(), vol 9344. Springer, Cham. https://doi.org/10.1007/978-3-319-23781-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23781-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23780-0

  • Online ISBN: 978-3-319-23781-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics