Semantic of Data Dependencies to Improve the Data Quality

Zaidi, Houda; Pollet, Yann; Boufarès, Faouzi; Kraiem, Naoufel

doi:10.1007/978-3-319-23781-7_5

Houda Zaidi^15,17,
Yann Pollet¹⁵,
Faouzi Boufarès¹⁶ &
…
Naoufel Kraiem¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9344))

840 Accesses
3 Citations

Abstract

Data quality in databases is a critical challenge because the cost of anomalies may be very high, especially for large databases. Therefore, the correction of these anomalies represents an issue that has become more and more important both in enterprises and in academia. In this work, we address the problems of intra-column and inter-columns anomalies in big data. We propose a new approach for data cleaning that takes into account the semantic dependencies between the columns of a data source. The novelty of our proposal is the reduction of the size of the search space in the process of functional dependency discovery based on data semantics. In this paper, we present the first steps of our work. They allow recognizing the semantics of data and correct intra-column anomalies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ben Salem, A., Boufarès, F., Correia, S.: Semantic recogonition of a data structure in Big-data. In: 6th International Conference on Computational Intelligence and software Engineering, Beijing, pp. 93–103 (2014)
Google Scholar
Ben Salem, A.: Qualité contextuelle des données: Détection et nettoyage guidés par la sémantique des données. Thèse de doctorat, de l’université Sorbonne Paris cité, Paris (2015)
Google Scholar
Berkopec, A.: HyperQuick algorithm for discrete hypergeometric distribution. J. Discrete Algorithms. 5(2), 341–347 (2007)
Article MATH MathSciNet Google Scholar
Boufarès, F., Ben Salem, A., Correia, S.: Qualité de données dans les entrepôts de données: élimination des similaires. In: 8èmes Journées francophones sur les Entrepôts de Données et l’Analyse en ligne, Bordeaux, France, pp. 32–41 (2012)
Google Scholar
Boufarès, F., Ben Salem, A., Rehab, M., Correia, S.: Similar elimination data: MFB algorithm. In: IEEE-2013 International Conference on Control, Decision and Information Technologies, Hammamet, Tunisie, pp. 289–293 (2013)
Google Scholar
Dallachiesay, M., Ebaidz, A., Eldawy, A., Elmagarmid, A, Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 541–552. IEEE Press, New York (2013)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Conference on Symposium on Operating System Design and Implementation, California, pp. 137–150 (2004)
Google Scholar
Diallo, T., Novelli, N.: Découverte des dépendances fonctionnelles conditionnelles. In: 10th conférence internationale sur l’extraction et la gestion des connaissances, Hammamet, Tunisie, pp. 315–326 (2010)
Google Scholar
Novelli, N., Cicchetti, R.: FUN: an efficient algorithm for mining functional and embedded dependencies. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 189–203. Springer, Heidelberg (2000)
Chapter Google Scholar
PentahoDataIntegration. http://www.pentaho.fr/explore/pentaho-data-integration
Raman, V., Hellerstein J.M.: Potter’s wheel: an interactive data cleaning system. In: 27th International Conference on Very Large Data Bases, Rome, Italy, pp. 381–390 (2001)
Google Scholar
Simonenko, E., Novelli, N.: Extraction de dépendances fonctionnelles approximatives: une approche incrémentale. In: 12th Conférence Internationale Francophone sur l’Extraction et la Gestion des Connaissances, Bordeaux, France, pp. 95–100 (2012)
Google Scholar
Talend. https://www.talend.com/
Vassiliadis, P., Simitsis A., Georgantas, P., Terrovitis, M.: A framework for the design of ETL scenarios. In: 15th Conference on Advanced Information Systems Engineering, Klagenfurt, Austria, pp. 520–535 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory CEDRIC, CNAM, 75003, Paris, France
Houda Zaidi & Yann Pollet
Laboratory LIPN, University Sorbonne Paris Cité, 93430, Villetaneuse, France
Faouzi Boufarès
Laboratory RIADI, University Manouba, Manouba, Tunisia
Houda Zaidi & Naoufel Kraiem

Authors

Houda Zaidi
View author publications
You can also search for this author in PubMed Google Scholar
Yann Pollet
View author publications
You can also search for this author in PubMed Google Scholar
Faouzi Boufarès
View author publications
You can also search for this author in PubMed Google Scholar
Naoufel Kraiem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Houda Zaidi .

Editor information

Editors and Affiliations

National Engineering School for Mechanics and Aerotechnics, Poitiers, France
Ladjel Bellatreche
University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaidi, H., Pollet, Y., Boufarès, F., Kraiem, N. (2015). Semantic of Data Dependencies to Improve the Data Quality. In: Bellatreche, L., Manolopoulos, Y. (eds) Model and Data Engineering. Lecture Notes in Computer Science(), vol 9344. Springer, Cham. https://doi.org/10.1007/978-3-319-23781-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-23781-7_5
Published: 29 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23780-0
Online ISBN: 978-3-319-23781-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics