Abstract
Truth discovery methods and rule-based data repairing methods are two classic lines of approaches to improve data quality in the field of database. Truth discovery methods resolve the multi-source conflicts for the same entity by estimating the reliabilities of different source, while rule-based data repairing methods resolve the inconsistencies among different entities using integrity constraints. However, both lines of methods suffer unsatisfactory performances due to the lacking of enough evidence. In this paper, we propose AutoRepair, a novel automatic multi-source data repairing approach to enrich the evidence by taking the advantages of truth discovery and data repairing. We use functional dependency, one of the most common types of constraints, to detect the violations, and use the source reliability as evidence to discover and repair the errors among these violations. At the same time, the repaired results are used to estimate the source reliability. As the source reliability is unknown in advance, we model the process as an iterative framework to ensure better performance. Extensive experiments are conducted on both simulated and real-world datasets. The results clearly demonstrate the advantages of our approach, which outperform both recent truth discovery and rule-based data repairing methods.
Similar content being viewed by others
Notes
We consider the type of constraints as functional dependency due to its importance in improving data quality. However, other types of constraints can also be adopted.
To reduce the influence of the existence of incomplete tuples to the result, the attributes with a large number of missing values are removed.
We set the number of the simulated sources as 5, as it is relatively easy to acquire information from 5 sources in real-world areas, and the information is enough for our multi-source data repairing.
References
Bertossi L, Kolahi S, Lakshmanan LV (2013) Data cleaning and query answering with matching dependencies and matching functions. Theory Comput Syst 52(3):441–482
Beskales G, Ilyas IF, Golab L (2010) Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2):197–207
Bohannon P, Fan W, Flaster M et al (2005) A cost-based model and effective heuristic for repairing constraints by value modification. In: Özcan F (ed) Proceedings of SIGMOD, ACM, Baltimore, MD, pp 143–154
Chiang F, Miller RJ (2011) A unified model for data and constraint repair. In: Abiteboul S, Böhm K, Koch C, Tan K (eds) Proceedings of ICDE, IEEE. Hannover, Germany, pp 446–457
Cong G, Fan W, Geerts F et al (2007) Improving data quality: consistency and accuracy. In Koch C, Johannes G, Garofalakis M et al (eds) Proceedings of VLDB, ACM, University of Vienna, Vienna, pp 315–326
Dallachiesa M, Ebaid A, Eldawy A et al (2013) NADEEF: a commodity data cleaning system. In: Ross K, Srivastava D, Papadias D (eds) Proceedigns of SIGMOD, ACM, New York, NY, pp 541–552
Dong X, Berti-Équille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573
Fan W (2008) Dependencies revisited for improving data quality. In: Lenzerini M, Lembo D (eds) Proceedings of PODS, ACM, Vancouver, BC, pp 159–170
Fan W (2015) Data quality: from theory to practice. SIGMOD Record 44(3):7–18
Fan W, Geerts F, Jia X, Kementsietsidis A (2008) Conditional functional dependencies for capturing data inconsistencies. TODS 33(2):6:1–6:48
Fan W, Jia X, Li J et al (2009) Reasoning about record matching rules. PVLDB 2(1):407–418
Fan W, Li J, Ma S et al (2010) Towards certain fixes with editing rules and master data. PVLDB 3(1–2):173–184
Geerts F, Mecca G, Papotti P et al (2013) The LLUNATIC data-cleaning framework. PVLDB 6(9):625–636
Klein BD, Goodhue DL, Davis GB (1997) Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quarterly 21(2):169–194
Kolahi S, Lakshmanan LV (2009) On approximating optimum repairs for functional dependency violations. In: Fagin R (ed) Proceedings of ICDT, ACM, St. Petersburg, pp 53–62
Li Q, Li Y, Gao J et al (2014) A confidence-aware approach for truth discovery on long-tail data. PVLDB 8(4):425–436
Li Q, Li Y, Gao J et al (2014) Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Dyreson, C, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 1187–1198
Li X, Dong X, Lyons KB et al (2012) Truth finding on the deep web: is the problem solved? PVLDB 6(2):97–108
Li Y, Gao J, Meng C et al (2015) A survey on truth discovery. SIGKDD Explor 17(12):1–16
Li Y, Li Q, Gao J et al (2015) On the discovery of evolving truth. In: Cao L, Zhang C, Joachims T et al (eds) Proceedings of SIGKDD, ACM, Sydney, NSW, pp 675–684
Ma S, Fan W, Bravo L (2014) Extending inclusion dependencies with conditions. Theor Comput Sci 515:64–95
Mayfield C, Neville J, Prabhakar S (2010) ERACER: a database approach for statistical inference and data cleaning. In: Elmagarmid A, Agrawal D (eds) Proceedings of SIGMOD, ACM, Indianapolis, IN, pp 75–86
Meng C, Jiang W, Li Y (2015) Truth discovery on crowd sensing of correlated entities. In Song J, Abdelzaher T, Mascolo C (eds) Proceedings of SenSys, ACM, Seoul, pp 169–182
Pasternack J, Roth D (2010) Knowing What to Believe (when you already know something). In: Huang C, Jurafsky D (eds) Proceedings of COLING, Tsinghua University Press, Beijing, pp 877–885
Pochampally R, Das Sarma A, Dong, X et al (2014) Fusing data with correlations. In Dyreson CE, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 433–444
Qi G.-J, Aggarwal CC, Han J et al (2013) Mining collective intelligence in diverse groups. In Schwabe D, Almeida V, Glaser H (eds) Proceedings of WWW, ACM, Rio de Janeiro, pp 1041–1052
Rekatsinas T, Xu C, Ilyas IF (2017) HoloClean: holistic data repairs with probabilistic inference. PVLDB 10(11):1190–1201
Wang J, Tang N (2014) Towards dependable data repairing with fixing rules. In Dyreson CE, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 457–468
Wang X, Sheng QZ, Yao L et al (2016) Empowering truth discovery with multi-truth prediction. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 881–890
Wang X, Sheng QZ, Yao L et al (2016) Truth discovery via exploiting implications from multi-source data. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 861–870
Xiao H, Gao J, Li Q et al (2016) Towards confidence in the truth: a bootstrapping based truth discovery approach. In: Krishnapuram B, Shah M, Smola AJ et al (eds) Proceedings of SIGKDD, ACM, San Francisco, CA, pp 1935–1944
Yakout M, Elmagarmid AK, Neville J et al (2011) Guided data repair. PVLDB 4(5):279–289
Ye C, Wang H, Li J et al (2016) Crowdsourcing-enhanced missing values imputation based on bayesian network. In: Navathe SB, Wu W, Shekhar S et al (eds) Proceedings of DASFAA, Springer, Dallas, TX, pp 67–81
Yin X, Han J, Philip SY (2008) Truth discovery with multiple conflicting information providers on the web. TKDE 20(6):796–808
Yu D, Huang H, Cassidy T (2014) The wisdom of minority: unsupervised slot filling validation based on multi-dimensional truth-finding. In: Hajic J and Tsujii J (eds) Proceedings of COLING, ACL, Dublin, pp 1567–1578
Zhang H, Li Q, Ma F et al (2016) Influence-aware truth discovery. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 851–860
Zhang H, Li Y, Ma F et al (2018) TextTruth: an unsupervised approach to discover trustworthy information from multi-sourced text data. In Guo Y, Farooq F (eds) Proceedings of SIGKDD, ACM, London, pp 2729–2737
Zhao B, Han J (2012) A probabilistic model for estimating real-valued truth from conflicting sources. In: Proceedings of QDB
Zhao B, Rubinstein BI, Gemmell J (2012) A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561
Acknowledgements
This paper was partially supported by NSFC Grants U1509216, U1866602, the National Key Research and Development Program of China 2016YFB1000703, NSFC Grants 61472099, 61602129, NSF IIS 1553411, and the Chinese Scholarship Council Funding (No. 201606120227).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ye, C., Li, Q., Zhang, H. et al. AutoRepair: an automatic repairing approach over multi-source data. Knowl Inf Syst 61, 227–257 (2019). https://doi.org/10.1007/s10115-018-1284-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1284-9