Abstract
Bayesian classifiers have been proven effective in many practical applications. To train a Bayesian classifier, important parameters such as prior and class conditional probabilities need to be learned from datasets. In practice, datasets are prone to errors due to dirty (missing, erroneous or duplicated) values, which will severely affect the model accuracy if no data cleaning task is enforced. However, cleaning the whole dataset is prohibitively laborious and thus infeasible for even medium-sized datasets. To this end, we propose to induce Bayes models by cleaning only small samples of the dataset. We derive confidence intervals as a function of sample size after data cleaning. In this way, the posterior probability is guaranteed to fall into the estimated confidence intervals with constant probability. Then, we design two strategies to compare the posterior probability intervals if overlap exists. Extension to semi-naive Bayes method is also addressed. Experimental results suggest that cleaning only a small number of samples can train satisfactory Bayesian models, offering significant improvement in cost over cleaning all of the data and significant improvement on precision, recall and F-Measure over cleaning none of the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Nakahara, Y.: User oriented ranking criteria and its application to fuzzy mathematical programming problems. Fuzzy Sets Syst. 94(3), 275–286 (1998)
Franklin, M.J., Kossmann, D., Kraska, T., et al.: CrowdDB: answering queries with crowdsourcing. In: SIGMOD Conference, pp. 61–72 (2011)
Bounhas, M., Mellouli, K., Prade, H., Serrurier, M.: From Bayesian classifiers to possibilistic classifiers for numerical data. In: Deshpande, A., Hunter, A. (eds.) SUM 2010. LNCS (LNAI), vol. 6379, pp. 112–125. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15951-0_15
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: SIGKDD Conference, pp. 290–294 (2000)
Sengupta, A., Pal, T.K.: On comparing interval numbers. Eur. J. Oper. Res. 127(1), 28–43 (2000)
Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: SIGMOD Conference, pp. 469–480 (2014)
Kazmierska, J., Malicki, J.: Application of the Naive Bayesian Classifier to optimize treatment decisions. Radiother. Oncol. 86(2), 211–216 (2008)
Li, J., Yang, D., Ji, C.: Mine weighted network motifs via Bayes’ theorem. In: ICSAI, pp. 448–452 (2017)
Zheng, Z., Webb, G.I.: Lazy learning of Bayesian rules. Mach. Learn. 41(1), 53–84 (2000)
Swartz, N.: Gartner warns firms of ‘dirty data’. Inf. Manage. J. 41(3) (2007)
Gokhale, C., et al.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD Conference, pp. 601–612 (2014)
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. PVLDB 5(12), 1878–1881 (2012)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: SIGCHI Conference, pp. 453–456 (2008)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2), 197–207 (2010)
Khayyat, Z., Ilyas, I.F., Jindal, A., et al.: BigDansing: a system for big data cleansing. In: SIGMOD Conference, pp. 1215–1230 (2015)
Acknowledgment
The work reported in this paper is partially supported by NSFC under grant number 61370205 and NSF of Xinjiang Key Laboratory under grant number 2019D04024.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H., Cheng, W., Guo, K., Xiao, Y., Liu, Z. (2019). Bayesian Classifier Modeling for Dirty Data. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11672. Springer, Cham. https://doi.org/10.1007/978-3-030-29894-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-29894-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29893-7
Online ISBN: 978-3-030-29894-4
eBook Packages: Computer ScienceComputer Science (R0)