Skip to main content

Accurate Data Cleansing through Model Checking and Machine Learning Techniques

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2014)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 178))

  • 560 Accesses

Abstract

Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added).

In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here the term believability is intended as “the extent to which data are accepted or regarded as true, real and credible” [48].

  2. 2.

    The term “state” here is considered in terms of a value assignment to a set of finite-domain state variables.

References

  1. Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of Massive Data Sets, vol. 4. Springer, US (2002)

    Book  MATH  Google Scholar 

  2. Bertossi, L.: Consistent query answering in databases. ACM Sigmod Rec. 35(2), 68–76 (2006)

    Article  Google Scholar 

  3. Bishop, C.M., et al.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)

    MATH  Google Scholar 

  4. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1), 245–271 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  5. Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Inconsistency knowledge discovery for longitudinal data management: a model-based approach. In: Holzinger, A., Pasi, G. (eds.) HCI-KDD 2013. LNCS, vol. 7947, pp. 183–194. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  6. Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Planning meets data cleansing. In: The 24th International Conference on Automated Planning and Scheduling (ICAPS), pp. 439–443. AAAI (2014)

    Google Scholar 

  7. Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: A policy-based cleansing and integration framework for labour and healthcare data. In: Holzinger, A., Jurisica, I. (eds.) Knowledge Discovery and Data Mining. LNCS, vol. 8401, pp. 141–168. Springer, Heidelberg (2014)

    Google Scholar 

  8. Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Towards data cleansing via planning. Intelligenza Artificiale 8(1), 57–69 (2014)

    Google Scholar 

  9. Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1), 90–121 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  10. Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol. 3300, pp. 119–150. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Clemente, P., Kaba, B., Rouzaud-Cornabas, J., Alexandre, M., Aujay, G.: SPTrack: visual analysis of information flows within SELinux policies and attack logs. In: Huang, R., Ghorbani, A.A., Pasi, G., Yamaguchi, T., Yen, N.Y., Jin, B. (eds.) AMT 2012. LNCS, vol. 7669, pp. 596–605. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  12. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007)

    Google Scholar 

  13. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: Ross, K.A., Srivastava, D., Papadias, D. (eds.) SIGMOD Conference, pp. 541–552. ACM (2013)

    Google Scholar 

  14. De Silva, V., Carlsson, G.: Topological estimation using witness complexes. In: Proceedings of the First Eurographics Conference on Point-Based Graphics, pp. 157–166. Eurographics Association (2004)

    Google Scholar 

  15. Devaraj, S., Kohli, R.: Information technology payoff in the health-care industry: a longitudinal study. J. Manag. Inf. Syst. 16(4), 41–68 (2000)

    Google Scholar 

  16. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  17. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. In: Proceedings of the VLDB Endowment, vol. 3(1–2), pp. 173–184 (2010)

    Google Scholar 

  18. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The kdd process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)

    Article  Google Scholar 

  19. Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)

    Article  Google Scholar 

  20. Fisher, C., LaurĂ­a, E., Chengalur-Smith, S., Wang, R.: Introduction to Information Quality. AuthorHouse, USA (2012)

    Google Scholar 

  21. Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2–3), 169–202 (2000)

    Article  MATH  Google Scholar 

  22. Hansen, P., Järvelin, K.: Collaborative information retrieval in an information-intensive domain. Inf. Process. Manag. 41(5), 1101–1119 (2005)

    Article  Google Scholar 

  23. Holzinger, A.: On knowledge discovery and interactive intelligent visualization of biomedical data - challenges in human-computer interaction & biomedical informatics. In: Helfert, M., Francalanci, C., Filipe, J. (eds.) DATA. SciTePress (2012)

    Google Scholar 

  24. Holzinger, A., Bruschi, M., Eder, W.: On interactive data visualization of physiological low-cost-sensor data with focus on mental stress. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 469–480. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  25. Holzinger, A., Yildirim, P., Geier, M., Simonic, K.M.: Quality-based knowledge discovery from medical text on the web. In: Pasi et al. [38], pp. 145–158

    Google Scholar 

  26. Holzinger, A., Zupan, M.: Knodwat: a scientific framework application for testing knowledge discovery methods for the biomedical domain. BMC Bioinf. 14, 191 (2013)

    Article  Google Scholar 

  27. Kapovich, I., Myasnikov, A., Schupp, P., Shpilrain, V.: Generic-case complexity, decision problems in group theory, and random walks. J. Algebra 264(2), 665–694 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  28. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco (1995). http://dl.acm.org/citation.cfm?id=1643031.1643047

  29. Kolahi, S., Lakshmanan, L.V.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory, pp. 53–62. ACM (2009)

    Google Scholar 

  30. Lovaglio, P.G., Mezzanzanica, M.: Classification of longitudinal career paths. Qual. Quant. 47(2), 989–1008 (2013)

    Article  Google Scholar 

  31. Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.: Overview and framework for data and information quality research. J. Data Inf. Qual. 1(1), 2:1–2:22 (2009)

    Google Scholar 

  32. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Data Quality through Model Checking Techniques. In: Gama, J., Bradley, E., Hollmén, J. (eds.) IDA 2011. LNCS, vol. 7014, pp. 270–281. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  33. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Data quality sensitivity analysis on aggregate indicators. In: Helfert, M., Francalanci , C., Filipe, J. (eds.) DATA 2012-The International Conference on Data Technologies and Applications, pp. 97-108. SciTePress (2012). 10.5220/0004040300970108

    Google Scholar 

  34. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Automatic synthesis of data cleansing activities. In: Helfert, M., Francalanci, C. (eds.) The 2nd International Conference on Data Management Technologies and Applications (DATA), pp. 138–149. Scitepress (2013)

    Google Scholar 

  35. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Improving data cleansing accuracy: a model-based approach. In: The 3rd International Conference on Data Technologies and Applications, pp. 189–201. Insticc (2014)

    Google Scholar 

  36. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: A model-based evaluation of data quality activities in KDD. Inf. Process. Manag. 51(2), 144–166 (2015). doi:10.1016/j.ipm.2014.07.007

    Article  Google Scholar 

  37. Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: A model-based approach for developing data cleansing solutions. ACM J. Data Inf. Qual. 5(4), 1–28 (2015). doi:10.1145/2641575

    Article  Google Scholar 

  38. Ng, A.Y.: Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 78. ACM (2004)

    Google Scholar 

  39. de Oliveira, M.C.F., Levkowitz, H.: From visual data exploration to visual data mining: a survey. IEEE Trans. Vis. Comput. Graph. 9(3), 378–394 (2003)

    Article  Google Scholar 

  40. Pasi, G., Bordogna, G., Jain, L.C.: An introduction to quality issues in the management of web information. In: Quality Issues in the Management of Web Information [38], pp. 1–3

    Google Scholar 

  41. Pasi, G., Bordogna, G., Jain, L.C. (eds.): Quality Issues in the Management of Web Information. Intelligent Systems Reference Library, vol. 50. Springer, Heidelberg (2013)

    Google Scholar 

  42. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MATH  MathSciNet  Google Scholar 

  43. Penna, G.D., Intrigila, B., Magazzeni, D., Mercorio, F.: UPMurphi: a tool for universal planning on pddl+ problems. In: Proceedings of the 19th International Conference on Automated Planning and Scheduling (ICAPS 2009), pp. 106–113. AAAI Press, Thessaloniki, Greece (2009). http://aaai.org/ocs/index.php/ICAPS/ICAPS09/paper/view/707

  44. Penna, G.D., Magazzeni, D., Mercorio, F.: A universal planning system for hybrid domains. Appl. Intell. 36(4), 932–959 (2012). doi:10.1007/s10489-011-0306-z

    Article  Google Scholar 

  45. Prinzie, A., Van den Poel, D.: odeling complex longitudinal consumer behavior with dynamic bayesian networks: an acquisition pattern analysis application. J. Intell. Inf. Syst. 36(3), 283–304 (2011)

    Article  Google Scholar 

  46. Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  47. Vardi, M.: Fundamentals of dependency theory. In: Borger, E. (ed.) Trends in Theoretical Computer Science, pp. 171–224. Computer Science Press, Rockville (1987)

    Google Scholar 

  48. Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)

    MATH  Google Scholar 

  49. Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 International Conference on Management of Data, pp. 553–564. ACM (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mirko Cesarini or Fabio Mercorio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2015). Accurate Data Cleansing through Model Checking and Machine Learning Techniques. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2014. Communications in Computer and Information Science, vol 178. Springer, Cham. https://doi.org/10.1007/978-3-319-25936-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25936-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25935-2

  • Online ISBN: 978-3-319-25936-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics