Skip to main content

Data Quality Visualization for Preprocessing

  • Conference paper
  • First Online:
Advances in Data Mining. Applications and Theoretical Aspects (ICDM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9728))

Included in the following conference series:

  • 1736 Accesses

Abstract

Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? An object framework was designed that uses constructed features as a basis of visualizations. Six real datasets from business performance measurement system domain were acquired to demonstrate the implementation. The framework was found to be a viable preprocessing analysis supplement to both industry practice of exploratory data analysis and research benchmark of preprocessing combinations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdul-Rahmana, S., Abu Bakara, A., Hussein, B., Zeti, A.: An intelligent data pre-processing of complex datasets. Intell. Data Anal. 16, 305–325 (2012)

    Google Scholar 

  2. Bellman, R.: Dynamic Programming. Rand Corporation, Princeton University Press (1957)

    MATH  Google Scholar 

  3. Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis – How to Intelligently Make Sense of Real Data. Springer, London (2010)

    Book  MATH  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Breunig, M.M., Kriegel, H-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, pp. 93–104 (2000)

    Google Scholar 

  6. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection for libraries of models. In: Proceedings of ICML, p. 18 (2004)

    Google Scholar 

  7. Chambers, J.: Software for Data Analysis. Springer, New York (2008)

    Book  MATH  Google Scholar 

  8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)

    Article  Google Scholar 

  9. Chapman, P., Clinton, J., Kerber, R., Khabaza T., Reinartz T., Shearer, C., Wirth R.: Crisp-Dm 1.0 step by step data mining guide, Crisp-DM Consortium (2000)

    Google Scholar 

  10. Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Engel, J., Gerretzen, J., Szymanka, E., Jansen, J.J., Downey, G., Blanchet, L., Buydens, L.: Breaking with trends in preprocessing. TrAC Trends Anal. Chem. 50, 96–106 (2013)

    Article  Google Scholar 

  12. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)

    Article  Google Scholar 

  13. Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52(3), 1694–1711 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  14. Franco-Santos, M., Kennerley, M., Micheli, P., Martinez, V., Mason, S., Marr, B., Gray, D., Neely, A.: Towards a definition of a business performance measurement system. Int. J. Oper. Prod. Manage. 27(8), 784–801 (2007)

    Article  Google Scholar 

  15. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  16. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  17. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2012)

    Book  MATH  Google Scholar 

  18. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Taiwan National University, Taipei (2010)

    Google Scholar 

  19. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)

    Article  MATH  Google Scholar 

  20. Hu, M-X., Salvucci, S.: A Study of Imputation Algorithms, Institute of Education Science, NCES, U.S. Department of Education (1991)

    Google Scholar 

  21. Juhola, M., Siermala, M.: A scatter method for data and variable importance evaluation. Integr. Comput. Aided Eng. 19(2), 137–149 (2012)

    Google Scholar 

  22. Järvinen, P.: On Research Methods. Opinpajan kirja, Tampere (2012)

    Google Scholar 

  23. Kaplan, R.S., Norton, D.P.: The balanced scorecard – measures that drive performance. Harvard Bus. Rev. 71(1), 71–79 (1992)

    Google Scholar 

  24. Kriegel, H.-P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Disc. 15(1), 87–97 (2007)

    Article  MathSciNet  Google Scholar 

  25. Kriegel, H-P., Kröger P., Zimek, A.: Outlier detection techniques. In: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC (2010)

    Google Scholar 

  26. Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer, New York (2013)

    Book  MATH  Google Scholar 

  27. Kochanski, A., Perzyk, M., Klebczyk, M.: Knowledge in imperfect data in advances in knowledge representation. In: Ramirez, C. (ed.) (2012). DOI:10.5772/37714. http://www.intechopen.com/books/advances-inknowledge-representation/knowledge-in-imperfect-data

    Google Scholar 

  28. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)

    Article  MATH  Google Scholar 

  29. Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised learning. Int. J. Comput. Sci. 2, 111–117 (2006)

    Google Scholar 

  30. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004)

    Book  MATH  Google Scholar 

  31. Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2(1) (2013)

    Google Scholar 

  32. Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)

    Article  Google Scholar 

  33. March, S., Smith, G.: Design and natural science research on information technology. J. Decis. Support Syst. 15(4), 251–266 (1995)

    Article  Google Scholar 

  34. Nørreklit, H.: The balance on the balanced scorecard—a critical analysis of some of its assumptions. Manage. Account. Res. 11(1), 65–88 (2000)

    Article  Google Scholar 

  35. Pyle, D.: Data Preparation for Data Mining. Morgan Kauffman, San Francisco (2003)

    Google Scholar 

  36. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  37. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2008)

    Google Scholar 

  38. Sadig, S., Yeganeh, N.K., Induska, M.: 20 years of data quality research: themes, trends and synergies. In: ADC 2011 Proceedings of the Twenty-Second Australasian Database Conference, vol. 115, pp. 153–162 (2011)

    Google Scholar 

  39. Somerville, I.: Software Engineering. Pearson, Boston (2015)

    Google Scholar 

  40. Sondhi, P.: Feature Construction Methods: A Survey, Jan 2016. sifaka.cs.uiuc.edu

  41. Torgo, L.: Data Mining with R: Learning with Case Studies. CRC Press, Boca Raton (2010)

    Book  Google Scholar 

  42. Vattulainen, M.: A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In: Ghazawneh, A., Nørbjerg, J., Pries-Heje, J. (eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, 10–13 August 2014

    Google Scholar 

  43. Vattulainen, M.: Improving the predictive power of business performance measurement systems by constructed data quality features - five cases. In: Perner, P. (ed.) ICDM 2015. LNCS, vol. 9165, pp. 3–16. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  44. Vattulainen, M.: Preproviz: Tools for Visualization of Interdependent Data Quality Issues (2016). https://cran.r-project.org/web/packages/preproviz

  45. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  46. Wand, Y., Wang, R.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)

    Article  Google Scholar 

  47. Wickham, H.: Advanced R. Chapman (2014)

    Google Scholar 

  48. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Discov. Data Eng. 26(1), 97–107 (2013)

    Google Scholar 

  49. Yang, Q., Wu, X.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)

    Article  Google Scholar 

  50. Zhao, H., Sudra, R.: Entity identification for heterogenous database integration —a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markus Vattulainen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Vattulainen, M. (2016). Data Quality Visualization for Preprocessing. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2016. Lecture Notes in Computer Science(), vol 9728. Springer, Cham. https://doi.org/10.1007/978-3-319-41561-1_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41561-1_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41560-4

  • Online ISBN: 978-3-319-41561-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics