Data Quality Visualization for Preprocessing

Vattulainen, Markus

doi:10.1007/978-3-319-41561-1_32

Markus Vattulainen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9728))

Included in the following conference series:

Industrial Conference on Data Mining

1736 Accesses

Abstract

Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? An object framework was designed that uses constructed features as a basis of visualizations. Six real datasets from business performance measurement system domain were acquired to demonstrate the implementation. The framework was found to be a viable preprocessing analysis supplement to both industry practice of exploratory data analysis and research benchmark of preprocessing combinations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdul-Rahmana, S., Abu Bakara, A., Hussein, B., Zeti, A.: An intelligent data pre-processing of complex datasets. Intell. Data Anal. 16, 305–325 (2012)
Google Scholar
Bellman, R.: Dynamic Programming. Rand Corporation, Princeton University Press (1957)
MATH Google Scholar
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis – How to Intelligently Make Sense of Real Data. Springer, London (2010)
Book MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Breunig, M.M., Kriegel, H-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, pp. 93–104 (2000)
Google Scholar
Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection for libraries of models. In: Proceedings of ICML, p. 18 (2004)
Google Scholar
Chambers, J.: Software for Data Analysis. Springer, New York (2008)
Book MATH Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Article Google Scholar
Chapman, P., Clinton, J., Kerber, R., Khabaza T., Reinartz T., Shearer, C., Wirth R.: Crisp-Dm 1.0 step by step data mining guide, Crisp-DM Consortium (2000)
Google Scholar
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2005)
Article MathSciNet MATH Google Scholar
Engel, J., Gerretzen, J., Szymanka, E., Jansen, J.J., Downey, G., Blanchet, L., Buydens, L.: Breaking with trends in preprocessing. TrAC Trends Anal. Chem. 50, 96–106 (2013)
Article Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)
Article Google Scholar
Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52(3), 1694–1711 (2008)
Article MathSciNet MATH Google Scholar
Franco-Santos, M., Kennerley, M., Micheli, P., Martinez, V., Mason, S., Marr, B., Gray, D., Neely, A.: Towards a definition of a business performance measurement system. Int. J. Oper. Prod. Manage. 27(8), 784–801 (2007)
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1995)
Article MathSciNet MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2012)
Book MATH Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Taiwan National University, Taipei (2010)
Google Scholar
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article MATH Google Scholar
Hu, M-X., Salvucci, S.: A Study of Imputation Algorithms, Institute of Education Science, NCES, U.S. Department of Education (1991)
Google Scholar
Juhola, M., Siermala, M.: A scatter method for data and variable importance evaluation. Integr. Comput. Aided Eng. 19(2), 137–149 (2012)
Google Scholar
Järvinen, P.: On Research Methods. Opinpajan kirja, Tampere (2012)
Google Scholar
Kaplan, R.S., Norton, D.P.: The balanced scorecard – measures that drive performance. Harvard Bus. Rev. 71(1), 71–79 (1992)
Google Scholar
Kriegel, H.-P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Disc. 15(1), 87–97 (2007)
Article MathSciNet Google Scholar
Kriegel, H-P., Kröger P., Zimek, A.: Outlier detection techniques. In: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC (2010)
Google Scholar
Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer, New York (2013)
Book MATH Google Scholar
Kochanski, A., Perzyk, M., Klebczyk, M.: Knowledge in imperfect data in advances in knowledge representation. In: Ramirez, C. (ed.) (2012). DOI:10.5772/37714. http://www.intechopen.com/books/advances-inknowledge-representation/knowledge-in-imperfect-data
Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article MATH Google Scholar
Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised learning. Int. J. Comput. Sci. 2, 111–117 (2006)
Google Scholar
Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004)
Book MATH Google Scholar
Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2(1) (2013)
Google Scholar
Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)
Article Google Scholar
March, S., Smith, G.: Design and natural science research on information technology. J. Decis. Support Syst. 15(4), 251–266 (1995)
Article Google Scholar
Nørreklit, H.: The balance on the balanced scorecard—a critical analysis of some of its assumptions. Manage. Account. Res. 11(1), 65–88 (2000)
Article Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kauffman, San Francisco (2003)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2008)
Google Scholar
Sadig, S., Yeganeh, N.K., Induska, M.: 20 years of data quality research: themes, trends and synergies. In: ADC 2011 Proceedings of the Twenty-Second Australasian Database Conference, vol. 115, pp. 153–162 (2011)
Google Scholar
Somerville, I.: Software Engineering. Pearson, Boston (2015)
Google Scholar
Sondhi, P.: Feature Construction Methods: A Survey, Jan 2016. sifaka.cs.uiuc.edu
Torgo, L.: Data Mining with R: Learning with Case Studies. CRC Press, Boca Raton (2010)
Book Google Scholar
Vattulainen, M.: A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In: Ghazawneh, A., Nørbjerg, J., Pries-Heje, J. (eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, 10–13 August 2014
Google Scholar
Vattulainen, M.: Improving the predictive power of business performance measurement systems by constructed data quality features - five cases. In: Perner, P. (ed.) ICDM 2015. LNCS, vol. 9165, pp. 3–16. Springer, Heidelberg (2015)
Chapter Google Scholar
Vattulainen, M.: Preproviz: Tools for Visualization of Interdependent Data Quality Issues (2016). https://cran.r-project.org/web/packages/preproviz
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Wand, Y., Wang, R.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)
Article Google Scholar
Wickham, H.: Advanced R. Chapman (2014)
Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Discov. Data Eng. 26(1), 97–107 (2013)
Google Scholar
Yang, Q., Wu, X.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)
Article Google Scholar
Zhao, H., Sudra, R.: Entity identification for heterogenous database integration —a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Tampere, Tampere, Finland
Markus Vattulainen

Authors

Markus Vattulainen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Vattulainen .

Editor information

Editors and Affiliations

IBaI, Inst of Comp Vision and applied Comp Sci, Leipzig, Sachsen, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vattulainen, M. (2016). Data Quality Visualization for Preprocessing. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2016. Lecture Notes in Computer Science(), vol 9728. Springer, Cham. https://doi.org/10.1007/978-3-319-41561-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-41561-1_32
Published: 28 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41560-4
Online ISBN: 978-3-319-41561-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics