Skip to main content

A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

In this article we present a new and efficient algorithm to handle missing values in databases applied in data mining (DM). Missing values may harm the calculation of the clustering algorithm, and might lead to distorted results. Therefore missing values must be treated before the DM. Commonly, methods to handle missing values are implemented as a separate process from the DM. This may cause a long runtime and may lead to redundant I/O accesses. As a result, the entire DM process may be inefficient. We present a new algorithm (km-Impute) which integrates clustering and imputation of missing values in a unified process. The algorithm was tested on real Red wine quality measures (from the UCI Machine Learning Repository). km-Impute succeeded in imputing missing values and in building clusters as a unified integrated process. The structure and quality of clusters which were produced by km-Impute were similar to clusters of k-means. In addition, the clusters were analyzed by a wine expert. The clusters represented different types of Red wine quality. The success and the accuracy of the imputation were validated using another two datasets: White wine and Page blocks (from the UCI). The results were consistent with the tests which were applied on Red wine: The ratio of success of imputation in all three datasets was similar. Although the complexity of km-Impute was the same as k-means, in practice it was more efficient when applying on middle sized databases: The runtime was significantly shorter than k-means and fewer iterations were required until convergence. km-Impute also performed much less I/O accesses in comparison to k-means.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Pub, Waltham (2012)

    Google Scholar 

  2. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Pub, San Francisco (1999)

    Google Scholar 

  3. Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012)

    Google Scholar 

  4. Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  5. Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml/

  6. Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M.L., Kenneth Tan, C.J. (eds.) Transactions on Computational Science I. LNCS, vol. 4750, pp. 128–138. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 366–377. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  8. Miller, L.D., Stender, N., Soh, L.K., Samal, A., Kupzyk, K.: Hierarchical clustering algorithm with dynamic tree cut for data imputation (2011). http://ponca.unl.edu/facdb/csefacdb/TechReportArchive/TR-UNL-CSE-2011-0003.pdf

  9. Luengo, J., Garcia, S., Herrera, F.: Imputation of missing values : methods’ description. University of Granada, Granada, Spain (2011). http://sci2s.ugr.es/MVDM/pdf/MV-methods-description-Complementary-material.pdf

  10. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roni Ben Ishay .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ben Ishay, R., Herman, M. (2015). A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21024-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21023-0

  • Online ISBN: 978-3-319-21024-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics