A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering

Ben Ishay, Roni; Herman, Maya

doi:10.1007/978-3-319-21024-7_8

Roni Ben Ishay^5,6 &
Maya Herman⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

3090 Accesses
1 Citations

Abstract

In this article we present a new and efficient algorithm to handle missing values in databases applied in data mining (DM). Missing values may harm the calculation of the clustering algorithm, and might lead to distorted results. Therefore missing values must be treated before the DM. Commonly, methods to handle missing values are implemented as a separate process from the DM. This may cause a long runtime and may lead to redundant I/O accesses. As a result, the entire DM process may be inefficient. We present a new algorithm (km-Impute) which integrates clustering and imputation of missing values in a unified process. The algorithm was tested on real Red wine quality measures (from the UCI Machine Learning Repository). km-Impute succeeded in imputing missing values and in building clusters as a unified integrated process. The structure and quality of clusters which were produced by km-Impute were similar to clusters of k-means. In addition, the clusters were analyzed by a wine expert. The clusters represented different types of Red wine quality. The success and the accuracy of the imputation were validated using another two datasets: White wine and Page blocks (from the UCI). The results were consistent with the tests which were applied on Red wine: The ratio of success of imputation in all three datasets was similar. Although the complexity of km-Impute was the same as k-means, in practice it was more efficient when applying on middle sized databases: The runtime was significantly shorter than k-means and fewer iterations were required until convergence. km-Impute also performed much less I/O accesses in comparison to k-means.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Pub, Waltham (2012)
Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Pub, San Francisco (1999)
Google Scholar
Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012)
Google Scholar
Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)
Chapter Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml/
Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M.L., Kenneth Tan, C.J. (eds.) Transactions on Computational Science I. LNCS, vol. 4750, pp. 128–138. Springer, Heidelberg (2008)
Chapter Google Scholar
Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 366–377. Springer, Heidelberg (2009)
Chapter Google Scholar
Miller, L.D., Stender, N., Soh, L.K., Samal, A., Kupzyk, K.: Hierarchical clustering algorithm with dynamic tree cut for data imputation (2011). http://ponca.unl.edu/facdb/csefacdb/TechReportArchive/TR-UNL-CSE-2011-0003.pdf
Luengo, J., Garcia, S., Herrera, F.: Imputation of missing values : methods’ description. University of Granada, Granada, Spain (2011). http://sci2s.ugr.es/MVDM/pdf/MV-methods-description-Complementary-material.pdf
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The Open University of Israel, Ra’anana, Israel
Roni Ben Ishay & Maya Herman
The Ashkelon Academic College, Ashkelon, Israel
Roni Ben Ishay

Authors

Roni Ben Ishay
View author publications
You can also search for this author in PubMed Google Scholar
Maya Herman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roni Ben Ishay .

Editor information

Editors and Affiliations

IBaI, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ben Ishay, R., Herman, M. (2015). A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-21024-7_8
Published: 01 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics