Abstract
Clustering is a widely used technique in data mining, now there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this paper, we propose a clustering algorithm CFIKP that can handle large datasets with mixed type of attributes. We first use CF *-tree to pre-cluster datasets. After the dense regions are stored in leaf nodes, then we look every dense region as a single point and use an improved k-prototype to cluster such dense regions. Experiments show that the CFIKP algorithm is very efficient in clustering large datasets with mixed type of attributes.
This work is supported by the National Natural Science Foundation of China (60205007), Natural Science Foundation of Guangdong Province (031558, 04300462), Research Foundation of National Science and Technology Plan Project (2004BA721A02), Research Foundation of Science and Technology Plan Project in Guangdong Province (2003C50118) and Research Foundation of Science and Technology Plan Project in Guangzhou City(2002Z3-E0017).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Pro. 5th Berkeley Symp. Math. Statist, Pro., vol. 1, pp. 128–297 (1967)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovering 2, 283–304 (1998)
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Pro. 1994 Int. Conf. Very Large Data Bases, pp. 144–155 (1994)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clustering in large spatial database with noise. In: Proc. 1996 Int. Conf. Knowledge Discovering and Data Mining, pp. 266–231 (1996)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp. 103–114 (1996)
Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. In: Proc. ACM-SIGKDD int. conf. Knowledge discovery and data mining (KDD 2001), pp. 263–268 (2001)
Chen, P., Wang, Y.: An Efficient clustering algorithm for categorical and mixed typed attributes. Computer Engineering and Application (1), 190–191 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yin, J., Tan, Z. (2005). Clustering Mixed Type Attributes in Large Dataset. In: Pan, Y., Chen, D., Guo, M., Cao, J., Dongarra, J. (eds) Parallel and Distributed Processing and Applications. ISPA 2005. Lecture Notes in Computer Science, vol 3758. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11576235_66
Download citation
DOI: https://doi.org/10.1007/11576235_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29769-7
Online ISBN: 978-3-540-32100-2
eBook Packages: Computer ScienceComputer Science (R0)