PROFIT: A Projected Clustering Technique

Rajput, Dharmveer Singh; Singh, Pramod Kumar; Bhattacharya, Mahua

doi:10.1007/978-3-319-07812-0_4

Dharmveer Singh Rajput⁷,
Pramod Kumar Singh⁷ &
Mahua Bhattacharya⁷

Part of the book series: Annals of Information Systems ((AOIS,volume 17))

2857 Accesses

Abstract

Clustering high dimensional dataset is one of the major areas of research because of its widespread applications in many domains. However, a meaningful clustering in high dimensional dataset is a challenging issue due to (i) it usually contains many irrelevant dimensions which hide the clusters, (ii) the distance, which is the most common similarity measure in most of the methods, loses its meaning in high dimensions, and (iii) different clusters may exist in different subsets of dimensions in high dimensional dataset. Feature selection based clustering methods prominently solve the problem of clustering high dimensional data. However, finding all the clusters in one subset of few selected relevant dimensions is not justified as different clusters may exist in different subsets of dimensions. In this article, we propose an algorithm PROFIT (PROjective clustering algorithm based on FIsher score and Trimmed mean) which extends the idea of feature selection based clustering to projective clustering and works well with the high dimensional dataset consisting of attributes in continuous variable domain. It works in four phases: sampling phase, initialization phase, dimension selection phase and refinement phase. We consider five real datasets for experiments with different input parameters and consider three other well-known top-down subspace clustering methods PROCLUS, ORCLUS and PCKA along with our feature selection based non-subspace clustering method FAMCA for comparison. The obtained results are subjected to two well-known subspace clustering quality measures (Jagota index and sum of squared error) and Student’s t-test to determine the significant difference between clustering results. The obtained results and quality measures show effectiveness and superiority of the proposed method PROFIT to its competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 70–81. ACM (2000)
Google Scholar
Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Park, J.: Fast algorithms for projected clustering. ACM SIGMOD Record 28(2), 61–72 (1999)
Article Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press (1998)
Google Scholar
Andrews, H., Patterson, C.: Singular value decompositions and digital image processing. IEEE Trans. Acoust. Speech Signal Process 24(1), 26–53 (1976)
Article Google Scholar
Apolloni, B., Bassis, S., Brega, A.: Feature selection via boolean independent component analysis. Inf. Sci. 179(22), 3815–3831 (2009)
Article Google Scholar
Arai, K., Barakbah, A.: Hierarchical k-means: An algorithm for centroids initialization for k-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007)
Google Scholar
Barakbah, A., Kiyoki, Y.: A pillar algorithm for k-means optimization by distance maximization for initial centroid designation. In: Computational Intelligence and Data Mining, 2009. IEEE Symposium on CIDM'09, pp. 61–68. IEEE (2009)
Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. Technical Report (2002)
Google Scholar
Bouguessa, M., Wang, S.: Mining projected clusters in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21(4), 507–522 (2009)
Article Google Scholar
Celebi, M.: Effective initialization of k-means for color quantization. In: 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 1649–1652. IEEE (2009)
Google Scholar
Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)
Google Scholar
Chu, Y., Huang, J., Chuang, K., Yang, D., Chen, M.: Density conscious subspace clustering for high-dimensional data.. IEEE Trans. Knowl. Data Eng. 22(1), 16–30 (2010)
Article Google Scholar
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the twenty-first International Conference on Machine Learning, pp. 225–232. ACM (2004)
Google Scholar
Gheyas, I., Smith, L.: Feature subset selection in large dimensionality domains. Pattern Recognit. 43(1), 5–13 (2010)
Article Google Scholar
Goil, S., Nagesh, H., Choudhary, A.: Mafia: Efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)
Google Scholar
Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer (2010)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2001)
Google Scholar
Hu, Q., Che, X., Zhang, L., Yu, D.: Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 2114–2124 (2010)
Article Google Scholar
Jagota, A.: Novelty detection on a very large number of memories stored in a hopfield-style network. In: IJCNN-91-Seattle International Joint Conference on Neural Networks, 1991, vol. 2, pp. 905–. IEEE (1991)
Google Scholar
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)
Google Scholar
Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)
Google Scholar
Kabir, M., Islam, M., et al.: A new wrapper feature selection approach using neural network. Neurocomputing 73(16), 3273–3283 (2010)
Google Scholar
Khan, S., Ahmad, A.: Cluster center initialization algorithm for k-means clustering. Pattern Recognit. Lett. 25(11), 1293–1302 (2004)
Article Google Scholar
Kriegel, H., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discov. Data (TKDD) 3(1), 1–58 (2009)
Article Google Scholar
Kruskal, J., Wish, M.: Multidimensional Scaling, Quantitative Applications in the Social Sciences. Beverly Hills (1978)
Google Scholar
Liu, Y., Liu, Y., Chan, K.: Dimensionality reduction for heterogeneous dataset in rushes editing. Pattern Recognit. 42(2), 229–242 (2009)
Article Google Scholar
Moise, G., Zimek, A., Kröger, P., Kriegel, H., Sander, J.: Subspace and projected clustering: Experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)
Article Google Scholar
Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Article Google Scholar
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter. 6(1), 90–105 (2004)
Article Google Scholar
Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)
Google Scholar
Pearson, E.: Studies in the history of probability and statistics. XX: Some early correspondence between W.S. Gosset, R.A. Fisher and Karl Pearson, with notes and comments. Biometrika 55(3), 445–457 (1968)
Google Scholar
Puri, C., Kumar, N.: Projected Gustafson-Kessel clustering algorithm and its convergence. Trans. on Rough Sets XIV, 159–182 (2011)
Article Google Scholar
Rajput, D., Singh, P., Bhattacharya, M.: An efficient technique for clustering high dimensional data set. In: 10th International Conference on Information and Knowledge Engineering. pp. 434–440. WASET, USA (July 2011)
Google Scholar
Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Sugiyama, M., Kawanabe, M., Chui, P.: Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 23(1), 44–59 (2010)
Article Google Scholar
Tenenbaum, J., De Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Article Google Scholar
Veenman, C., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Patt. Anal. Machine Intell. 24(9), 1273–1280 (2002)
Article Google Scholar
Wang, D., Ding, C., Li, T.: K -subspace clustering. Machine Learn. Knowl. Discov. Databases 506–521 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

ABV – Indian Institute of Information Technology and Management, Gwalior, MP, India
Dharmveer Singh Rajput, Pramod Kumar Singh & Mahua Bhattacharya

Authors

Dharmveer Singh Rajput
View author publications
You can also search for this author in PubMed Google Scholar
Pramod Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Mahua Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dharmveer Singh Rajput .

Editor information

Editors and Affiliations

Research & Advanced Engineering, Ford Motor Company, Dearborn, Michigan, USA
Mahmoud Abou-Nasr
Universität Hamburg Inst. Wirtschaftsinformatik, Hamburg, Germany
Stefan Lessmann
Universität Hamburg Inst. Wirtschaftsinformatik, Hamburg, Germany
Robert Stahlbock
Deptartment of Computer & Information Science, Fordham University, Bronx, New York, USA
Gary M. Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rajput, D., Singh, P., Bhattacharya, M. (2015). PROFIT: A Projected Clustering Technique. In: Abou-Nasr, M., Lessmann, S., Stahlbock, R., Weiss, G. (eds) Real World Data Mining Applications. Annals of Information Systems, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-07812-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-07812-0_4
Published: 14 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07811-3
Online ISBN: 978-3-319-07812-0
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics