Skip to main content

PROFIT: A Projected Clustering Technique

  • Chapter
  • First Online:
Real World Data Mining Applications

Part of the book series: Annals of Information Systems ((AOIS,volume 17))

  • 2857 Accesses

Abstract

Clustering high dimensional dataset is one of the major areas of research because of its widespread applications in many domains. However, a meaningful clustering in high dimensional dataset is a challenging issue due to (i) it usually contains many irrelevant dimensions which hide the clusters, (ii) the distance, which is the most common similarity measure in most of the methods, loses its meaning in high dimensions, and (iii) different clusters may exist in different subsets of dimensions in high dimensional dataset. Feature selection based clustering methods prominently solve the problem of clustering high dimensional data. However, finding all the clusters in one subset of few selected relevant dimensions is not justified as different clusters may exist in different subsets of dimensions. In this article, we propose an algorithm PROFIT (PROjective clustering algorithm based on FIsher score and Trimmed mean) which extends the idea of feature selection based clustering to projective clustering and works well with the high dimensional dataset consisting of attributes in continuous variable domain. It works in four phases: sampling phase, initialization phase, dimension selection phase and refinement phase. We consider five real datasets for experiments with different input parameters and consider three other well-known top-down subspace clustering methods PROCLUS, ORCLUS and PCKA along with our feature selection based non-subspace clustering method FAMCA for comparison. The obtained results are subjected to two well-known subspace clustering quality measures (Jagota index and sum of squared error) and Student’s t-test to determine the significant difference between clustering results. The obtained results and quality measures show effectiveness and superiority of the proposed method PROFIT to its competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.uni-koeln.de/themen/statistik/data/cluster/milk.dat

  2. 2.

    http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.dat

  3. 3.

    http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.dat

  4. 4.

    http://archive.ics.uci.edu/ml/machine-learning-databases/image/segmentation.dat

  5. 5.

    http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

  6. 6.

    http://www.fromzerotoseo.com/stopwords-remove/

  7. 7.

    http://tartarus.org/martin/PorterStemmer/

References

  1. Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 70–81. ACM (2000)

    Google Scholar 

  2. Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Park, J.: Fast algorithms for projected clustering. ACM SIGMOD Record 28(2), 61–72 (1999)

    Article  Google Scholar 

  3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press (1998)

    Google Scholar 

  4. Andrews, H., Patterson, C.: Singular value decompositions and digital image processing. IEEE Trans. Acoust. Speech Signal Process 24(1), 26–53 (1976)

    Article  Google Scholar 

  5. Apolloni, B., Bassis, S., Brega, A.: Feature selection via boolean independent component analysis. Inf. Sci. 179(22), 3815–3831 (2009)

    Article  Google Scholar 

  6. Arai, K., Barakbah, A.: Hierarchical k-means: An algorithm for centroids initialization for k-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007)

    Google Scholar 

  7. Barakbah, A., Kiyoki, Y.: A pillar algorithm for k-means optimization by distance maximization for initial centroid designation. In: Computational Intelligence and Data Mining, 2009. IEEE Symposium on CIDM'09, pp. 61–68. IEEE (2009)

    Google Scholar 

  8. Berkhin, P.: A survey of clustering data mining techniques. Technical Report (2002)

    Google Scholar 

  9. Bouguessa, M., Wang, S.: Mining projected clusters in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21(4), 507–522 (2009)

    Article  Google Scholar 

  10. Celebi, M.: Effective initialization of k-means for color quantization. In: 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 1649–1652. IEEE (2009)

    Google Scholar 

  11. Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)

    Google Scholar 

  12. Chu, Y., Huang, J., Chuang, K., Yang, D., Chen, M.: Density conscious subspace clustering for high-dimensional data.. IEEE Trans. Knowl. Data Eng. 22(1), 16–30 (2010)

    Article  Google Scholar 

  13. Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the twenty-first International Conference on Machine Learning, pp. 225–232. ACM (2004)

    Google Scholar 

  14. Gheyas, I., Smith, L.: Feature subset selection in large dimensionality domains. Pattern Recognit. 43(1), 5–13 (2010)

    Article  Google Scholar 

  15. Goil, S., Nagesh, H., Choudhary, A.: Mafia: Efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)

    Google Scholar 

  16. Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer (2010)

    Google Scholar 

  17. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2001)

    Google Scholar 

  18. Hu, Q., Che, X., Zhang, L., Yu, D.: Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 2114–2124 (2010)

    Article  Google Scholar 

  19. Jagota, A.: Novelty detection on a very large number of memories stored in a hopfield-style network. In: IJCNN-91-Seattle International Joint Conference on Neural Networks, 1991, vol. 2, pp. 905–. IEEE (1991)

    Google Scholar 

  20. Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)

    Google Scholar 

  21. Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)

    Google Scholar 

  22. Kabir, M., Islam, M., et al.: A new wrapper feature selection approach using neural network. Neurocomputing 73(16), 3273–3283 (2010)

    Google Scholar 

  23. Khan, S., Ahmad, A.: Cluster center initialization algorithm for k-means clustering. Pattern Recognit. Lett. 25(11), 1293–1302 (2004)

    Article  Google Scholar 

  24. Kriegel, H., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discov. Data (TKDD) 3(1), 1–58 (2009)

    Article  Google Scholar 

  25. Kruskal, J., Wish, M.: Multidimensional Scaling, Quantitative Applications in the Social Sciences. Beverly Hills (1978)

    Google Scholar 

  26. Liu, Y., Liu, Y., Chan, K.: Dimensionality reduction for heterogeneous dataset in rushes editing. Pattern Recognit. 42(2), 229–242 (2009)

    Article  Google Scholar 

  27. Moise, G., Zimek, A., Kröger, P., Kriegel, H., Sander, J.: Subspace and projected clustering: Experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)

    Article  Google Scholar 

  28. Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)

    Article  Google Scholar 

  29. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter. 6(1), 90–105 (2004)

    Article  Google Scholar 

  30. Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)

    Google Scholar 

  31. Pearson, E.: Studies in the history of probability and statistics. XX: Some early correspondence between W.S. Gosset, R.A. Fisher and Karl Pearson, with notes and comments. Biometrika 55(3), 445–457 (1968)

    Google Scholar 

  32. Puri, C., Kumar, N.: Projected Gustafson-Kessel clustering algorithm and its convergence. Trans. on Rough Sets XIV, 159–182 (2011)

    Article  Google Scholar 

  33. Rajput, D., Singh, P., Bhattacharya, M.: An efficient technique for clustering high dimensional data set. In: 10th International Conference on Information and Knowledge Engineering. pp. 434–440. WASET, USA (July 2011)

    Google Scholar 

  34. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  35. Sugiyama, M., Kawanabe, M., Chui, P.: Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 23(1), 44–59 (2010)

    Article  Google Scholar 

  36. Tenenbaum, J., De Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

    Article  Google Scholar 

  37. Veenman, C., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Patt. Anal. Machine Intell. 24(9), 1273–1280 (2002)

    Article  Google Scholar 

  38. Wang, D., Ding, C., Li, T.: K -subspace clustering. Machine Learn. Knowl. Discov. Databases 506–521 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dharmveer Singh Rajput .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Rajput, D., Singh, P., Bhattacharya, M. (2015). PROFIT: A Projected Clustering Technique. In: Abou-Nasr, M., Lessmann, S., Stahlbock, R., Weiss, G. (eds) Real World Data Mining Applications. Annals of Information Systems, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-07812-0_4

Download citation

Publish with us

Policies and ethics