Skip to main content
Log in

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The revolution of digital and communication technologies is producing an enormous amount of data. Therefore, the nature of classical data changes into big data, and mining techniques have to face high computation cost, performance and scalability-related challenges. The K-means (KM) algorithm is the most widely used partitional clustering approach that depends on K clusters, initial centroid, distance measures and central tendency statistical approaches. The initial centroid determines the computational effectiveness, efficiency and local optima issues in big data clustering due to the gradient descent nature of the KM algorithm. The existing centroid initialization algorithm has achieved low cluster quality with high computational complexity due to iterations, distance computation, data and result comparison. To overcome these deficiencies, this paper presents the Maxmin Distance Sort Heuristic (MDSH) algorithm for big data clustering through a stratified sampling process. The performance of the MDSHKM algorithm is compared with the KM and KM++  algorithms through R square, Root-Mean-Square Standard Deviation, Davies–Bouldin score, Calinski Harabasz score, Silhouette Coefficient, Number of Iterations and CPU time validation indices using eight real datasets. The experimental evaluation shows that the MDSHKM algorithm achieves better cluster quality, computing cost, efficiency and stable convergence than the KM and KM++ algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001

    Article  Google Scholar 

  2. Gandomi A, Haider M (2015) Beyond the hype: big data concepts methods and analytics. Int J Inf Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

    Article  Google Scholar 

  3. Lee I (2017) Big data: dimensions, evolution, impacts and challenges. Bus Horiz 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004

    Article  Google Scholar 

  4. Njah H, Jamoussi S, Mahdi W (2019) Deep Bayesian network architecture for big data mining. Concurr Comput 31(2):1–17. https://doi.org/10.1002/cpe.4418

    Article  Google Scholar 

  5. Zhou K, Yang S (2020) Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Appl 23(1):455–466. https://doi.org/10.1007/s10044-019-00783-6

    Article  MathSciNet  Google Scholar 

  6. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  7. Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12(2):1–68. https://doi.org/10.1145/3132088

    Article  Google Scholar 

  8. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1

    Article  MathSciNet  Google Scholar 

  9. Sharma DK, Dhurandher SK, Agarwal D, Arora K (2019) KROP: k-means clustering based routing protocol for opportunistic networks. J Ambient Intell Humaniz Comput 10(4):1289–1306. https://doi.org/10.1007/s12652-018-0697-3

    Article  Google Scholar 

  10. Duwairi R, Abu-Rahmeh M (2015) A novel approach for initializing the spherical k-means clustering algorithm. Simul Model Pract Theory 54:49–63. https://doi.org/10.1016/j.simpat.2015.03.007

    Article  Google Scholar 

  11. Ilango SS, Vimal S, Kaliappan M, Subbulakshmi P (2019) Optimization using artificial bee colony based clustering approach for big data. Cluster Comput 22:12169–12177. https://doi.org/10.1007/s10586-017-1571-3

    Article  Google Scholar 

  12. Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7(1):6. https://doi.org/10.1186/s40537-019-0279-z

    Article  Google Scholar 

  13. Khondoker MR (2018) Big data clustering. In: Wiley StatsRef: statistics reference online. John Wiley & Sons Ltd, Chichester, pp 1–10. https://doi.org/10.1002/9781118445112.stat07978

  14. Chen M, Ludwig SA, Li K (2017) Clustering in big data. In: Li K-C, Jiang H, Zomaya AY (eds) Big data management and processing. Chapman and Hall/CRC, New York, pp 333–346. https://doi.org/10.1201/9781315154008

  15. Dafir Z, Lamari Y, Slaoui SC (2021) A survey on parallel clustering algorithms for big data. Artif Intell Rev 54(4):2411–2443. https://doi.org/10.1007/s10462-020-09918-2

    Article  Google Scholar 

  16. HajKacem MA Ben, N’Cir C-E Ben, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C-E Ben (eds) Clustering methods for big data analytics, unsupervised and semi-supervised learning. Springer Nature, Switzerland, pp 1–23. https://doi.org/10.1007/978-3-319-97864-2_1

  17. Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the k-means method. In: Rutkowski L (eds) Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 2nd ed., Verlag Berlin Heidelberg, Springer, pp 165–172. https://doi.org/10.1007/978-3-642-13232-2_20

  18. Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516. https://doi.org/10.1016/j.patcog.2014.01.015

    Article  Google Scholar 

  19. Torrente A, Romo J (2021) Initializing k-means clustering by bootstrap and data depth. J Classif 38(2):232–256. https://doi.org/10.1007/s00357-020-09372-3

    Article  MathSciNet  MATH  Google Scholar 

  20. Reddy D, Mishra D, Jana PK (2011) MST-based cluster initialization for k-means. In: Proceedings of the international conference on computer science and information technology. Springer, pp 329–338. https://doi.org/10.1007/978-3-642-17857-3_33

  21. Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759. https://doi.org/10.1007/s10489-018-1238-7

    Article  MATH  Google Scholar 

  22. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035.

  23. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014

    Article  Google Scholar 

  24. Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20(10):1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0

    Article  Google Scholar 

  25. Mousavian Anaraki SA, Haeri A, Moslehi F (2021) A hybrid reciprocal model of pca and k-means with an innovative approach of considering sub-datasets for the improvement of k-means initialization and step-by-step labeling to create clusters with high interpretability. Pattern Anal Appl 24(3):1387–1402. https://doi.org/10.1007/s10044-021-00977-x

    Article  Google Scholar 

  26. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210. https://doi.org/10.1016/j.eswa.2012.07.021

    Article  Google Scholar 

  27. Celebi ME, Kingravi HA (2015) Linear, deterministic and order-invariant initialization methods for the k-means clustering algorithm. In: Celebi ME (ed) Partitional clustering algorithms. Springer, Cham, pp 79–98. https://doi.org/10.1007/978-3-319-09259-1_3

  28. von Luxburg U (2010) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274. https://doi.org/10.1561/2200000008

    Article  MATH  Google Scholar 

  29. He J, Lan M, Tan CL, et al (2004) Initialization of cluster refinement algorithms: a review and comparative study. In: Proceedings of the IEEE international conference on neural networks. IEEE Xplore, pp 297–302. https://doi.org/10.1109/ijcnn.2004.1379917

  30. Jothi R, Mohanty SK, Ojha A (2019) DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Anal Appl 22(2):649–667. https://doi.org/10.1007/s10044-017-0673-0

    Article  MathSciNet  Google Scholar 

  31. Wang S, Liu X, Xiang L (2021) An improved initialisation method for k-means algorithm optimised by Tissue-like P system. Int J Parallel, Emergent Distrib Syst 36(1):3–10. https://doi.org/10.1080/17445760.2019.1682144

    Article  Google Scholar 

  32. Ji S, Xu D, Guo L et al (2020) The seeding algorithm for spherical k-means clustering with penalties. J Comb Optim. https://doi.org/10.1007/s10878-020-00569-1

    Article  Google Scholar 

  33. Murugesan VP, Murugesan P (2020) A new initialization and performance measure for the rough k-means clustering. Soft Comput 24(15):11605–11619. https://doi.org/10.1007/s00500-019-04625-9

    Article  Google Scholar 

  34. Chowdhury K, Chaudhuri D, Pal AK (2020) An entropy-based initialization method of k-means clustering on the optimal number of clusters. Neural Comput Appl 33(12):6965–6982. https://doi.org/10.1007/s00521-020-05471-9

    Article  Google Scholar 

  35. Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  36. Sharma SK (2020) An empirical model (EM: CCO) for clustering, convergence and center optimization in distributive databases. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-01955-7

    Article  Google Scholar 

  37. Xiao Y, Yu J (2012) Partitive clustering (k-means family). Wiley Interdiscip Rev Data Min Knowl Discov 2(3):209–225. https://doi.org/10.1002/widm.1049

    Article  Google Scholar 

  38. Dasgupta S (2013) Algorithms for k-means clustering. In: Geometric algorithms Lecture. University of California, San Diego, pp 3:1–3:7

  39. Kanagaraj R, Rajkumar N, Srinivasan K (2020) Multiclass normalized clustering and classification model for electricity consumption data analysis in machine learning techniques. J Ambient Intell Humaniz Comput 12(5):5093–5103. https://doi.org/10.1007/s12652-020-01960-w

    Article  Google Scholar 

  40. Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, United States, pp 589–601

    Google Scholar 

  41. Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for k-means clustering based recommender systems. Inf Sci 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062

    Article  Google Scholar 

  42. Li Y, Cai J, Yang H et al (2019) A novel algorithm for initial cluster center selection. IEEE Access 7:74683–74693. https://doi.org/10.1109/ACCESS.2019.2921320

    Article  Google Scholar 

  43. Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10(9):e0137246. https://doi.org/10.1371/journal.pone.0137246

    Article  Google Scholar 

  44. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  45. Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, K.K., Shukla, D. Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining. Pattern Anal Applic 25, 139–156 (2022). https://doi.org/10.1007/s10044-021-01045-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-021-01045-0

Keywords

Navigation