Abstract
Risk analysis is one of the most essential business activities because it discovers unknown risks such as financial risk, recovery risk, investment risk, operational risk, credit risk, debit risk, and so on. Clustering is a data mining technique that uses data behavior and nature to discover unexpected risks in business data. In a big data setup, clustering algorithms encounter execution time and cluster quality-related challenges due to the primary attribute of big data. This study suggests a Stratified Systematic Sampling Extension (SSE) approach for risk analysis in big data mining using a single machine execution by clustering methodology. Sampling is a data reduction technique that saves computation time and improves cluster quality, scalability, and speed of the clustering algorithm. The proposed sampling plan first formulates the stratum by selecting the minimum variance dimension and then selects samples from each stratum using random linear systematic sampling. The clustering algorithm produces robust clusters in terms of risk and non-risk group with the help of sample data and extends the sample-based clustering results to final clustering results utilizing Euclidean distance. The performance of the SSE-based clustering algorithm has been compared to existing K-means and K-means ++ algorithms using Davies Bouldin score, Silhouette coefficient, Scattering Density between clusters Validity, Scattering Distance Validity and CPU time validation metrics on financial risk datasets. The experimental results demonstrate that the SSE-based clustering algorithm achieved better clustering objectives in terms of cluster compaction, separation, density, and variance while minimizing iterations, distance computation, data comparison, and computational time. The statistical analysis reveals that the proposed sampling plan attained statistical significance by employing the Friedman test.
Similar content being viewed by others
References
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer Nature, Switzerland
Abualigah L, Diabat A (2021) Advances in Sine Cosine algorithm: a comprehensive survey. Artif Intell Rev 54:2567–2608. https://doi.org/10.1007/s10462-020-09909-3
Abualigah L, Diabat A, Mirjalili S et al (2021a) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609. https://doi.org/10.1016/j.cma.2020.113609
Abualigah L, Yousri D, Abd Elaziz M et al (2021b) Aquila Optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250. https://doi.org/10.1016/j.cie.2021.107250
Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, Boca Raton, United States
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035
Aune-Lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Bejarano J, Bose K, Brannan T, Thomas A (2011). Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011–12 1–11
Ben HMA, Ben NCE, Essoussi N (2019) STiMR k -means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001419500137
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K -median and K -means. Mach Learn 66:243–257. https://doi.org/10.1007/s10994-006-0587-3
Brus DJ (2019) Sampling for digital soil mapping: a tutorial supported by R scripts. Geoderma 338:464–480. https://doi.org/10.1016/j.geoderma.2018.07.036
Caicedo PE, Rengifo CF, Rodriguez LE et al (2020) Dataset for gait analysis and assessment of fall risk for older adults. Data Br 33:106550. https://doi.org/10.1016/j.dib.2020.106550
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Chen B, Haas P, Scheuermann P (2002). A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Digital Library, pp 462–468
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/S2424862218500173
Cochran WG (1962). Samling Techniques. Asia Publishing House, Bombay
da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query Size Estimation for Joins Using. 237–275
Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Deva Arul S, Iyapparaja M (2020) Social internet of things using big data analytics and security aspects–a review. Electron Gov 16:137–154. https://doi.org/10.1504/EG.2020.105238
Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. Methodol Comput Appl Probab 12:335–360. https://doi.org/10.1007/s11009-008-9108-0
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Furht B, Villanustre F (2016) Big Data Technologies and Applications. Springer International Publishing, Cham
Gopalakrishnan C, Iyapparaja M (2021) Multilevel thresholding based follicle detection and classification of polycystic ovary syndrome from the ultrasound images using machine learning. Int J Syst Assur Eng Manag. https://doi.org/10.1007/s13198-021-01203-x
Haas PJ (2016) Data-Stream Sampling: Basic Techniques and Results. Springer-Verlag, Berlin Heidelberg
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:44. https://doi.org/10.1186/s40537-019-0206-3
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Hibberts M, Johnson RB, Hudson K (2012) Common Survey Sampling Techniques. In: Gideon L (ed) Handbook of Survey Methodology for the Social Sciences. Springer Science+Business Media New York
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/ACCESS.2014.2332453
Iyapparaja M, Deva Arul S (2020) Effective feature selection using hybrid Ga-EHO for classifying big data siot. Int J Web Portals 12:12–25. https://doi.org/10.4018/IJWP.2020010102
Jabłoński A, Jabłoński M (2020) New Economy Business Models in the Concepts of, the and the Circular Economy. Social Business Models in the Digital Economy. Springer International Publishing, Cham, pp 51–88
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Jaiswal R, Kumar A, Sen S (2014) A Simple D 2 -Sampling Based PTAS for k -Means. 22–46. https://doi.org/10.1007/s00453-013-9833-9
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
G Ji-hong Z Shui-geng B Fu-ling H Yan-xiang 2001 Scaling up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique 6 467 473
Jothi R, Mohanty SK, Ojha A (2019) DK-means: a deterministic K-means clustering algorithm for gene expression analysis. Pattern Anal Appl 22:649–667. https://doi.org/10.1007/s10044-017-0673-0
Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming model. Math Comput Mod 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006
Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002
Kao F, Leu C, Ko C (2011) Remainder markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Kara ME (2018) Supplier risk assessment based on best-worst method and k-means clustering: a case study. Sustainability 10:1066. https://doi.org/10.3390/su10041066
Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. Comput Static Data Analys 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026
Khondoker MR (2018). Big data clustering. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, Chichester, UK.
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci (ny) 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Li M, Li D, Shen S, et al (2016) DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 133–146
Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon. https://doi.org/10.1016/j.heliyon.2019.e02541
Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010
Maheshwari S, Gautam P, Jaggi CK (2021) Role of big data analytics in supply chain management: current trends and future perspectives. Int J Prod Res 59:1875–1900. https://doi.org/10.1080/00207543.2020.1793011
Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/BDMA.2019.9020015
Mandelli D, Yilmaz A, Aldemir T et al (2013) Scenario clustering and dynamic probabilistic risk assessment. Reliab Eng Syst Saf 115:146–160. https://doi.org/10.1016/j.ress.2013.02.013
Mani SK, Iyapparaja M (2020) Improving quality-of-service in fog computing through efficient resource allocation. Comput Intell 36:1527–1547. https://doi.org/10.1111/coin.12285
Marle F, Vidal L, Bocquet J (2013) Interactions-based risk clustering methodologies and algorithms for complex project management. Int J Prod Econ 142:225–234. https://doi.org/10.1016/j.ijpe.2010.11.022
Moharm K (2019) State of the art in big data applications in microgrid: A review. Adv Eng Informatics. https://doi.org/10.1016/j.aei.2019.100945
Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds), Communication and Intelligent Systems
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088
Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Rajasekaran S, Saha S (2013). A novel deterministic sampling technique to speedup clustering algorithms. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 8347 LNAI:34–46. https://doi.org/10.1007/978-3-642-53917-6_4
Ramasubramanian K, Singh A (2016). Sampling and Resampling Techniques. In: Machine Learning Using R. pp 67–127
Rice JA (2007) Mathematical statistics and metastatistical analysis, Third Edit. Thomson Higher Education
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Satyanarayana A (2014) Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. Can Conf Electr Comput Eng. https://doi.org/10.1109/CCECE.2014.6901029
shalabh (2019) Systematic Sampling. In: Sampling Theory. pp 1–17
Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023
Singh S (2003) Advanced sampling theory with applications, vol 1. Springer, Netherlands, Dordrecht
Tchagna Kouanou A, Tchiotsop D, Kengne R et al (2018) An optimal big data workflow for biomedical image analysis. Informatics Med Unlocked 11:68–74. https://doi.org/10.1016/j.imu.2018.05.001
Umarani V, Punithavalli M (2011) Analysis of the progressive sampling-based approach using real life datasets. Open Comput Sci 1:221–242. https://doi.org/10.2478/s13537-011-0016-y
Wang X, Hamilton HJ (2003) DBRS: A Density-Based Spatial Clustering Method with Random Sampling. 563–575
Wang X, He Y (2016) Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst Man, Cybern Mag 2:26–31. https://doi.org/10.1109/msmc.2016.2557479
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Wang X, Frattini P, Stead D et al (2020) Dynamic rockfall risk analysis. Eng Geol 272:105622. https://doi.org/10.1016/j.enggeo.2020.105622
Xian X, Zhang C, Bonk S, Liu K (2021) Online monitoring of big data streams: a rank-based sampling algorithm by data augmentation. J Qual Technol 53:135–153. https://doi.org/10.1080/00224065.2019.1681924
Xiao Y, Yu J (2012) Partitive clustering ( k -means family). Wiley Interdiscip Rev Data Min Knowl Discov 2:209–225. https://doi.org/10.1002/widm.1049
Xie H, Zhang L, Lim CP et al (2019) Improving K-means clustering with enhanced Firefly Algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763
Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10:e0137246. https://doi.org/10.1371/journal.pone.0137246
Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for k-means-clustering based recommender systems. Inf Sci (ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Zhan Q (2017) Improved spectral clustering based on Nyström method. Multimed Tools Appl 76:20149–20165. https://doi.org/10.1007/s11042-017-4566-4
Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072. https://doi.org/10.1016/j.csda.2020.107072
M Zhang C Wang J Bu et al 2015 A sampling method based on url clustering for fast web accessibility evaluation 16 449 456 https://doi.org/10.1631/FITEE.1400377
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
Funding
This study received no external funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pandey, K.K., Shukla, D. Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data. Int J Syst Assur Eng Manag 13, 1239–1253 (2022). https://doi.org/10.1007/s13198-021-01424-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-021-01424-0