Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Pandey, Kamlesh Kumar; Shukla, Diwakar

doi:10.1007/s13198-021-01424-0

Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Original article
Published: 19 October 2021

Volume 13, pages 1239–1253, (2022)
Cite this article

International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

529 Accesses
10 Citations
Explore all metrics

Abstract

Risk analysis is one of the most essential business activities because it discovers unknown risks such as financial risk, recovery risk, investment risk, operational risk, credit risk, debit risk, and so on. Clustering is a data mining technique that uses data behavior and nature to discover unexpected risks in business data. In a big data setup, clustering algorithms encounter execution time and cluster quality-related challenges due to the primary attribute of big data. This study suggests a Stratified Systematic Sampling Extension (SSE) approach for risk analysis in big data mining using a single machine execution by clustering methodology. Sampling is a data reduction technique that saves computation time and improves cluster quality, scalability, and speed of the clustering algorithm. The proposed sampling plan first formulates the stratum by selecting the minimum variance dimension and then selects samples from each stratum using random linear systematic sampling. The clustering algorithm produces robust clusters in terms of risk and non-risk group with the help of sample data and extends the sample-based clustering results to final clustering results utilizing Euclidean distance. The performance of the SSE-based clustering algorithm has been compared to existing K-means and K-means ++ algorithms using Davies Bouldin score, Silhouette coefficient, Scattering Density between clusters Validity, Scattering Distance Validity and CPU time validation metrics on financial risk datasets. The experimental results demonstrate that the SSE-based clustering algorithm achieved better clustering objectives in terms of cluster compaction, separation, density, and variance while minimizing iterations, distance computation, data comparison, and computational time. The statistical analysis reveals that the proposed sampling plan attained statistical significance by employing the Friedman test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

References

Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer Nature, Switzerland
Book Google Scholar
Abualigah L, Diabat A (2021) Advances in Sine Cosine algorithm: a comprehensive survey. Artif Intell Rev 54:2567–2608. https://doi.org/10.1007/s10462-020-09909-3
Article Google Scholar
Abualigah L, Diabat A, Mirjalili S et al (2021a) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609. https://doi.org/10.1016/j.cma.2020.113609
Article MathSciNet MATH Google Scholar
Abualigah L, Yousri D, Abd Elaziz M et al (2021b) Aquila Optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250. https://doi.org/10.1016/j.cie.2021.107250
Article Google Scholar
Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, Boca Raton, United States
Google Scholar
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Article MathSciNet MATH Google Scholar
Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035
Aune-Lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Article Google Scholar
Bejarano J, Bose K, Brannan T, Thomas A (2011). Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011–12 1–11
Ben HMA, Ben NCE, Essoussi N (2019) STiMR k -means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001419500137
Article Google Scholar
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K -median and K -means. Mach Learn 66:243–257. https://doi.org/10.1007/s10994-006-0587-3
Article MATH Google Scholar
Brus DJ (2019) Sampling for digital soil mapping: a tutorial supported by R scripts. Geoderma 338:464–480. https://doi.org/10.1016/j.geoderma.2018.07.036
Article Google Scholar
Caicedo PE, Rengifo CF, Rodriguez LE et al (2020) Dataset for gait analysis and assessment of fall risk for older adults. Data Br 33:106550. https://doi.org/10.1016/j.dib.2020.106550
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Article Google Scholar
Chen B, Haas P, Scheuermann P (2002). A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Digital Library, pp 462–468
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/S2424862218500173
Article Google Scholar
Cochran WG (1962). Samling Techniques. Asia Publishing House, Bombay
da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Article Google Scholar
Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query Size Estimation for Joins Using. 237–275
Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Deva Arul S, Iyapparaja M (2020) Social internet of things using big data analytics and security aspects–a review. Electron Gov 16:137–154. https://doi.org/10.1504/EG.2020.105238
Article Google Scholar
Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. Methodol Comput Appl Probab 12:335–360. https://doi.org/10.1007/s11009-008-9108-0
Article MathSciNet MATH Google Scholar
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Article Google Scholar
Furht B, Villanustre F (2016) Big Data Technologies and Applications. Springer International Publishing, Cham
Book Google Scholar
Gopalakrishnan C, Iyapparaja M (2021) Multilevel thresholding based follicle detection and classification of polycystic ovary syndrome from the ultrasound images using machine learning. Int J Syst Assur Eng Manag. https://doi.org/10.1007/s13198-021-01203-x
Article Google Scholar
Haas PJ (2016) Data-Stream Sampling: Basic Techniques and Results. Springer-Verlag, Berlin Heidelberg
Google Scholar
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:44. https://doi.org/10.1186/s40537-019-0206-3
Article Google Scholar
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Article Google Scholar
Hibberts M, Johnson RB, Hudson K (2012) Common Survey Sampling Techniques. In: Gideon L (ed) Handbook of Survey Methodology for the Social Sciences. Springer Science+Business Media New York
Google Scholar
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/ACCESS.2014.2332453
Article Google Scholar
Iyapparaja M, Deva Arul S (2020) Effective feature selection using hybrid Ga-EHO for classifying big data siot. Int J Web Portals 12:12–25. https://doi.org/10.4018/IJWP.2020010102
Article Google Scholar
Jabłoński A, Jabłoński M (2020) New Economy Business Models in the Concepts of, the and the Circular Economy. Social Business Models in the Digital Economy. Springer International Publishing, Cham, pp 51–88
Chapter Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Jaiswal R, Kumar A, Sen S (2014) A Simple D 2 -Sampling Based PTAS for k -Means. 22–46. https://doi.org/10.1007/s00453-013-9833-9
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
Article MATH Google Scholar
G Ji-hong Z Shui-geng B Fu-ling H Yan-xiang 2001 Scaling up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique 6 467 473
Jothi R, Mohanty SK, Ojha A (2019) DK-means: a deterministic K-means clustering algorithm for gene expression analysis. Pattern Anal Appl 22:649–667. https://doi.org/10.1007/s10044-017-0673-0
Article MathSciNet Google Scholar
Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming model. Math Comput Mod 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006
Article MathSciNet MATH Google Scholar
Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002
Article MathSciNet Google Scholar
Kao F, Leu C, Ko C (2011) Remainder markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Article MathSciNet MATH Google Scholar
Kara ME (2018) Supplier risk assessment based on best-worst method and k-means clustering: a case study. Sustainability 10:1066. https://doi.org/10.3390/su10041066
Article Google Scholar
Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. Comput Static Data Analys 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026
Article MathSciNet MATH Google Scholar
Khondoker MR (2018). Big data clustering. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, Chichester, UK.
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Article MathSciNet Google Scholar
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci (ny) 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137
Article Google Scholar
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Article Google Scholar
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Article Google Scholar
Li M, Li D, Shen S, et al (2016) DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 133–146
Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon. https://doi.org/10.1016/j.heliyon.2019.e02541
Article Google Scholar
Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010
Article Google Scholar
Maheshwari S, Gautam P, Jaggi CK (2021) Role of big data analytics in supply chain management: current trends and future perspectives. Int J Prod Res 59:1875–1900. https://doi.org/10.1080/00207543.2020.1793011
Article Google Scholar
Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/BDMA.2019.9020015
Article Google Scholar
Mandelli D, Yilmaz A, Aldemir T et al (2013) Scenario clustering and dynamic probabilistic risk assessment. Reliab Eng Syst Saf 115:146–160. https://doi.org/10.1016/j.ress.2013.02.013
Article Google Scholar
Mani SK, Iyapparaja M (2020) Improving quality-of-service in fog computing through efficient resource allocation. Comput Intell 36:1527–1547. https://doi.org/10.1111/coin.12285
Article Google Scholar
Marle F, Vidal L, Bocquet J (2013) Interactions-based risk clustering methodologies and algorithms for complex project management. Int J Prod Econ 142:225–234. https://doi.org/10.1016/j.ijpe.2010.11.022
Article Google Scholar
Moharm K (2019) State of the art in big data applications in microgrid: A review. Adv Eng Informatics. https://doi.org/10.1016/j.aei.2019.100945
Article Google Scholar
Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds), Communication and Intelligent Systems
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Google Scholar
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088
Article Google Scholar
Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Article Google Scholar
Rajasekaran S, Saha S (2013). A novel deterministic sampling technique to speedup clustering algorithms. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 8347 LNAI:34–46. https://doi.org/10.1007/978-3-642-53917-6_4
Ramasubramanian K, Singh A (2016). Sampling and Resampling Techniques. In: Machine Learning Using R. pp 67–127
Rice JA (2007) Mathematical statistics and metastatistical analysis, Third Edit. Thomson Higher Education
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Article Google Scholar
Satyanarayana A (2014) Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. Can Conf Electr Comput Eng. https://doi.org/10.1109/CCECE.2014.6901029
Article Google Scholar
shalabh (2019) Systematic Sampling. In: Sampling Theory. pp 1–17
Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023
Article Google Scholar
Singh S (2003) Advanced sampling theory with applications, vol 1. Springer, Netherlands, Dordrecht
Book Google Scholar
Tchagna Kouanou A, Tchiotsop D, Kengne R et al (2018) An optimal big data workflow for biomedical image analysis. Informatics Med Unlocked 11:68–74. https://doi.org/10.1016/j.imu.2018.05.001
Article Google Scholar
Umarani V, Punithavalli M (2011) Analysis of the progressive sampling-based approach using real life datasets. Open Comput Sci 1:221–242. https://doi.org/10.2478/s13537-011-0016-y
Article Google Scholar
Wang X, Hamilton HJ (2003) DBRS: A Density-Based Spatial Clustering Method with Random Sampling. 563–575
Wang X, He Y (2016) Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst Man, Cybern Mag 2:26–31. https://doi.org/10.1109/msmc.2016.2557479
Article Google Scholar
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Article MATH Google Scholar
Wang X, Frattini P, Stead D et al (2020) Dynamic rockfall risk analysis. Eng Geol 272:105622. https://doi.org/10.1016/j.enggeo.2020.105622
Article Google Scholar
Xian X, Zhang C, Bonk S, Liu K (2021) Online monitoring of big data streams: a rank-based sampling algorithm by data augmentation. J Qual Technol 53:135–153. https://doi.org/10.1080/00224065.2019.1681924
Article Google Scholar
Xiao Y, Yu J (2012) Partitive clustering ( k -means family). Wiley Interdiscip Rev Data Min Knowl Discov 2:209–225. https://doi.org/10.1002/widm.1049
Article Google Scholar
Xie H, Zhang L, Lim CP et al (2019) Improving K-means clustering with enhanced Firefly Algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763
Article Google Scholar
Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10:e0137246. https://doi.org/10.1371/journal.pone.0137246
Article Google Scholar
Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for k-means-clustering based recommender systems. Inf Sci (ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Article Google Scholar
Zhan Q (2017) Improved spectral clustering based on Nyström method. Multimed Tools Appl 76:20149–20165. https://doi.org/10.1007/s11042-017-4566-4
Article Google Scholar
Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072. https://doi.org/10.1016/j.csda.2020.107072
Article MathSciNet MATH Google Scholar
M Zhang C Wang J Bu et al 2015 A sampling method based on url clustering for fast web accessibility evaluation 16 449 456 https://doi.org/10.1631/FITEE.1400377
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
Article Google Scholar

Download references

Funding

This study received no external funding.

Author information

Authors and Affiliations

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, M.P, India
Kamlesh Kumar Pandey & Diwakar Shukla

Authors

Kamlesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pandey, K.K., Shukla, D. Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data. Int J Syst Assur Eng Manag 13, 1239–1253 (2022). https://doi.org/10.1007/s13198-021-01424-0

Download citation

Received: 30 July 2021
Revised: 12 September 2021
Accepted: 20 September 2021
Published: 19 October 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s13198-021-01424-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Abstract

Access this article

Similar content being viewed by others

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Abstract

Access this article

Similar content being viewed by others

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation