Skip to main content
Log in

Feature selection via minimizing global redundancy for imbalanced data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Mining knowledge from imbalanced data is challenging due to the uneven distribution of classes and increasing dimensionality of data accumulated from real-life applications. Selecting informative features from imbalanced data is especially important for building an effective learning method. The global redundancy and the effect of imbalanced distribution need to be considered simultaneously. In this study, a feature selection method that considers the imbalanced distribution of classes in data is investigated by embedding the weighted constraint on the majority class into the global redundancy minimization GRM framework. Global redundancy minimization is acquired through an objective function that contains a feature redundancy matrix and feature scores. A new form of regularization to a within-class scatter matrix is first presented, which emphasizes the minority class and replaces the redundancy measurement approach. Then, after employing this new form of a within-class scatter matrix in GRM and taking the between-class distance as the GRM input score, a GRM-based discriminant feature selection algorithm (GRM-DFS) is proposed. Comparison studies on a within-class scatter matrix with different forms of regularization indicate that the proposed form of a within-class scatter matrix is effective when dealing with imbalanced data. Experiments on public imbalanced datasets are performed. The experimental results indicate that GRM-DFS is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284

    Article  Google Scholar 

  2. Jian C, Jian G, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122

    Article  Google Scholar 

  3. Bach M, Werner A, Żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174– 190

    Article  Google Scholar 

  4. Bedi P, Gupta N, Jindal V (2021) I-siamIDS: an improved siam-IDS for handling class imbalance in network-based intrusion detection systems. Appl Intell 51:1133–1151

    Article  Google Scholar 

  5. Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  6. Cao P, Liu X, Zhang J, Zhao D, Huang M, Zaiane O (2017) 2,1-norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification. Neurocomputing 234:38–57

  7. Das B, Krishnan NC, Cook DJ (2013) wRACOG: A Gibbs Sampling-Based Oversampling Technique. In: IEEE International Conference on Data Mining. IEEE, pp 111–120

  8. Wang Z, Cao C, Zhu Y (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31:5178–5191

    Article  Google Scholar 

  9. Peng C, Zhao D, Zaiane O (2013) An Optimized Cost-Sensitive SVM for Imbalanced Data Learning. In: Advances in Knowledge Discovery and Data Mining. Springer, pp 280–292

  10. Li K, Kong X, Zhi L, Liu W, Yin J (2013) Boosting weighted ELM for imbalanced learning. Neurocomputing 128(5):15–21

    Google Scholar 

  11. Peng M, Qi Z, Xing X, Tao G, Huang X (2019) Trainable Undersampling for Class-Imbalance Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 4707–4714

  12. Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15

    Article  Google Scholar 

  13. Du G, Zhang J, Luo Z, Ma F, Li S (2020) Joint imbalanced classification and feature selection for hospital readmissions. Knowl-Based Syst 200(106020)

  14. Liu H, Zhou M, Liu Q (2019) An embedded feature selection method for imbalanced data classification. IEEE/CAA J Autom Sin 27:703–715

    Article  Google Scholar 

  15. Peng Z, Hu X, Li P, Wu X (2017) Online Feature Selection for High-dimensional Class-imbalanced Data. Knowl-Based Syst 136:187–199

    Article  Google Scholar 

  16. Chen H, Li T, Fan X, Luo C (2019) Feature selection for imbalanced data based on neighborhood rough sets. Inf Sci 483:1–20

    Article  Google Scholar 

  17. Zhang C, Zhou Y, Guo J, Wang G, Xuan W (2018) Research on classification method of high-dimensional class-imbalanced datasets based on SVM. In: International journal of machine learning and cybernetics(DSC), vol 10, pp 1765–1778

  18. Shahee SA, Ananthakumar U (2020) An effective distance based feature selection approach for imbalanced data. Appl Intell 50:717–745

    Article  Google Scholar 

  19. Viegas F, Rocha L, Goncalves M, Mourao F, Sa G, Salles T, Andrade G, Sandin I (2018) A Genetic Programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569

    Article  Google Scholar 

  20. Meng L, Chang X, Yong L, Chao X, Tao D (2018) Cost-Sensitive Feature selection by optimizing F-Measures. IEEE Trans Image Process 27:1323–1335

    Article  MathSciNet  Google Scholar 

  21. Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755

    Article  Google Scholar 

  22. Nie F, Yang S, Zhang R, Li X (2019) A general framework for Auto-Weighted feature selection via global redundancy minimization. IEEE Trans Image Process 28:2428–2438

    Article  MathSciNet  Google Scholar 

  23. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol 3:185–205

    Article  Google Scholar 

  24. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  25. Yang F, Mao K, Lee GKK, Tang W (2015) Emphasizing minority class in LDA for feature subset selection on High-Dimensional Small-Sized problems. IEEE Trans Knowl Data Eng 27:88–101

    Article  Google Scholar 

  26. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. 2nd ed

  27. Thomaz C, Gillies D, Feitosa R (2001) Using mixture covariance matrices to improve face and facial expression recognitions. Pattern Recogn Lett 24(13):2159–2165

    Article  Google Scholar 

  28. Masaeli M, Fung G, Dy JG (2010) From Transformation-Based Dimensionality Reduction to Feature Selection. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp 21–24

  29. Yang Z, Ye Q, Chen Q, Ma X, Liu F (2020) Robust discriminant feature selection via joint 2,1-norm distance minimization and maximization. Knowl-Based Syst:207(106090)

  30. Tao H, Hou C, Nie F, Jiao Y, Yi D (2016) Effective discriminative feature selection with nontrivial solution. IEEE Trans Neural Netw Learn Syst 27(4):796–808

    Article  MathSciNet  Google Scholar 

  31. Zhao Z, Wang X (2018) Cost-sensitive SVDD models based on a sample selection approach. Appl Intell 48:4247–4266

    Article  Google Scholar 

  32. Zhang S (2020) Cost-sensitive KNN classification. Neurocomputing 391:234–242

    Article  Google Scholar 

  33. Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining m-SMOTE and ENN based on Random Forest for medical imbalanced data. J Biomed Inform:107(103465)

  34. Kamalov F, Denisov D (2020) Gamma distribution-based sampling for imbalanced data. Knowl-Based Syst:207(106368)

  35. Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56

    Article  MathSciNet  Google Scholar 

  36. Wu T, Zhou Y, Zhang R, Xiao Y, Nie F (2017) Self-weighted discriminative feature selection via adaptive redundancy minimization. Neurocomputing 275:2824–2830

    Article  Google Scholar 

  37. Zhao M, Lin M, Bernard CY, Zhao Z, Tang X (2018) Trace Ratio Criterion based Discriminative Feature Selection via 2,p-norm regularization for supervised learning. Neurocomputing 321:1–16

    Article  Google Scholar 

  38. Boyd S, Vandenberghe L, Faybusovich L (2006) Convex optimization. IEEE Trans Autom Control 51:1859–1859

    Article  Google Scholar 

  39. Bertsekas DP (1996) Constrained Optimization and Lagrange Multiplier Methods

  40. Lin Z, Liu R, Su Z (2011) Linearized alternating direction method with adaptive penalty for low rank representation. In: Advances in Neural Information Processing Systems (NIPS). MIT, pp 612– 620

  41. Curtis FE, Jiang H, Robinson DP (2015) An adaptive augmented Lagrangian method for large-scale constrained optimization. Br Med J 152:201–245

    MathSciNet  MATH  Google Scholar 

  42. Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL Data-Mining Software tool: Data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2-3):255–287

    Google Scholar 

  43. Au DC, Lorence RM, Gennis RB (2003) Numerical optimization, theoretical and practical aspects. IEEE Trans Autom Control 51:541–541

    Google Scholar 

  44. Kyrillidis A, Becker S, Cevher V (2013) Sparse projections onto the simplex. In: International conference machine learning (ICML), vol 28, pp 235–243

  45. Blake CL, Merz CJ (1998) Uci repository of machine learning databases

  46. Benabdeslem K, Hindawi M (2011) Constrained laplacian score for semi-supervised feature selection. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 204–218

  47. Kononenko I (1994) Estimating attributes: Analysis and extensions of RELIEF. Italy: Mach Learn: ECML-94 784:171–182

    Article  Google Scholar 

  48. Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1:3–18

    Article  Google Scholar 

  49. Zhu Z, Ong YS, Zurada M (2010) Identification of full and partial class relevant genes. IEEE/ACM Trans Comput Biol Bioinform 7:263–277

    Article  Google Scholar 

  50. Huang C, Huang X, Fang Y, Xu J, Qu Y, Zhai P, Fan L, Yin H, Xu Y, Li J (2020) Sample imbalance disease classification model based on association rule feature selection. Pattern Recogn Lett 133:280–286

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongmei Chen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the National Natural Science Foundation of China (Nos. 61976182, 62076171, 61876157), Key program for International S&T Cooperation of Sichuan Province (2019YFH0097), Sichuan Key R&D project (2020YFG0035).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, S., Chen, H., Li, T. et al. Feature selection via minimizing global redundancy for imbalanced data. Appl Intell 52, 8685–8707 (2022). https://doi.org/10.1007/s10489-021-02855-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02855-9

Keywords

Navigation