Abstract
Genetic programming (GP) has been successfully applied to classification. However, GP may evolve biased classifiers when encountering the problem of class imbalance. These biased classifiers are often not reliable to be applied to some real-world applications. High dimensionality makes it more difficult for classifiers to effectively separate the majority class and the minority class. The use of GP to handle the joint effect of high dimensionality and class imbalance has not been heavily investigated. In this paper, we propose a GP approach to high-dimensional imbalanced classification, with the goals of increasing the classification performance as well as saving training time. To achieve this goal, a new fitness function is developed to solve the problem of class imbalance, and moreover, a strategy is proposed to reuse previous good GP individuals for improving efficiency. The proposed method is examined on ten high-dimensional imbalanced datasets. Experimental results show that, for high-dimensional imbalanced classification, the proposed method generally outperforms other GP methods and traditional classification algorithms using sampling methods to solve the problem of class imbalance.
Similar content being viewed by others
References
Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 31(10):6345–6363
Bhowan U, Zhang M, Johnston M (2010) Genetic programming for classification with unbalanced data. In: European conference on genetic programming. Springer, p 1–13
Bhowan U, Johnston M, Zhang M (2011a) Ensemble learning and pruning in multi-objective genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 192–202
Bhowan U, Johnston M, Zhang M (2011b) Evolving ensembles in multi-objective genetic programming for classification with unbalanced data. In: Proceedings of the 13th annual conference on genetic and evolutionary computation. ACM, pp 1331–1338
Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):406–421
Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 17(3):368–386
Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 18(6):893–908
Blagus R, Lusa L (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinform 14(1):64
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6
Curry R, Lichodzijewski P, Heywood MI (2007) Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans Syst Man Cybern Part B (Cybern) 37(4):1065–1073
Ertekin S, Huang J, Bottou L, Giles L (2007a) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 127–136
Ertekin S, Huang J, Giles CL (2007b) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, vol 7, pp 823–824
Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C (Appl Rev) 40(2):121–144
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, vol 99, pp 97–105
Fisher RA (1992) Statistical methods for research workers. In: Kotz S et al. (eds) Breakthroughs in statistics. Springer, pp 66–70
Fleury A, Vacher M, Noury N (2010) SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans Inf Technol Biomed 14(2):274–283
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484
Gathercole C, Ross P (1994) Dynamic training subset selection for supervised learning in genetic programming. In: International conference on parallel problem solving from nature. Springer, pp 312–321
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks. IEEE, pp 1322–1328
Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41
Hsieh WW (2007) Nonlinear principal component analysis of noisy data. Neural Netw 20(4):434–443
Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings 2001 IEEE international conference on data mining. IEEE, pp 257–264
Joshi A, Dangra J, Rawat M (2016) A decision tree based classification technique for accurate heart disease classification and prediction. Int J Technol Res Manag 3:1–4
Li J, Li X, Yao X (2005) Cost-sensitive classification with genetic programming. In: The 2005 IEEE congress on evolutionary computation, vol 3. IEEE, pp 2114–2121
Li P, Chan KL, Fang W (2006) Hybrid kernel machine ensemble for imbalanced data sets. In: 18th international conference on pattern recognition (ICPR’06), vol 1. IEEE, pp 1108–1111
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
Liu J, Chen XX, Fang L, Li JX, Yang T, Zhan Q, Tong K, Fang Z (2018) Mortality prediction based on imbalanced high-dimensional ICU big data. Comput Ind 98:218–225
Luna JM, Pechenizkiy M, del Jesus MJ, Ventura S (2017) Mining context-aware association rules using grammar-based genetic programming. IEEE Trans Cybern 48:3030–3044
Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 769–775
Pears R, Finlay J, Connor AM (2014) Synthetic minority over-sampling technique (SMOTE) for predicting software build outcomes. arXiv:1407.2330
Pei W, Xue B, Shang L, Zhang M (2018) Genetic programming based on granular computing for classification with high-dimensional data. In: Australasian joint conference on artificial intelligence. Springer, pp 643–655
Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic programming. http://www.gp-field-guide.org.uk
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Song D, Heywood MI, Zincir-Heywood AN (2003) A linear genetic programming approach to intrusion detection. In: Genetic and evolutionary computation conference. Springer, pp 2325–2336
Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363
Tan P-N, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India
Tashk ARB, Faez K (2007) Boosted bayesian kernel classifier method for face detection. In: Proceedings of the third international conference on natural computation. IEEE, pp 533–537
Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66
Tran B, Xue B, Zhang M (2016) Genetic programming for feature construction and selection in classification on high-dimensional data. Memet Comput 8(1):3–15
Tran B, Xue B, Zhang M (2017) Using feature clustering for GP-based feature construction on high-dimensional data. In: European conference on genetic programming. Springer, pp 210–226
Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 6:786–795
Yang P, Xu L, Zhou BB, Zhang Z, Zomaya AY (2009) A particle swarm based hybrid system for imbalanced medical data sampling. In: BMC genomics, vol 10. BioMed Central, p S34
Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: IEEE 7th international symposium on cyberspace safety and security (CSS), IEEE 12th international conference on embedded software and systems (ICESS), IEEE 17th international conference on high performance computing and communications (HPCC). IEEE, pp 1314–1319
Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11
Zhang S, Qin Z, Ling CX, Sheng S (2005) “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248
Acknowledgements
This work was supported in part by the Marsden Fund of New Zealand government under contracts VUW1509 (Mengjie Zhang) and VUW1615 (Bing Xue and Mengjie Zhang), the Science for Technological Innovation Challenge (SfTI) fund under grant E3603/2903 (Mengjie Zhang and Bing Xue), the University Research Fund at Victoria University of Wellington (Grant Number 216378/3764 and 223805/3986, Bing Xue and Mengjie Zhang), MBIE Data Science SSIF Fund under the contract RTVU1914 (Mengjie Zhang and Bing Xue), and National Natural Science Foundation of China (NSFC), under grant 61876169 (Bing Xue) and grant 61672276 (Lin Shang). Wenbin Pei was supported by China Scholarship Council/Victoria University Scholarship.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pei, W., Xue, B., Shang, L. et al. Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism. Soft Comput 24, 18021–18038 (2020). https://doi.org/10.1007/s00500-020-05056-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05056-7