Tackling class overlap and imbalance problems in software defect prediction

Chen, Lin; Fang, Bin; Shang, Zhaowei; Tang, Yuanyan

doi:10.1007/s11219-016-9342-6

Tackling class overlap and imbalance problems in software defect prediction

Published: 25 September 2016

Volume 26, pages 97–125, (2018)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Lin Chen¹,
Bin Fang¹,
Zhaowei Shang¹ &
…
Yuanyan Tang²

1895 Accesses
63 Citations
13 Altmetric
2 Mentions
Explore all metrics

Abstract

Software defect prediction (SDP) is a promising solution to save time and cost in the software testing phase for improving software quality. Numerous machine learning approaches have proven effective in SDP. However, the unbalanced class distribution in SDP datasets could be a problem for some conventional learning methods. In addition, class overlap increases the difficulty for the predictors to learn the defective class accurately. In this study, we propose a new SDP model which combines class overlap reduction and ensemble imbalance learning to improve defect prediction. First, the neighbor cleaning method is applied to remove the overlapping non-defective samples. The whole dataset is then randomly under-sampled several times to generate balanced subsets so that multiple classifiers can be trained on these data. Finally, these individual classifiers are assembled with the AdaBoost mechanism to build the final prediction model. In the experiments, we investigated nine highly unbalanced datasets selected from a public software repository and confirmed that the high rate of overlap between classes existed in SDP data. We assessed the performance of our proposed model by comparing it with other state-of-the-art methods including conventional SDP models, imbalance learning and data cleaning methods. Test results and statistical analysis show that the proposed model provides more reasonable defect prediction results and performs best in terms of G-mean and AUC among all tested models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

Article 27 July 2023

An Efficient Approach to Software Fault Prediction

References

Arar, O. F., & Ayan, K. (2015). Software defect prediction using cost-sensitive neural network. Applied Soft Computing, 33, 263–277.
Article Google Scholar
Catal, C., & Diri, B. (2009). A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4), 7346–7354.
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 107–119). Dubrovnik: Springer.
Conover, W. J., & Iman, R. L. (1981). Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35(3), 124–129.
MATH Google Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.
MathSciNet MATH Google Scholar
Denil, M., & Trappenberg, T. (2010). Overlap versus imbalance. In Proceedings of Advances in Artificial Intelligence, Canadian Conference on Artificial Intelligence, Canadian, Ai 2010, Ottawa, Canada, May 31–June 2, 2010 (pp. 220–231).
Drown, D. J., Khoshgoftaar, T. M., & Seliya, N. (2009). Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 39(5), 1097–1107.
Article Google Scholar
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
Article MathSciNet Google Scholar
Fenton, N. E., & Ohlsson, N. (2000). Quantitative analysis of faults and failures in a complex software system. IEEE Transactions on Software Engineering, 26(8), 797–814.
Article Google Scholar
Freund, Y., & Schapire, R. E. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23–37). London: Springer.
Ghotra, B., McIntosh, S., & Hassan, A. E. (2015). Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering-Volume 1 (pp. 789–800). Piscataway: IEEE.
Gondra, I. (2008). Applying machine learning to software fault-proneness prediction. Journal of Systems and Software, 81(2), 186–195.
Article Google Scholar
Gray, D., Bowes, D., Davey, N., Sun, Y., & Christianson, B. (2011). The misuse of the NASA metrics data program data sets for automated software defect prediction. In 15th Annual Conference on Evaluation and Assessment in Software Engineering (EASE 2011) (pp. 96–103). Durham: IET.
Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6), 1276–1304.
Article Google Scholar
Halstead, M. H. (1977). Elements of software science (Vol. 7). New York: Elsevier.
MATH Google Scholar
He, H., & Garcia, E. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Article Google Scholar
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
MATH Google Scholar
Jing, X., Wu, F., Dong, X., Qi, F., & Xu, B. (2015). Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (pp. 496–507). New York: ACM.
Kampenes, V. B., Dybå, T., Hannay, J. E., & Sjøberg, D. I. (2007). A systematic review of effect size in software engineering experiments. Information and Software Technology, 49(11), 1073–1086.
Article Google Scholar
Khoshgoftaar, T. M., Gao, K., & Seliya, N. (2010). Attribute selection and imbalanced data: Problems in software defect prediction. In 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (Vol. 1, pp. 137–144). Arras: IEEE.
Kim, S., Zhang, H., Wu, R., & Gong, L. (2011). Dealing with noise in defect prediction. In 2011 33rd International Conference on Software Engineering (ICSE) (pp. 481–490). New York: IEEE.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML (Vol. 97, pp. 179–186). Nashville.
Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. Berlin: Springer.
Book MATH Google Scholar
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Chemnitz: Springer.
Liu, M. X., Miao, L. S., & Zhang, D. Q. (2014). Two-Stage cost-Sensitive learning for Software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.
Article Google Scholar
Liu, X., Wu, J., & Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics, 39(2), 539–550.
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250(11), 113–141.
Article Google Scholar
Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27, 504–518.
Article Google Scholar
McCabe, T. J. (1976). A complexity measure. IEEE Transactions on Software Engineering, 4, 308–320.
Article MathSciNet MATH Google Scholar
Menzies, T., Butcher, A., Cok, D., Marcus, A., Layman, L., Shull, F., et al. (2013). Local versus global lessons for defect prediction and effort estimation. IEEE Transactions on Software Engineering, 39(6), 822–834.
Article Google Scholar
Menzies, T., Caglayan, B., Kocaguneli, E., Krall, J., Peters, F., & Turhan, B. (2012). The promise repository of empirical software engineering data. promisedata. googlecode. com.
Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2–13.
Article Google Scholar
Nam, J., & Kim, S. (2015). Heterogeneous defect prediction. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (pp. 508–519). New York: ACM.
Pelayo, L., & Dick, S. (2007). Applying novel resampling strategies to software defect prediction. In Fuzzy Information Processing Society, 2007. NAFIPS’07. Annual Meeting of the North American (pp. 69–72). San Diego: IEEE.
Pelayo, L., & Dick, S. (2012). Evaluating stratification alternatives to improve software defect prediction. IEEE Transactions on Reliability, 61(2), 516–525.
Article Google Scholar
Prati, R. C., Batista, G. E. A. P. A., & Monard, M. C. (2004). Class imbalances versus class overlapping: An analysis of a learning system behavior. Lecture Notes in Computer Science, 2972, 312–321.
Ryu, D., Choi, O., & Baik, J. (2016). Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21(1), 43–71.
Article Google Scholar
Ryu, D., Jang, J.-I., & Baik, J. (2015). A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Quality Journal, 1–38.
Seiffert, C., Khoshgoftaar, T. M., & Van Hulse, J. (2009). Improving software-quality predictions with data sampling and boosting. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 39(6), 1283–1294.
Article Google Scholar
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 40(1), 185–197.
Article Google Scholar
Shepperd, M., & Ince, D. C. (1994). A critique of three metrics. Journal of Systems and Software, 26(3), 197–210.
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., & Mair, C. (2013). Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9), 1208–1215.
Article Google Scholar
Siers, M. J., & Islam, M. Z. (2015). Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Information Systems, 51, 62–71.
Article Google Scholar
Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, S. Y. J. (2011). A general software defect-proneness prediction framework. IEEE Transactions on Software Engineering, 37(3), 356–370.
Article Google Scholar
Srinivasan, K., & Fisher, D. (1995). Machine learning approaches to estimating software development effort. IEEE Transactions on Software Engineering, 21(2), 126–137.
Article Google Scholar
Tan, M., Tan, L., Dara, S., & Mayeux, C. (2015). Online defect prediction for imbalanced data. In Proceedings of the 37th International Conference on Software Engineering-Volume 2 (pp. 99–108). Piscataway: IEEE.
Tang, W., & Khoshgoftaar, T. M. (2004). Noise identification with the k-means algorithm. In ICTAI 2004. 16th IEEE International Conference on Tools with Artificial Intelligence (pp. 373–378). Boca Raton: IEEE.
Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5), 540–578.
Article Google Scholar
Turhan, B., Mısırlı, A. T., & Bener, A. (2013). Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 55(6), 1101–1118.
Article Google Scholar
Wang, S., & Yao, X. (2013). Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 62(2), 434–443.
Article Google Scholar
Zhang, F., Zheng, Q., Zou, Y., & Hassan, A. E. (2016). Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proceedings of the 38th International Conference on Software Engineering (pp. 309–320). New York: ACM.

Download references

Acknowledgments

This paper is supported by National Key Basic Research Program of China (973 program 2013CB329103 of 2013CB329100), the Program for Natural Science Foundation of China (No. 61672120, No. 61472053, No. 91118005), the Doctoral Program of Higher Education (20120191110027) and Natural Science Foundation of Chongqing (No. CSTC2010BB2217, No. cstc2012jjA40017).

Author information

Authors and Affiliations

Department of Computer Science, Chongqing University, Chongqing, 400030, China
Lin Chen, Bin Fang & Zhaowei Shang
Faculty of Science and Technology, University of Macau, Macau, China
Yuanyan Tang

Authors

Lin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaowei Shang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyan Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Fang, B., Shang, Z. et al. Tackling class overlap and imbalance problems in software defect prediction. Software Qual J 26, 97–125 (2018). https://doi.org/10.1007/s11219-016-9342-6

Download citation

Published: 25 September 2016
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11219-016-9342-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tackling class overlap and imbalance problems in software defect prediction

Abstract

Access this article

Similar content being viewed by others

Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

An Efficient Approach to Software Fault Prediction

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tackling class overlap and imbalance problems in software defect prediction

Abstract

Access this article

Similar content being viewed by others

Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

An Efficient Approach to Software Fault Prediction

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation