Rare Event Prediction Using Similarity Majority Under-Sampling Technique

Li, Jinyan; Fong, Simon; Hu, Shimin; Chu, Victor W.; Wong, Raymond K.; Mohammed, Sabah; Dey, Nilanjan

doi:10.1007/978-981-10-7242-0_3

Jinyan Li¹²,
Simon Fong¹²,
Shimin Hu¹²,
Victor W. Chu¹³,
Raymond K. Wong¹⁴,
Sabah Mohammed¹⁵ &
…
Nilanjan Dey¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 788))

Included in the following conference series:

International Conference on Soft Computing in Data Science

910 Accesses
6 Citations

Abstract

In data mining it is not uncommon to be confronted by imbalanced classification problem in which interesting samples are rare. Having too many ordinary but too few rare samples as training data, will mislead the classifier to become over-fitted by learning too much from majority class samples and become under-fitted lacking recognizing power for minority class samples. In this research work, a novel rebalancing technique that under-samples (reduce by sampling) the majority class size for subsiding the imbalanced class distributions without synthesizing extra training samples, is studied. This simple method is called Similarity Majority Under-Sampling Technique (SMUTE). By measuring the similarity between each majority class sample and its surrounding minority class samples, SMUTE effectively discriminates the majority and minority class samples with consideration of not changing too much of the underlying non-linear mapping between the input variables and the target classes. Two experiments are conducted and reported in this paper: one is an extensive performance comparison of SMUTE with the states-of-the-arts using generated imbalanced data; the other is the use of real data representing a case of natural disaster prevention where accident samples are rare. SMUTE is found to be working favourably well over other methods in both cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://archive.ics.uci.edu/ml/datasets/seismic-bumps.

References

Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
MATH Google Scholar
Li, J., et al.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomputing 72(10), 3708–3728 (2016)
Article Google Scholar
Cao, H., et al.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)
Article Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)
Article Google Scholar
Li, J., et al.: Solving the under-fitting problem for decision tree algorithms by incremental swarm optimization in rare-event healthcare classification. J. Med. Imaging Health Inform. 6(4), 1102–1110 (2016)
Article Google Scholar
Li, J., et al.: Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Mining 9(1), 37 (2016)
Article Google Scholar
Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML (2003)
Google Scholar
Tang, Y., et al.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(1), 281–288 (2009)
Google Scholar
Li, J., Fong, S., Yuan, M., Wong, R.K.: Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS, vol. 10086, pp. 374–390. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49586-6_25
Chapter Google Scholar
Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782 (2014)
Article Google Scholar
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explor. Newslett. 6(1), 30–39 (2004)
Article Google Scholar
Li, J., et al.: Similarity majority under-sampling technique for easing imbalanced classification problem. In: 15th Australasian Data Mining Conference (AusDM 2017), Melbourne, Australia, 19–25 August 2017, Proceedings. Australian Computer Society (2017)
Google Scholar
Weiss, G.M.: Learning with rare cases and small disjuncts. In: ICML (1995)
Google Scholar
Weiss, G.M.: Mining with rarity: a unifying framework. ACM Sigkdd Explor. Newslett. 6(1), 7–19 (2004)
Article Google Scholar
Arunasalam, B., Chawla, S.: CCCS: a top-down associative classifier for imbalanced class distribution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2006)
Google Scholar
Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II. Citeseer (2003)
Google Scholar
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Li, J., et al.: Adaptive multi-objective swarm fusion for imbalanced data classification. Inf. Fusion 39, 1–24 (2018)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Hu, S., et al.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, WCSE 2009. IEEE (2009)
Google Scholar
Estabrooks, A., Japkowicz, N.: A mixture-of-experts framework for learning from imbalanced data sets. In: Hoffmann, F., Hand, David J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 34–43. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44816-0_4
Chapter Google Scholar
Quinlan, J.R.: Bagging, boosting, and C4. 5. In: AAAI/IAAI, vol. 1 (1996)
Google Scholar
Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining, ICDM 2006. IEEE (2006)
Google Scholar
Alcalá, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2010)
Google Scholar
Li, J., Fong, S., Zhuang, Y.: Optimizing SMOTE by metaheuristics with neural network and decision tree. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI). IEEE (2015)
Google Scholar
Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)
Google Scholar
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
MathSciNet Google Scholar
Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19309-5_55
Chapter Google Scholar
Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 871–883 (1999)
Article Google Scholar
Ahlgren, P., Jarneving, B., Rousseau, R.: Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. J. Am. Soc. Inf. Sci. Technol. 54(6), 550–560 (2003)
Article Google Scholar
Xu, Z., Xia, M.: Distance and similarity measures for hesitant fuzzy sets. Inf. Sci. 181(11), 2128–2138 (2011)
Article MathSciNet MATH Google Scholar
Choi, S.-S., Cha, S.-H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)
Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Google Scholar
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)
MathSciNet MATH Google Scholar
Bekkar, M., Alitouche, T.A.: Imbalanced data learning approaches review. Int. J. Data Mining Knowl. Manag. Process 3(4), 15 (2013)
Article Google Scholar
Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets (2003)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Bifet, A., et al.: Moa: massive online analysis. J. Mach. Learn. Res. 11(May), 1601–1604 (2010)
Google Scholar
Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics (2011)
Google Scholar

Download references

Acknowledgement

The authors are thankful to the financial support from the research grants, (1) MYRG2016-00069, titled ‘Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance’ offered by RDAO/FST, University of Macau and Macau SAR government. (2) FDCT/126/2014/A3, titled ‘A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel’ offered by FDCT of Macau SAR government. Special thanks go to a Master student, Jin Zhen, for her kind assistance in programming and experimentation.

Author information

Authors and Affiliations

Department of Computer Information Science, University of Macau, Macau SAR, China
Jinyan Li, Simon Fong & Shimin Hu
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Victor W. Chu
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Raymond K. Wong
Department of Computer Science, Lakehead University, Thunder Bay, Canada
Sabah Mohammed
Department of Information Technology, Techno India College of Technology, Kolkata, India
Nilanjan Dey

Authors

Jinyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Simon Fong
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Victor W. Chu
View author publications
You can also search for this author in PubMed Google Scholar
Raymond K. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Sabah Mohammed
View author publications
You can also search for this author in PubMed Google Scholar
Nilanjan Dey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Fong .

Editor information

Editors and Affiliations

Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Azlinah Mohamed
University of Tennessee at Knoxville, Knoxville, Tennessee, USA
Michael W. Berry
Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Bee Wah Yap

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J. et al. (2017). Rare Event Prediction Using Similarity Majority Under-Sampling Technique. In: Mohamed, A., Berry, M., Yap, B. (eds) Soft Computing in Data Science. SCDS 2017. Communications in Computer and Information Science, vol 788. Springer, Singapore. https://doi.org/10.1007/978-981-10-7242-0_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-7242-0_3
Published: 24 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7241-3
Online ISBN: 978-981-10-7242-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics