Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers

Liu, Alexander; Martin, Cheryl; La Cour, Brian; Ghosh, Joydeep

doi:10.1007/978-1-4419-1280-0_8

Alexander Liu^4,5,
Cheryl Martin⁴,
Brian La Cour⁴ &
…
Joydeep Ghosh⁵

Part of the book series: Annals of Information Systems ((AOIS,volume 8))

4225 Accesses
4 Citations

Abstract

In this chapter, we examine the relationship between cost-sensitive learning and resampling. We first introduce these concepts, including a new resampling method called “generative oversampling,” which creates new data points by learning parameters for an assumed probability distribution. We then examine theoretically and empirically the effects of different forms of resampling and their relationship to cost-sensitive learning on different classifiers and different data characteristics. For example, we show that generative oversampling used with linear SVMs provides the best results for a variety of text data sets. In contrast, no significant performance difference is observed for low-dimensional data sets when using Gaussians to model distributions in a naive Bayes classifier. Our theoretical and empirical results in these and other cases support the conclusion that the relative performance of costsensitive learning and resampling is dependent on both the classifier and the data characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1), 20–29 (2004)
Article Google Scholar
Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. KDD pp. 535–541 (2006)
Google Scholar
Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Knowledge Discovery and Data Mining pp. 164–168 (1998)
Google Scholar
Chawla, N., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley & Sons, New York (2001)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)
Google Scholar
Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: WebACE: A web agent for document categorization and exploration. Proceedings of the Second International Conference on Autonomous Agents pp. 408–415 (1998)
Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: An interactive retrieval evaluation and new large test collection for research. Proceedings of ACM SIGIR pp. 192–201 (1994)
Google Scholar
Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML ’07: Proceedings of the 24th international conference on Machine learning, pp. 935–942 (2007)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning (1398), 137–142 (1998)
Google Scholar
Karypis, G.: CLUTO – a clustering toolkit. University of Minnesota technical report 02-017(2002)
Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2–3), 195–215 (1998)
Article Google Scholar
Lewis, D., Gale, W.: Training text classifiers by uncertainty sampling. Proceedings of the Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(1994)
Google Scholar
Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. Knowledge Discovery and Data Mining pp. 73–79 (1998)
Google Scholar
Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. DMIN ’07: International Conference on Data Mining(2007)
Google Scholar
Maloof, M.: Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 Workshop on Learning from Imbalanced Data Sets II(2003)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization(1998)
Google Scholar
McCarthy, K., Zabar, B., Weiss, G.: Does cost-sensitive learning beat sampling for classifying rare classes? UBDM ’05: Proceedings of the 1st international workshop on Utility-based data mining pp. 69–77 (2005)
Google Scholar
Melville, P., Mooney, R.J.: Diverse ensembles for active learning. ICML ’04: Proceedings of the twenty-first international conference on Machine learning pp. 584–591 (2004)
Google Scholar
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach – a case study in intensive care monitoring. Proceedings of the 16th International Conference on Machine Learning (ICML-99)(1999)
Google Scholar
Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6(1), 50–59 (2004)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)
Article Google Scholar
Weiss, G., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling class imbalance? DMIN ’07: International Conference on Data Mining(2007)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)
Article Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining(2003)
Google Scholar
Zhang, Mani: kNN approach to unbalanced data distributions: A case study involving information extraction. ICML ’03: Proceedings of the twentieth international conference on Machine learning(2003)
Google Scholar
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. SDM Workshop on Clustering High Dimensional Data and Its Applications(2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Applied Research Labs, University of Texas at Austin, Austin, TX, USA
Alexander Liu, Cheryl Martin & Brian La Cour
Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA
Alexander Liu & Joydeep Ghosh

Authors

Alexander Liu
View author publications
You can also search for this author in PubMed Google Scholar
Cheryl Martin
View author publications
You can also search for this author in PubMed Google Scholar
Brian La Cour
View author publications
You can also search for this author in PubMed Google Scholar
Joydeep Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Alexander Liu , Cheryl Martin , Brian La Cour or Joydeep Ghosh .

Editor information

Editors and Affiliations

Inst. Wirtschaftsinformatik, Universität Hamburg, Von-Melle-Park 5, Hamburg, 20146, Germany
Robert Stahlbock
Management School, Dept. Management Science, Lancaster University, Lancaster, LA1 4YX, United Kingdom
Sven F. Crone
Inst. Wirtschaftsinformatik, Universität Hamburg, Von-Melle-Park 5, Hamburg, 20146, Germany
Stefan Lessmann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, A., Martin, C., La Cour, B., Ghosh, J. (2010). Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers. In: Stahlbock, R., Crone, S., Lessmann, S. (eds) Data Mining. Annals of Information Systems, vol 8. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1280-0_8

Download citation

DOI: https://doi.org/10.1007/978-1-4419-1280-0_8
Published: 15 October 2009
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-1279-4
Online ISBN: 978-1-4419-1280-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics