Skip to main content

Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers

  • Chapter
  • First Online:
Data Mining

Part of the book series: Annals of Information Systems ((AOIS,volume 8))

Abstract

In this chapter, we examine the relationship between cost-sensitive learning and resampling. We first introduce these concepts, including a new resampling method called “generative oversampling,” which creates new data points by learning parameters for an assumed probability distribution. We then examine theoretically and empirically the effects of different forms of resampling and their relationship to cost-sensitive learning on different classifiers and different data characteristics. For example, we show that generative oversampling used with linear SVMs provides the best results for a variety of text data sets. In contrast, no significant performance difference is observed for low-dimensional data sets when using Gaussians to model distributions in a naive Bayes classifier. Our theoretical and empirical results in these and other cases support the conclusion that the relative performance of costsensitive learning and resampling is dependent on both the classifier and the data characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1), 20–29 (2004)

    Article  Google Scholar 

  2. Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. KDD pp. 535–541 (2006)

    Google Scholar 

  3. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Knowledge Discovery and Data Mining pp. 164–168 (1998)

    Google Scholar 

  4. Chawla, N., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    Google Scholar 

  5. Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley & Sons, New York (2001)

    Google Scholar 

  6. Elkan, C.: The foundations of cost-sensitive learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)

    Google Scholar 

  7. Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: WebACE: A web agent for document categorization and exploration. Proceedings of the Second International Conference on Autonomous Agents pp. 408–415 (1998)

    Google Scholar 

  8. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: An interactive retrieval evaluation and new large test collection for research. Proceedings of ACM SIGIR pp. 192–201 (1994)

    Google Scholar 

  9. Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML ’07: Proceedings of the 24th international conference on Machine learning, pp. 935–942 (2007)

    Google Scholar 

  10. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning (1398), 137–142 (1998)

    Google Scholar 

  11. Karypis, G.: CLUTO – a clustering toolkit. University of Minnesota technical report 02-017(2002)

    Google Scholar 

  12. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2–3), 195–215 (1998)

    Article  Google Scholar 

  13. Lewis, D., Gale, W.: Training text classifiers by uncertainty sampling. Proceedings of the Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(1994)

    Google Scholar 

  14. Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. Knowledge Discovery and Data Mining pp. 73–79 (1998)

    Google Scholar 

  15. Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. DMIN ’07: International Conference on Data Mining(2007)

    Google Scholar 

  16. Maloof, M.: Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 Workshop on Learning from Imbalanced Data Sets II(2003)

    Google Scholar 

  17. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization(1998)

    Google Scholar 

  18. McCarthy, K., Zabar, B., Weiss, G.: Does cost-sensitive learning beat sampling for classifying rare classes? UBDM ’05: Proceedings of the 1st international workshop on Utility-based data mining pp. 69–77 (2005)

    Google Scholar 

  19. Melville, P., Mooney, R.J.: Diverse ensembles for active learning. ICML ’04: Proceedings of the twenty-first international conference on Machine learning pp. 584–591 (2004)

    Google Scholar 

  20. Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach – a case study in intensive care monitoring. Proceedings of the 16th International Conference on Machine Learning (ICML-99)(1999)

    Google Scholar 

  21. Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6(1), 50–59 (2004)

    Article  Google Scholar 

  22. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  23. Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)

    Article  Google Scholar 

  24. Weiss, G., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling class imbalance? DMIN ’07: International Conference on Data Mining(2007)

    Google Scholar 

  25. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)

    Article  Google Scholar 

  26. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining(2003)

    Google Scholar 

  27. Zhang, Mani: kNN approach to unbalanced data distributions: A case study involving information extraction. ICML ’03: Proceedings of the twentieth international conference on Machine learning(2003)

    Google Scholar 

  28. Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. SDM Workshop on Clustering High Dimensional Data and Its Applications(2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Alexander Liu , Cheryl Martin , Brian La Cour or Joydeep Ghosh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Liu, A., Martin, C., La Cour, B., Ghosh, J. (2010). Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers. In: Stahlbock, R., Crone, S., Lessmann, S. (eds) Data Mining. Annals of Information Systems, vol 8. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1280-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-1280-0_8

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-1279-4

  • Online ISBN: 978-1-4419-1280-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics