Skip to main content

Predicting Rare Classes of Primary Tumors with Over-Sampling Techniques

  • Conference paper
Database Theory and Application, Bio-Science and Bio-Technology (BSBT 2011, DTA 2011)

Abstract

The discovery of hidden biomedical patterns from large clinical databases can uncover knowledge to support prognosis and diagnosis decision makings. Researchers and healthcare professionals have applied data mining technology to obtain descriptive patterns and predictive models from biomedical and healthcare databases. However, clinical application of data mining algorithms has a severe problem of low predictive accuracy that hamper their wide usage in the clinical environment. We thus focus our study on the improvement of predictive accuracy of the models created from the data mining algorithms. Our main research interest concerns the problem of learning a tree-based classifier model from a multiclass data set with low prevalence rate of some minority classes. We apply random over-sampling and synthetic minority over-sampling (SMOTE) techniques to increase the predictive performance of the learned model. In our study, we consider specific kinds of primary tumors occurring at the frequency rate less than one percent as rare classes. From the experimental results, the SMOTE technique gave a high specificity model, whereas the random over-sampling produced a high sensitivity classifier. The precision performance of a tree-based model obtained from the random over-sampling technique is on average much better than the model learned from the original imbalanced data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Breiman, L., Freidman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)

    Google Scholar 

  2. Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Systems with Applications 36, 4626–4636 (2009)

    Article  Google Scholar 

  3. Chawla, N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  4. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research 16, 341–378 (2002)

    MATH  Google Scholar 

  5. Debnath, R., Takahide, N., Takahashi, H.: A decision based one-against-one method for multi-class support vector machine. Pattern Analysis & Applications 7(2), 164–175 (2004)

    Article  MathSciNet  Google Scholar 

  6. Frank, A., Asuncion, A.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2010), http://archive.ics.uci.edu/ml

    Google Scholar 

  7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)

    Article  Google Scholar 

  8. Han, S., Yuan, B., Liu, W.: Rare class mining: progress and prospect. In: Proc. Chinese Conference on Pattern Recognition, pp. 1–5 (2009)

    Google Scholar 

  9. Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)

    Article  Google Scholar 

  10. Lalkhen, A.G., McCluskey, A.: Clinical tests: sensitivity and specificity. Continuing Education in Anaesthesia, Critical Care & Pain 8(6), 221–223 (2008)

    Article  Google Scholar 

  11. Mugambi, E.M., Hunter, A., Oatley, G., Kennedy, L.: Polynomial-fuzzy decision tree structures for classifying medical data. Knowledge-Based Systems 17(2-4), 81–87 (2004)

    Article  Google Scholar 

  12. Pandey, B., Mishra, R.B.: Knowledge and intelligent computing system in medicine. Computers in Biology and Medicine 39, 215–230 (2009)

    Article  Google Scholar 

  13. Quinlan, J.R.: Induction of decision tree. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  14. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. of Machine Learning Research 5, 101–141 (2004)

    MathSciNet  MATH  Google Scholar 

  15. Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proc. DaWaK 2008, pp. 283–292 (2008)

    Google Scholar 

  16. Tapia, E., Ornella, L., Bulacio, P., Angelone, L.: Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 12, 59 (2011)

    Article  Google Scholar 

  17. Thabtah, F.A., Cowling, P., Peng, Y.: Multiple labels associative classification. Knowledge and Information Systems 9(1), 109–129 (2006)

    Article  Google Scholar 

  18. Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 1513–1542 (2009)

    Article  Google Scholar 

  19. Webster’s New WorldTM Medical Dictionary, 3rd edn. Wiley Publishing (2008)

    Google Scholar 

  20. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1), 7–9 (2004)

    Article  Google Scholar 

  21. Yeung, K.Y., Bumgarner, R.E.: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biology 4(12), R83 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kerdprasop, N., Kerdprasop, K. (2011). Predicting Rare Classes of Primary Tumors with Over-Sampling Techniques. In: Kim, Th., et al. Database Theory and Application, Bio-Science and Bio-Technology. BSBT DTA 2011 2011. Communications in Computer and Information Science, vol 258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27157-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27157-1_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27156-4

  • Online ISBN: 978-3-642-27157-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics