Predicting Rare Classes of Primary Tumors with Over-Sampling Techniques

Kerdprasop, Nittaya; Kerdprasop, Kittisak

doi:10.1007/978-3-642-27157-1_17

Nittaya Kerdprasop¹⁰ &
Kittisak Kerdprasop¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 258))

Included in the following conference series:

874 Accesses
4 Citations

Abstract

The discovery of hidden biomedical patterns from large clinical databases can uncover knowledge to support prognosis and diagnosis decision makings. Researchers and healthcare professionals have applied data mining technology to obtain descriptive patterns and predictive models from biomedical and healthcare databases. However, clinical application of data mining algorithms has a severe problem of low predictive accuracy that hamper their wide usage in the clinical environment. We thus focus our study on the improvement of predictive accuracy of the models created from the data mining algorithms. Our main research interest concerns the problem of learning a tree-based classifier model from a multiclass data set with low prevalence rate of some minority classes. We apply random over-sampling and synthetic minority over-sampling (SMOTE) techniques to increase the predictive performance of the learned model. In our study, we consider specific kinds of primary tumors occurring at the frequency rate less than one percent as rare classes. From the experimental results, the SMOTE technique gave a high specificity model, whereas the random over-sampling produced a high sensitivity classifier. The precision performance of a tree-based model obtained from the random over-sampling technique is on average much better than the model learned from the original imbalanced data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breiman, L., Freidman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Systems with Applications 36, 4626–4636 (2009)
Article Google Scholar
Chawla, N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, Heidelberg (2005)
Chapter Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research 16, 341–378 (2002)
MATH Google Scholar
Debnath, R., Takahide, N., Takahashi, H.: A decision based one-against-one method for multi-class support vector machine. Pattern Analysis & Applications 7(2), 164–175 (2004)
Article MathSciNet Google Scholar
Frank, A., Asuncion, A.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2010), http://archive.ics.uci.edu/ml
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)
Article Google Scholar
Han, S., Yuan, B., Liu, W.: Rare class mining: progress and prospect. In: Proc. Chinese Conference on Pattern Recognition, pp. 1–5 (2009)
Google Scholar
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)
Article Google Scholar
Lalkhen, A.G., McCluskey, A.: Clinical tests: sensitivity and specificity. Continuing Education in Anaesthesia, Critical Care & Pain 8(6), 221–223 (2008)
Article Google Scholar
Mugambi, E.M., Hunter, A., Oatley, G., Kennedy, L.: Polynomial-fuzzy decision tree structures for classifying medical data. Knowledge-Based Systems 17(2-4), 81–87 (2004)
Article Google Scholar
Pandey, B., Mishra, R.B.: Knowledge and intelligent computing system in medicine. Computers in Biology and Medicine 39, 215–230 (2009)
Article Google Scholar
Quinlan, J.R.: Induction of decision tree. Machine Learning 1, 81–106 (1986)
Google Scholar
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. of Machine Learning Research 5, 101–141 (2004)
MathSciNet MATH Google Scholar
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proc. DaWaK 2008, pp. 283–292 (2008)
Google Scholar
Tapia, E., Ornella, L., Bulacio, P., Angelone, L.: Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 12, 59 (2011)
Article Google Scholar
Thabtah, F.A., Cowling, P., Peng, Y.: Multiple labels associative classification. Knowledge and Information Systems 9(1), 109–129 (2006)
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 1513–1542 (2009)
Article Google Scholar
Webster’s New World^TM Medical Dictionary, 3rd edn. Wiley Publishing (2008)
Google Scholar
Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1), 7–9 (2004)
Article Google Scholar
Yeung, K.Y., Bumgarner, R.E.: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biology 4(12), R83 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Data Engineering Research Unit, School of Computer Engineering, Suranaree University of Technology, 111 University Avenue, Nakhon, Ratchasima, 30000, Thailand
Nittaya Kerdprasop & Kittisak Kerdprasop

Authors

Nittaya Kerdprasop
View author publications
You can also search for this author in PubMed Google Scholar
Kittisak Kerdprasop
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Multimedia Engineering Department, Hannam University, 133 Ojeong-dong, Daeduk-gu, Daejeon, Korea
Tai-hoon Kim
The Ohio State University, 470 Hitchcock Hall, 2070 Neil Avenue, 43210-1275, Columbus, OH, USA
Hojjat Adeli
University of Calabria, Via P. Bucci, 41C, I-87036, Rende, Cosenza, Italy
Alfredo Cuzzocrea
Engineering and Electronics, Edinburgh University, King’s Buildings, Faraday, rm 3.101, Mayfield Road, EH9 3JL, Edinburgh, UK
Tughrul Arslan
Victoria University, 8001, Melbourne, VIC, Australia
Yanchun Zhang
Hosei University, 184-8584, Tokyo, Japan
Jianhua Ma
Information Security Infrastructure Research Group, Electronics and Telecommunications Research Institute (ETRI), 161, Gajeong-Dong, Yuseong-Gu, Daejeon, Korea
Kyo-il Chung
University of Technology Malaysia, 81310, Johor, Malaysia
Siti Mariyam
Department of Biomedical Engineering, Nanjing University of Aeronautics & Astronautics, 210016, Nanjing, P.R. China
Xiaofeng Song

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kerdprasop, N., Kerdprasop, K. (2011). Predicting Rare Classes of Primary Tumors with Over-Sampling Techniques. In: Kim, Th., et al. Database Theory and Application, Bio-Science and Bio-Technology. BSBT DTA 2011 2011. Communications in Computer and Information Science, vol 258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27157-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-27157-1_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27156-4
Online ISBN: 978-3-642-27157-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics