Skip to main content

Identifying Rare Classes with Sparse Training Data

  • Conference paper
Database and Expert Systems Applications (DEXA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4653))

Included in the following conference series:

  • 1210 Accesses

Abstract

Building models and learning patterns from a collection of data are essential tasks for decision making and dissemination of knowledge. One of the common tools to extract knowledge is to build a classifier. However, when the training dataset is sparse, it is difficult to build an accurate classifier. This is especially true in biological science, as biological data are hard to produce and error-prone. Through empirical results, this paper shows challenges in building an accurate classifier with a sparse biological training dataset. Our findings indicate the inadequacies in well known classification techniques. Although certain clustering techniques, such as seeded k-Means, show some promise, there are still spaces for further improvement. In addition, we propose a novel idea that could be used to produce more balanced classifier when training data samples are very limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, L., Carpita, N., Reiter, W., Wilson, R., Jeffries, C., McCann, M.: A rapid method to screen for cell-wall mutants using discriminant analysis of fourier transform infrared spectra. The plant Journal 16(3), 385–392 (1998)

    Article  Google Scholar 

  2. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.L.: Uniprot: the universal protein knowledgebase. Nucleic Acids Research 32, D115–D119 (2004)

    Google Scholar 

  3. Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on swiss-prot. Bioinformatics 17(10), 920–926 (2001)

    Article  Google Scholar 

  4. Karp, P.D.: What we do not know about sequence analysis and sequence databases. BioInformatics 14(9), 753–754 (1998)

    Article  Google Scholar 

  5. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  6. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: ICML, pp. 27–34 (2002)

    Google Scholar 

  7. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: ICML (2004)

    Google Scholar 

  8. Yu, H.: Svmc: Single-class classification with support vector machines. In: IJCAI, pp. 567–574 (2003)

    Google Scholar 

  9. Blake, C., Merz, C.: UCI repository of machine learning databases (1998)

    Google Scholar 

  10. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)

    Google Scholar 

  11. Duda, R., Hart, P.E.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chichester (1973)

    MATH  Google Scholar 

  12. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)

    MATH  Google Scholar 

  13. Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proc Int Conf Intell Syst Mol Biol., pp. 147–152 (1997)

    Google Scholar 

  14. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Fransisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Roland Wagner Norman Revell Günther Pernul

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, M., Jiang, W., Clifton, C., Prabhakar, S. (2007). Identifying Rare Classes with Sparse Training Data. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_73

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74469-6_73

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74467-2

  • Online ISBN: 978-3-540-74469-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics