Identifying Rare Classes with Sparse Training Data

Zhang, Mingwu; Jiang, Wei; Clifton, Chris; Prabhakar, Sunil

doi:10.1007/978-3-540-74469-6_73

Mingwu Zhang¹,
Wei Jiang¹,
Chris Clifton¹ &
…
Sunil Prabhakar¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4653))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1210 Accesses

Abstract

Building models and learning patterns from a collection of data are essential tasks for decision making and dissemination of knowledge. One of the common tools to extract knowledge is to build a classifier. However, when the training dataset is sparse, it is difficult to build an accurate classifier. This is especially true in biological science, as biological data are hard to produce and error-prone. Through empirical results, this paper shows challenges in building an accurate classifier with a sparse biological training dataset. Our findings indicate the inadequacies in well known classification techniques. Although certain clustering techniques, such as seeded k-Means, show some promise, there are still spaces for further improvement. In addition, we propose a novel idea that could be used to produce more balanced classifier when training data samples are very limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, L., Carpita, N., Reiter, W., Wilson, R., Jeffries, C., McCann, M.: A rapid method to screen for cell-wall mutants using discriminant analysis of fourier transform infrared spectra. The plant Journal 16(3), 385–392 (1998)
Article Google Scholar
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.L.: Uniprot: the universal protein knowledgebase. Nucleic Acids Research 32, D115–D119 (2004)
Google Scholar
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on swiss-prot. Bioinformatics 17(10), 920–926 (2001)
Article Google Scholar
Karp, P.D.: What we do not know about sequence analysis and sequence databases. BioInformatics 14(9), 753–754 (1998)
Article Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: ICML, pp. 27–34 (2002)
Google Scholar
Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: ICML (2004)
Google Scholar
Yu, H.: Svmc: Single-class classification with support vector machines. In: IJCAI, pp. 567–574 (2003)
Google Scholar
Blake, C., Merz, C.: UCI repository of machine learning databases (1998)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Google Scholar
Duda, R., Hart, P.E.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chichester (1973)
MATH Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)
MATH Google Scholar
Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proc Int Conf Intell Syst Mol Biol., pp. 147–152 (1997)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Fransisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Purdue University, West Lafayette, IN 47907-2107, USA
Mingwu Zhang, Wei Jiang, Chris Clifton & Sunil Prabhakar

Authors

Mingwu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Chris Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Prabhakar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Roland Wagner Norman Revell Günther Pernul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M., Jiang, W., Clifton, C., Prabhakar, S. (2007). Identifying Rare Classes with Sparse Training Data. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_73

Download citation

DOI: https://doi.org/10.1007/978-3-540-74469-6_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74467-2
Online ISBN: 978-3-540-74469-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics