SPSO: Synthetic Protein Sequence Oversampling for Imbalanced Protein Data and Remote Homology Detection

Beigi, Majid; Zell, Andreas

doi:10.1007/11946465_10

Majid Beigi²² &
Andreas Zell²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4345))

Included in the following conference series:

International Symposium on Biological and Medical Data Analysis

939 Accesses
2 Citations

Abstract

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Our method of oversampling involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. This method is very useful in remote homology detection, and we used real and artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. Advances in Neural Information Processing System, 1441–1448 (2003)
Google Scholar
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)
Google Scholar
Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)
Google Scholar
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)
Google Scholar
Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research 16, 321–357 (2002)
MATH Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)
Google Scholar
saigo, H., Vert, J.P., Ueda, N., akustu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)
Article Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustalw: improving the sesitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Article Google Scholar
Attwood, T.K., Croning, M.D.R., Gaulton, A.: Deriving structural and functional insights from a ligand-based hierarchical classification of g-protein coupled receptors. Protein Eng. 15, 7–12 (2002)
Article Google Scholar
Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohhen, F.E., Vriend, G.: Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res. 31(1), 294–297 (2003)
Article Google Scholar
Bairoch, A., Apweiler, R.: The swiss-prot protein sequence data bank and its supplement trembl. Nucleic Acids Res. 29, 346–349 (2001)
Article Google Scholar
Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology. The MIT Press, Cambridge
Google Scholar
Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)
Google Scholar
Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 423, 203–231 (2001)
Article Google Scholar
Swet, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Sand 1, D-72076, Tübingen, Germany
Majid Beigi & Andreas Zell

Authors

Majid Beigi
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Zell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Medical School, Lab of Medical Informatics, Aristotle University, Box 323, 54124, Thessaloniki, Greece
Nicos Maglaveras & Ioanna Chouvarda &
Laboratory of Medical Informatics, Faculty of Medicine, Aristotle University of Thessaloniki, P.O.Box 323, 54124, Thessaloniki, Greece
Vassilis Koutkias
Institute of Informatics, Johann Wolfgang Goethe-Universität, 60054, Frankfurt a.M., Germany
Rüdiger Brause

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beigi, M., Zell, A. (2006). SPSO: Synthetic Protein Sequence Oversampling for Imbalanced Protein Data and Remote Homology Detection. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_10

Download citation

DOI: https://doi.org/10.1007/11946465_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68063-5
Online ISBN: 978-3-540-68065-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics