Abstract
Classifying protein sequences has important applications in areas such as disease diagnosis, treatment development and drug design. In this paper we present a highly accurate classifier called the g-MARSĀ (gapped Markov Chain with Support Vector Machine) protein classifier. It models the structure of a protein sequence by measuring the transition probabilities between pairs of amino acids. This results in a Markov chain style model for each protein sequence. Then, to capture the similarity among non-exactly matching protein sequences, we show that this model can be generalized to incorporate gaps in the Markov chain. We perform a thorough experimental study and compare g-MARS to several other state-of-the-art protein classifiers. Overall, we demonstrate that g-MARS has superior accuracy and operates efficiently on a diverse range of protein families.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001)
Cheng, B., Carbonell, J., Klein-Seetharaman, J.: Protein classification based on text document classification technique. PROTEINS: Structures, Function and Bioinformatics.Ā 58, 955ā970 (2005)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis-āProbabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. BioinformaticsĀ 21(5), 617ā623 (2005)
GPCRDB, http://www.gpcr.org
Gromiha, M.M., Suwa, M.: A simple statistical method for discriminating outer membrane proteins with better accuracy. BioinformaticsĀ 21(7), 961ā968 (2005)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Huang, S., Liu, R., Chen, C., Chao, Y., Chen., S.: Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions. In: BIBE, pp. 113ā120 (2005)
Jaakkola, T., Diekhans, M., Haussler, D.: Using the fisher kernel method to detect remote protein homologies. In: ISMB, pp. 149ā158 (1999)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. BioinformaticsĀ 20(4) (2004)
Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Pacific Symp. on Biocomputing, pp. 566ā575 (2002)
Liu, Z.: Predicting protein subcellular localization from homologs using machine learning algorithms. Masterās thesis, Dept of Computer Science, University of Alberta (2002)
She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: KDD, pp. 436ā445 (2003)
Wang, J., Hannenhalli, S.: Generalizations of markov model to characterize biological sequences. BMC BioinformaticsĀ 6(219) (2005)
Zhou, S., Wang, K.: Localization site prediction for membrane proteins by integrating rule and svm classification. IEEE Trans. Knowl. Data Eng.Ā 17(12), 1694ā1705 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ji, X., Bailey, J., Ramamohanarao, K. (2008). g-MARS: Protein Classification Using Gapped Markov Chains and Support Vector Machines. In: Chetty, M., Ngom, A., Ahmad, S. (eds) Pattern Recognition in Bioinformatics. PRIB 2008. Lecture Notes in Computer Science(), vol 5265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88436-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-88436-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88434-7
Online ISBN: 978-3-540-88436-1
eBook Packages: Computer ScienceComputer Science (R0)