Abstract
Prediction of novel genes associated to a disease is an important issue in biomedical research. At early days, annotation-based methods were proposed for this problem. In next stage, with high-throughput technologies, data of interaction between genes/proteins has grown quickly and covered almost genome and proteome, and therefore network-based methods for the issue is becoming prominent. Besides those two methods, the prediction problem can be also approached using machine learning techniques because it can be formulated as a classification task of machine learning. To date, a number of supervised learning techniques and various types of gene/protein annotation data have been used to solve the disease gene classification/ prediction problem. However, to the best of our knowledge, there has been no study on the comparison of these methods that work on comprehensive biomedical annotation data. In addition, it is generally true that no classifier is better than others for all classification problems. Therefore, in this study, we compare the performance of disease gene prediction of several supervised learning techniques that have been used in the literature such as Decision Tree Learning, k-Nearest Neighbor, Naive Bayesian, Artificial Neural Networks and Support Vector Machines. We additionally assess Random Forest, a relatively new decision-tree-based ensemble learning method. The simulation results indicate that Random Forest obtained the best performance of all. Also, all methods are stable with the change of known disease genes used as positive training samples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kann, M.G.: Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Briefings in Bioinformatics 11, 96–110 (2009)
Tranchevent, L.-C., et al.: A guide to web tools to prioritize candidate genes. Briefings in Bioinformatics 12, 22–32 (2010)
Turner, F., et al.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)
Adie, E.A., et al.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)
Aerts, S., et al.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)
Chen, J., et al.: Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8, 392 (2007)
Wang, X., et al.: Network-based methods for human disease gene prediction. Briefings in Functional Genomics 10, 280–293 (2011)
Tarca, A.L., et al.: Machine learning and its applications to biology. PLoS Computational Biology 3, e116 (2007)
Larrañaga, P., et al.: Machine learning in bioinformatics. Briefings in Bioinformatics 7, 86–112 (2006)
Yip, K.Y., et al.: Machine learning and genome annotation: a match meant to be? Genome Biology 14, 205 (2013)
de Ridder, D., et al.: Pattern recognition in bioinformatics. Briefings in Bioinformatics 14, 633–647 (2013)
Basford, K.E., et al.: On the classification of microarray gene-expression data. Briefings in Bioinformatics 14, 402–410 (2013)
Maetschke, S.R., et al.: Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Briefings in Bioinformatics (2013)
Ding, H., et al.: Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics (2013)
Upstill-Goddard, R., et al.: Machine learning approaches for the discovery of gene-gene interactions in disease data. Briefings in Bioinformatics 14, 251–260 (2012)
Okser, S., et al.: Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives. BioData Mining (2013)
Lospez-Bigas, N., Ouzounis, C.A.: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)
Adie, E., et al.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)
Xu, J., Li, Y.: Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics 22, 2800–2805 (2006)
Calvo, S., et al.: Systematic identification of human mitochondrial disease genes through integrative genomics. Nat. Genet. 38, 576–582 (2006)
Smalter, A., et al.: Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2007, pp. 209–216 (2007)
Sun, J., et al.: Functional link artificial neural network-based disease gene prediction. In: Neural Networks, IJCNN 2009, pp. 3003–3010 (2009)
Breiman, L., et al.: Classification and regression trees. Wadsworth & Brooks, Monterey (1984)
Schapire, R.E.: A brief introduction to boosting. Ijcai 99, 1401–1406 (1999)
Radivojac, P., et al.: An integrated approach to inferring gene-disease associations in humans. Proteins: Structure, Function, and Bioinformatics 72, 1030–1037 (2008)
Keerthikumar, S., et al.: Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach. DNA Research 16, 345–351 (2009)
Amberger, J., et al.: McKusick’s Online Mendelian Inheritance in Man (OMIM®). Nucleic Acids Research 37, D793–D796 (2009)
Safran, M., et al.: GeneCards TM 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics, 1542–1543 (2002)
Lage, K., et al.: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotech. 25, 309–316 (2007)
Tu, Z., et al.: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics 7, 31 (2006)
Brown, K.R., Jurisica, I.: Online Predicted Human Interaction Database. Bioinformatics 21, 2076–2082 (2005)
Freudenberg, J., Propping, P.: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18, S110–S115 (2002)
The UniProt, C.: The Universal Protein Resource (UniProt) in 2010. Nucl. Acids Res. 38, D142–D148 (2010)
Jonsson, P.F., Bates, P.A.: Global topological features of cancer proteins in the human interactome. Bioinformatics 22, 2291–2297 (2006)
Apweiler, R., et al.: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29, 37–40 (2001)
Hunter, S., et al.: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research 40, D306–D312 (2011)
Smedley, D., et al.: BioMart - biological queries made easy. BMC Genomics 10, 22 (2009)
Sayers, E.W., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 39, D38–D51 (2011)
Luo, H., et al.: DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Research 42, D574–D580 (2014)
Dennis, G., et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4, R60 (2003)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Olshen, L.B.J.H.F.R.A., Stone, C.J.: Classification and regression trees. Wadsworth International Group (1984)
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 175–185 (1992)
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11, 10–18 (2009)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST)Â 2, 27 (2011)
Bollmann, P., Cherniavsky, V.S.: Restricted evaluation in information retrieval. ACM SIGIR Forum 16, 15–21 (1981)
Mordelet, F., Vert, J.-P.: ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12, 389 (2011)
Yang, P., et al.: Positive-unlabeled learning for disease gene identification. Bioinformatics 28, 2640–2647 (2012)
Yu, S., et al.: Gene prioritization and clustering by multi-view text mining. BMC Bioinformatics 11, 28 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Le, DH., Xuan Hoai, N., Kwon, YK. (2015). A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-11680-8_46
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11679-2
Online ISBN: 978-3-319-11680-8
eBook Packages: EngineeringEngineering (R0)