Phenotype Prediction with Semi-supervised Classification Trees

Levatić, Jurica; Brbić, Maria; Perdih, Tomaž Stepišnik; Kocev, Dragi; Vidulin, Vedrana; Šmuc, Tomislav; Supek, Fran; Džeroski, Sašo

doi:10.1007/978-3-319-78680-3_10

Jurica Levatić^18,19,
Maria Brbić²⁰,
Tomaž Stepišnik Perdih^18,19,
Dragi Kocev^18,19,
Vedrana Vidulin^18,20,21,
Tomislav Šmuc²⁰,
Fran Supek^20,22 &
…
Sašo Džeroski^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10785))

Included in the following conference series:

International Workshop on New Frontiers in Mining Complex Patterns

466 Accesses
1 Citations

Abstract

In this work, we address the task of phenotypic traits prediction using methods for semi-supervised learning. More specifically, we propose to use supervised and semi-supervised classification trees as well as supervised and semi-supervised random forests of classification trees. We consider 114 datasets for different phenotypic traits referring to 997 microbial species. These datasets present a challenge for the existing machine learning methods: they are not labelled/annotated entirely and their distribution is typically imbalanced. We investigate whether approaching the task of phenotype prediction as a semi-supervised learning task can yield improved predictive performance. The results suggest that the semi-supervised methodology considered here is especially helpful when using single trees, especially when the amount of labeled data ranges from 20 to 40%. Similar improvements can be seen when the presence of the phenotype is very imbalanced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Phenotype predictions from [7] are available at protraits.irb.hr.

References

Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning, vol. 2. MIT Press, Cambridge (2006)
Book Google Scholar
MacDonald, N.J., Beiko, R.G.: Efficient learning of microbial genotype-phenotype association rules. Bioinformatics 26(15), 1834 (2010)
Article Google Scholar
Smole, Z., Nikolic, N., Supek, F., Šmuc, T., Sbalzarini, I.F., Krisko, A.: Proteome sequence features carry signatures of the environmental niche of prokaryotes. BMC Evol. Biol. 11(1), 26 (2011)
Article Google Scholar
Feldbauer, R., Schulz, F., Horn, M., Rattei, T.: Prediction of microbial phenotypes based on comparative genomics. BMC Bioinform. 16(14), S1 (2015)
Article Google Scholar
Brbić, M., Warnecke, T., Kriško, A., Supek, F.: Global shifts in genome and proteome composition are very tightly coupled. Genome Biol. Evol. 7(6), 1519 (2015)
Article Google Scholar
Chaffron, S., Rehrauer, H., Pernthaler, J., von Mering, C.: A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20(7), 947–959 (2010)
Article Google Scholar
Brbić, M., Piškorec, M., Vidulin, V., Kriško, A., Šmuc, T., Supek, F.: The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res. 44(21), 10074 (2016)
Google Scholar
Levatić, J., Ceci, M., Kocev, D., Džeroski, S.: Semi-supervised classification trees. J. Intell. Inf. Syst. 49(3), 461–486 (2017)
Article Google Scholar
Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the 15th International Conference on Machine learning, pp. 55–63 (1998)
Google Scholar
Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46(3), 817–833 (2013)
Article Google Scholar
Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. J. Mach. Learn. Res. 3, 621–650 (2002)
MATH Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Article MATH Google Scholar
Cozman, F., Cohen, I., Cirelo, M.: Unlabeled data can degrade classification performance of generative classifiers. In: Proceedings of the 15th International Florida Artificial Intelligence Research Society Conference, pp. 327–331 (2002)
Google Scholar
Guo, Y., Niu, X., Zhang, H.: An extensive empirical study on semi-supervised learning. In: Proceedings of the 10th International Conference on Data Mining, pp. 186–195 (2010)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Cambridge (2005)
MATH Google Scholar
Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., Jensen, L.J., von Mering, C., Bork, P.: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40(D1), D284 (2012)
Article Google Scholar
Stothard, P., Van Domselaar, G., Shrivastava, S., Guo, A., O’Neill, B., Cruz, J., Ellison, M., Wishart, D.S.: BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 33(suppl. 1), D317–D320 (2005)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Chawla, N., Karakoulas, G.: Learning from labeled and unlabeled data: an empirical study across techniques and domains. J. Artif. Intell. Res. 23(1), 331–366 (2005)
MATH Google Scholar
Reddy, T., Thomas, A.D., Stamatis, D., Bertsch, J., Isbandi, M., Jansson, J., Mallajosyula, J., Pagani, I., Lobos, E.A., Kyrpides, N.C.: The genomes online database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 43(D1), D1099 (2015)
Article Google Scholar
Land, M.L., Hyatt, D., Jun, S.R., Kora, G.H., Hauser, L.J., Lukjancenko, O., Ussery, D.W.: Quality scores for 32,000 genomes. Stand. genomic sci. 9(1), 20 (2014)
Article Google Scholar

Download references

Acknowledgments

We acknowledge the financial support of the Slovenian Research Agency, via the grant P2-0103 and a young researcher grant to TSP, Croatian Science Foundation grants HRZZ-9623 (DescriptiveInduction), as well as the European Commission, via the grants ICT-2013-612944 MAESTRA and ICT-2013-604102 HBP. We would also like to acknowledge the joint support of the Republic of Slovenia and the European Union under the European Regional Development Fund (grant “Raziskovalci-2.0-FIŠ-52900”, implementation of the operation no. C3330-17-529008).

Author information

Authors and Affiliations

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Jurica Levatić, Tomaž Stepišnik Perdih, Dragi Kocev, Vedrana Vidulin & Sašo Džeroski
Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
Jurica Levatić, Tomaž Stepišnik Perdih, Dragi Kocev & Sašo Džeroski
Division of Electronics, Ruder Boskovic Institute, Zagreb, Croatia
Maria Brbić, Vedrana Vidulin, Tomislav Šmuc & Fran Supek
Faculty of Information Studies, Novo Mesto, Slovenia
Vedrana Vidulin
Center for Genomic Regulation, Barcelona, Spain
Fran Supek

Authors

Jurica Levatić
View author publications
You can also search for this author in PubMed Google Scholar
Maria Brbić
View author publications
You can also search for this author in PubMed Google Scholar
Tomaž Stepišnik Perdih
View author publications
You can also search for this author in PubMed Google Scholar
Dragi Kocev
View author publications
You can also search for this author in PubMed Google Scholar
Vedrana Vidulin
View author publications
You can also search for this author in PubMed Google Scholar
Tomislav Šmuc
View author publications
You can also search for this author in PubMed Google Scholar
Fran Supek
View author publications
You can also search for this author in PubMed Google Scholar
Sašo Džeroski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jurica Levatić .

Editor information

Editors and Affiliations

University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Bari Aldo Moro, Bari, Italy
Corrado Loglisci
CNR, Rende, Italy
Giuseppe Manco
CNR, Rende, Italy
Elio Masciari
University of North Carolina, Charlotte, North Carolina, USA
Zbigniew W. Ras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Levatić, J. et al. (2018). Phenotype Prediction with Semi-supervised Classification Trees. In: Appice, A., Loglisci, C., Manco, G., Masciari, E., Ras, Z. (eds) New Frontiers in Mining Complex Patterns. NFMCP 2017. Lecture Notes in Computer Science(), vol 10785. Springer, Cham. https://doi.org/10.1007/978-3-319-78680-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-78680-3_10
Published: 24 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78679-7
Online ISBN: 978-3-319-78680-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics