Skip to main content

An Analysis of k-Mer Frequency Features with Machine Learning Models for Viral Subtyping of Polyomavirus and HIV-1 Genomes

  • Conference paper
  • First Online:
Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1 (FTC 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1288))

Included in the following conference series:

  • 1397 Accesses

Abstract

Viral subtyping is the process of classifying a virus genome into a subtype inside its family. Moreover, it plays a major role in the appropriate diagnosis and treatment of illness. In this context, researches use alignment-based methods to process viral subtyping classification. Nevertheless, alignment-based methods are slow and we need to expose the privacy of the sample genome consulted. For that reason, some methods have emerged, they use machine learning models that take the viral sample genome and predict the virus subtyping. Additionally, the performance of machine learning models depends on the feature vector computed, the most remarkable methods are based on k-mer frequency as features. In this study, we compared the two most relevant methods based on k-mer frequency, Kameris, and Castor-KRFE on a dataset of Polyomavirus and HIV-1 genomes. Both have the same results when we avoid their dimensionality reduction and feature elimination, but when not, Kameris slightly outperform Castor-KRFE. Moreover, Castor-KRFE could get a small feature vector for \(k>5\) (in k-mer).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adetiba, E., Badejo, J.A., Thakur, S., Matthews, V.O., Adebiyi, M.O., Adebiyi, E.F.: Experimental investigation of frequency chaos game representation for in silico and accurate classification of viral pathogens from genomic sequences. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 155–164. Springer (2017)

    Google Scholar 

  2. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  3. Banerji, J., Rusconi, S., Schaffner, W.: Expression of a \(\beta \)-globin gene is enhanced by remote SV40 DNA sequences. Cell 27(2), 299–308 (1981)

    Article  Google Scholar 

  4. Bansiwal, A.: Analysis of circulating recombinant forms (CRFs) of HIV-1 using Chaos Game Representation (CGR). Ph.D. thesis, IISER M (2014)

    Google Scholar 

  5. Bjornson, R.D., Sherman, A.H., Weston, S.B., Willard, N., Wing, J.: Turboblast (r): a parallel implementation of blast built on the turbohub. In: ipdps, p. 0183. IEEE (2002)

    Google Scholar 

  6. Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83(14), 5155–5159 (1986)

    Article  Google Scholar 

  7. Calvignac-Spencer, S., Feltkamp, M.C.W., Daugherty, M.D., Moens, U., Ramqvist, T., Johne, R., Ehlers, B., et al.: A taxonomy update for the family polyomaviridae. Arch. Virol. 161(6), 1739–1750 (2016)

    Article  Google Scholar 

  8. Chan, R.H., Chan, T.H., Yeung, H.M., Wang, R.W.: Composition vector method based on maximum entropy principle for sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(1), 79–87 (2011)

    Article  Google Scholar 

  9. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)

    Article  Google Scholar 

  10. Clumeck, N., Pozniak, A., Raffi, F.: EACS Executive Committee: European aids clinical society (EACS) guidelines for the clinical management and treatment of HIV-infected adults. HIV Med. 9(2), 65–71 (2008)

    Article  Google Scholar 

  11. De Oliveira, T., Deforche, K., Cassol, S., Salminen, M., Paraskevis, D., Seebregts, C., Snoeck, J., Van Rensburg, E.J., Wensing, A.M.J., Van De Vijver, D.A., et al.: An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics 21(19), 3797–3800 (2005)

    Article  Google Scholar 

  12. Duffy, S., Shackelton, L.A., Holmes, E.C.: Rates of evolutionary change in viruses: patterns and determinants. Nat. Rev. Genet. 9(4), 267–276 (2008)

    Article  Google Scholar 

  13. Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010)

    Article  Google Scholar 

  14. Fiscon, G., Weitschek, E., Cella, E., Lo Presti, A., Giovanetti, M., Babakir-Mina, M., Ciotti, M., Ciccozzi, M., Pierangeli, A., Bertolazzi, P., et al.: Missel: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. BioData Min. 9(1), 38 (2016)

    Article  Google Scholar 

  15. Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10(7), e1003711 (2014)

    Article  Google Scholar 

  16. Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)

    Article  Google Scholar 

  17. Joy, J.B., Liang, R.H., Nguyen, T., McCloskey, R.M., Poon, A.F.Y.: Origin and evolution of human immunodeficiency viruses. In: Global Virology I-Identifying and Investigating Viral Diseases, pp. 587–611. Springer (2015)

    Google Scholar 

  18. Lebatteux, D., Remita, A.M., Diallo, A.B.: Toward an alignment-free method for feature extraction and accurate classification of viral sequences. J. Comput. Biol. 26(6), 519–535 (2019)

    Article  Google Scholar 

  19. Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M.S., Sun, F.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theoret. Biol. 284(1), 106–116 (2011)

    Article  MathSciNet  Google Scholar 

  20. Lowe, D.B., Shearer, M.H., Jumper, C.A., Kennedy, R.C.: Sv40 association with human malignancies and mechanisms of tumor immunity by large tumor antigen. Cell. Mol. Life Sci. 64(7–8), 803 (2007)

    Article  Google Scholar 

  21. Moens, U., Calvignac-Spencer, S., Lauber, C., Ramqvist, T., Feltkamp, M.C.W., Daugherty, M.D., Verschoor, E.J., Ehlers, B., et al.: ICTV virus taxonomy profile: polyomaviridae. J. Gener. Virol. 98(6), 1159–1160 (2017)

    Article  Google Scholar 

  22. Oehmen, C., Nieplocha, J.: Scalablast: a scalable implementation of blast for high-performance data-intensive bioinformatics analysis. IEEE Trans. Parallel Distrib. Syst. 17(8), 740–749 (2006)

    Article  Google Scholar 

  23. Oehmen, C.S., Baxter, D.J.: Scalablast 2.0: rapid and robust blast calculations on multiprocessor systems. Bioinformatics 29(6), 797–798 (2013)

    Article  Google Scholar 

  24. Pandit, A., Sinha, S.: Using genomic signatures for HIV-1 sub-typing. BMC Bioinform. 11(S1), S26 (2010)

    Article  Google Scholar 

  25. Pond, S.L.K., Posada, D., Stawiski, E., Chappey, C., Poon, A.F.Y., Hughes, G., Fearnhill, E., Gravenor, M.B., Brown, A.J.L., Frost, S.D.W.: An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. Biol. 5(11), e1000581 (2009)

    Article  MathSciNet  Google Scholar 

  26. Poulin, D.L., DeCaprio, J.A.: Is there a role for SV40 in human cancer? J. Clin. Oncol. 24(26), 4356–4365 (2006)

    Article  Google Scholar 

  27. Randhawa, G.S., Soltysiak, M.P.M., El Roz, H., de Souza, C.P.E., Hill, K.A., Kari, L.: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study. bioRxiv (2020)

    Google Scholar 

  28. Remita, M.A., Halioui, A., Diouara, A.A.M., Daigle, B., Kiani, G., Diallo, A.B.: A machine learning approach for viral genome classification. BMC Bioinform. 18(1), 208 (2017)

    Article  Google Scholar 

  29. Ren, J., Ahlgren, N.A., Lu, Y.Y., Fuhrman, J.A., Sun, F.: Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5(1), 69 (2017)

    Article  Google Scholar 

  30. Sharp, P.M., Hahn, B.H.: Origins of HIV and the aids pandemic. Cold Spring Harbor Perspect. Med. 1(1), a006841 (2011)

    Article  Google Scholar 

  31. Silva, J.C.F., Carvalho, T.F.M., Basso, M.F., Deguchi, M., Pereira, W.A., Sobrinho, R.R., Vidigal, P.M.P., Brustolini, O.J.B., Silva, F.F., Dal-Bianco, M., et al.: Geminivirus data warehouse: a database enriched with machine learning approaches. BMC Bioinform. 18(1), 240 (2017)

    Article  Google Scholar 

  32. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)

    Article  Google Scholar 

  33. Solis-Reyes, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS One 13(11), e0206409 (2018)

    Article  Google Scholar 

  34. Struck, D., Lawyer, G., Ternes, A.-M., Schmit, J.-C., Bercoff, D.P.: Comet: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res. 42(18), e144–e144 (2014)

    Article  Google Scholar 

  35. Tanchotsrinon, W., Lursinsap, C., Poovorawan, Y.: A high performance prediction of HPV genotypes by chaos game representation and singular value decomposition. BMC Bioinform. 16(1), 71 (2015)

    Article  Google Scholar 

  36. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  37. Vinga, S.: Alignment-free methods in computational biology (2014)

    Google Scholar 

  38. Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM SIGKDD Explor. Newsl. 12(1), 40–48 (2010)

    Article  Google Scholar 

  39. Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. E. Machaca Arceda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arceda, V.E.M. (2021). An Analysis of k-Mer Frequency Features with Machine Learning Models for Viral Subtyping of Polyomavirus and HIV-1 Genomes. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. FTC 2020. Advances in Intelligent Systems and Computing, vol 1288. Springer, Cham. https://doi.org/10.1007/978-3-030-63128-4_21

Download citation

Publish with us

Policies and ethics