Skip to main content

Fast Approximation of Frequent k-mers and Applications to Metagenomics

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all k-mers and their abundances, in other situations reporting only frequent k-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of k-mers’ abundance-based distances among datasets of reads, commonly used in metagenomic analyses.

In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent k-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.

This work is supported, in part, by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at https://github.com/VandinLab/SAKEIMA.

  2. 2.

    https://hmpdacc.org/HMASM.

  3. 3.

    https://github.com/gmarcais/Jellyfish.

  4. 4.

    Every instance of SAKEIMA and Jellyfish was executed with 1 worker, i.e., sequentially. Note that the Poisson approximation employed by SAKEIMA allows multiple workers to independently process the input k-mers, therefore SAKEIMA can be used in a parallel scenario. We will investigate the impact of parallelism in the extended version of this work.

References

  1. Benoit, G., Peterlongo, P., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)

    Article  Google Scholar 

  2. Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584–3592 (2015)

    Article  Google Scholar 

  3. Brown, C.T., Howe, A., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802 (2012)

  4. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)

    Article  Google Scholar 

  5. Danovaro, R., Canals, M., et al.: A submarine volcanic eruption leads to a novel microbial habitat. Nat. Ecol. Evol. 1(6), 0144 (2017)

    Article  Google Scholar 

  6. Dickson, L.B., Jiolle, D., et al.: Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector. Sci. Adv. 3(8), e1700585 (2017)

    Article  Google Scholar 

  7. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)

    Article  Google Scholar 

  8. Hrytsenko, Y., Daniels, N.M., Schwartz, R.S.: Efficient distance calculations between genomes using mathematical approximation. In: Proceedings of the ACM-BCB, p. 546 (2018)

    Google Scholar 

  9. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)

    Article  Google Scholar 

  10. Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)

    Article  Google Scholar 

  11. Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Res. 13(8), 1916–1922 (2003)

    Google Scholar 

  12. Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04128-0_29

    Chapter  Google Scholar 

  13. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  14. Melsted, P., Halldórsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30(24), 3541–3547 (2014)

    Article  Google Scholar 

  15. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinform. 12(1), 333 (2011)

    Article  Google Scholar 

  16. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, Cambridge (2017)

    MATH  Google Scholar 

  17. Mohamadi, H., Khan, H., Birol, I.: ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33(9), 1324–1330 (2017)

    Google Scholar 

  18. Ondov, B.D., Treangen, T.J., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)

    Article  Google Scholar 

  19. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(14), 568–575 (2017)

    Google Scholar 

  20. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)

    Article  Google Scholar 

  21. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)

    Article  MathSciNet  Google Scholar 

  22. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    Article  Google Scholar 

  23. Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)

    Article  Google Scholar 

  24. Salmela, L., Walve, R., Rivals, E., Ukkonen, E.: Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6), 799–806 (2016)

    Google Scholar 

  25. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. National Acad. Sci. 106(8), 2677–2682 (2009)

    Article  Google Scholar 

  26. Sivadasan, N., Srinivasan, R., Goyal, K.: Kmerlight: fast and accurate k-mer abundance estimation. arXiv preprint arXiv:1609.05626 (2016)

  27. Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300 (2016)

    Article  Google Scholar 

  28. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  29. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16(2), 264 (1971)

    Article  Google Scholar 

  30. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)

    Article  Google Scholar 

  31. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

  32. Zhang, Q., Pell, J., et al.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS One 9(7), e101271 (2014)

    Article  Google Scholar 

  33. Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Vandin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pellegrina, L., Pizzi, C., Vandin, F. (2019). Fast Approximation of Frequent k-mers and Applications to Metagenomics. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics