Fast Approximation of Frequent k-mers and Applications to Metagenomics

Pellegrina, Leonardo; Pizzi, Cinzia; Vandin, Fabio

doi:10.1007/978-3-030-17083-7_13

Leonardo Pellegrina¹⁵,
Cinzia Pizzi¹⁵ &
Fabio Vandin¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1936 Accesses
1 Citations

Abstract

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all k-mers and their abundances, in other situations reporting only frequent k-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of k-mers’ abundance-based distances among datasets of reads, commonly used in metagenomic analyses.

In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent k-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.

This work is supported, in part, by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at https://github.com/VandinLab/SAKEIMA.
2.
https://hmpdacc.org/HMASM.
3.
https://github.com/gmarcais/Jellyfish.
4.
Every instance of SAKEIMA and Jellyfish was executed with 1 worker, i.e., sequentially. Note that the Poisson approximation employed by SAKEIMA allows multiple workers to independently process the input k-mers, therefore SAKEIMA can be used in a parallel scenario. We will investigate the impact of parallelism in the extended version of this work.

References

Benoit, G., Peterlongo, P., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)
Article Google Scholar
Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584–3592 (2015)
Article Google Scholar
Brown, C.T., Howe, A., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802 (2012)
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)
Article Google Scholar
Danovaro, R., Canals, M., et al.: A submarine volcanic eruption leads to a novel microbial habitat. Nat. Ecol. Evol. 1(6), 0144 (2017)
Article Google Scholar
Dickson, L.B., Jiolle, D., et al.: Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector. Sci. Adv. 3(8), e1700585 (2017)
Article Google Scholar
Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)
Article Google Scholar
Hrytsenko, Y., Daniels, N.M., Schwartz, R.S.: Efficient distance calculations between genomes using mathematical approximation. In: Proceedings of the ACM-BCB, p. 546 (2018)
Google Scholar
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)
Article Google Scholar
Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)
Article Google Scholar
Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Res. 13(8), 1916–1922 (2003)
Google Scholar
Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04128-0_29
Chapter Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Melsted, P., Halldórsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30(24), 3541–3547 (2014)
Article Google Scholar
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinform. 12(1), 333 (2011)
Article Google Scholar
Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, Cambridge (2017)
MATH Google Scholar
Mohamadi, H., Khan, H., Birol, I.: ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33(9), 1324–1330 (2017)
Google Scholar
Ondov, B.D., Treangen, T.J., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Article Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(14), 568–575 (2017)
Google Scholar
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)
Article Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)
Article MathSciNet Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Article Google Scholar
Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)
Article Google Scholar
Salmela, L., Walve, R., Rivals, E., Ukkonen, E.: Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6), 799–806 (2016)
Google Scholar
Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. National Acad. Sci. 106(8), 2677–2682 (2009)
Article Google Scholar
Sivadasan, N., Srinivasan, R., Goyal, K.: Kmerlight: fast and accurate k-mer abundance estimation. arXiv preprint arXiv:1609.05626 (2016)
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300 (2016)
Article Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16(2), 264 (1971)
Article Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar
Zhang, Q., Pell, J., et al.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS One 9(7), e101271 (2014)
Article Google Scholar
Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Leonardo Pellegrina, Cinzia Pizzi & Fabio Vandin

Authors

Leonardo Pellegrina
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Pizzi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Vandin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Vandin .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pellegrina, L., Pizzi, C., Vandin, F. (2019). Fast Approximation of Frequent k-mers and Applications to Metagenomics. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_13
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics