Skip to main content

Redundancy Treatment of NGS Contigs in Microbial Genome Finishing with Hashing-Based Approach

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2020)

Abstract

Repetitive DNA sequences longer than reads’ length produce assembly gaps. In addition, repetition can cause complex and misassembled rearrangements that creates branches in assembler graphs. Algorithms must decide which way is the best. Incorrect decisions create false associations, called chimeric contigs. Reads coming from different copies of a repetitive region on genome may be wrongly assembled as a unique contig, a repetitive contig. Furthermore, the growth of hybrid assembling approaches using different sequencing platforms data, different fragment sizes or even data from distinct assemblers are responsible for significantly increasing in the amount of generated contigs and therefore subsequent redundancy on data. Thus, this work presents a hybrid computational method to detect and eliminate redundant contigs from microbial genome assemblies. It consists of two Hashing-Based techniques: a Bloom Filter to detect duplicated contigs and a Locality-Sensitive Hashing (LSH) to remove similar contigs. The redundancy reduction facilitates downstream analysis and diminishes the required time to finishing and curate genomic assemblies. The hybrid assembly of GAGE-B dataset was performed with SPAdes (De Bruijn Graph) assembler and Fermi (OLC) assembler. The proposed pipeline was applied to the resulting contigs and the performance compared to other similar tools such as HSBLASTN, Simplifier and CD-HIT. Results are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ambardar, S., Gupta, R., Trakroo, D., Lal, R., Vakhlu, J.: High throughput sequencing: an overview of sequencing chemistry. Indian J. Microbiol. 56(4), 394–404 (2016)

    Article  CAS  Google Scholar 

  2. Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013)

    Article  CAS  Google Scholar 

  3. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), e1003345 (2013)

    Article  Google Scholar 

  4. Martin, J.A., Wang, Z.: Next-generation transcriptome assembly. Nat. Rev. Genet. 12(10), 671–682 (2011)

    Article  CAS  Google Scholar 

  5. Goswami, M., et al.: Distance sensitive bloom filters without false negatives. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, USA, 2017, pp. 257–269. Society for Industrial and Applied Mathematics (2017)

    Google Scholar 

  6. Tang, L., Li, M., Fang-Xiang, W., Pan, Y., Wang, J.: MAC: Merging assemblies by using adjacency algebraic model and classification. Front. Genet. 10, 1396 (2020)

    Article  Google Scholar 

  7. de Sousa Paz, H.E.: reSHAPE : montagem hibrida de genomas com foco em organismos bacterianos combinando ferramentas de novo. Dissertacao (2018)

    Google Scholar 

  8. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2011)

    Article  Google Scholar 

  9. Batzer, M.A., Deininger, P.L.: Alu repeats and human genomic diversity. Nat. Rev. Genet. 3(5), 370–379 (2002)

    Article  CAS  Google Scholar 

  10. Zavodna, M., Bagshaw, A., Brauning, R., Gemmell, N.J.: The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS ONE 9(12), e113862 (2014)

    Article  Google Scholar 

  11. Phillippy, A.M., Schatz, M.C., Pop, M.: Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9(3), R55 (2008)

    Article  Google Scholar 

  12. Wetzel, J., Kingsford, C., Pop, M.: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinform. 12(1), 95 (2011)

    Article  Google Scholar 

  13. Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 2047–217X (2013)

    Article  Google Scholar 

  14. Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16(7), 897–908 (2009)

    Article  CAS  Google Scholar 

  15. Ramos, R.T.J., Carneiro, A.R., Azevedo, V., Schneider, M.P., Barh, D., Silva, A.: Simplifier: a web tool to eliminate redundant NGS contigs. Bioinformation 8(20), 996–999 (2012)

    Article  Google Scholar 

  16. Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string matching. J. Complex. 4(1), 33–72 (1988)

    Article  Google Scholar 

  17. Pandiselvam, P., Marimuthu, T., Lawrance, R.: A comparative study on string matching algorithm of biological sequences (2014)

    Google Scholar 

  18. Al-Khamaiseh, K., ALShagarin, S.: A survey of string matching algorithms. Int. J. Eng. Res. Appl. 4, 144–156 (2014)

    Google Scholar 

  19. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey (2014)

    Google Scholar 

  20. Chauhan, S.S., Batra, S.: Finding similar items using lsh and bloom filter. In: 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies, pp. 1662–1666 (2014)

    Google Scholar 

  21. Bender, M.A., et al.: Don’t thrash: how to cache your hash on flash. Proc. VLDB Endow. 5, 1627–1637 (2012)

    Article  Google Scholar 

  22. Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors [lecture notes]. Signal Process. Mag. IEEE 25, 128–131 (2008)

    Article  Google Scholar 

  23. Baluja, S., Covell, M.: Learning to hash: forgiving hash functions and applications. Data Min. Knowl. Disc. 17(3), 402–430 (2008)

    Article  Google Scholar 

  24. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Conference Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 604–613 (2000)

    Google Scholar 

  25. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29 (1997)

    Google Scholar 

  26. Jain, R., Rawat, M., Jain, S.: Data optimization techniques using bloom filter in big data. Int. J. Comput. Appl. 142, 23–27 (2016)

    Google Scholar 

  27. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)

    Article  Google Scholar 

  28. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  29. Ding, K., Huo, C., Fan, B., Xiang, S., Pan, C.: In defense of locality-sensitive hashing. IEEE Trans. Neural Netw. Learn. Syst. 29(1), 87–103 (2018)

    Article  Google Scholar 

  30. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  31. Broder, A., Mitzenmacher, M.: Survey: Network applications of bloom filters: a survey. Internet Math. 1, 11 (2003)

    Google Scholar 

  32. Naor, M., Yogev, E.: Tight bounds for sliding bloom filters. Algorithmica 73(4), 652–672 (2015)

    Article  Google Scholar 

  33. Magoc, T., et al.: GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29(14), 1718–1725 (2013)

    Article  CAS  Google Scholar 

  34. Aronesty, E.: Comparison of sequencing utility programs. Open Bioinform. J. 7(1), 1–8 (2013)

    Article  Google Scholar 

  35. Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  CAS  Google Scholar 

  36. Li, H.: Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics (Oxford, England) 28(14), 1838–1844 (2012)

    Article  CAS  Google Scholar 

  37. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)

    Article  Google Scholar 

  38. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)

    Article  CAS  Google Scholar 

  39. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  40. Chen, Y., Ye, W., Zhang, Y., Yuesheng, X.: High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res. 43(16), 7762–7768 (2015)

    Article  CAS  Google Scholar 

  41. Li, W., Godzik, A.: CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcus Braga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Braga, M., Pinheiro, K., Araújo, F., Miranda, F., Silva, A., Ramos, R. (2020). Redundancy Treatment of NGS Contigs in Microbial Genome Finishing with Hashing-Based Approach. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65775-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65774-1

  • Online ISBN: 978-3-030-65775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics