Identifying wrong assemblies in de novo short read primary sequence assembly contigs

Chawla, Vandna; Kumar, Rajnish; Shankar, Ravi

doi:10.1007/s12038-016-9630-0

Identifying wrong assemblies in de novo short read primary sequence assembly contigs

Published: 05 August 2016

Volume 41, pages 455–474, (2016)
Cite this article

Journal of Biosciences Aims and scope Submit manuscript

Vandna Chawla^1,2,
Rajnish Kumar¹ &
Ravi Shankar¹

751 Accesses
17 Altmetric
2 Mentions
Explore all metrics

Abstract

With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads

Article Open access 16 November 2015

SLR: a scaffolding algorithm based on long reads and contig classification

Article Open access 30 October 2019

Assessing the impact of exact reads on reducing the error rate of read mapping

Article Open access 06 November 2018

Abbreviations

ACC:: accuracy
BAC:: bacterial artificial chromosome
CE:: compression-expansion
FCD:: fragment coverage distribution
FN:: false negative
FP:: false positive
MCC:: Matthews correlation coefficient
NCBI:: National Center for Biotechnology Information
PDBG:: paired de Bruijn graphs
PE:: paired end
SBS:: sequencing-by-synthesis
SE:: single end
SRA:: sequence read archive
TN:: true negative
TP:: true positive
WGS:: whole genome shotgun

References

Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, et al. 2000 The genome sequence of Drosophila melanogaster. Science 287 2185–2195
Article PubMed Google Scholar
Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, et al. 2011 The genome of Theobroma cacao. Nat. Genet. 43 101–108
Berkman PJ, Lai K, Lorenc MT and Edwards D 2012 Next-generation sequencing applications for wheat crop improvement. Am. J. Bot. 99 365–371
Boisvert S, Laviolette F and Corbeil J 2010 Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol. 17 1519–1533
Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, et al. 2013 Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2 10
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K and Madden TL 2009 BLAST+: architecture and applications. BMC Bioinf. 10 1–9
Chu T-C, Lu C-H, Liu T, Lee GC, Li W-H and Shih AC-C 2013 Assembler for de novo assembly of large genomes. Proc. Natl. Acad. Sci. USA 110 E3417–E3424
Clark SC, Egan R, Frazier PI and Wang Z 2013 ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics. 29 435–443
Consortium T 1000 GP 2010 A map of human genome variation from population-scale sequencing. Nature 467 1061–1073
Consortium T 1000 GP 2012 An integrated map of genetic variation from 1,092 human genomes. Nature 491 56–65
Consortium TEP 2012 An integrated encyclopedia of DNA elements in the human genome. Nature 489 57–74
Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Ann Blomberg L, Bouffard P, Burt DW, et al. 2010 Multi-platform next-generation sequencing of the domestic Turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biol. 8 e1000475
Dohm JC, Lottaz C, Borodina T and Himmelbauer H 2008 Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, et al. 2011 Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21 2224–2241
Ewing B and Green P 1998 Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8 186–194
Article CAS PubMed Google Scholar
Ewing AD and Kazazian HH 2011 Whole-Genome Res.equencing allows detection of many rare LINE-1 insertion alleles in humans. Genome Res. 21 985–990
Ewing B, Hillier L, Wendl MC and Green P 1998 Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8 175–185
Article CAS PubMed Google Scholar
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, et al. 2013 Ensembl 2013. Nucleic Acids Res. 41 D48–D55
Fu W, O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, et al. 2013 Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493 216–220
Gahlan P, Singh HR, Shankar R, Sharma N, Kumari A, Chawla V, Ahuja PS and Kumar S 2012 de novo sequencing and characterization of Picrorhiza kurrooa transcriptome at two temperatures showed major transcriptome adjustments. BMC Genomics 13 126
Hansen KD, Brenner SE and Dudoit S 2010 Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131
Hartigan JA and Wong MA 1979 Algorithm AS 136: A K-means clustering algorithm. J. R. Stat. Soc.: Ser. C: Appl. Stat. 28 100–108
Henry RJ 2012 Next-generation sequencing for understanding and accelerating crop domestication. Brief Funct. Genomics 11 51–56
Article CAS PubMed Google Scholar
Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, et al. 2009 The genome of the cucumber, Cucumis sativus L. Nat. Genet. 41 1275–1281
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M and Otto TD 2013 REAPR: a universal tool for genome assembly evaluation. Genome Biol. 14 R47
Huse SM, Huber JA, Morrison HG, Sogin ML and Welch DM 2007 Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 8 R143
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O and Walichiewicz J 2005 Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110 462–467
Kersey PJ, Allen JE, Christensen M, Davis P, Falin LJ, Grabmueller C, Hughes DST, Humphrey J, et al. 2014 Ensembl genomes 2013: scaling up access to genome-wide data. Nucleic Acids Res. 42 D546–D552
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, et al. 2001 Initial sequencing and analysis of the human genome. Nature 409 860–921
Langmead B, Trapnell C, Pop M and Salzberg SL 2009 Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25
Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, et al. 2010a The sequence and de novo assembly of the giant panda genome. Nature 463 311–317
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, et al. 2010b de novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20 265–272
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, et al. 2012 SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1 18
MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, et al. 2009 ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 10 R103
Manning JE, Schmid CW and Davidson N 1975 Interspersion of repetitive and nonrepetitive DNA sequences in the Drosophila melanogaster genome. Cell 4 141–155
Article CAS PubMed Google Scholar
Phillippy AM, Schatz MC and Pop M 2008 Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9 R55
Poon AFY, Swenson LC, Dong WWY, Deng W, Kosakovsky Pond SL, Brumme ZL, Mullins JI, Richman DD, et al. 2010 Phylogenetic analysis of population-based and deep sequencing data to identify coevolving sites in the nef gene of HIV-1. Mol. Biol. Evol. 27 819–832
Rahman A and Pachter L 2013 CGAL: computing genome assembly likelihoods. Genome Biol. 14 R8
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, et al. 2012 GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22 557–567
Sanger F, Nicklen S and Coulson AR 1977 DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74 5463–5467
Schatz MC, Witkowski J and McCombie WR 2012 Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13 243
Shumway M, Cochrane G and Sugawara H 2010 Archiving next generation sequencing data. Nucleic Acids Res. 38 D870–D871
Simpson JT and Durbin R 2012 Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22 549–556
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM and Birol I 2009 ABySS: a parallel assembler for short read sequence data. Genome Res. 19 1117–1123
Treangen TJ and Salzberg SL 2012 Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13 36–46
Vezzi F, Narzisi G and Mishra B 2012 Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One 7 e52210
Wang K, Wang Z, Li F, Ye W, Wang J, Song G, Yue Z, Cong L, et al. 2012 The draft genome of a diploid cotton Gossypium raimondii. Nat. Genet. 44 1098–1103
Zerbino DR and Birney E 2008 Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18 821–829
Zimin AV, Smith DR, Sutton G and Yorke JA 2008 Assembly reconciliation. Bioinformatics 24 42–45

Download references

Acknowledgements

We are thankful to Rohit Chauhan for his help during the analysis. VC is thankful to DST for SRF-INSPIRE fellowship, RK is thankful to DBT-BINC. We are thankful to all the researchers who provided their valuable data as an open access resource. The manuscript has IHBT communication ID: 3959. This work was supported under project funding BSC-121 (CSIR-12th FYP GENESIS project).

Author information

Authors and Affiliations

Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
Vandna Chawla, Rajnish Kumar & Ravi Shankar
Department of Biotechnology, Guru Nanak Dev University, Amritsar, Punjab, India
Vandna Chawla

Authors

Vandna Chawla
View author publications
You can also search for this author in PubMed Google Scholar
Rajnish Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Shankar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi Shankar.

Additional information

[Chawla V, Kumar R and Shankar R 2016 Identifying wrong assemblies in de novo short read primary sequence assembly contigs. J. Biosci.]

Supplementary materials pertaining to this article are available on the Journal of Biosciences Website.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 977 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chawla, V., Kumar, R. & Shankar, R. Identifying wrong assemblies in de novo short read primary sequence assembly contigs. J Biosci 41, 455–474 (2016). https://doi.org/10.1007/s12038-016-9630-0

Download citation

Received: 02 November 2015
Accepted: 14 July 2016
Published: 05 August 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s12038-016-9630-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying wrong assemblies in de novo short read primary sequence assembly contigs

Abstract

Access this article

Similar content being viewed by others

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads

SLR: a scaffolding algorithm based on long reads and contig classification

Assessing the impact of exact reads on reducing the error rate of read mapping

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying wrong assemblies in de novo short read primary sequence assembly contigs

Abstract

Access this article

Similar content being viewed by others

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads

SLR: a scaffolding algorithm based on long reads and contig classification

Assessing the impact of exact reads on reducing the error rate of read mapping

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation