De-Bruijn graph with MapReduce framework towards metagenomic data classification

Kamal, Md. Sarwar; Parvin, Sazia; Ashour, Amira S.; Shi, Fuqian; Dey, Nilanjan

doi:10.1007/s41870-017-0005-z

De-Bruijn graph with MapReduce framework towards metagenomic data classification

Original Research
Published: 23 February 2017

Volume 9, pages 59–75, (2017)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Md. Sarwar Kamal¹,
Sazia Parvin²,
Amira S. Ashour³,
Fuqian Shi⁴ &
…
Nilanjan Dey⁵

331 Accesses
23 Citations
Explore all metrics

Abstract

Metagenomic gene classifications are significant in bioinformatics and computational biology research. There are huge interrelated datasets that deal with human characteristics, diseases and molecular functionalities. Analysis of metagenomic reorganization is a challenging issue due to their diversity and efficient classification tools. Graph based MapReducing approach can easily handle the genomic diversity. MapReduce has two parts such as mapping and reducing. In mapping phase, a recursive naive algorithm is used for generating K-mers. De-Bruijn graph is a compact representation of k-mers and finds out an optimal path (solution) for genome assembly. Similarity metrics have been utilized for finding similarity among the De-Oxy Ribonucleic Acid (DNA) sequences. In reducing side, Jaccard similarity and purity of clustering are used as datasets classifier to classify the sequences based on their similarity. Reducing phase can easily classify the DNA sequences from large database. Extensive experimental analysis has demonstrated that graph based MapReduce analysis generate optimal solutions. Remarkable improvements in time and space have recorded and observed. The results established that proposed framework performed faster than SSMA-SFSD when classified elements are increased. It provided better accuracy for metagenomic data clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Algorithm Optimization Applied to Metagenomics Using Big Data

Improving Metagenome Sequence Clustering Application Performance Using Louvain Algorithm

Iterative Clustering Method for Metagenomic Sequences

References

Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2):e1000667
Article Google Scholar
Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR (2008) Evolution of mammals and their gut microbes. Science 320(80):1647–1651
Article Google Scholar
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5:e77
Article Google Scholar
Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH (2012) Structure, function and diversity of the healthy human microbiome. Nature 486:207–214
Article Google Scholar
Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920
Article Google Scholar
Greenblum S, Turnbaugh PJ, Borenstein E (2009) Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. PNAS 109:594–599
Article Google Scholar
Qin J, Li Y, Cai Z, Li S, Zhu J (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60
Article Google Scholar
Handelsman J (2007) Committee on metagenomics: challenges and functional applications. The National Academies Press, Washington
Google Scholar
Pevzner P, Tang H, Waterman M (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98:9748–9753
Article MathSciNet MATH Google Scholar
Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327
Article Google Scholar
Compeau P, Pevzner P, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29:987–991
Article Google Scholar
Peng Y, Leung HCM, Yiu SM, Chin FYL (2011) T-IDBA: a de novo iterative de Bruijn graph assembler for transcriptome. In: Bafna V, Sahinalp SC (eds) Research in computational molecular biology. RECOMB 2011. Lecture notes in computer science, vol 6577. Springer, Berlin, Heidelberg
Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Article Google Scholar
Simpson JT, Wong K, Jackman K, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Article Google Scholar
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB (2008) AllPaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 18:810–820
Article Google Scholar
Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 40(20):e155
Article Google Scholar
Grabherr M (2009) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652
Article Google Scholar
López V, del Río S, Benítez J, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38
Article MathSciNet Google Scholar
Miner D, Shook A (2012) MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. O’Reilly Media, Inc., Sebastopol, CA
Google Scholar
Dean J, Ghemawat S (2003) MapReduce: simplified data processing on large clusters. In: Proceedings. of Symposium on opearting systems design and implementation, vol 6, pp 1–10
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI 2004
Yinan W, Renner DW, Albert I, SzparaL ML (2015) VirAmp: a galaxy-based viral genome assembly pipeline. GigaScience 4:19
Article Google Scholar
Chang Z, Li G, Li J, Zhang Y, Ashby C, Liu D, Cramer C, Huang X (2015) Bridger: a new framework for de novo transcriptome assembly using RNA-seqdata. Genome Biol 16:30
Article Google Scholar
Hernandez D (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18:802–809
Article Google Scholar
Wang S, Cho H, Zhai CX, Berger B, Peng J (2015) Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31:i357–i364
Article Google Scholar
Yuzhen Y, Haixu T (2016) Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 32(7):1001–1008
Article Google Scholar
Christopher BB (1997) dentification of genes in human genomicdna. Ph.d. Thesis. Stanford University, Stanford, CA,USA
Gens P, Enrique B, Roderic G (2000) Geneid in drosophila. Genome Res 10:511–515
Article Google Scholar
Arthur D, Kirsten B, Edwin P, Steven S (2007) Identifying bacterial genes and endosymbiontdna with glimmer. Bioinformatics 23:7
Google Scholar
Ewan B, Michele C, Richard D (2004) Gene wise and genome wise. Genome Res 14:988–995
Article Google Scholar
Leila T, Oliver R, Saurabh G, Alexander S, Michael B, Serafim B, Burkhard M (2003) Agenda: homology-based gene prediction. Bioinformatics 19:1575–1577
Article Google Scholar
Green P, Lipman D, Hillier L, Waterston R, States RD, Claverie JM (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259:1711–1716
Article Google Scholar
Noguchi H, Park J, Takagi T (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 34(19):5623–5630
Article Google Scholar
Hoff KJ, Lingner T, Meinicke P (2009) Orphelia:predicting genes in metagenomic sequencing reads. Nucleic Acids Res 37:W101–W105
Article Google Scholar
Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920
Article Google Scholar
Yang B, Peng Y, Leung H, Yiu SM, Qin J, Li R, Chin FYL (2010) Metacluster: unsupervised binning of environmental genomic fragments and taxonomic annotation. In: Proceedingsof the first ACM international conference on bioinformatics and computational biology, pp 170–179
Yang X, Zola J, Aluru S. (2011) Parallel metagenomic sequence clustering via sketching and maximal qQuasi clique enumeration on map-reduce clouds. In: Parallel and distributed processing symposium (IPDPS), 2011 IEEE International, pp 1223–1233
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: SIGMOD pp 1099–1110
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010). Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp 996–1005
Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276
Google Scholar
Río S, López V, Benítez J, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137
Article Google Scholar
Birney E, Zerbino DR (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Article Google Scholar
Pevzner PA, Tang HX, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753
Article MathSciNet MATH Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 18(5):821–829
Article Google Scholar
Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. Bioinformatics 17(1):237
Google Scholar
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85
Google Scholar
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204
Article Google Scholar
Gross JL, Yellen J (2004) Handbook of graph theory. CRC Press LLC, Boca Raton
MATH Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51:107–113
Article Google Scholar
Dean J, Ghemawat S (2010) Mapreduce:a flexible data processing tool. ACM 53:72–77
Article Google Scholar
Benkrid K, Liu Y, Benkrid A (2009) A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment. IEEE Trans Very Large Scale Integr Syst 17(4):561–570
Article Google Scholar
Edgar RC (2010) Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19):2460–2461
Article Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Article Google Scholar
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on theory of computing pp 327–336
Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. Technical report, Department of Computer Science, University of Minnesota, Minneapolis
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Bwl 147:195–197
Article Google Scholar
Hugenholtz P, Tyson GW (2008) Microbiology: metagenomic. Nature 455(7212):481–483
Article Google Scholar
Chatterji S, Yamazaki I, Bai Z, Eisen J (2008) Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Annual international conference on research in computational molecular biology, Springer, pp 17–28
Khan I, Kamal S, Chowdhury L (2015) MSuPDA: a memory efficient algorithm for sequence alignment. Comput Life Sci 8(1):84–94
Google Scholar
García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: ascalingupapproach. Pattern Recognit 41(8):2693–2709
Article MATH Google Scholar
Price KV, Storn RM, Lampinen JA (2005) The differential evolution algorithm. In: Differential evolution: a practical approach to global optimization, pp 37–134. ISBN 978-3-540-31306-9
Neri F, Tirronen V (2009) Scale factor local searching differential evolution. Memet Comput 1(2):153–171
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh
Md. Sarwar Kamal
University of New South Wales-ADFA, Canberra, Australia
Sazia Parvin
Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
Amira S. Ashour
College of Information and Engineering, Wenzhou Medical University, Wenzhou, People’s Republic of China
Fuqian Shi
Department of Information Technology, Techno India College of Technology, Kolkata, India
Nilanjan Dey

Authors

Md. Sarwar Kamal
View author publications
You can also search for this author in PubMed Google Scholar
Sazia Parvin
View author publications
You can also search for this author in PubMed Google Scholar
Amira S. Ashour
View author publications
You can also search for this author in PubMed Google Scholar
Fuqian Shi
View author publications
You can also search for this author in PubMed Google Scholar
Nilanjan Dey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nilanjan Dey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kamal, M.S., Parvin, S., Ashour, A.S. et al. De-Bruijn graph with MapReduce framework towards metagenomic data classification. Int. j. inf. tecnol. 9, 59–75 (2017). https://doi.org/10.1007/s41870-017-0005-z

Download citation

Published: 23 February 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s41870-017-0005-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

De-Bruijn graph with MapReduce framework towards metagenomic data classification

Abstract

Access this article

Similar content being viewed by others

Clustering Algorithm Optimization Applied to Metagenomics Using Big Data

Improving Metagenome Sequence Clustering Application Performance Using Louvain Algorithm

Iterative Clustering Method for Metagenomic Sequences

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

De-Bruijn graph with MapReduce framework towards metagenomic data classification

Abstract

Access this article

Similar content being viewed by others

Clustering Algorithm Optimization Applied to Metagenomics Using Big Data

Improving Metagenome Sequence Clustering Application Performance Using Louvain Algorithm

Iterative Clustering Method for Metagenomic Sequences

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation