Skip to main content

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

  • Chapter
  • First Online:
Big Data Analysis: New Algorithms for a New Society

Part of the book series: Studies in Big Data ((SBD,volume 16))

Abstract

Next generation sequencing (NGS) technology has become a serious computational challenge since its commercial introduction in 2008. Currently, thousands of machines worldwide produce daily billions of sequenced nucleotide base pairs of data. Due to continuous development of faster and economical sequencing technologies, processing the large amounts of data produced by high throughput sequencing technologies became the main challenge in bioinformatics. It can be solved by the new generation of software tools based on the paradigms and principles developed within the Hadoop ecosystem. This chapter presents the overall perspective for data analysis software for genomics and prospects for the emerging applications. To show genomic big data analysis in practice, a case study of the SparkSeq system that delivers tool for biological sequence analysis is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Apache Spark is another state-of-the-art case of a distributed computing framework, it is an open source software developed by AMPLab at UC Berkeley [36]. Apache Spark has been widely adopted as a successor to Apache Hadoop MapReduce, and can be deployed on single machines, small clusters of a few nodes up to large cluster deployment with thousands of nodes.

References

  1. Shendure, J., Ji, H. (eds.): Next-generation DNA sequencing. In: Shendure, J., Ji, H., (eds.) Nature Biotechnology, vol. 26. Nature Publishing Group (2008)

    Google Scholar 

  2. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., Daly, M.J.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011)

    Article  Google Scholar 

  3. Duitama, J., Quintero, J.C., Cruz, D.F., Quintero, C., Hubmann, G., Foulquié-Moreno, M.R., Verstrepen, K.J., Thevelein, J.M., Tohme, J.: An integrated framework for discovery and genotyping of genomic variants from high-throughput sequencing experiments. Nucleic Acids Res. 42, e44 (2014)

    Article  Google Scholar 

  4. Ozsolak, F., Milos, P.M.: RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12, 87–98 (2011)

    Article  Google Scholar 

  5. Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8, 1765–1786 (2013)

    Article  Google Scholar 

  6. Bird, A.P.: Cpg-rich islands and the function of dna methylation. Nature 321, 209–213 (1985)

    Article  Google Scholar 

  7. Suzuki, M.M., Bird, A.: Dna methylation landscapes: provocative insights from epigenomics. Nat. Rev. Genet. 9, 465–476 (2008)

    Article  Google Scholar 

  8. Tatusova, T.A., Madden, T.L.: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174, 247–250 (1999)

    Article  Google Scholar 

  9. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. In: Proceedings of the National Academy of Sciences of the United States of America (1988)

    Google Scholar 

  10. DNA sequencing with chain-terminating inhibitors. In: Proceedings of the National Academy of Sciences of the United States of America, National Academy of Sciences of the United States of America (1977)

    Google Scholar 

  11. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9, 357–359 (2012)

    Article  Google Scholar 

  12. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010)

    Article  Google Scholar 

  13. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)

    Article  Google Scholar 

  14. Frazee, A.C., Sabunciyan, S., Hansen, K.D., Irizarry, R.A., Leek, J.T.: Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics (Oxford, England) (2014)

    Google Scholar 

  15. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Nature Precedings (2010)

    Google Scholar 

  16. Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010)

    Article  Google Scholar 

  17. Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bioconductor (2013)

    Google Scholar 

  18. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., Soboleva, A.: NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011)

    Google Scholar 

  19. Kodama, Y., Shumway, M., Leinonen, R.: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012)

    Article  Google Scholar 

  20. Cochrane, G., Akhtar, R., Bonfield, J., Bower, L., Demiralp, F., Faruque, N., Gibson, R., Hoad, G., Hubbard, T., Hunter, C., Jang, M., Juhos, S., Leinonen, R., Leonard, S., Lin, Q., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Plaister, S., Radhakrishnan, R., Robinson, S., Sobhany, S., Hoopen, P.T., Vaughan, R., Zalunin, V., Birney, E.: Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Res. 37, D19–25 (2009)

    Article  Google Scholar 

  21. Kwok, P.Y.: Single Nucleotide Polymorphisms. Humana, Totowa, NJ (2003)

    Google Scholar 

  22. Okoniewski, M.J., Meienberg, J., Patrignani, A., Szabelska, A., Mátyás, G., Schlapbach, R.: Precise breakpoint localization of large genomic deletions using PacBio and Illumina next-generation sequencers. BioTechniques 54, 98–100 (2013)

    Article  Google Scholar 

  23. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10, R25 (2009)

    Article  Google Scholar 

  24. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009)

    Article  Google Scholar 

  25. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: 1000 Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078–2079 (2009)

    Google Scholar 

  26. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)

    Article  Google Scholar 

  27. Saunders, C.T., Wong, W.S.W., Swamy, S., Becq, J., Murray, L.J., Cheetham, R.K.: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England) 28, 1811–1817 (2012)

    Google Scholar 

  28. Thomas, M.F., Ansel, K.M.: Construction of small RNA cDNA libraries for deep sequencing. Methods Mol. Biol. (Clifton, N.J.) 667, 93–111 (2010)

    Google Scholar 

  29. Kornblihtt, A.R., Schor, I.E., Allo, M., Dujardin, G., Petrillo, E., Muñoz, M.J.: Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol. 14, 153–165 (2013)

    Article  Google Scholar 

  30. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L.: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013)

    Article  Google Scholar 

  31. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29, 15–21 (2013)

    Google Scholar 

  32. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Natu. Protoc. 7, 562–578 (2012)

    Article  Google Scholar 

  33. Li, B., Dewey, C.N.: Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinf. 12, 323 (2011)

    Article  Google Scholar 

  34. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011)

    Article  Google Scholar 

  35. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2012)

    Google Scholar 

  36. Franklin, M.: Spark Becomes Top Level Apache Project

    Google Scholar 

  37. Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: Distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. SOSP ’13, New York, NY, USA, ACM (2013)

    Google Scholar 

  38. Bykov, S., Geller, A., Kliot, G., Larus, J., Pandya, R., Thelin, J.: Orleans: Cloud computing for everyone. In: ACM Symposium on Cloud Computing (SOCC 2011), ACM (2011)

    Google Scholar 

  39. O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘big data’, hadoop and cloud computing in genomics. J. Biomed. Inf. 774–781 (2013)

    Google Scholar 

  40. Dove, E.S., Joly, Y., Tassé, A.M.: Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. (2014)

    Google Scholar 

  41. Kuo, A.M.H.: Opportunities and challenges of cloud computing to improve health care services. J. Med. Internet Res. 13 (2011)

    Google Scholar 

  42. Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation. J. Med. Internet Res. 36(6), 4031–4036 (2012)

    Google Scholar 

  43. Jimerson, B.: Software Architecture for High Availability in the Cloud

    Google Scholar 

  44. Apache: Spark programming guide (2014)

    Google Scholar 

  45. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)

    Google Scholar 

  46. Kumar, V.: Running Hadoop in the Cloud

    Google Scholar 

  47. Apache: Spark sql programming guide (2014)

    Google Scholar 

  48. Apache: Parquet (2014)

    Google Scholar 

  49. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)

    Google Scholar 

  50. Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., Heljanko, K.: Hadoop-bam: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28, 876–877 (2012)

    Article  Google Scholar 

  51. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: Seqpig: simple and scalable scripting for large sequencing data sets in hadoop. Bioinformatics 30, 119–120 (2014)

    Article  Google Scholar 

  52. Wiewiórka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: Sparkseq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2652–2653 (2014)

    Google Scholar 

  53. Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29, 3014–3019 (2013)

    Article  Google Scholar 

  54. Leo, S., Santoni, F., Zanetti, G.: Biodoop: bioinformatics on hadoop. In: IEEE International Conference on Parallel Processing Workshops, 2009. ICPPW’09, pp. 415–422 (2009)

    Google Scholar 

  55. Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.A.: Adam: Genomics formats and processing patterns for cloud scale computing. Technical Report UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013)

    Google Scholar 

  56. McCabe, C.: How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop. http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ (2013)

  57. Callaghan, B., Pawlowski, B., Staubach, P.: Nfs version 3 protocol specification. Technical report, RFC 1813, Network Working Group (1995)

    Google Scholar 

  58. Dove, E.S., Joly, Y., Tassé, A.M., Burton, P., Chisholm, R., Fortier, I., Goodwin, P., Harris, J., Hveem, K., Kaye, J., et al.: Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. (2014)

    Google Scholar 

  59. Beck, M., Haupt, V.J., Roy, J., Moennich, J., Jäkel, R., Schroeder, M., Isik, Z.: Genecloud: Secure cloud computing for biomedical research. In: Trusted Cloud Computing, pp. 3–14. Springer (2014)

    Google Scholar 

  60. Hortonworks. Manage Security Policy for Hive & HBase with Knox & Ranger. http://hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/ (2014)

  61. Sharma, P.P., Navdeti, C.P.: Securing big data hadoop: a review of security issues, threats and solution. Int. J. Comput. Sci. Inf. Technol. 5 (2014)

    Google Scholar 

  62. Merelli, I., Pérez-Sánchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Res. Int. 2014 (2014)

    Google Scholar 

  63. Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009)

    Article  Google Scholar 

  64. Holland, R.C., Down, T.A., Pocock, M., Prlić, A., Huen, D., James, K., Foisy, S., Dräger, A., Yates, A., Heuer, M., et al.: Biojava: an open-source framework for bioinformatics. Bioinformatics 24, 2096–2097 (2008)

    Article  Google Scholar 

  65. Wadkar, S., Siddalingaiah, M.: Apache ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)

    Google Scholar 

  66. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)

    Google Scholar 

  67. Franklin, M.: The berkeley data analytics stack: Present and future. In: 2013 IEEE International Conference on Big Data, pp. 2–3 (2013)

    Google Scholar 

  68. Xiao, W., Ji, C.L., Li, J.D.: Design and implementation of massive data retrieving based on cloud computing platform. Appl. Mech. Mater. 303, 2235–2240 (2013)

    Article  Google Scholar 

  69. Turnbull, J.: The Docker Book: Containerization is the new virtualization. James Turnbull (2014)

    Google Scholar 

  70. Team, R.C., et al.: R: A language and environment for statistical computing (2012)

    Google Scholar 

  71. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004)

    Article  Google Scholar 

  72. Kaster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012)

    Google Scholar 

  73. Cingolani, P., Sladek, R., Blanchette, M.: Bigdatascript: a scripting language for data pipelines. Bioinformatics 31, 10–16 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michał J. Okoniewski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Szczerba, M., Wiewiórka, M.S., Okoniewski, M.J., Rybiński, H. (2016). Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing. In: Japkowicz, N., Stefanowski, J. (eds) Big Data Analysis: New Algorithms for a New Society. Studies in Big Data, vol 16. Springer, Cham. https://doi.org/10.1007/978-3-319-26989-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26989-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26987-0

  • Online ISBN: 978-3-319-26989-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics