Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

Szczerba, Monika; Wiewiórka, Marek S.; Okoniewski, Michał J.; Rybiński, Henryk

doi:10.1007/978-3-319-26989-4_11

Monika Szczerba⁴,
Marek S. Wiewiórka⁴,
Michał J. Okoniewski⁵ &
…
Henryk Rybiński⁴

Part of the book series: Studies in Big Data ((SBD,volume 16))

4114 Accesses
3 Citations

Abstract

Next generation sequencing (NGS) technology has become a serious computational challenge since its commercial introduction in 2008. Currently, thousands of machines worldwide produce daily billions of sequenced nucleotide base pairs of data. Due to continuous development of faster and economical sequencing technologies, processing the large amounts of data produced by high throughput sequencing technologies became the main challenge in bioinformatics. It can be solved by the new generation of software tools based on the paradigms and principles developed within the Hadoop ecosystem. This chapter presents the overall perspective for data analysis software for genomics and prospects for the emerging applications. To show genomic big data analysis in practice, a case study of the SparkSeq system that delivers tool for biological sequence analysis is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Apache Spark is another state-of-the-art case of a distributed computing framework, it is an open source software developed by AMPLab at UC Berkeley [36]. Apache Spark has been widely adopted as a successor to Apache Hadoop MapReduce, and can be deployed on single machines, small clusters of a few nodes up to large cluster deployment with thousands of nodes.

References

Shendure, J., Ji, H. (eds.): Next-generation DNA sequencing. In: Shendure, J., Ji, H., (eds.) Nature Biotechnology, vol. 26. Nature Publishing Group (2008)
Google Scholar
DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., Daly, M.J.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011)
Article Google Scholar
Duitama, J., Quintero, J.C., Cruz, D.F., Quintero, C., Hubmann, G., Foulquié-Moreno, M.R., Verstrepen, K.J., Thevelein, J.M., Tohme, J.: An integrated framework for discovery and genotyping of genomic variants from high-throughput sequencing experiments. Nucleic Acids Res. 42, e44 (2014)
Article Google Scholar
Ozsolak, F., Milos, P.M.: RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12, 87–98 (2011)
Article Google Scholar
Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8, 1765–1786 (2013)
Article Google Scholar
Bird, A.P.: Cpg-rich islands and the function of dna methylation. Nature 321, 209–213 (1985)
Article Google Scholar
Suzuki, M.M., Bird, A.: Dna methylation landscapes: provocative insights from epigenomics. Nat. Rev. Genet. 9, 465–476 (2008)
Article Google Scholar
Tatusova, T.A., Madden, T.L.: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174, 247–250 (1999)
Article Google Scholar
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. In: Proceedings of the National Academy of Sciences of the United States of America (1988)
Google Scholar
DNA sequencing with chain-terminating inhibitors. In: Proceedings of the National Academy of Sciences of the United States of America, National Academy of Sciences of the United States of America (1977)
Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9, 357–359 (2012)
Article Google Scholar
Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Article Google Scholar
Frazee, A.C., Sabunciyan, S., Hansen, K.D., Irizarry, R.A., Leek, J.T.: Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics (Oxford, England) (2014)
Google Scholar
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Nature Precedings (2010)
Google Scholar
Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010)
Article Google Scholar
Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bioconductor (2013)
Google Scholar
Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., Soboleva, A.: NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011)
Google Scholar
Kodama, Y., Shumway, M., Leinonen, R.: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012)
Article Google Scholar
Cochrane, G., Akhtar, R., Bonfield, J., Bower, L., Demiralp, F., Faruque, N., Gibson, R., Hoad, G., Hubbard, T., Hunter, C., Jang, M., Juhos, S., Leinonen, R., Leonard, S., Lin, Q., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Plaister, S., Radhakrishnan, R., Robinson, S., Sobhany, S., Hoopen, P.T., Vaughan, R., Zalunin, V., Birney, E.: Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Res. 37, D19–25 (2009)
Article Google Scholar
Kwok, P.Y.: Single Nucleotide Polymorphisms. Humana, Totowa, NJ (2003)
Google Scholar
Okoniewski, M.J., Meienberg, J., Patrignani, A., Szabelska, A., Mátyás, G., Schlapbach, R.: Precise breakpoint localization of large genomic deletions using PacBio and Illumina next-generation sequencers. BioTechniques 54, 98–100 (2013)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10, R25 (2009)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Article Google Scholar
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: 1000 Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078–2079 (2009)
Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)
Article Google Scholar
Saunders, C.T., Wong, W.S.W., Swamy, S., Becq, J., Murray, L.J., Cheetham, R.K.: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England) 28, 1811–1817 (2012)
Google Scholar
Thomas, M.F., Ansel, K.M.: Construction of small RNA cDNA libraries for deep sequencing. Methods Mol. Biol. (Clifton, N.J.) 667, 93–111 (2010)
Google Scholar
Kornblihtt, A.R., Schor, I.E., Allo, M., Dujardin, G., Petrillo, E., Muñoz, M.J.: Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol. 14, 153–165 (2013)
Article Google Scholar
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L.: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013)
Article Google Scholar
Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29, 15–21 (2013)
Google Scholar
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Natu. Protoc. 7, 562–578 (2012)
Article Google Scholar
Li, B., Dewey, C.N.: Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinf. 12, 323 (2011)
Article Google Scholar
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2012)
Google Scholar
Franklin, M.: Spark Becomes Top Level Apache Project
Google Scholar
Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: Distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. SOSP ’13, New York, NY, USA, ACM (2013)
Google Scholar
Bykov, S., Geller, A., Kliot, G., Larus, J., Pandya, R., Thelin, J.: Orleans: Cloud computing for everyone. In: ACM Symposium on Cloud Computing (SOCC 2011), ACM (2011)
Google Scholar
O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘big data’, hadoop and cloud computing in genomics. J. Biomed. Inf. 774–781 (2013)
Google Scholar
Dove, E.S., Joly, Y., Tassé, A.M.: Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. (2014)
Google Scholar
Kuo, A.M.H.: Opportunities and challenges of cloud computing to improve health care services. J. Med. Internet Res. 13 (2011)
Google Scholar
Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation. J. Med. Internet Res. 36(6), 4031–4036 (2012)
Google Scholar
Jimerson, B.: Software Architecture for High Availability in the Cloud
Google Scholar
Apache: Spark programming guide (2014)
Google Scholar
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)
Google Scholar
Kumar, V.: Running Hadoop in the Cloud
Google Scholar
Apache: Spark sql programming guide (2014)
Google Scholar
Apache: Parquet (2014)
Google Scholar
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)
Google Scholar
Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., Heljanko, K.: Hadoop-bam: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28, 876–877 (2012)
Article Google Scholar
Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: Seqpig: simple and scalable scripting for large sequencing data sets in hadoop. Bioinformatics 30, 119–120 (2014)
Article Google Scholar
Wiewiórka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: Sparkseq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2652–2653 (2014)
Google Scholar
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29, 3014–3019 (2013)
Article Google Scholar
Leo, S., Santoni, F., Zanetti, G.: Biodoop: bioinformatics on hadoop. In: IEEE International Conference on Parallel Processing Workshops, 2009. ICPPW’09, pp. 415–422 (2009)
Google Scholar
Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.A.: Adam: Genomics formats and processing patterns for cloud scale computing. Technical Report UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013)
Google Scholar
McCabe, C.: How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop. http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ (2013)
Callaghan, B., Pawlowski, B., Staubach, P.: Nfs version 3 protocol specification. Technical report, RFC 1813, Network Working Group (1995)
Google Scholar
Dove, E.S., Joly, Y., Tassé, A.M., Burton, P., Chisholm, R., Fortier, I., Goodwin, P., Harris, J., Hveem, K., Kaye, J., et al.: Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. (2014)
Google Scholar
Beck, M., Haupt, V.J., Roy, J., Moennich, J., Jäkel, R., Schroeder, M., Isik, Z.: Genecloud: Secure cloud computing for biomedical research. In: Trusted Cloud Computing, pp. 3–14. Springer (2014)
Google Scholar
Hortonworks. Manage Security Policy for Hive & HBase with Knox & Ranger. http://hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/ (2014)
Sharma, P.P., Navdeti, C.P.: Securing big data hadoop: a review of security issues, threats and solution. Int. J. Comput. Sci. Inf. Technol. 5 (2014)
Google Scholar
Merelli, I., Pérez-Sánchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Res. Int. 2014 (2014)
Google Scholar
Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009)
Article Google Scholar
Holland, R.C., Down, T.A., Pocock, M., Prlić, A., Huen, D., James, K., Foisy, S., Dräger, A., Yates, A., Heuer, M., et al.: Biojava: an open-source framework for bioinformatics. Bioinformatics 24, 2096–2097 (2008)
Article Google Scholar
Wadkar, S., Siddalingaiah, M.: Apache ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Google Scholar
Franklin, M.: The berkeley data analytics stack: Present and future. In: 2013 IEEE International Conference on Big Data, pp. 2–3 (2013)
Google Scholar
Xiao, W., Ji, C.L., Li, J.D.: Design and implementation of massive data retrieving based on cloud computing platform. Appl. Mech. Mater. 303, 2235–2240 (2013)
Article Google Scholar
Turnbull, J.: The Docker Book: Containerization is the new virtualization. James Turnbull (2014)
Google Scholar
Team, R.C., et al.: R: A language and environment for statistical computing (2012)
Google Scholar
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004)
Article Google Scholar
Kaster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012)
Google Scholar
Cingolani, P., Sladek, R., Blanchette, M.: Bigdatascript: a scripting language for data pipelines. Bioinformatics 31, 10–16 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
Monika Szczerba, Marek S. Wiewiórka & Henryk Rybiński
Research Informatics, Scientific IT Services, ETH Zürich, Zurich, Switzerland
Michał J. Okoniewski

Authors

Monika Szczerba
View author publications
You can also search for this author in PubMed Google Scholar
Marek S. Wiewiórka
View author publications
You can also search for this author in PubMed Google Scholar
Michał J. Okoniewski
View author publications
You can also search for this author in PubMed Google Scholar
Henryk Rybiński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michał J. Okoniewski .

Editor information

Editors and Affiliations

University of Ottawa, Ottawa, Ontario, Canada
Nathalie Japkowicz
Institute of Computing Sciences, Poznań University of Technology, Poznań, Poland
Jerzy Stefanowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Szczerba, M., Wiewiórka, M.S., Okoniewski, M.J., Rybiński, H. (2016). Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing. In: Japkowicz, N., Stefanowski, J. (eds) Big Data Analysis: New Algorithms for a New Society. Studies in Big Data, vol 16. Springer, Cham. https://doi.org/10.1007/978-3-319-26989-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-26989-4_11
Published: 17 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26987-0
Online ISBN: 978-3-319-26989-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics