Skip to main content

Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase

  • Chapter
  • First Online:
Grid and Cloud Database Management

Abstract

Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Conseil Europen pour la Recherche Nuclaire – European Organization for Nuclear Research.

  2. 2.

    http://timemachine.iic.harvard.edu/publications/.

  3. 3.

    http://www.audioscrobbler.net/development/dumbo/.

  4. 4.

    https://epilepsy.uni-freiburg.de/freiburg-seizure-prediction-project/eeg-database.

  5. 5.

    The ROOT table is confined to a single region and maps all the regions in the META table. Each row in the ROOT and META tables is approximately 1 KB in size. At the default region size of 256 MB, this means that the ROOT region can map 2. 6 ×105 META regions, which in turn map a total 6. 9 ×1, 010 user regions, meaning that approximately 1. 8 ×1, 019 (264) bytes of user data.

  6. 6.

    The META table stores information about every user region in HBase such as start and end row keys, whether region is on or off-line and address that is currently serving the region.

  7. 7.

    This figure is obtained from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html.

  8. 8.

    http://www.uniklinik-freiburg.de/epilepsie/live/indexen.html.

  9. 9.

    We ascertained that this is not a limitation on the client and it was not overloaded in terms of either cpu or bandwidth utilization.

  10. 10.

    HBase 0.20.6 and the more recent development releases which specifically addressed and fixed many bugs which affected performance and stability.

References

  1. -Micron All Sky Survey. http://pegasus.phast.umass.edu

  2. Andersson, P., Mollerstrand, F.: Zohmg – a large scale data store for aggregated time-series-based data. Master’s thesis, Chalmers University of Technology (2009)

    Google Scholar 

  3. Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)

    Article  Google Scholar 

  4. Aschenbrenner-Scheibe, R., Maiwald, T., Winterhalder, M., Voss, H.U., Timmer, J., Schulze-Bonhage, A.: How well can epileptic seizures be predicted, an evaluation of a nonlinear method. Brain 126, 2616–2626 (2003)

    Article  MATH  Google Scholar 

  5. Berger, H.: Uber das elektroencephalogramm des menschen (on the electroencephalogram of man). Archiv fiir Psychiatrie und Nervenkrankheiten 87, 527–570 (1929)

    Article  Google Scholar 

  6. Brinkmann, B.H., Bower, M.R., Stengel, K.A., Worrell, G.A., Stead. M.: Large-scale electrophysiology: Acquisition, compression, encryption, and storage of big data. J. Neurosci. Meth. 180(1), 185–192 (2009)

    Article  Google Scholar 

  7. Cafarella, M.,  Cutting, D.: Building nutch: Open source search. In: ACM Queue, April 2004

    Google Scholar 

  8. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)

    Article  Google Scholar 

  9. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with ycsb. ACM Symposium on Cloud Computing. ACM, IN, USA (2010)

    Book  Google Scholar 

  10. Das, K., Bhaduri, K., Arora, S., Griffin, W., Borne, K., Giannella, C.,  Kargupta, H.: Scalable distributed change detection from astronomy data streams using local, asynchronous eigen monitoring algorithms. In: Proceedings of the SIAM International Conference on Data Mining, Sparks, Nevada, 2009

    Book  Google Scholar 

  11. Dave, R.: Scaling Astronomy. Oreilly Ignite 4, Boston, MA, September 2008. http://timemachine.iic.harvard.edu/publications/#scaling-astronomy

  12. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004, pp. 137–150

    Google Scholar 

  13. Dean, J., Ghemawat, S.: Mapreduce: A flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  14. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: The 19th ACM Symposium on Operating Systems Principles, lake George, NY (2003)

    Google Scholar 

  15. Graves, S.J., Conover, H., Keiser, K., Ramachandran, R., Redman, S., Rushing, J., Tanner, S.: Mining and Modeling in the Linked Environments for Atmospheric Discovery (LEAD). In: Huntsville Simulation Conference, Huntsville, AL, 19 Oct 2004

    Google Scholar 

  16. Amazon Elastic Compute Cloud, Amazon EC2. http://aws.amazon.com/ec2/

  17. Amazon Simple Storage Service, Amazon S3. http://aws.amazon.com/s3/

  18. Grape 6. http://grape.mtk.nao.ac.jp/grape/news/ABC/ABC-cuttingedge000602.html

  19. Apache Hadoop. http://hadoop.apache.org/core/

  20. Apache Hbase. http://hbase.apache.org/

  21. Large Hadron Collider, European Organization for Nuclear Research. http://lhc.web.cern.ch/lhc/

  22. Hbase Architecture. http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

  23. Randy Bryant’s Home Page. http://www.cs.cmu.edu/bryant/

  24. San Diego Supercomputer Center, SDSC. http://www.sdsc.edu/

  25. Hsu, F.H.: Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, NJ (2002)

    MATH  Google Scholar 

  26. Keogh, E.J.: Recent advances in mining time series data. In: PKDD, p. 6 (2005)

    Google Scholar 

  27. Keogh, E.J.: A decade of progress in indexing and mining large time series databases. In: VLDB, p. 1268 (2006)

    Google Scholar 

  28. Lin, J., Vlachos, M., Keogh, E., Gunopulos, D.: Iterative incremental clustering of time series. In: Proceedings of the IX Conference on Extending Database Technology (2004)

    Book  Google Scholar 

  29. Reeves, G., Liu, J., Nath, S., Zhao, F.: Managing massive time series streams with multi-scale compressed trickles. In Proceedings of the 35th Conference on Very Large Data Bases, Lyon, France, 2009

    Google Scholar 

  30. Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: KDD, pp. 623–631 (2008)

    Google Scholar 

  31. Sloan Digital Sky Survey. http://www.sdss.org

  32. The human genome project. http://www.ornl.gov/sci/techresources/HumanGenome/home.shtml

  33. The protein data bank (pdb). http://www.rcsb.org/pdb/Welcome.do

  34. The swiss-prot protein knowledge base. http://www.expasy.org/sprot/

  35. World data center for meterology. http://www.ncdc.noaa.gov/oa/wmo/wdcamet.html

  36. Yankov, D., Keogh, E.J., Rebbapragada, U.: Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. In: ICDM, pp. 381–390 (2007)

    Google Scholar 

Download references

Acknowledgements

Funding for this work is provided by National Science Foundation award, IIS–0916186. Data for this project were provided by the Freiburg Seizure Prediction EEG Database (FSPEEG). The authors would like to thank administrators of the CLIC Lab Cluster at the Department of Computer Science, Columbia University for help with cluster set-up and experimentation; Dr. David Waltz, Dr. Catherine A Schevon, Dr. Ronald Emerson, Phil Gross, and Shen Wang provided their insightful comments during different phases of the project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haimonti Dutta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Dutta, H., Kamil, A., Pooleery, M., Sethumadhavan, S., Demme, J. (2011). Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20045-8_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20044-1

  • Online ISBN: 978-3-642-20045-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics