Abstract
Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Conseil Europen pour la Recherche Nuclaire – European Organization for Nuclear Research.
- 2.
http://timemachine.iic.harvard.edu/publications/.
- 3.
http://www.audioscrobbler.net/development/dumbo/.
- 4.
https://epilepsy.uni-freiburg.de/freiburg-seizure-prediction-project/eeg-database.
- 5.
The ROOT table is confined to a single region and maps all the regions in the META table. Each row in the ROOT and META tables is approximately 1 KB in size. At the default region size of 256 MB, this means that the ROOT region can map 2. 6 ×105 META regions, which in turn map a total 6. 9 ×1, 010 user regions, meaning that approximately 1. 8 ×1, 019 (264) bytes of user data.
- 6.
The META table stores information about every user region in HBase such as start and end row keys, whether region is on or off-line and address that is currently serving the region.
- 7.
This figure is obtained from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html.
- 8.
http://www.uniklinik-freiburg.de/epilepsie/live/indexen.html.
- 9.
We ascertained that this is not a limitation on the client and it was not overloaded in terms of either cpu or bandwidth utilization.
- 10.
HBase 0.20.6 and the more recent development releases which specifically addressed and fixed many bugs which affected performance and stability.
References
-Micron All Sky Survey. http://pegasus.phast.umass.edu
Andersson, P., Mollerstrand, F.: Zohmg – a large scale data store for aggregated time-series-based data. Master’s thesis, Chalmers University of Technology (2009)
Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)
Aschenbrenner-Scheibe, R., Maiwald, T., Winterhalder, M., Voss, H.U., Timmer, J., Schulze-Bonhage, A.: How well can epileptic seizures be predicted, an evaluation of a nonlinear method. Brain 126, 2616–2626 (2003)
Berger, H.: Uber das elektroencephalogramm des menschen (on the electroencephalogram of man). Archiv fiir Psychiatrie und Nervenkrankheiten 87, 527–570 (1929)
Brinkmann, B.H., Bower, M.R., Stengel, K.A., Worrell, G.A., Stead. M.: Large-scale electrophysiology: Acquisition, compression, encryption, and storage of big data. J. Neurosci. Meth. 180(1), 185–192 (2009)
Cafarella, M., Cutting, D.: Building nutch: Open source search. In: ACM Queue, April 2004
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with ycsb. ACM Symposium on Cloud Computing. ACM, IN, USA (2010)
Das, K., Bhaduri, K., Arora, S., Griffin, W., Borne, K., Giannella, C., Kargupta, H.: Scalable distributed change detection from astronomy data streams using local, asynchronous eigen monitoring algorithms. In: Proceedings of the SIAM International Conference on Data Mining, Sparks, Nevada, 2009
Dave, R.: Scaling Astronomy. Oreilly Ignite 4, Boston, MA, September 2008. http://timemachine.iic.harvard.edu/publications/#scaling-astronomy
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004, pp. 137–150
Dean, J., Ghemawat, S.: Mapreduce: A flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: The 19th ACM Symposium on Operating Systems Principles, lake George, NY (2003)
Graves, S.J., Conover, H., Keiser, K., Ramachandran, R., Redman, S., Rushing, J., Tanner, S.: Mining and Modeling in the Linked Environments for Atmospheric Discovery (LEAD). In: Huntsville Simulation Conference, Huntsville, AL, 19 Oct 2004
Amazon Elastic Compute Cloud, Amazon EC2. http://aws.amazon.com/ec2/
Amazon Simple Storage Service, Amazon S3. http://aws.amazon.com/s3/
Grape 6. http://grape.mtk.nao.ac.jp/grape/news/ABC/ABC-cuttingedge000602.html
Apache Hadoop. http://hadoop.apache.org/core/
Apache Hbase. http://hbase.apache.org/
Large Hadron Collider, European Organization for Nuclear Research. http://lhc.web.cern.ch/lhc/
Hbase Architecture. http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
Randy Bryant’s Home Page. http://www.cs.cmu.edu/bryant/
San Diego Supercomputer Center, SDSC. http://www.sdsc.edu/
Hsu, F.H.: Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, NJ (2002)
Keogh, E.J.: Recent advances in mining time series data. In: PKDD, p. 6 (2005)
Keogh, E.J.: A decade of progress in indexing and mining large time series databases. In: VLDB, p. 1268 (2006)
Lin, J., Vlachos, M., Keogh, E., Gunopulos, D.: Iterative incremental clustering of time series. In: Proceedings of the IX Conference on Extending Database Technology (2004)
Reeves, G., Liu, J., Nath, S., Zhao, F.: Managing massive time series streams with multi-scale compressed trickles. In Proceedings of the 35th Conference on Very Large Data Bases, Lyon, France, 2009
Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: KDD, pp. 623–631 (2008)
Sloan Digital Sky Survey. http://www.sdss.org
The human genome project. http://www.ornl.gov/sci/techresources/HumanGenome/home.shtml
The protein data bank (pdb). http://www.rcsb.org/pdb/Welcome.do
The swiss-prot protein knowledge base. http://www.expasy.org/sprot/
World data center for meterology. http://www.ncdc.noaa.gov/oa/wmo/wdcamet.html
Yankov, D., Keogh, E.J., Rebbapragada, U.: Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. In: ICDM, pp. 381–390 (2007)
Acknowledgements
Funding for this work is provided by National Science Foundation award, IIS–0916186. Data for this project were provided by the Freiburg Seizure Prediction EEG Database (FSPEEG). The authors would like to thank administrators of the CLIC Lab Cluster at the Department of Computer Science, Columbia University for help with cluster set-up and experimentation; Dr. David Waltz, Dr. Catherine A Schevon, Dr. Ronald Emerson, Phil Gross, and Shen Wang provided their insightful comments during different phases of the project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Dutta, H., Kamil, A., Pooleery, M., Sethumadhavan, S., Demme, J. (2011). Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-20045-8_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20044-1
Online ISBN: 978-3-642-20045-8
eBook Packages: Computer ScienceComputer Science (R0)