Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase

Dutta, Haimonti; Kamil, Alex; Pooleery, Manoj; Sethumadhavan, Simha; Demme, John

doi:10.1007/978-3-642-20045-8_16

Haimonti Dutta³,
Alex Kamil⁴,
Manoj Pooleery³,
Simha Sethumadhavan⁵ &
…
John Demme⁶

1394 Accesses

Abstract

Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Conseil Europen pour la Recherche Nuclaire – European Organization for Nuclear Research.
2.
http://timemachine.iic.harvard.edu/publications/.
3.
http://www.audioscrobbler.net/development/dumbo/.
4.
https://epilepsy.uni-freiburg.de/freiburg-seizure-prediction-project/eeg-database.
5.
The ROOT table is confined to a single region and maps all the regions in the META table. Each row in the ROOT and META tables is approximately 1 KB in size. At the default region size of 256 MB, this means that the ROOT region can map 2. 6 ×105 META regions, which in turn map a total 6. 9 ×1, 010 user regions, meaning that approximately 1. 8 ×1, 019 (264) bytes of user data.
6.
The META table stores information about every user region in HBase such as start and end row keys, whether region is on or off-line and address that is currently serving the region.
7.
This figure is obtained from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html.
8.
http://www.uniklinik-freiburg.de/epilepsie/live/indexen.html.
9.
We ascertained that this is not a limitation on the client and it was not overloaded in terms of either cpu or bandwidth utilization.
10.
HBase 0.20.6 and the more recent development releases which specifically addressed and fixed many bugs which affected performance and stability.

References

-Micron All Sky Survey. http://pegasus.phast.umass.edu
Andersson, P., Mollerstrand, F.: Zohmg – a large scale data store for aggregated time-series-based data. Master’s thesis, Chalmers University of Technology (2009)
Google Scholar
Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)
Article Google Scholar
Aschenbrenner-Scheibe, R., Maiwald, T., Winterhalder, M., Voss, H.U., Timmer, J., Schulze-Bonhage, A.: How well can epileptic seizures be predicted, an evaluation of a nonlinear method. Brain 126, 2616–2626 (2003)
Article MATH Google Scholar
Berger, H.: Uber das elektroencephalogramm des menschen (on the electroencephalogram of man). Archiv fiir Psychiatrie und Nervenkrankheiten 87, 527–570 (1929)
Article Google Scholar
Brinkmann, B.H., Bower, M.R., Stengel, K.A., Worrell, G.A., Stead. M.: Large-scale electrophysiology: Acquisition, compression, encryption, and storage of big data. J. Neurosci. Meth. 180(1), 185–192 (2009)
Article Google Scholar
Cafarella, M., Cutting, D.: Building nutch: Open source search. In: ACM Queue, April 2004
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)
Article Google Scholar
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with ycsb. ACM Symposium on Cloud Computing. ACM, IN, USA (2010)
Book Google Scholar
Das, K., Bhaduri, K., Arora, S., Griffin, W., Borne, K., Giannella, C., Kargupta, H.: Scalable distributed change detection from astronomy data streams using local, asynchronous eigen monitoring algorithms. In: Proceedings of the SIAM International Conference on Data Mining, Sparks, Nevada, 2009
Book Google Scholar
Dave, R.: Scaling Astronomy. Oreilly Ignite 4, Boston, MA, September 2008. http://timemachine.iic.harvard.edu/publications/#scaling-astronomy
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004, pp. 137–150
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: A flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: The 19th ACM Symposium on Operating Systems Principles, lake George, NY (2003)
Google Scholar
Graves, S.J., Conover, H., Keiser, K., Ramachandran, R., Redman, S., Rushing, J., Tanner, S.: Mining and Modeling in the Linked Environments for Atmospheric Discovery (LEAD). In: Huntsville Simulation Conference, Huntsville, AL, 19 Oct 2004
Google Scholar
Amazon Elastic Compute Cloud, Amazon EC2. http://aws.amazon.com/ec2/
Amazon Simple Storage Service, Amazon S3. http://aws.amazon.com/s3/
Grape 6. http://grape.mtk.nao.ac.jp/grape/news/ABC/ABC-cuttingedge000602.html
Apache Hadoop. http://hadoop.apache.org/core/
Apache Hbase. http://hbase.apache.org/
Large Hadron Collider, European Organization for Nuclear Research. http://lhc.web.cern.ch/lhc/
Hbase Architecture. http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
Randy Bryant’s Home Page. http://www.cs.cmu.edu/bryant/
San Diego Supercomputer Center, SDSC. http://www.sdsc.edu/
Hsu, F.H.: Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, NJ (2002)
MATH Google Scholar
Keogh, E.J.: Recent advances in mining time series data. In: PKDD, p. 6 (2005)
Google Scholar
Keogh, E.J.: A decade of progress in indexing and mining large time series databases. In: VLDB, p. 1268 (2006)
Google Scholar
Lin, J., Vlachos, M., Keogh, E., Gunopulos, D.: Iterative incremental clustering of time series. In: Proceedings of the IX Conference on Extending Database Technology (2004)
Book Google Scholar
Reeves, G., Liu, J., Nath, S., Zhao, F.: Managing massive time series streams with multi-scale compressed trickles. In Proceedings of the 35th Conference on Very Large Data Bases, Lyon, France, 2009
Google Scholar
Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: KDD, pp. 623–631 (2008)
Google Scholar
Sloan Digital Sky Survey. http://www.sdss.org
The human genome project. http://www.ornl.gov/sci/techresources/HumanGenome/home.shtml
The protein data bank (pdb). http://www.rcsb.org/pdb/Welcome.do
The swiss-prot protein knowledge base. http://www.expasy.org/sprot/
World data center for meterology. http://www.ncdc.noaa.gov/oa/wmo/wdcamet.html
Yankov, D., Keogh, E.J., Rebbapragada, U.: Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. In: ICDM, pp. 381–390 (2007)
Google Scholar

Download references

Acknowledgements

Funding for this work is provided by National Science Foundation award, IIS–0916186. Data for this project were provided by the Freiburg Seizure Prediction EEG Database (FSPEEG). The authors would like to thank administrators of the CLIC Lab Cluster at the Department of Computer Science, Columbia University for help with cluster set-up and experimentation; Dr. David Waltz, Dr. Catherine A Schevon, Dr. Ronald Emerson, Phil Gross, and Shen Wang provided their insightful comments during different phases of the project.

Author information

Authors and Affiliations

Center for Computational Learning Systems (CCLS), Columbia University, New York, NY, 10115, USA
Haimonti Dutta & Manoj Pooleery
School of General Studies, Columbia University, New York, NY, 10027, USA
Alex Kamil
Computer Architecture Laboratory, Department of Computer Science, Columbia University, New York, NY, 10115, USA
Simha Sethumadhavan
Department of Computer Science, Columbia University, New York, NY, 10115, USA
John Demme

Authors

Haimonti Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Alex Kamil
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Pooleery
View author publications
You can also search for this author in PubMed Google Scholar
Simha Sethumadhavan
View author publications
You can also search for this author in PubMed Google Scholar
John Demme
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haimonti Dutta .

Editor information

Editors and Affiliations

Faculty of Engineering, Dept. of Innovation Engineering, University of Salento, Via per Monteroni, 73100, Lecce, Italy
Sandro Fiore Ph.D.
Faculty of Engineering, Dept. of Innovation Engineering, University of Salento, Via per Monteroni, 73100, Lecce, Italy
Giovanni Aloisio

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dutta, H., Kamil, A., Pooleery, M., Sethumadhavan, S., Demme, J. (2011). Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-20045-8_16
Published: 17 May 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20044-1
Online ISBN: 978-3-642-20045-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics