Abstract
Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.
Similar content being viewed by others
References
http://hbase.apache.org/ (2014). Accessed Nov 2014
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html (2015). Accessed Mar 2015
http://hadoop.apache.org (2014). Accessed Mar 2014
Asseg, F., Razum, M., Hahn, M.: Apache hadoop as a storage backend for fedora commons. In: OR2012, The 7th International Conference on Open Repositories, Edinburgh. http://or2012.ed.ac.uk/ (2012)
Bechhofer, S., Sierman, B., Jones, C., Elstrøm, G., Kulovits, H., Becker, C.: Final version of policy specification model. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed May 2015
http://sbforge.org/display/BITMAG/The+Bit+Repository+project (2014). Accessed Mar 2014 (Note this is a live wiki page)
http://www.jisc.ac.uk/media/documents/programmes/digitisation/digitisation_v2_overview_final.pdf (2014). Accessed Nov 2014
http://www.britishnewspaperarchive.co.uk/help/about (2014). Accessed Nov 2014
CCSDS Secretariat: Audit and certification of Trustworthy Digital Repositories, Recommended Practice, CCSDS 652.0-M-1, issue 1 edn. CCSDS Secretariat (2011). (Magenta Book)
http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html (2014). Accessed Mar 2014
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ (2014). Accessed Nov 2014
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
http://dingo.psnc.pl/darceo/ (2014). Accessed Mar 2014
http://sbforge.org/display/DOMS/Home (2014). Accessed Mar 2014. (Note this is a live wiki page)
http://www.dspace.org (2014). Accessed Mar 2014
http://www.eprints.org (2014). Accessed Mar 2014
http://www.escidoc.org/ (2014). Accessed Mar 2014
Scape connector api on fedora 4. https://github.com/fasseg/fcrepo4-scapex (2014). Accessed Nov 2014
http://fedoracommons.org (2014). Accessed Mar 2014
http://wiki.duraspace.org/display/FCREPO/Enhanced+Content+Models (2014). Accessed Mar 2014
http://wiki.duraspace.org/display/AKUBRA/Akubra+Project (2014). Accessed Nov 2014
Ferneke-Nielsen, R.B., Jurik, B.A., Andersen, B., Palmer, W., Pop, D., Duncan, S.S.A., Vujic, I., Klíma, O., Kutner, O., Parkola, T., Asseg, F., Barton, S., Medjkoune, L.: Scape final evaluation and methodology report. http://www.scape-project.eu/deliverable/d18-2-scape-final-evaluation-and-methodology-report (2014). Accessed May 2015
Hahn, M.: Recommendations for preservation-aware digital object model. http://www.scape-project.eu/deliverable/d8-1-recommendations-for-preservation-aware-digital-object-model (2014). Accessed May 2015
Hahn, M., Asseg, F.: Connector api. http://github.com/openplanets/scape-apis/blob/master/Data_Connector-API_V1.1.pdf (2014). Accessed May 2015
Hahn, M., Asseg, F., Sherwinter, N., Castro, R.: Scape data model. https://github.com/openpreserve/scapeapis/blob/master/Digital_Object_Model_V1.0.pdf (2014)
http://projecthydra.org (2014). Accessed Mar 2014
International Organization for Standardization: Iso/iec 15444–1:2004 information technology—jpeg 2000 image coding system: core coding system. www.iso.org/iso/catalogue_detail.htm?csnumber=37674 (2009). Accessed May 2015
http://www.irods.org (2014). Accessed Nov 2014
http://islandora.ca (2014). Accessed Mar 2014
Jurik, B., Blekinge, A., Ferneke-Nielsen, R., Møldrup-Dalum, P.: Bridging the gap between real world repositories and scalable preservation environments. In: Proceedings Digital Libraries 2014: conjoined conference for both the IEEE/ACM Joint Conference on Digital Libraries and the Theory and Practice of Digital Libraries Conference series (2014)
Kakadu software. http://kakadusoftware.com/documentation/ (2014). Accessed Nov 2014
van der Knijff, J.: Jpylyzer, jp2 validator and extractor. http://openpreserve.github.io/jpylyzer (2014). Accessed May 2015
Kraxner, M., Plangg, M., Duretec, K., Becker, C., Faria, L.: The scape planning and watch suite—supporting the preservation lifecycle in repositories. In: IPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)
Library of Congress: http://www.loc.gov/standards/premis (2014). Accessed Mar 2014
Library of Congress: http://www.loc.gov/standards/mets (2014). Accessed Mar 2014
http://www.lilyproject.org/lily/index.html (2014). Accessed Mar 2014
http://netarkivet.dk/in-english (2014). Accessed Mar 2014
http://ninestar.co.in (2014). Accessed Mar 2014
Palmer, W., Jurik, B., Ferneke-Nielsen, R.B., Kutner, O., Schlarb, S., Neudecker, C., Hahn, M.: Large scale digital repositories executable workflows for large-scale execution. http://www.scape-project.eu/deliverable/d16-2-lsdr-executable-workflows-for-large-scale-execution (2014). Accessed May 2015
http://www.roda-community.org/ (2014). Accessed Nov 2014
http://www.scape-project.eu (2014). Accessed Mar 2014
http://www.ifs.tuwien.ac.at/dp/plato/ (2014). Accessed Nov 2014
http://openpreserve.github.io/scout/ (2014). Accessed Nov 2014
http://wiki.opf-labs.org/display/SP/SCAPE+Platform (2014). Accessed Nov 2014. (Note this is a live wiki page)
https://github.com/openpreserve/scape-stager-loader-SB (2014). Accessed Nov 2014
http://ifs.tuwien.ac.at/imp/c3po (2014). Accessed Nov 2014
Sheldon, M.: Analysis of current digital preservation policies: Archives, libraries and museums. http://blogs.loc.gov/digitalpreservation/2013/08/analysis-of-current-digital-preservation-policies-archives-libraries-and-museums/ (2013). Accessed May 2015
Sierman, B., Jones, C., Bechhofer, S., Elstrøm, G.: Preservation policy levels in scape. In: iPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)
Sierman, B., Jones, C., Elstrøm, G.: Catalogue of preservation policy elements. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed Nov 2014
State and University Library: Jpeg 2000 specifications for the newspaper collection. http://sbforge.org/display/NEWSPAPER/Appendix+2B+-+JPEG2000+specifications (2013). Accessed Mar 2014
http://www.statsbiblioteket.dk/nationalbibliotek/adgang-til-samlingerne/tv-og-radio/radio-tv (2014). Accessed Mar 2014. (In Danish)
http://en.statsbiblioteket.dk/national-library-division/newspaper-digitisation/newspaper-digitization (2014). Accessed May 2015
http://www.statsbiblioteket.dk/nationalbibliotek/adgang-til-samlingerne/aviser/StatensAvissamling (2014). Accessed Mar 2014. (In Danish)
http://blog.avisdigitalisering.dk/format/#Choosing (2014). Accessed Mar 2014
http://sbforge.org/display/NEWSPAPER/Batch+Description (2014). Accessed Nov 2014
http://www.taverna.org.uk/ (2014). Accessed Nov 2014
Williams, K.: 2.646.800 historiske sider er indtil nu digitaliseret. http://quickpaper.rosendahls.dk/Statsbib/DenGang2 (2014). (In Danish). Accessed May 2015
Acknowledgments
We would like to thank the following for their invaluable help in discussing and proofreading this paper: Bjarne Andersen, Karen Williams, Kåre Fiedler Christiansen and especially Jette Junge. We would also like to thank Tom Gravgaard Christensen and Jens Henrik Leonard Jensen for creating the Hadoop clusters, Kim Teglgaard Christensen for log files and input on the Newspaper Digitisation Project, and last but certainly not least our thanks go to all our colleagues in the SCAPE project for the great discussions on large-scale challenges and solutions; specifically Frank Asseg from FIZ Karlsruhe, Hélder Silva from Keep Solutions and Peter May from British Library who kindly helped us with the references.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
Rights and permissions
About this article
Cite this article
Jurik, B.A., Blekinge, A.A., Ferneke-Nielsen, R.B. et al. Bridging the gap between real world repositories and scalable preservation environments. Int J Digit Libr 16, 267–282 (2015). https://doi.org/10.1007/s00799-015-0152-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-015-0152-4