A Holistic View on Web Archives

Holzmann, Helge; Nejdl, Wolfgang

doi:10.1007/978-3-030-63291-5_8

Helge Holzmann⁵ &
Wolfgang Nejdl⁶

1380 Accesses
1 Citations

Abstract

In order to address the requirements of different user groups and use cases of web archives, we have identified three views to access and explore web archives: user-, data- and graph-centric. The user-centric view is the natural way to look at the archived pages in a browser, just like the live web is consumed. By zooming out from there and looking at whole collections in a web archive, data processing methods can enable analysis at scale. In this data-centric view, the web and its dynamics as well as the contents of archived pages can be looked at from two angles: (1) by retrospectively analysing crawl metadata with respect to the size, age and growth of the web and (2) by processing archival collections to build research corpora from web archives. Finally, the third perspective is what we call the graph-centric view, which considers websites, pages or extracted facts as nodes in a graph. Links among pages or the extracted information are represented by edges in the graph. This structural perspective conveys an overview of the holdings and connections among contained resources and information. Only all three views together provide the holistic view that is required to effectively work with web archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adamic LA, Huberman BA, Barabási AL, Albert R, Jeong H, Bianconi G (2000) Power-law distribution of the world wide web. Science 287(5461):2115. https://doi.org/10.1126/science.287.5461.2115a
Article Google Scholar
Adar E, Teevan J, Dumais ST, Elsas JL (2009) The web changes everything: understanding the dynamics of web content. In: Proceedings of the 2nd ACM international conference on web search and data mining - WSDM ’09. ACM Press, New York, pp 282–291. https://doi.org/10.1145/1498759.1498837
Chapter Google Scholar
Agata T, Miyata Y, Ishita E, Ikeuchi A, Ueda S (2014) Life span of web pages: a survey of 10 million pages collected in 2001. Digit Libr pp 463–464. https://doi.org/10.1109/JCDL.2014.6970226
Albert R, Jeong H, Barabási AL (1999) Internet: diameter of the world-wide web. Nature 401(6749):130–131. https://doi.org/10.1038/43601
Article Google Scholar
Alkwai LM, Nelson ML, Weigle MC (2015) How well are Arabic websites archived? In: Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries - JCDL ’15. ACM, pp 223–232. https://doi.org/10.1145/2756406.2756912
Alonso O, Gertz M, Baeza-Yates R (2007) On the value of temporal information in information retrieval. ACM SIGIR Forum 41(2):35–41. https://doi.org/10.1145/1328964.1328968
Article Google Scholar
AlSum A (2014) Web archive services framework for tighter integration between the past and present web. PhD thesis, Old Dominion University
Google Scholar
Anand A, Bedathur S, Berberich K, Schenkel R (2011) Temporal index sharding for space-time efficiency in archive search. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’11. ACM, New York, NY, pp 545–554. https://doi.org/10.1145/2009916.2009991
Google Scholar
Anand A, Bedathur S, Berberich K, Schenkel R (2012) Index maintenance for time-travel text search. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’12, Portland, Oregon, pp 235–244. https://doi.org/10.1145/2348283.2348318
Berberich K, Bedathur S, Alonso O, Weikum G (2010) A language modeling approach for temporal information needs. In: Proceedings of the 32nd European conference on advances in information retrieval (ECIR), Springer-Verlag, Berlin, Heidelberg, ECIR’2010, pp 13–25. https://doi.org/10.1007/978-3-642-12275-0_5
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of the 13th conference on world wide web - WWW ’04, ACM. ACM Press, Manhattan, pp 595–602. https://doi.org/10.1145/988672.988752. http://law.di.unimi.it/datasets.php
Broder A (2002) A taxonomy of web search. In: ACM Sigir forum, ACM. Association for Computing Machinery (ACM), New York, vol 36, pp 3–10. https://doi.org/10.1145/792550.792552
Google Scholar
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. Comput Netw 33(1):309–320. https://doi.org/10.1016/s1389-1286(00)00083-9
Article Google Scholar
Campos R, Dias G, Jorge AM, Jatowt A (2015) Survey of temporal information retrieval and related applications. ACM Comput Surv (CSUR) 47(2):15
Article Google Scholar
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):4. https://doi.org/10.1145/1365815.1365816
Article Google Scholar
Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th international conference on very large data bases, VLDB ’00
Google Scholar
Cohen S, Li C, Yang J, Yu C (2011) Computational journalism: a call to arms to database researchers. In: Proceedings of the 5th Biennial conference on innovative data systems research, pp 148–151
Google Scholar
Costa M, Silva MJ (2010) Understanding the information needs of web archive users. In: Proceedings of the 10th international web archiving workshop
Google Scholar
Costa M, Gomes D, Couto F, Silva M (2013) A survey of web archive search architectures. In: Proceedings of the 22nd international conference on world wide web - WWW ’13 companion. ACM Press, New York, NY, pp 1045–1050. https://doi.org/10.1145/2487788.2488116
Chapter Google Scholar
Craswell N, Hawking D, Robertson S (2001) Effective site finding using link anchor information. In: Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’01, ACM. ACM Press, New York. https://doi.org/10.1145/383952.383999
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77. https://doi.org/10.1145/1629175.1629198
Article Google Scholar
Fafalios P, Holzmann H, Kasturia V, Nejdl W (2017) Building and querying semantic layers for web archives. In: Proceedings of the 17th ACM/IEEE-CS joint conference on digital libraries - JCDL ’17. IEEE, Piscataway. https://doi.org/10.1109/jcdl.2017.7991555
Fafalios P, Holzmann H, Kasturia V, Nejdl W (2018) Building and querying semantic layers for web archives (extended version). Int J Digit Libr. https://doi.org/10.1007/s00799-018-0251-0
Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on world wide web - WWW ’03, pp 669–678. https://doi.org/10.1002/spe.577
Goel V (2016) Beta Wayback machine - now with site search!. https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search. Accessed 16 Mar 2017
Hale SA, Yasseri T, Cowls J, Meyer ET, Schroeder R, Margetts H (2014) Mapping the UK webspace: fifteen years of British universities on the web. In: Proceedings of the 2014 ACM conference on web science - WebSci ’14, WebSci ’14. ACM Press, New York. https://doi.org/10.1145/2615569.2615691
Hall W, Hendler J, Staab S (2017) A manifesto for web science @10. arXiv:170208291
Google Scholar
Hockx-Yu H (2014) Access and scholarly use of web archives. Alexandria J Natl Int Library Inf Issues 25(1–2):113–127. https://doi.org/10.7227/alx.0023
Google Scholar
Holzmann H (2019) Concepts and tools for the effective and efficient use of web archives. PhD thesis, Leibniz Universität Hannover. https://doi.org/10.15488/4436
Holzmann H, Anand A (2016) Tempas: temporal archive search based on tags. In: Proceedings of the 25th international conference companion on world wide web - WWW ’16 companion. ACM Press, New York. https://doi.org/10.1145/2872518.2890555
Holzmann H, Goel V, Anand A (2016a) Archivespark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16,. ACM, New York, pp 83–92. https://doi.org/10.1145/2910896.2910902
Chapter Google Scholar
Holzmann H, Nejdl W, Anand A (2016b) On the applicability of delicious for temporal search on web archives. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16. ACM Press, Pisa. https://doi.org/10.1145/2911451.2914724
Holzmann H, Nejdl W, Anand A (2016c) The dawn of today’s popular domains - a study of the archived German web over 18 years. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16. IEEE/ACM Press, Newark/New Jersey, pp 73–82. https://doi.org/10.1145/2910896.2910901
Chapter Google Scholar
Holzmann H, Runnwerth M, Sperber W (2016d) Linking mathematical software in web archives. In: Mathematical software – ICMS 2016. Springer International Publishing, Cham, pp 419–422. https://doi.org/10.1007/978-3-319-42432-3_52
Chapter Google Scholar
Holzmann H, Sperber W, Runnwerth M (2016e) Archiving software surrogates on the web for future reference. In: Research and advanced technology for digital libraries, 20th international conference on theory and practice of digital libraries, TPDL 2016, Hannover. https://doi.org/10.1007/978-3-319-43997-6_17
Holzmann H, Goel V, Gustainis EN (2017a) Universal distant reading through metadata proxies with archivespark. In: 2017 IEEE international conference on big data (Big Data). IEEE, Boston, MA. https://doi.org/10.1109/bigdata.2017.8257958
Holzmann H, Nejdl W, Anand A (2017b) Exploring web archives through temporal anchor texts. In: Proceedings of the 2017 ACM on web science conference - WebSci ’17. ACM Press, Troy, New York. https://doi.org/10.1145/3091478.3091500
Holzmann H, Anand A, Khosla M (2018) What the HAK? Estimating ranking deviations in incomplete graphs. In: 14th International workshop on mining and learning with graphs (MLG) - co-located with 24th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), London
Google Scholar
Holzmann H, Anand A, Khosla M (2019) Delusive pagerank in incomplete graphs. In: Complex networks and their applications, vol VII. Springer International Publishing, Cham
Book Google Scholar
Huberman BA, Adamic LA (1999) Internet: growth dynamics of the world-wide web. Nature 401(6749):131
Article Google Scholar
Jones R, Diaz F (2007) Temporal profiles of queries. ACM Trans Inf Syst 25(3):14–es. https://doi.org/10.1145/1247715.1247720
Kanhabua N, Blanco R, Nørvåg K, et al (2015) Temporal information retrieval. Found Trends Inf Retrieval 9(2):91–208. https://doi.org/10.1145/2911451.2914805
Article Google Scholar
Kanhabua N, Kemkes P, Nejdl W, Nguyen TN, Reis F, Tran NK (2016) How to search the internet archive without indexing it. In: Research and advanced technology for digital libraries. Springer International Publishing, Hannover, pp 147–160. https://doi.org/10.1007/978-3-319-43997-6_12
Chapter Google Scholar
Kasioumis N, Banos V, Kalb H (2014) Towards building a blog preservation platform. World Wide Web J 17(4):799–825. https://doi.org/10.1007/s11280-013-0234-4
Article Google Scholar
Koehler W (2002) Web Page change and persistence-a four-year longitudinal study. J Am Soc Inf Sci Technol 53(2):162–171. https://doi.org/10.1002/asi.10018
Article Google Scholar
Koolen M, Kamps J (2010) The importance of anchor text for ad hoc search revisited. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval - SIGIR ’10, ACM. ACM Press, pp 122–129. https://doi.org/10.1145/1835449.1835472
Chapter Google Scholar
Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’02. ACM, New York. https://doi.org/10.1145/564376.564383
Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40. https://doi.org/10.1145/1773912.1773922
Article Google Scholar
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceeding of the 1th ACM SIGKDD international conference on knowledge discovery in data mining - KDD ’05, ACM. ACM Press, pp 177–187. https://doi.org/10.1145/1081870.1081893
Google Scholar
Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data (TKDD) 1(1):2. https://doi.org/10.1145/1217299.1217301
Article Google Scholar
Lin J, Gholami M, Rao J (2014) Infrastructure for supporting exploration and discovery in web archives. In: Proceedings of the 23rd international conference on world wide web - WWW ’14 companion. ACM Press, New York. https://doi.org/10.1145/2567948.2579045
Marshall CC, Shipman FM (2014) An argument for archiving Facebook as a heterogeneous personal store. In: Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries - JCDL ’14. IEEE Press, Piscataway, pp 11–20. https://doi.org/10.1109/jcdl.2014.6970144
Google Scholar
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, and ELA (2010) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182. https://doi.org/10.1126/science.1199644
Article Google Scholar
Moretti F (2005) Graphs, maps, trees: abstract models for a literary history. Verso
Google Scholar
Ntoulas A, Cho J, Olston C (2004) What’s new on the web?: The evolution of the web from a search engine perspective. In: Proceedings of the 13th conference on world wide web - WWW ’04. ACM Press, New York, pp 1–12. https://doi.org/10.1145/988672.988674
Google Scholar
Ogilvie P, Callan J (2003) Combining document representations for known-item search. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’03, ACM. ACM Press, New York. https://doi.org/10.1145/860435.860463
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. InfoLab
Google Scholar
SalahEldeen HM, Nelson ML (2012) Losing my revolution: how many resources shared on social media have been lost? In: Theory and practice of digital libraries, TPDL’12. Springer, Paphos, Cyprus, pp 125–137. https://doi.org/10.1007/978-3-642-33290-6_14
Chapter Google Scholar
Schreibman S, Siemens R, Unsworth J (2008) A companion to digital humanities. Blackwell Publishing, Malden
Google Scholar
Shaltev M, Zab JH, Kemkes P, Siersdorfer S, Zerr S (2016) Cobwebs from the past and present: extracting large social networks using internet archive data. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16, Pisa. https://doi.org/10.1145/2911451.2911467
Singh J, Nejdl W, Anand A (2016) History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on conference on human information interaction and retrieval - CHIIR ’16. ACM Press, New York, pp 183–192. https://doi.org/10.1145/2854946.2854959
Google Scholar
Suel T, Yuan J (2001) Compressing the graph structure of the web. In: Data compression conference
Book Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, vol 10, p 10
Google Scholar

Download references

Author information

Authors and Affiliations

Consultant, Hannover, Germany
Helge Holzmann
L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
Wolfgang Nejdl

Authors

Helge Holzmann
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helge Holzmann .

Editor information

Editors and Affiliations

Fundação para a Ciência e a Tecnologia, Lisbon, Portugal
Daniel Gomes
Data Science & Intelligent Systems, Computer Science Institute, University of Bonn, Bonn, Germany
Elena Demidova
School of Advanced Study, University of London, London, UK
Jane Winters
University Library J. C. Senckenberg, Goethe University Frankfurt, Frankfurt am Main, Germany
Thomas Risse

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Holzmann, H., Nejdl, W. (2021). A Holistic View on Web Archives. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-63291-5_8
Published: 01 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63290-8
Online ISBN: 978-3-030-63291-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics