Skip to main content

A Holistic View on Web Archives

  • Chapter
  • First Online:
The Past Web

Abstract

In order to address the requirements of different user groups and use cases of web archives, we have identified three views to access and explore web archives: user-, data- and graph-centric. The user-centric view is the natural way to look at the archived pages in a browser, just like the live web is consumed. By zooming out from there and looking at whole collections in a web archive, data processing methods can enable analysis at scale. In this data-centric view, the web and its dynamics as well as the contents of archived pages can be looked at from two angles: (1) by retrospectively analysing crawl metadata with respect to the size, age and growth of the web and (2) by processing archival collections to build research corpora from web archives. Finally, the third perspective is what we call the graph-centric view, which considers websites, pages or extracted facts as nodes in a graph. Links among pages or the extracted information are represented by edges in the graph. This structural perspective conveys an overview of the holdings and connections among contained resources and information. Only all three views together provide the holistic view that is required to effectively work with web archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Adamic LA, Huberman BA, Barabási AL, Albert R, Jeong H, Bianconi G (2000) Power-law distribution of the world wide web. Science 287(5461):2115. https://doi.org/10.1126/science.287.5461.2115a

    Article  Google Scholar 

  • Adar E, Teevan J, Dumais ST, Elsas JL (2009) The web changes everything: understanding the dynamics of web content. In: Proceedings of the 2nd ACM international conference on web search and data mining - WSDM ’09. ACM Press, New York, pp 282–291. https://doi.org/10.1145/1498759.1498837

    Chapter  Google Scholar 

  • Agata T, Miyata Y, Ishita E, Ikeuchi A, Ueda S (2014) Life span of web pages: a survey of 10 million pages collected in 2001. Digit Libr pp 463–464. https://doi.org/10.1109/JCDL.2014.6970226

  • Albert R, Jeong H, Barabási AL (1999) Internet: diameter of the world-wide web. Nature 401(6749):130–131. https://doi.org/10.1038/43601

    Article  Google Scholar 

  • Alkwai LM, Nelson ML, Weigle MC (2015) How well are Arabic websites archived? In: Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries - JCDL ’15. ACM, pp 223–232. https://doi.org/10.1145/2756406.2756912

  • Alonso O, Gertz M, Baeza-Yates R (2007) On the value of temporal information in information retrieval. ACM SIGIR Forum 41(2):35–41. https://doi.org/10.1145/1328964.1328968

    Article  Google Scholar 

  • AlSum A (2014) Web archive services framework for tighter integration between the past and present web. PhD thesis, Old Dominion University

    Google Scholar 

  • Anand A, Bedathur S, Berberich K, Schenkel R (2011) Temporal index sharding for space-time efficiency in archive search. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’11. ACM, New York, NY, pp 545–554. https://doi.org/10.1145/2009916.2009991

    Google Scholar 

  • Anand A, Bedathur S, Berberich K, Schenkel R (2012) Index maintenance for time-travel text search. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’12, Portland, Oregon, pp 235–244. https://doi.org/10.1145/2348283.2348318

  • Berberich K, Bedathur S, Alonso O, Weikum G (2010) A language modeling approach for temporal information needs. In: Proceedings of the 32nd European conference on advances in information retrieval (ECIR), Springer-Verlag, Berlin, Heidelberg, ECIR’2010, pp 13–25. https://doi.org/10.1007/978-3-642-12275-0_5

  • Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of the 13th conference on world wide web - WWW ’04, ACM. ACM Press, Manhattan, pp 595–602. https://doi.org/10.1145/988672.988752. http://law.di.unimi.it/datasets.php

  • Broder A (2002) A taxonomy of web search. In: ACM Sigir forum, ACM. Association for Computing Machinery (ACM), New York, vol 36, pp 3–10. https://doi.org/10.1145/792550.792552

    Google Scholar 

  • Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. Comput Netw 33(1):309–320. https://doi.org/10.1016/s1389-1286(00)00083-9

    Article  Google Scholar 

  • Campos R, Dias G, Jorge AM, Jatowt A (2015) Survey of temporal information retrieval and related applications. ACM Comput Surv (CSUR) 47(2):15

    Article  Google Scholar 

  • Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):4. https://doi.org/10.1145/1365815.1365816

    Article  Google Scholar 

  • Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th international conference on very large data bases, VLDB ’00

    Google Scholar 

  • Cohen S, Li C, Yang J, Yu C (2011) Computational journalism: a call to arms to database researchers. In: Proceedings of the 5th Biennial conference on innovative data systems research, pp 148–151

    Google Scholar 

  • Costa M, Silva MJ (2010) Understanding the information needs of web archive users. In: Proceedings of the 10th international web archiving workshop

    Google Scholar 

  • Costa M, Gomes D, Couto F, Silva M (2013) A survey of web archive search architectures. In: Proceedings of the 22nd international conference on world wide web - WWW ’13 companion. ACM Press, New York, NY, pp 1045–1050. https://doi.org/10.1145/2487788.2488116

    Chapter  Google Scholar 

  • Craswell N, Hawking D, Robertson S (2001) Effective site finding using link anchor information. In: Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’01, ACM. ACM Press, New York. https://doi.org/10.1145/383952.383999

  • Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77. https://doi.org/10.1145/1629175.1629198

    Article  Google Scholar 

  • Fafalios P, Holzmann H, Kasturia V, Nejdl W (2017) Building and querying semantic layers for web archives. In: Proceedings of the 17th ACM/IEEE-CS joint conference on digital libraries - JCDL ’17. IEEE, Piscataway. https://doi.org/10.1109/jcdl.2017.7991555

  • Fafalios P, Holzmann H, Kasturia V, Nejdl W (2018) Building and querying semantic layers for web archives (extended version). Int J Digit Libr. https://doi.org/10.1007/s00799-018-0251-0

  • Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on world wide web - WWW ’03, pp 669–678. https://doi.org/10.1002/spe.577

  • Goel V (2016) Beta Wayback machine - now with site search!. https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search. Accessed 16 Mar 2017

  • Hale SA, Yasseri T, Cowls J, Meyer ET, Schroeder R, Margetts H (2014) Mapping the UK webspace: fifteen years of British universities on the web. In: Proceedings of the 2014 ACM conference on web science - WebSci ’14, WebSci ’14. ACM Press, New York. https://doi.org/10.1145/2615569.2615691

  • Hall W, Hendler J, Staab S (2017) A manifesto for web science @10. arXiv:170208291

    Google Scholar 

  • Hockx-Yu H (2014) Access and scholarly use of web archives. Alexandria J Natl Int Library Inf Issues 25(1–2):113–127. https://doi.org/10.7227/alx.0023

    Google Scholar 

  • Holzmann H (2019) Concepts and tools for the effective and efficient use of web archives. PhD thesis, Leibniz Universität Hannover. https://doi.org/10.15488/4436

  • Holzmann H, Anand A (2016) Tempas: temporal archive search based on tags. In: Proceedings of the 25th international conference companion on world wide web - WWW ’16 companion. ACM Press, New York. https://doi.org/10.1145/2872518.2890555

  • Holzmann H, Goel V, Anand A (2016a) Archivespark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16,. ACM, New York, pp 83–92. https://doi.org/10.1145/2910896.2910902

    Chapter  Google Scholar 

  • Holzmann H, Nejdl W, Anand A (2016b) On the applicability of delicious for temporal search on web archives. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16. ACM Press, Pisa. https://doi.org/10.1145/2911451.2914724

  • Holzmann H, Nejdl W, Anand A (2016c) The dawn of today’s popular domains - a study of the archived German web over 18 years. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16. IEEE/ACM Press, Newark/New Jersey, pp 73–82. https://doi.org/10.1145/2910896.2910901

    Chapter  Google Scholar 

  • Holzmann H, Runnwerth M, Sperber W (2016d) Linking mathematical software in web archives. In: Mathematical software – ICMS 2016. Springer International Publishing, Cham, pp 419–422. https://doi.org/10.1007/978-3-319-42432-3_52

    Chapter  Google Scholar 

  • Holzmann H, Sperber W, Runnwerth M (2016e) Archiving software surrogates on the web for future reference. In: Research and advanced technology for digital libraries, 20th international conference on theory and practice of digital libraries, TPDL 2016, Hannover. https://doi.org/10.1007/978-3-319-43997-6_17

  • Holzmann H, Goel V, Gustainis EN (2017a) Universal distant reading through metadata proxies with archivespark. In: 2017 IEEE international conference on big data (Big Data). IEEE, Boston, MA. https://doi.org/10.1109/bigdata.2017.8257958

  • Holzmann H, Nejdl W, Anand A (2017b) Exploring web archives through temporal anchor texts. In: Proceedings of the 2017 ACM on web science conference - WebSci ’17. ACM Press, Troy, New York. https://doi.org/10.1145/3091478.3091500

  • Holzmann H, Anand A, Khosla M (2018) What the HAK? Estimating ranking deviations in incomplete graphs. In: 14th International workshop on mining and learning with graphs (MLG) - co-located with 24th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), London

    Google Scholar 

  • Holzmann H, Anand A, Khosla M (2019) Delusive pagerank in incomplete graphs. In: Complex networks and their applications, vol VII. Springer International Publishing, Cham

    Book  Google Scholar 

  • Huberman BA, Adamic LA (1999) Internet: growth dynamics of the world-wide web. Nature 401(6749):131

    Article  Google Scholar 

  • Jones R, Diaz F (2007) Temporal profiles of queries. ACM Trans Inf Syst 25(3):14–es. https://doi.org/10.1145/1247715.1247720

  • Kanhabua N, Blanco R, Nørvåg K, et al (2015) Temporal information retrieval. Found Trends Inf Retrieval 9(2):91–208. https://doi.org/10.1145/2911451.2914805

    Article  Google Scholar 

  • Kanhabua N, Kemkes P, Nejdl W, Nguyen TN, Reis F, Tran NK (2016) How to search the internet archive without indexing it. In: Research and advanced technology for digital libraries. Springer International Publishing, Hannover, pp 147–160. https://doi.org/10.1007/978-3-319-43997-6_12

    Chapter  Google Scholar 

  • Kasioumis N, Banos V, Kalb H (2014) Towards building a blog preservation platform. World Wide Web J 17(4):799–825. https://doi.org/10.1007/s11280-013-0234-4

    Article  Google Scholar 

  • Koehler W (2002) Web Page change and persistence-a four-year longitudinal study. J Am Soc Inf Sci Technol 53(2):162–171. https://doi.org/10.1002/asi.10018

    Article  Google Scholar 

  • Koolen M, Kamps J (2010) The importance of anchor text for ad hoc search revisited. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval - SIGIR ’10, ACM. ACM Press, pp 122–129. https://doi.org/10.1145/1835449.1835472

    Chapter  Google Scholar 

  • Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’02. ACM, New York. https://doi.org/10.1145/564376.564383

  • Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40. https://doi.org/10.1145/1773912.1773922

    Article  Google Scholar 

  • Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceeding of the 1th ACM SIGKDD international conference on knowledge discovery in data mining - KDD ’05, ACM. ACM Press, pp 177–187. https://doi.org/10.1145/1081870.1081893

    Google Scholar 

  • Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data (TKDD) 1(1):2. https://doi.org/10.1145/1217299.1217301

    Article  Google Scholar 

  • Lin J, Gholami M, Rao J (2014) Infrastructure for supporting exploration and discovery in web archives. In: Proceedings of the 23rd international conference on world wide web - WWW ’14 companion. ACM Press, New York. https://doi.org/10.1145/2567948.2579045

  • Marshall CC, Shipman FM (2014) An argument for archiving Facebook as a heterogeneous personal store. In: Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries - JCDL ’14. IEEE Press, Piscataway, pp 11–20. https://doi.org/10.1109/jcdl.2014.6970144

    Google Scholar 

  • Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, and ELA (2010) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182. https://doi.org/10.1126/science.1199644

    Article  Google Scholar 

  • Moretti F (2005) Graphs, maps, trees: abstract models for a literary history. Verso

    Google Scholar 

  • Ntoulas A, Cho J, Olston C (2004) What’s new on the web?: The evolution of the web from a search engine perspective. In: Proceedings of the 13th conference on world wide web - WWW ’04. ACM Press, New York, pp 1–12. https://doi.org/10.1145/988672.988674

    Google Scholar 

  • Ogilvie P, Callan J (2003) Combining document representations for known-item search. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’03, ACM. ACM Press, New York. https://doi.org/10.1145/860435.860463

  • Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. InfoLab

    Google Scholar 

  • SalahEldeen HM, Nelson ML (2012) Losing my revolution: how many resources shared on social media have been lost? In: Theory and practice of digital libraries, TPDL’12. Springer, Paphos, Cyprus, pp 125–137. https://doi.org/10.1007/978-3-642-33290-6_14

    Chapter  Google Scholar 

  • Schreibman S, Siemens R, Unsworth J (2008) A companion to digital humanities. Blackwell Publishing, Malden

    Google Scholar 

  • Shaltev M, Zab JH, Kemkes P, Siersdorfer S, Zerr S (2016) Cobwebs from the past and present: extracting large social networks using internet archive data. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16, Pisa. https://doi.org/10.1145/2911451.2911467

  • Singh J, Nejdl W, Anand A (2016) History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on conference on human information interaction and retrieval - CHIIR ’16. ACM Press, New York, pp 183–192. https://doi.org/10.1145/2854946.2854959

    Google Scholar 

  • Suel T, Yuan J (2001) Compressing the graph structure of the web. In: Data compression conference

    Book  Google Scholar 

  • Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, vol 10, p 10

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Helge Holzmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Holzmann, H., Nejdl, W. (2021). A Holistic View on Web Archives. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63291-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63290-8

  • Online ISBN: 978-3-030-63291-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics