Skip to main content

Analyzing the Web: Are Top Websites Lists a Good Choice for Research?

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2022)

Abstract

The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The term sites will be used as a synonym for hosts. A page is regarded as a single web page document on a host.

  2. 2.

    Relevant queries and their search volume can be identified using tools such as the Google Keyword Planner that provides historical data about search volume of specific queries.

  3. 3.

    https://s3-us-west-1.amazonaws.com/umbrella-static/index.html.

  4. 4.

    https://majestic.com/reports/majestic-million.

  5. 5.

    Available at https://tranco-list.eu/list/3X4L.

  6. 6.

    Data available at https://commoncrawl.org/2021/05/host-and-domain-level-web-graphs-feb-apr-may-2021/.

  7. 7.

    https://github.com/CLD2Owners/cld2.

  8. 8.

    Data dumps of Wikipedia External links are available at https://dumps.wikimedia.org/backup-index.html.

  9. 9.

    Only search volume data from Google has been taken into account, and, as a consequence, only the hosts found in Google Search.

  10. 10.

    https://alby.link/ccsample.

References

  1. Alby, A., Bauknecht, H., Weidinger, S., Mempel, M., Alby, T.: Muster und Limitationen der Internet-basierten Selbstdiagnose bei häufigen Dermatosen. JDDG: Journal der Deutschen Dermatologischen Gesellschaft 19 (2021)

    Google Scholar 

  2. Alby, T.: Analyzing the web: are top websites lists a good choice for research? (0.1) [data set] (2022). https://doi.org/10.5281/zenodo.6821240

  3. Alexa Internet, I.: We will be retiring alexa.com on 1 May 2022 (2021). https://support.alexa.com/hc/en-us/articles/4410503838999

  4. Allen, G., et al.: BiGBERT: classifying educational web resources for kindergarten-12\(^{th}\) grades. In: Hiemstra, D., et al. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 176–184. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_13

    Chapter  Google Scholar 

  5. AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digital Lib. 14(3), 149–166 (2014). https://doi.org/10.1007/s00799-014-0118-y

    Article  Google Scholar 

  6. Backlinko: We analyzed 5 million Google search results (2021). https://backlinko.com/google-ctr-stats

  7. Boldi, P., Vigna, S.: Axioms for centrality. CoRR abs/1308.2140 (2013). http://arxiv.org/abs/1308.2140

  8. Carterette, B., Pavluy, V., Fang, H., Kanoulas, E.: Million query track 2009 overview. In: TREC (2009)

    Google Scholar 

  9. Craigie, M., Loader, B., Burrows, R., Muncer, S.: Reliability of health information on the internet: an examination of expert’s ratings. J. Med. Internet Res. 4, e856 (2002). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761929/

  10. Englehardt, S., Narayanan, A.: Online tracking: a 1-million-site measurement and analysis. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. CCS ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2976749.2978313

  11. Fafalios, P., Iosifidis, V., Ntoutsi, E., Dietze, S.: TweetsKB: a public and large-scale RDF corpus of annotated tweets. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_12

    Chapter  Google Scholar 

  12. Felt, A.P., Barnes, R., King, A., Palmer, C., Bentzel, C., Tabriz, P.: Measuring HTTPS adoption on the web. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 1323–1338. USENIX Association, Vancouver, BC (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/felt

  13. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 1–6. WebDB 2004, Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1017074.1017077

  14. Fröbe, M., et al.: CopyCat: near-duplicates within and between the clueweb and the common crawl, pp. 2398–2404. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3404835.3463246

  15. Funel, A.: Analysis of the web graph aggregated by host and pay-level domain. CoRR abs/1802.05435 (2018). http://arxiv.org/abs/1802.05435

  16. Giannakoulopoulos, A., Pergantis, M., Konstantinou, N., Lamprogeorgos, A., Limniati, L., Varlamis, I.: Exploring the dominance of the English language on the websites of EU countries. Future Internet 12(4) (2020). https://doi.org/10.3390/fi12040076, https://www.mdpi.com/1999-5903/12/4/76

  17. Hale, S.A., Blank, G., Alexander, V.D.: Live versus archive: comparing a web archive to a population of web pages, pp. 45–61. UCL Press (2017). http://www.jstor.org/stable/j.ctt1mtz55k.8

  18. He, K., Fisher, A., Wang, L., Gember, A., Akella, A., Ristenpart, T.: Next stop, the cloud: Understanding modern web service deployment in ec2 and azure. In: Proceedings of the 2013 Conference on Internet Measurement Conference, pp. 177–190. IMC 2013, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2504730.2504740

  19. Höchstötter, N., Lewandowski, D.: What users see – structures in search engine results pages. Information Sciences 179(12), 1796–1812 (2009). https://doi.org/10.1016/j.ins.2009.01.028, special Section: Web Search

  20. Holzmann, H., Nejdl, W., Anand, A.: The dawn of today’s popular domains: a study of the archived german web over 18 years. CoRR abs/1702.01151 (2017). http://arxiv.org/abs/1702.01151

  21. Iqbal, U., Shafiq, Z., Qian, Z.: The ad wars: Retrospective measurement and analysis of anti-adblock filter lists. In: Proceedings of the 2017 Internet Measurement Conference, pp. 171–183. IMC 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3131365.3131387

  22. Kakos, A.B., Lovejoy, D.A., Whiteside, J.L.: Quality of information on pelvic organ prolapse on the Internet. Int. Urogynecol. J. 26(4), 551–555 (2014). https://doi.org/10.1007/s00192-014-2538-z

    Article  Google Scholar 

  23. Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007). https://doi.org/10.1162/coli.2007.33.1.147

    Article  Google Scholar 

  24. Leithner, A., Maurer-Ertl, W., Glehr, M., Friesenbichler, J., Leithner, K., Windhager, R.: Wikipedia and osteosarcoma: a trustworthy patients’ information? J. Am. Med. Infor. Assoc. 17(4), 373–374 (2010). https://doi.org/10.1136/jamia.2010.004507

    Article  Google Scholar 

  25. Lewandowski, D.: A three-year study on the freshness of web search engine databases. J. Inf. Sci. 34(6), 817–831 (2008). https://doi.org/10.1177/0165551508089396

    Article  Google Scholar 

  26. Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.: Upset: visualization of intersecting sets. IEEE Trans. Visual. Comput. Graph. (InfoVis) 20(12), 1983–1992 (2014). https://doi.org/10.1109/TVCG.2014.2346248

    Article  Google Scholar 

  27. Libert, T.: Exposing the hidden web: An analysis of third-party HTTP requests on 1 million websites. CoRR abs/1511.00619 (2015). http://arxiv.org/abs/1511.00619

  28. Lo, B., Sedhain, R.: How reliable are website rankings? implications for e-business advertising and internet search. Issues Inf. Syst. 7, 233–238 (2006)

    Google Scholar 

  29. Luccioni, A.S., Viviano, J.D.: What’s in the box? an analysis of undesirable content in the common crawl corpus. CoRR abs/2105.02732 (2021). https://arxiv.org/abs/2105.02732

  30. Mason, A.M., Compton, J., Bhati, S.: Disabilities and the digital divide: assessing web accessibility, readability, and mobility of popular health websites. J. Health Commun. 26(10), 667–674 (2021). https://doi.org/10.1080/10810730.2021.1987591, pMID: 34657585

  31. Nagel, S.: Common crawl’s first in-house web graph (2017). https://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/

  32. Nagel, S.: Index to WARC files and URLs in columnar format (2018). https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

  33. Nagel, S.: August 2019 crawl archive now available (2019). https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/

  34. Piccardi, T., Redi, M., Colavizza, G., West, R.: On the value of Wikipedia as a gateway to the web. In: Proceedings of the Web Conference 2021, pp. 249–260. WWW 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3442381.3450136

  35. Pochat, V.L., van Goethem, T., Joosen, W.: Rigging research results by manipulating top websites rankings. CoRR abs/1806.01156 (2018). http://arxiv.org/abs/1806.01156

  36. Pochat, V.L., Van Goethem, T., Tajalizadehkhoob, S., Korczyński, M., Joosen, W.: Tranco: a research-oriented top sites ranking hardened against manipulation. arXiv preprint arXiv:1806.01156 (2018)

  37. Robertson, F., Lagus, J., Kajava, K.: A COVID-19 news coverage mood map of Europe. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 110–115. Association for Computational Linguistics (2021), https://aclanthology.org/2021.hackashop-1.15

  38. Rweyemamu, W., Lauinger, T., Wilson, C., Robertson, W., Kirda, E.: Clustering and the weekend effect: recommendations for the use of top domain lists in security research. In: Choffnes, D., Barcellos, M. (eds.) Passive and Active Measurement, pp. 161–177. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-98785-5

    Chapter  Google Scholar 

  39. Scheitle, Q., et al.: A long way to the top: significance, structure, and stability of internet top lists. In: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. IMC 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3278532.3278574

  40. Silva, C.E., Campos, J.C.: Characterizing the control logic of web applications’ user interfaces. In: Murgante, B., et al. (eds.) Computational Science and Its Applications - ICCSA 2014, pp. 263–276. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-09147-1

    Chapter  Google Scholar 

  41. Srinath, M., Wilson, S., Giles, C.L.: Privacy at scale: Introducing the Privaseer corpus of web privacy policies. CoRR abs/2004.11131 (2020). https://arxiv.org/abs/2004.11131

  42. Tahir, B., Mehmood, M.A.: Corpulyzer: a novel framework for building low resource language corpora. IEEE Access 9, 8546–8563 (2021). https://doi.org/10.1109/ACCESS.2021.3049793

    Article  Google Scholar 

  43. Thelwall, M.: Web impact factors and search engine coverage. J. Documentation (2000). https://doi.org/10.1108/00220410010803801

  44. Thelwall, M.: A fair history of the web? examining country balance in the internet archive. Lib. Inf. Sci. Res. 26, 162–176 (2004). https://doi.org/10.1016/S0740-8188(04)00024-6

    Article  Google Scholar 

  45. Varvello, M., Schomp, K., Naylor, D., Blackburn, J., Finamore, A., Papagiannaki, K.: Is the web HTTP/2 yet? In: Karagiannis, T., Dimitropoulos, X. (eds.) PAM 2016. LNCS, vol. 9631, pp. 218–232. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30505-9_17

    Chapter  Google Scholar 

  46. Vaughan, L., Thelwall, M.: Search engine coverage bias: evidence and possible causes. Inf. Process. Manag. 40(4), 693–707 (2004). https://doi.org/10.1016/S0306-4573(03)00063-3

    Article  Google Scholar 

  47. Wang, L., Wang, J., Wang, M., Li, Y., Liang, Y., Xu, D.: Using internet search engines to obtain medical information: a comparative study. J. Med. Internet Res. 14(3), e74 (2012). https://doi.org/10.2196/jmir.1943

    Article  Google Scholar 

  48. Wenzek, G., et al.: Ccnet: extracting high quality monolingual datasets from web crawl data. CoRR abs/1911.00359 (2019). http://arxiv.org/abs/1911.00359

  49. West, A.G., Chang, J., Venkatasubramanian, K., Sokolsky, O., Lee, I.: Link spamming Wikipedia for profit. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 152–161. CEAS 2011, Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/2030376.2030394

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom Alby .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alby, T., Jäschke, R. (2022). Analyzing the Web: Are Top Websites Lists a Good Choice for Research?. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16802-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16801-7

  • Online ISBN: 978-3-031-16802-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics