Abstract
The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The term sites will be used as a synonym for hosts. A page is regarded as a single web page document on a host.
- 2.
Relevant queries and their search volume can be identified using tools such as the Google Keyword Planner that provides historical data about search volume of specific queries.
- 3.
- 4.
- 5.
Available at https://tranco-list.eu/list/3X4L.
- 6.
- 7.
- 8.
Data dumps of Wikipedia External links are available at https://dumps.wikimedia.org/backup-index.html.
- 9.
Only search volume data from Google has been taken into account, and, as a consequence, only the hosts found in Google Search.
- 10.
References
Alby, A., Bauknecht, H., Weidinger, S., Mempel, M., Alby, T.: Muster und Limitationen der Internet-basierten Selbstdiagnose bei häufigen Dermatosen. JDDG: Journal der Deutschen Dermatologischen Gesellschaft 19 (2021)
Alby, T.: Analyzing the web: are top websites lists a good choice for research? (0.1) [data set] (2022). https://doi.org/10.5281/zenodo.6821240
Alexa Internet, I.: We will be retiring alexa.com on 1 May 2022 (2021). https://support.alexa.com/hc/en-us/articles/4410503838999
Allen, G., et al.: BiGBERT: classifying educational web resources for kindergarten-12\(^{th}\) grades. In: Hiemstra, D., et al. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 176–184. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_13
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digital Lib. 14(3), 149–166 (2014). https://doi.org/10.1007/s00799-014-0118-y
Backlinko: We analyzed 5 million Google search results (2021). https://backlinko.com/google-ctr-stats
Boldi, P., Vigna, S.: Axioms for centrality. CoRR abs/1308.2140 (2013). http://arxiv.org/abs/1308.2140
Carterette, B., Pavluy, V., Fang, H., Kanoulas, E.: Million query track 2009 overview. In: TREC (2009)
Craigie, M., Loader, B., Burrows, R., Muncer, S.: Reliability of health information on the internet: an examination of expert’s ratings. J. Med. Internet Res. 4, e856 (2002). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761929/
Englehardt, S., Narayanan, A.: Online tracking: a 1-million-site measurement and analysis. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. CCS ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2976749.2978313
Fafalios, P., Iosifidis, V., Ntoutsi, E., Dietze, S.: TweetsKB: a public and large-scale RDF corpus of annotated tweets. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_12
Felt, A.P., Barnes, R., King, A., Palmer, C., Bentzel, C., Tabriz, P.: Measuring HTTPS adoption on the web. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 1323–1338. USENIX Association, Vancouver, BC (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/felt
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 1–6. WebDB 2004, Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1017074.1017077
Fröbe, M., et al.: CopyCat: near-duplicates within and between the clueweb and the common crawl, pp. 2398–2404. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3404835.3463246
Funel, A.: Analysis of the web graph aggregated by host and pay-level domain. CoRR abs/1802.05435 (2018). http://arxiv.org/abs/1802.05435
Giannakoulopoulos, A., Pergantis, M., Konstantinou, N., Lamprogeorgos, A., Limniati, L., Varlamis, I.: Exploring the dominance of the English language on the websites of EU countries. Future Internet 12(4) (2020). https://doi.org/10.3390/fi12040076, https://www.mdpi.com/1999-5903/12/4/76
Hale, S.A., Blank, G., Alexander, V.D.: Live versus archive: comparing a web archive to a population of web pages, pp. 45–61. UCL Press (2017). http://www.jstor.org/stable/j.ctt1mtz55k.8
He, K., Fisher, A., Wang, L., Gember, A., Akella, A., Ristenpart, T.: Next stop, the cloud: Understanding modern web service deployment in ec2 and azure. In: Proceedings of the 2013 Conference on Internet Measurement Conference, pp. 177–190. IMC 2013, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2504730.2504740
Höchstötter, N., Lewandowski, D.: What users see – structures in search engine results pages. Information Sciences 179(12), 1796–1812 (2009). https://doi.org/10.1016/j.ins.2009.01.028, special Section: Web Search
Holzmann, H., Nejdl, W., Anand, A.: The dawn of today’s popular domains: a study of the archived german web over 18 years. CoRR abs/1702.01151 (2017). http://arxiv.org/abs/1702.01151
Iqbal, U., Shafiq, Z., Qian, Z.: The ad wars: Retrospective measurement and analysis of anti-adblock filter lists. In: Proceedings of the 2017 Internet Measurement Conference, pp. 171–183. IMC 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3131365.3131387
Kakos, A.B., Lovejoy, D.A., Whiteside, J.L.: Quality of information on pelvic organ prolapse on the Internet. Int. Urogynecol. J. 26(4), 551–555 (2014). https://doi.org/10.1007/s00192-014-2538-z
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007). https://doi.org/10.1162/coli.2007.33.1.147
Leithner, A., Maurer-Ertl, W., Glehr, M., Friesenbichler, J., Leithner, K., Windhager, R.: Wikipedia and osteosarcoma: a trustworthy patients’ information? J. Am. Med. Infor. Assoc. 17(4), 373–374 (2010). https://doi.org/10.1136/jamia.2010.004507
Lewandowski, D.: A three-year study on the freshness of web search engine databases. J. Inf. Sci. 34(6), 817–831 (2008). https://doi.org/10.1177/0165551508089396
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.: Upset: visualization of intersecting sets. IEEE Trans. Visual. Comput. Graph. (InfoVis) 20(12), 1983–1992 (2014). https://doi.org/10.1109/TVCG.2014.2346248
Libert, T.: Exposing the hidden web: An analysis of third-party HTTP requests on 1 million websites. CoRR abs/1511.00619 (2015). http://arxiv.org/abs/1511.00619
Lo, B., Sedhain, R.: How reliable are website rankings? implications for e-business advertising and internet search. Issues Inf. Syst. 7, 233–238 (2006)
Luccioni, A.S., Viviano, J.D.: What’s in the box? an analysis of undesirable content in the common crawl corpus. CoRR abs/2105.02732 (2021). https://arxiv.org/abs/2105.02732
Mason, A.M., Compton, J., Bhati, S.: Disabilities and the digital divide: assessing web accessibility, readability, and mobility of popular health websites. J. Health Commun. 26(10), 667–674 (2021). https://doi.org/10.1080/10810730.2021.1987591, pMID: 34657585
Nagel, S.: Common crawl’s first in-house web graph (2017). https://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/
Nagel, S.: Index to WARC files and URLs in columnar format (2018). https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Nagel, S.: August 2019 crawl archive now available (2019). https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
Piccardi, T., Redi, M., Colavizza, G., West, R.: On the value of Wikipedia as a gateway to the web. In: Proceedings of the Web Conference 2021, pp. 249–260. WWW 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3442381.3450136
Pochat, V.L., van Goethem, T., Joosen, W.: Rigging research results by manipulating top websites rankings. CoRR abs/1806.01156 (2018). http://arxiv.org/abs/1806.01156
Pochat, V.L., Van Goethem, T., Tajalizadehkhoob, S., Korczyński, M., Joosen, W.: Tranco: a research-oriented top sites ranking hardened against manipulation. arXiv preprint arXiv:1806.01156 (2018)
Robertson, F., Lagus, J., Kajava, K.: A COVID-19 news coverage mood map of Europe. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 110–115. Association for Computational Linguistics (2021), https://aclanthology.org/2021.hackashop-1.15
Rweyemamu, W., Lauinger, T., Wilson, C., Robertson, W., Kirda, E.: Clustering and the weekend effect: recommendations for the use of top domain lists in security research. In: Choffnes, D., Barcellos, M. (eds.) Passive and Active Measurement, pp. 161–177. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-98785-5
Scheitle, Q., et al.: A long way to the top: significance, structure, and stability of internet top lists. In: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. IMC 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3278532.3278574
Silva, C.E., Campos, J.C.: Characterizing the control logic of web applications’ user interfaces. In: Murgante, B., et al. (eds.) Computational Science and Its Applications - ICCSA 2014, pp. 263–276. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-09147-1
Srinath, M., Wilson, S., Giles, C.L.: Privacy at scale: Introducing the Privaseer corpus of web privacy policies. CoRR abs/2004.11131 (2020). https://arxiv.org/abs/2004.11131
Tahir, B., Mehmood, M.A.: Corpulyzer: a novel framework for building low resource language corpora. IEEE Access 9, 8546–8563 (2021). https://doi.org/10.1109/ACCESS.2021.3049793
Thelwall, M.: Web impact factors and search engine coverage. J. Documentation (2000). https://doi.org/10.1108/00220410010803801
Thelwall, M.: A fair history of the web? examining country balance in the internet archive. Lib. Inf. Sci. Res. 26, 162–176 (2004). https://doi.org/10.1016/S0740-8188(04)00024-6
Varvello, M., Schomp, K., Naylor, D., Blackburn, J., Finamore, A., Papagiannaki, K.: Is the web HTTP/2 yet? In: Karagiannis, T., Dimitropoulos, X. (eds.) PAM 2016. LNCS, vol. 9631, pp. 218–232. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30505-9_17
Vaughan, L., Thelwall, M.: Search engine coverage bias: evidence and possible causes. Inf. Process. Manag. 40(4), 693–707 (2004). https://doi.org/10.1016/S0306-4573(03)00063-3
Wang, L., Wang, J., Wang, M., Li, Y., Liang, Y., Xu, D.: Using internet search engines to obtain medical information: a comparative study. J. Med. Internet Res. 14(3), e74 (2012). https://doi.org/10.2196/jmir.1943
Wenzek, G., et al.: Ccnet: extracting high quality monolingual datasets from web crawl data. CoRR abs/1911.00359 (2019). http://arxiv.org/abs/1911.00359
West, A.G., Chang, J., Venkatasubramanian, K., Sokolsky, O., Lee, I.: Link spamming Wikipedia for profit. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 152–161. CEAS 2011, Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/2030376.2030394
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Alby, T., Jäschke, R. (2022). Analyzing the Web: Are Top Websites Lists a Good Choice for Research?. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)