Analyzing the Web: Are Top Websites Lists a Good Choice for Research?

Alby, Tom; Jäschke, Robert

doi:10.1007/978-3-031-16802-4_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13541))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1119 Accesses
1 Altmetric

Abstract

The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The term sites will be used as a synonym for hosts. A page is regarded as a single web page document on a host.
2.
Relevant queries and their search volume can be identified using tools such as the Google Keyword Planner that provides historical data about search volume of specific queries.
3.
https://s3-us-west-1.amazonaws.com/umbrella-static/index.html.
4.
https://majestic.com/reports/majestic-million.
5.
Available at https://tranco-list.eu/list/3X4L.
6.
Data available at https://commoncrawl.org/2021/05/host-and-domain-level-web-graphs-feb-apr-may-2021/.
7.
https://github.com/CLD2Owners/cld2.
8.
Data dumps of Wikipedia External links are available at https://dumps.wikimedia.org/backup-index.html.
9.
Only search volume data from Google has been taken into account, and, as a consequence, only the hosts found in Google Search.
10.
https://alby.link/ccsample.

References

Alby, A., Bauknecht, H., Weidinger, S., Mempel, M., Alby, T.: Muster und Limitationen der Internet-basierten Selbstdiagnose bei häufigen Dermatosen. JDDG: Journal der Deutschen Dermatologischen Gesellschaft 19 (2021)
Google Scholar
Alby, T.: Analyzing the web: are top websites lists a good choice for research? (0.1) [data set] (2022). https://doi.org/10.5281/zenodo.6821240
Alexa Internet, I.: We will be retiring alexa.com on 1 May 2022 (2021). https://support.alexa.com/hc/en-us/articles/4410503838999
Allen, G., et al.: BiGBERT: classifying educational web resources for kindergarten-12\(^{th}\) grades. In: Hiemstra, D., et al. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 176–184. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_13
Chapter Google Scholar
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digital Lib. 14(3), 149–166 (2014). https://doi.org/10.1007/s00799-014-0118-y
Article Google Scholar
Backlinko: We analyzed 5 million Google search results (2021). https://backlinko.com/google-ctr-stats
Boldi, P., Vigna, S.: Axioms for centrality. CoRR abs/1308.2140 (2013). http://arxiv.org/abs/1308.2140
Carterette, B., Pavluy, V., Fang, H., Kanoulas, E.: Million query track 2009 overview. In: TREC (2009)
Google Scholar
Craigie, M., Loader, B., Burrows, R., Muncer, S.: Reliability of health information on the internet: an examination of expert’s ratings. J. Med. Internet Res. 4, e856 (2002). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761929/
Englehardt, S., Narayanan, A.: Online tracking: a 1-million-site measurement and analysis. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. CCS ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2976749.2978313
Fafalios, P., Iosifidis, V., Ntoutsi, E., Dietze, S.: TweetsKB: a public and large-scale RDF corpus of annotated tweets. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_12
Chapter Google Scholar
Felt, A.P., Barnes, R., King, A., Palmer, C., Bentzel, C., Tabriz, P.: Measuring HTTPS adoption on the web. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 1323–1338. USENIX Association, Vancouver, BC (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/felt
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 1–6. WebDB 2004, Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1017074.1017077
Fröbe, M., et al.: CopyCat: near-duplicates within and between the clueweb and the common crawl, pp. 2398–2404. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3404835.3463246
Funel, A.: Analysis of the web graph aggregated by host and pay-level domain. CoRR abs/1802.05435 (2018). http://arxiv.org/abs/1802.05435
Giannakoulopoulos, A., Pergantis, M., Konstantinou, N., Lamprogeorgos, A., Limniati, L., Varlamis, I.: Exploring the dominance of the English language on the websites of EU countries. Future Internet 12(4) (2020). https://doi.org/10.3390/fi12040076, https://www.mdpi.com/1999-5903/12/4/76
Hale, S.A., Blank, G., Alexander, V.D.: Live versus archive: comparing a web archive to a population of web pages, pp. 45–61. UCL Press (2017). http://www.jstor.org/stable/j.ctt1mtz55k.8
He, K., Fisher, A., Wang, L., Gember, A., Akella, A., Ristenpart, T.: Next stop, the cloud: Understanding modern web service deployment in ec2 and azure. In: Proceedings of the 2013 Conference on Internet Measurement Conference, pp. 177–190. IMC 2013, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2504730.2504740
Höchstötter, N., Lewandowski, D.: What users see – structures in search engine results pages. Information Sciences 179(12), 1796–1812 (2009). https://doi.org/10.1016/j.ins.2009.01.028, special Section: Web Search
Holzmann, H., Nejdl, W., Anand, A.: The dawn of today’s popular domains: a study of the archived german web over 18 years. CoRR abs/1702.01151 (2017). http://arxiv.org/abs/1702.01151
Iqbal, U., Shafiq, Z., Qian, Z.: The ad wars: Retrospective measurement and analysis of anti-adblock filter lists. In: Proceedings of the 2017 Internet Measurement Conference, pp. 171–183. IMC 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3131365.3131387
Kakos, A.B., Lovejoy, D.A., Whiteside, J.L.: Quality of information on pelvic organ prolapse on the Internet. Int. Urogynecol. J. 26(4), 551–555 (2014). https://doi.org/10.1007/s00192-014-2538-z
Article Google Scholar
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007). https://doi.org/10.1162/coli.2007.33.1.147
Article Google Scholar
Leithner, A., Maurer-Ertl, W., Glehr, M., Friesenbichler, J., Leithner, K., Windhager, R.: Wikipedia and osteosarcoma: a trustworthy patients’ information? J. Am. Med. Infor. Assoc. 17(4), 373–374 (2010). https://doi.org/10.1136/jamia.2010.004507
Article Google Scholar
Lewandowski, D.: A three-year study on the freshness of web search engine databases. J. Inf. Sci. 34(6), 817–831 (2008). https://doi.org/10.1177/0165551508089396
Article Google Scholar
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.: Upset: visualization of intersecting sets. IEEE Trans. Visual. Comput. Graph. (InfoVis) 20(12), 1983–1992 (2014). https://doi.org/10.1109/TVCG.2014.2346248
Article Google Scholar
Libert, T.: Exposing the hidden web: An analysis of third-party HTTP requests on 1 million websites. CoRR abs/1511.00619 (2015). http://arxiv.org/abs/1511.00619
Lo, B., Sedhain, R.: How reliable are website rankings? implications for e-business advertising and internet search. Issues Inf. Syst. 7, 233–238 (2006)
Google Scholar
Luccioni, A.S., Viviano, J.D.: What’s in the box? an analysis of undesirable content in the common crawl corpus. CoRR abs/2105.02732 (2021). https://arxiv.org/abs/2105.02732
Mason, A.M., Compton, J., Bhati, S.: Disabilities and the digital divide: assessing web accessibility, readability, and mobility of popular health websites. J. Health Commun. 26(10), 667–674 (2021). https://doi.org/10.1080/10810730.2021.1987591, pMID: 34657585
Nagel, S.: Common crawl’s first in-house web graph (2017). https://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/
Nagel, S.: Index to WARC files and URLs in columnar format (2018). https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Nagel, S.: August 2019 crawl archive now available (2019). https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
Piccardi, T., Redi, M., Colavizza, G., West, R.: On the value of Wikipedia as a gateway to the web. In: Proceedings of the Web Conference 2021, pp. 249–260. WWW 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3442381.3450136
Pochat, V.L., van Goethem, T., Joosen, W.: Rigging research results by manipulating top websites rankings. CoRR abs/1806.01156 (2018). http://arxiv.org/abs/1806.01156
Pochat, V.L., Van Goethem, T., Tajalizadehkhoob, S., Korczyński, M., Joosen, W.: Tranco: a research-oriented top sites ranking hardened against manipulation. arXiv preprint arXiv:1806.01156 (2018)
Robertson, F., Lagus, J., Kajava, K.: A COVID-19 news coverage mood map of Europe. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 110–115. Association for Computational Linguistics (2021), https://aclanthology.org/2021.hackashop-1.15
Rweyemamu, W., Lauinger, T., Wilson, C., Robertson, W., Kirda, E.: Clustering and the weekend effect: recommendations for the use of top domain lists in security research. In: Choffnes, D., Barcellos, M. (eds.) Passive and Active Measurement, pp. 161–177. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-98785-5
Chapter Google Scholar
Scheitle, Q., et al.: A long way to the top: significance, structure, and stability of internet top lists. In: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. IMC 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3278532.3278574
Silva, C.E., Campos, J.C.: Characterizing the control logic of web applications’ user interfaces. In: Murgante, B., et al. (eds.) Computational Science and Its Applications - ICCSA 2014, pp. 263–276. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-09147-1
Chapter Google Scholar
Srinath, M., Wilson, S., Giles, C.L.: Privacy at scale: Introducing the Privaseer corpus of web privacy policies. CoRR abs/2004.11131 (2020). https://arxiv.org/abs/2004.11131
Tahir, B., Mehmood, M.A.: Corpulyzer: a novel framework for building low resource language corpora. IEEE Access 9, 8546–8563 (2021). https://doi.org/10.1109/ACCESS.2021.3049793
Article Google Scholar
Thelwall, M.: Web impact factors and search engine coverage. J. Documentation (2000). https://doi.org/10.1108/00220410010803801
Thelwall, M.: A fair history of the web? examining country balance in the internet archive. Lib. Inf. Sci. Res. 26, 162–176 (2004). https://doi.org/10.1016/S0740-8188(04)00024-6
Article Google Scholar
Varvello, M., Schomp, K., Naylor, D., Blackburn, J., Finamore, A., Papagiannaki, K.: Is the web HTTP/2 yet? In: Karagiannis, T., Dimitropoulos, X. (eds.) PAM 2016. LNCS, vol. 9631, pp. 218–232. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30505-9_17
Chapter Google Scholar
Vaughan, L., Thelwall, M.: Search engine coverage bias: evidence and possible causes. Inf. Process. Manag. 40(4), 693–707 (2004). https://doi.org/10.1016/S0306-4573(03)00063-3
Article Google Scholar
Wang, L., Wang, J., Wang, M., Li, Y., Liang, Y., Xu, D.: Using internet search engines to obtain medical information: a comparative study. J. Med. Internet Res. 14(3), e74 (2012). https://doi.org/10.2196/jmir.1943
Article Google Scholar
Wenzek, G., et al.: Ccnet: extracting high quality monolingual datasets from web crawl data. CoRR abs/1911.00359 (2019). http://arxiv.org/abs/1911.00359
West, A.G., Chang, J., Venkatasubramanian, K., Sokolsky, O., Lee, I.: Link spamming Wikipedia for profit. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 152–161. CEAS 2011, Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/2030376.2030394

Download references

Author information

Authors and Affiliations

Humboldt-Universität zu Berlin, Berlin, Germany
Tom Alby & Robert Jäschke

Authors

Tom Alby
View author publications
You can also search for this author in PubMed Google Scholar
Robert Jäschke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom Alby .

Editor information

Editors and Affiliations

University of Padua, Padua, Italy
Gianmaria Silvello
Universidad Politécnica de Madrid, Madrid, Spain
Oscar Corcho
CNR-ISTI – National Research Council, Pisa, Italy
Paolo Manghi
University of Padua, Padua, Italy
Giorgio Maria Di Nunzio
Linnaeus University, Växjö, Sweden
Koraljka Golub
University of Padua, Padua, Italy
Nicola Ferro
Sapienza University of Rome, Rome, Italy
Antonella Poggi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alby, T., Jäschke, R. (2022). Analyzing the Web: Are Top Websites Lists a Good Choice for Research?. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-16802-4_2
Published: 15 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analyzing the Web: Are Top Websites Lists a Good Choice for Research?