Definition of the Subject
The World Wide Web consists of a vast amount of information in the form of web-pages. Almost all subjects are covered somewhere in the billions of pages reachable from your computer. To access this enormous library from our computers is a wonderful thing – but the tremendous size of the web is also a problem. With an ever-growing number of pages covering all sorts of subjects, it becomes very hard to find the one, relevant piece of information you need. Web link analysis is a tool which assists us with this task. It helps us to rank web documents, and so to find the most important documents among the huge amount of irrelevant information available online. The input to the web link analysis is a network of hyperlinked documents, where the nodes are documents, and the directed edges are hyperlinks from one page to another. The output is a vector containing a numerical score which represents the importance of each page in the Web graph. A typical use of this...
Abbreviations
- Graph:
-
A set of vertices connected by directed or undirected edges.
- Web graph :
-
The World Wide Web represented as a graph. The webpages are vertices and hyperlinks are edges. The Web graph is called a directed graph since the hyperlinks are directed links from one page to another.
- Adjacency matrix :
-
A matrix A which represents the structure of a graph. For the Web graph, with directed links, A is typically defined such that \( { A_{ij}=A_{i\to j}=1 } \) if page i points to page j, and 0 otherwise.
- Link analysis:
-
In Link Analysis the relations between nodes in a graph are studied. In large complex networks link analysis is a tool to analyze a node's relations to all other nodes in the network. Based on the link structure of the graph, several node-properties can be found, for instance: Is the node well connected to the rest of the network? Can node A be reached from node B? How many edges separate A from B? Is a node central in its network? Is it contained in a sub-network? With link analysis these and other node properties are found. In short, link analysis helps us to extract a node's role in a graph.
- Web-page ranking:
-
Link analysis is used to measure the quality of a webpage. A link is interpreted as a recommendation. Link analysis scores are used by search engines when hits from queries (Web pages) are presented to the user in the form of a ranked list.
- PageRank:
-
The most well-known Web-page ranking algorithm, developed at Stanford University by Larry Page and Sergey Brin. Page and Brin are the founders of Google, which employs the PageRank-algorithm.
- Personalization of ranking-scores:
-
When the web-page link analysis score is calculated, some pages may be boosted according to criteria other than just the link-structure of the Web graph. For example, pages relevant to a certain topic (which is preferred by the user) might be boosted to receive a better ranking.
- Link-spamming :
-
A method where a spammer tries to boost a webpage's rank in a search engine hit list. This is done by constructing several pages (a ‘link spam farm’) and manipulating the link-structure in such a way that a targeted page will get a boost in its link analysis score, and hence a higher ranking in a search engine hit list.
- Crawling:
-
A crawler (or spider) is a program that maps the contents of the World Wide Web by starting at some initial webpage(s) and jump from page to page via hyperlinks in a systematic manner. All pages reachable from the startpage(s) by clicking hyperlinks are then visible to the crawler. From each page visited, hyperlinks and useful information is extracted. Search engines use webcrawlers to build their searchable databases (index). The part of the WWW which has been visited by a crawler and stored in a search-engine's database is called the indexable Web or the visible Web.
- Node indegree (outdegree):
-
The number of links pointing to (from) a node. The idea is readily generalized to a set of nodes such as an SCC (see below).
- Strongly connected component (SCC):
-
A maximal set of nodes in a directed graph, such that there is a directed path from any node in the SCC to any other.
- Source (sink) SCC:
-
An SCC with zero indegree (outdegree).
Bibliography
Primary Literature
Aberer K, Wu J (2003) A framework for decentralized ranking in web information retrieval. In: Web technologies and applications. Lecture Notes in computer science, vol 2642. Springer, Berlin, p 594
Adali S, Liu S, Magdon-Ismail M (2005) Optimal link bombs are uncoordinated. In: Proceedings of the first International workshop on adversarial information retrieval on the web (AIRWeb) Chiba, May 2005, pp 58–69; available online at http://airweb.cse.lehigh.edu/2005/proceedings.pdf
Adamic LA (1999) The small world web. In: Lecture notes in computer science, vol 1696. Proceedings of the third European conference on research and advanced technology for digital libraries. Springer, London, pp 443–452
Aktas MS, Nacar MA, Menczer F (2006) Using hyperlink features to personalize web search. In: Advances in web mining and web usage analysis. Lecture notes in computer science, vol 3932. Springer, Berlin
Aktas M, Nacar M, Menczer F (2004) Personalizing PageRank based on domain profiles. In: Proceedings of the KDD workshop on web mining and web usage analysis (WebKDD) Seattle, August 22–25 2004. Springer, Berlin
Albert R, Jeong H, Barabasi AL (1999) Diameter of World-Wide Web. Nature 410:130–131
Altman A, Tennenholtz M (2007) An Axiomatic Approach to Personalized Ranking Systems. In: Veloso MM (ed) Proceedings of the 20th International joint conference on artificial intelligence (IJCAI) Hyderabad, January 6–12 2007. AAAI Press, pp 1187–1192
Avrachenkov K, Litvak N, Pham KS (2007) Distribution of PageRank mass among principle components of the web. In: Bonato A, Chung FRK (eds) Proceedings of the 5th International Workshop on algorithms and models for the web-graph (WAW) San Diego, December 11–12 2007. Lecture notes in computer science, vol 4863. Springer
Barabasi AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286:509–512
Becchetti L, Castillo C, Donato D, Leonardi S, Baeza-Yates R (2006) Link-based characterization and detection of web spam. In: Proceedings of the second International workshop on adversarial information retrieval on the web (AIRWeb) Seattle, August 10 2006, available online at http://airweb.cse.lehigh.edu/2006/proceedings.pdf
Benczúr A, Csalogány K, Sarlós T, Uher M (2005) Spamrank: fully automatic link spam detection. In: Proceedings of the first International workshop on adversarial information retrieval on the web (AIRWeb) Chiba, May 2005, pp 58–69; available online at http://airweb.cse.lehigh.edu/2005/proceedings.pdf, held in conjunction with WWW2005
Berberich K, Bedathur SJ, Weikum G, Vazirgiannis M (2007) Comparing apples and oranges: Normalized pagerank for evolving graphs. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ (eds) Proceedings of the 16th International conference on World Wide Web (WWW) Banff, May 8–12 2007. ACM, pp 1145–1146
Berman A, Plemmons RJ (1979) Nonnegative Matrices in the Mathematical Sciences. Academic Press, New York
Bianchini M, Gori M, Scarselli F (2005) Inside PageRank. ACM Trans Inter Tech 5(1):92–128
Bjelland J, Canright GS, Engø-Monsen K (2008) Web Link Analysis: estimating a document's importance from its context. Telektronikk 1:95–113; also in http://delis.upb.de/paper/DELIS-TR-0631.pdf
Bjelland J, Burgess M, Canright G, Engø-Monsen K (2008) Eigenvectors of Directed Graphs and Importance Scores: Dominance, T-Rank, and Sink Remedies. DELIS technical report TR-0629, http://delis.upb.de/paper/DELIS-TR-0629.pdf
Boldi P, Santini M, Vigna S (2005) PageRank as a function of the damping factor. In: Proc. of the Fourteenth International World Wide Web Conference, Chiba. ACM Press, New York
Borodin A, Rosenthal JS, Roberts GO, Tsaparas P (2005) Link Analysis Ranking: Algorithms, Theory and Experiments. ACM Trans Internet Technol (TOIT) 5(1):131–152
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the seventh international conference on World Wide Web. Comput Netw ISDN Syst 30(1–7):107–117
Brinkmeier M (2006) PageRank revisited. ACM TOIT 6(3):257–279
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata S, Tomkins A, Wiener J (2000) Graph Structure in the web. Comput Netw 33:309–320
Canright G, Engø-Monsen K (2007) A pumping model for the spreading of computer viruses. In: Bandara AK, Burgess M (eds) Inter-domain management. Proceedings of the first International conference on autonomous infrastructure, management and security (AIMS) Oslo, June 21–22 2007. Lecture notes in computer science, vol 4543. Springer, pp 133–144
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th annual International ACM SIGIR conference on research and development in information retrieval (SIGIR) Amsterdam, July 23–27 2007. ACM, New York, pp 423–430
Chakrabarti S, Dom B, Gibson D, Kleinberg J, Kumar SR, Raghavan P, Rajagopalan S, Tomkins A (1999) Hypersearching the web. Sci Am 280(6)
Chang H, Cohn D, McCallum A (2000) Creating customized authority lists. In: Langley P (ed) Proceedings of the 17th International conference on machine learning (ICM) Stanford University, Stanford, June 29 – July 2 2000. Morgan Kaufmann, pp 127–134
Cheng A, Friedman E (2005) Sybilproof reputation mechanisms. In: Proceeding of the 2005 ACM SIGCOMM Workshop on Economics of Peer-To-Peer Systems. Philadelphia, 22 August. P2PECON '05. ACM, New York, pp 128–132
Cheng A, Friedman EJ (2006) Manipulability of PageRank under Sybil strategies. In: Proceedings of the first workshop on the economics of networked systems (NetEcon) Ann Arbor, June 11 2006, available at http://www.cs.duke.edu/nicl/netecon06/papers/proceedings.pdf
Chirita P, Nejdl W, Scurtu O (2004) Knowing where to search: Personalized search strategies for peers in P2P networks. SIGIR Workshop on Peer-to-Peer Information Retrieval, Sheffield
Costache S, Nejdl W, Paiu R (2007) Personalizing PageRank-Based Ranking over Distributed Collections. Advanced Information Systems Engineering. In: Proceedings of the 19th International Conference, CAiSE 2007, Trondheim, 11–15 June. Springer, Berlin, pp 111–126
Davison BD (2000) Recognizing nepotistic links on the Web. In: Artificial Intelligence for Web Search. AAAI Press, Menlo Park, pp 23–28
Ding C, He X, Husbands P, Zha H, Simon HD (2002) PageRank, HITS and a unified framework for link analysis. In: Proceedings of the 25th Annual international ACM SIGIR Conference on Research and Development in information Retrieval, Tampere, 11–15 August. SIGIR '02. ACM, New York, pp 353–354
Donato D, Leonardi S, Millozzi S, Tsaparas P (2005) Mining the inner structure of the Web graph. In: Proc of the 8th WebDB. J Phys A Math Theor 41:224017
Donato D, Leonardi S, Tsaparas P (2005) Stability and Similarity of Link Analysis Ranking Algorithms. ICALP, Lisbon
Donato D, Laura L, Leonardi S, Millozzi S (2007) The Web as a graph: How far we are. ACM Trans Internet Technol 7:1
Du Y, Shi Y, Zhao X (2007) Using spam farm to boost PageRank. In: Proceedings of the 3rd international Workshop on Adversarial information Retrieval on the Web, Banff, 08 May. AIRWeb '07 vol 215. ACM, New York, pp 29–36
Fogaras D, Rácz B, Csalogány K, Sarlós T (2005) Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments. Internet Math 2(3):333–358
Garfield E (1972) Citation analysis as a tool in journal evaluation. Science 178:471–479
Golub GH, Van Loan CF (1996) Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore
Gulli A, Signorini A (2005) The indexable web is more than 11.5 billion pages. In: WWW 2005 Conference Proceedings, Chiba. ACM Press, New York, pp 902–903
Gyöngyi Z, Garcia-Molina H (2005) Link spam alliances. In: Proceedings of the 31st international Conference on Very Large Data Bases, Trondheim, 30 August–02 September. Very Large Data Bases. VLDB Endowment, pp 517–528
Gyöngyi Z, Garcia-Molina H (2005) Web spam taxonomy. In: First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb, Chiba
Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB, Toronto, pp 576–587. Morgan Kaufmann Publishers, San Francisco
Haveliwala TH (2002) Topic-sensitive PageRank. Proc. of WWW Conf, Honolulu. ACM Press, New York, pp 517–526
Haveliwala TH, Kamvar SD (2003) The second eigenvalue of the Google matrix. Stanford University Technical Report 2003-20
Haveliwala T, Kamvar S, Jeh G (2003) An analytical comparison of approaches to personalizing PageRank. Technical report 2003-35, Stanford University. http://citeseer.ist.psu.edu/haveliwala03analytical.html
Jeh G, Widom J (2003) Scaling personalized web search. In: WWW '03: Proceedings of the 12th international conference on World Wide Web, New York. ACM Press, New York, pp 271–279
Jelasity M, Canright G, Engø-Monsen K (2007) Asynchronous distributed power iteration with gossip-based normalization. In: Kermarrec A-M, Bougé L, Priol T (eds) Euro-Par 2007. Lecture Notes in Computer Science, vol 4641. Springer, Berlin, pp 514–525
Kamvar SD, Haveliwala TH, Manning CD, Golub GH (2003) Extrapolation methods for accelerating PageRank computations. In: Proc 12th Intl Conf World Wide Web. ACM Press, New York, pp 261–270
Kamvar SD, Haveliwala TH, Manning CD, Golub GH (2003) Exploiting the block structure of the web for computing pagerank. Technical Report 2003–17, Stanford University
Kempe D, McSherry F (2008) A decentralized algorithm for spectral analysis. J Comput Syst Sci 74(1):70–83
Kephart JO, White SR (1991) Directed-graph epidemiological models of computer viruses. In: Proceedings of the IEEE symposium on security and privacy, Oakland, May 1991. IEEE Computer Society Press, Piscataway, pp 343–359
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Kleinberg JM, Kumar R, Raghavan P, Rajagopalan S, Tomkins AS (1999) The web as a graph: measurements, models and methods. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS) Dallas, May 15–18 2000. ACM, New York, pp 1–10
Langville AN, Meyer CD (2004) The Use of Linear Algebra by Web Search Engines. IMAGE Newsl 33:2–6
Langville AN, Meyer CD (2005) Deeper Inside PageRank. Internet Math 1(3):335–380
Lawrence S, Giles CL (1999) Accessibility of information on the web. Nature 400:107–109
Lee HC, Borodin A (2003) Perturbation of the hyper-linked environment. In: Warnow T, Zhu B (eds) Proceedings of the 9th annual International conference on computing and combinatorics (COCOON) Big Sky, July 25–28 2003. Lecture notes in computer science, vol 2697. Springer, pp 272–283
Lempel R, Moran S (2000) The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Comput Netw 33(1–6):387–401
Lempel R, Moran S (2005) Rank Stability and Rank Similarity of Link-Based Web Ranking Algorithms in Authority Connected Graphs. Inf Retr 8:245–264
Levine BN, Shields C, Margolin NB (2006) A Survey of Solutions to the Sybil Attack. Tech report 2006–052, University of Massachusetts Amherst, Amherst
Liang C, Ru L, Zhu X, R-SpamRank: A Spam Detection Algorithm Based on Link Analysis. J Comput Inf Syst, to appear
Motwani R, Raghavan P (1996) Randomized Algorithms. ACM Press, New York
Najork M, Wiener JL (2001) Breadth-first search crawling yields high-quality pages. In: Proc 10th Intl WWW Conf. Elsevier Science, Hong-Kong, pp 114–118
Nejdl W (2004) How to Build Google2Google – An (Incomplete) Recipe. The Semantic Web – ISWC 2004, 1–5. Lecture Notes in Computer Science, vol 3298. Springer, Berlin
Newman MEJ, Forrest S, Balthrop J (2002) Email networks and the spread of computer viruses. Phys Rev E 66:35–101
Ng A, Zheng A, Jordan M (2001) Link analysis, eigenvectors and stability. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. Seattle, pp 903–910
Ng AY, Zheng AX, Jordan MI (2001) Stable algorithms for link analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, New York, pp 258–266
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford University, Stanford
Parreira JX, Weikum G (2005) JXP: Global authority scores in a P2P network. In: Doan AH, Neven F, McCann R, Bex GJ (eds) Proceedings of the 8th International workshop on the web and databases (WebDB) Baltimore, collocated with ACM SIGMOD/PODS 2005, June 16–17 2005. ACM Press, New York, pp 31–36
Parreira JX, Donato D, Michel S, Weikum G (2006) Efficient and decentralized PageRank approximation in a peer-to-peer web search network. In: Dayal U, Whang K, Lomet D, Alonso G, Lohman G, Kersten M, Cha SK, Kim Y (eds) Proceedings of the 32nd international Conference on Very Large Data Bases, Seoul, 12–15 September. Very Large Data Bases. VLDB Endow, pp 415–426
Parreira JX, Donato D, Castillo C, Weikum G (2007) Computing trusted authority scores in peer-to-peer web search networks. In: Proceedings of the 3rd international Workshop on Adversarial information Retrieval on the Web, Banff, 8 May. AIRWeb '07, vol 215. ACM, New York, pp 73–80
Parreira JX, Michel S, Weikum G (2007) p2pDating: Real life inspired semantic overlay networks for Web search. Inf Process Manag 43(3):643–664
Parreira XP, Castillo C, Donato D, Michel S, Weikum G (2008) The Juxtaposed approximate PageRank method for robust PageRank approximation in a peer-to-peer web search network. VLDB J 17(2). Springer, New York
Rafiei D, Mendelzon AO (2000) What is this page known for? Computing Web page reputations. In: Proceedings of the 9th International world wide web conference on computer networks. Int J Comput Telecommun Netw. North-Holland, Amsterdam, pp 823–835
Richardson M, Domingos P (2002) The intelligent surfer: Probabilistic combination of link and content information in pagerank. Adv Neural Inf Process Syst 14:1441–1448
Sankaralingam K, Sethumadhavan S, Browne JC (2003) Distributed Pagerank for P2P Systems. In: Proceedings of the 12th IEEE International symposium on high performance distributed computing (HPDC) Seattle, June 22–24 2003. IEEE Computer Society, Washington, pp 58–68
Sankaralingam K, Yalamanchi M, Sethumadhavan S, Browne JC (2003) Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks. J Grid Comput 1(3):291–307
Serrano MA, Maguitman A, Boguna M, Fortunato S, Vespignani A (2007) Decoding the structure of the WWW: A comparative analysis of web crawls. ACM Trans Web 1(2):10
Shi S, Yu J, Yang GW, Wang DX (2003) Distributed Page Ranking in Structured P2P Networks. In: Proceedings of the 32nd International conference on parallel processing (ICPP) Kaohsiung, October 6–9 2003. IEEE Computer Society, Washington, pp 179–186
Steinmetz R, Wehrle K (2005) Peer-to-peer Systems and Applications. Springer, Heidelberg
Stoyanovich J, Bedathur SJ, Berberich K, Weikum G (2007) EntityAuthority: Semantically enriched graph-based authority propagation. In: Proceedings of the 10th International workshop on the web and databases (WebDB) Beijing, June 15 2007. ACM, New York
Sydow M (2004) Random surfer with back step. In: Proceedings of the 13th International world wide web conference on alternate track papers and posters (WWW Alt) New York, May 19–21 2004. ACM Press, New York, pp 352–353
Varga RS (1962) Matrix iterative Analysis. Prentice-Hall, Upper Saddle River, USA
Wang Y, DeWitt D (2004) Computing pagerank in a distributed internet search system. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th International conference on very large data bases (VLDB) Toronto, August 31 – September 3 2004. VLDB Endowment, pp 420–431
Watts D, Strogatz S (1998) Collective dynamics of ‘small world’ networks. Nature 393:440–442
Wu J, Aberer K (2004) Using SiteRank for decentralized computation of Web document ranking. Adaptive Hypermedia and adaptive Web-based Systems. LNCS, vol 3137. Springer, Berlin, pp 265–274
Wu J, Aberer K (2005) Using a Layered Markov Model for Distributed Web Ranking Computation. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems. ICDCS 10(10):533–542
Yahoo.com (2008) Search query ‘car’ at http://www.yahoo.com. Accessed 13 Feb 2008
Zhang H, Goel A, Govindan R, Mason K, van Roy B (2004) Making eigenvector-based reputation systems robust to collusion. In: Proceedings of the third Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol 3243. Springer, Rome, pp 92–104
Zhu Y, Ye S, Li X (2005) Distributed pagerank computation based on iterative aggregation-disaggregation methods. In: Proceedings of the 14th ACM International conference on information and knowledge management (CIKM) Bremen, October 31 – November 5 2005. ACM Press, New York, pp 578–585
Books and Reviews
Bianchini M, Gori M, Scarselli F (2005) Inside PageRank. ACM Trans Inter Tech 5(1):92–128
Chakrabarti S (2002) Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, San Francisco
Langville AN, Meyer CD (2005) Deeper Inside PageRank. Internet Math 1(3):335–380
Langville AN, Meyer CD (2006) Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton
Thelwall M (2004) Link Analysis: An Information Science Approach. Academic Press, Amsterdam
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag
About this entry
Cite this entry
Bjelland, J., Canright, G., Engø-Monsen, K. (2009). Link Analysis and Web Search. In: Meyers, R. (eds) Encyclopedia of Complexity and Systems Science. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30440-3_312
Download citation
DOI: https://doi.org/10.1007/978-0-387-30440-3_312
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-75888-6
Online ISBN: 978-0-387-30440-3
eBook Packages: Physics and AstronomyReference Module Physical and Materials ScienceReference Module Chemistry, Materials and Physics