Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

Singh, Aameek; Srivatsa, Mudhakar; Liu, Ling; Miller, Todd

doi:10.1007/978-3-540-24610-7_10

Aameek Singh⁷,
Mudhakar Srivatsa⁷,
Ling Liu⁷ &
…
Todd Miller⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2924))

Included in the following conference series:

Workshop on Distributed Information Retrieval

240 Accesses
20 Citations
1 Altmetric

Abstract

This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions are known to have problems like link congestion, being a single point of failure, and expensive administration. It requires both horizontal and vertical scalability solutions to manage Network File Systems (NFS) and load balancing DNS and HTTP requests.

In this paper, we present an architecture of a completely distributed and decentralized Peer-to-Peer (P2P) crawler called Apoidea, which is self-managing and uses geographical proximity of the web resources to the peers for a better and faster crawl. We use Distributed Hash Table (DHT) based protocols to perform the critical URL-duplicate and content-duplicate tests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aberer, K.: P-grid: A self-organizing access structure for p2p information systems. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, p. 179. Springer, Heidelberg (2001)
Chapter Google Scholar
Kubaitowics, J., Zhao, B., Joseph, A.: Tapestry: An infrastructure for faulttolerance wide-area location and routing. Technical Report UCB/CSD-01-1141, University of California, Berkeley (2001)
Google Scholar
Bloom, B.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of 11th International World Wide Web Conference (2002)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998)
Article Google Scholar
Burner, M.: Crawling towards eternity: Building an archive of the world wide web. In: Web Techniques (1997)
Google Scholar
Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A distributed anonymous information storage and retrieval system. In: Federrath, H. (ed.) Designing Privacy Enhancing Technologies. LNCS, vol. 2009, p. 46. Springer, Heidelberg (2001)
Chapter Google Scholar
Gnawali, O.D.: A keyword-set search system for peer-to-peer networks
Google Scholar
Gnutella: The gnutella home page (2002), http://gnutella.wego.com/
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Kazaa: The kazaa home page (2002), http://www.kazaa.com/
Lu, T., Sinha, S., Sudam, A.: Panache: A scalable distributed index for keyword search. Technical report (2002)
Google Scholar
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable contentaddressable network (2001)
Google Scholar
Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)
Chapter Google Scholar
SETI@home. The seti@home home page, http://setiathome.ssl.berkeley.edu
Shkapenyuk, V., Suel, T.: Design and implementation of a highperformance distributed web crawler. In: Proceedings of International Conference on Data Engineering (2002)
Google Scholar
Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of SIGCOMM Annual Conference on Data Communication (August 2001)
Google Scholar
Takahashi, T., Soonsang, H., Taura, K., Tonezawa, A.: World wide web crawler. Poster Proceedings of 11th International World Wide Web Conference (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332
Aameek Singh, Mudhakar Srivatsa, Ling Liu & Todd Miller

Authors

Aameek Singh
View author publications
You can also search for this author in PubMed Google Scholar
Mudhakar Srivatsa
View author publications
You can also search for this author in PubMed Google Scholar
Ling Liu
View author publications
You can also search for this author in PubMed Google Scholar
Todd Miller
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Ave, 15213, Pittsburgh, PA, USA
Jamie Callan
Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, A., Srivatsa, M., Liu, L., Miller, T. (2004). Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-24610-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20875-4
Online ISBN: 978-3-540-24610-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics