Skip to main content

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

  • Conference paper
Distributed Multimedia Information Retrieval (DIR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2924))

Included in the following conference series:

Abstract

This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions are known to have problems like link congestion, being a single point of failure, and expensive administration. It requires both horizontal and vertical scalability solutions to manage Network File Systems (NFS) and load balancing DNS and HTTP requests.

In this paper, we present an architecture of a completely distributed and decentralized Peer-to-Peer (P2P) crawler called Apoidea, which is self-managing and uses geographical proximity of the web resources to the peers for a better and faster crawl. We use Distributed Hash Table (DHT) based protocols to perform the critical URL-duplicate and content-duplicate tests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aberer, K.: P-grid: A self-organizing access structure for p2p information systems. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, p. 179. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  2. Kubaitowics, J., Zhao, B., Joseph, A.: Tapestry: An infrastructure for faulttolerance wide-area location and routing. Technical Report UCB/CSD-01-1141, University of California, Berkeley (2001)

    Google Scholar 

  3. Bloom, B.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  4. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of 11th International World Wide Web Conference (2002)

    Google Scholar 

  5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  6. Burner, M.: Crawling towards eternity: Building an archive of the world wide web. In: Web Techniques (1997)

    Google Scholar 

  7. Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A distributed anonymous information storage and retrieval system. In: Federrath, H. (ed.) Designing Privacy Enhancing Technologies. LNCS, vol. 2009, p. 46. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Gnawali, O.D.: A keyword-set search system for peer-to-peer networks

    Google Scholar 

  9. Gnutella: The gnutella home page (2002), http://gnutella.wego.com/

  10. Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  11. Kazaa: The kazaa home page (2002), http://www.kazaa.com/

  12. Lu, T., Sinha, S., Sudam, A.: Panache: A scalable distributed index for keyword search. Technical report (2002)

    Google Scholar 

  13. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable contentaddressable network (2001)

    Google Scholar 

  14. Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  15. SETI@home. The seti@home home page, http://setiathome.ssl.berkeley.edu

  16. Shkapenyuk, V., Suel, T.: Design and implementation of a highperformance distributed web crawler. In: Proceedings of International Conference on Data Engineering (2002)

    Google Scholar 

  17. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of SIGCOMM Annual Conference on Data Communication (August 2001)

    Google Scholar 

  18. Takahashi, T., Soonsang, H., Taura, K., Tonezawa, A.: World wide web crawler. Poster Proceedings of 11th International World Wide Web Conference (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Singh, A., Srivatsa, M., Liu, L., Miller, T. (2004). Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24610-7_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20875-4

  • Online ISBN: 978-3-540-24610-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics