Skip to main content

Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees

  • Conference paper
Advances in Information Retrieval (ECIR 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

Abstract

Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, for highly distributed environments, such as peer-to-peer networks, current clustering algorithms fail to scale. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 100000 peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blake, C.: A comparison of document, sentence, and term event spaces. In: ACL (2006)

    Google Scholar 

  2. Cudré-Mauroux, P., Agarwal, S., Aberer, K.: Gridvine: An infrastructure for peer information management. IEEE Internet Computing 11(5) (2007)

    Google Scholar 

  3. Datta, S., Giannella, C.R., Kargupta, H.: Approximate distributed K-Means clustering over a peer-to-peer network. IEEE TKDE 21(10), 1372–1388 (2009)

    Google Scholar 

  4. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Workshop on Large-Scale Parallel KDD Systems (1999)

    Google Scholar 

  5. Eisenhardt, M., Müller, W., Henrich, A.: Classifying documents by distributed P2P clustering. In: INFORMATIK (2003)

    Google Scholar 

  6. Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)

    Article  Google Scholar 

  7. Hammouda, K., Kamel, M.: HP2PC: Scalable hierarchically-distributed peer to peer clustering. In: SDM (2007)

    Google Scholar 

  8. Haslhofer, B., Knezevié, P.: The BRICKS digital library infrastructure. In: Semantic Digital Libraries, pp. 151–161 (2009)

    Google Scholar 

  9. Hsiao, H.-C., King, C.-T.: Similarity discovery in structured P2P overlays. In: ICPP (2003)

    Google Scholar 

  10. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  11. Lu, J., Callan, J.: Content-based retrieval in hybrid peer-to-peer networks. In: CIKM 2003, pp. 199–206. ACM, New York (2003)

    Chapter  Google Scholar 

  12. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)

    MATH  Google Scholar 

  13. Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: PODS (1998)

    Google Scholar 

  14. Papapetrou, O., Siberski, W., Fuhr, N.: Text clustering for P2P networks with probabilistic guarantees. Extended version (2009), http://www.l3s.de/~papapetrou/publications/pcp2p-ecir-ext.pdf

  15. Steyvers, M., Griffiths, T.: Handbook of Latent Semantic Analysis. In: Probabilistic Topic Models, pp. 427–448. Lawrence Erlbaum, Mahwah (2007)

    Google Scholar 

  16. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: SIGCOMM (2001)

    Google Scholar 

  17. Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: SIGIR (1999)

    Google Scholar 

  18. Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Papapetrou, O., Siberski, W., Fuhr, N. (2010). Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12275-0_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12274-3

  • Online ISBN: 978-3-642-12275-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics