Skip to main content

Topic-Based Language Models for Distributed Retrieval

  • Chapter
Advances in Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

Effective retrieval in a distributed environment is an important but difficult problem. Lack of effectiveness appears to have two major causes. First, existing collection selection algorithms do not work well on heterogeneous collections. Second, relevant documents are scattered over many collections and searching a few collections misses many relevant documents. We propose a topic-oriented approach to distributed retrieval. With this approach, we structure the document set of a distributed retrieval environment around a set of topics. Retrieval for a query involves first selecting the right topics for the query and then dispatching the search process to collections that contain such topics. The content of a topic is characterized by a language model. In environments where the labeling of documents by topics is unavailable, document clustering is employed for topic identification. Based on these ideas, three methods are proposed to suit different environments. We show that all three methods improve effectiveness of distributed retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Broglio, J., Callan, J. P., and Croft, W. (1994). An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop. Morgan Kaufmann.

    Google Scholar 

  • Callan, J. P., Lu, Z., and Croft, W. (1995). Searching distributed collections with inference networks. In Proceedings of 18th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 21–28.

    Google Scholar 

  • Chakravarthy, A. and Hasse, K. (1995). NetSerf: Using semantic knowledge to find Internet information archives. In Proceedings of the 18th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 4–11.

    Google Scholar 

  • Cutting, D., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scatter/gather: a cluster-based approach to broswing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in lnformation Retrieval.

    Google Scholar 

  • Danzig, P., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 220–229.

    Google Scholar 

  • Dolin, R., Agrawal, D., Dillon, L., and Abbadi, A. E. (1996). Pharos: a scalable distributed architecture for locating heterogeneous information sources. Technical Report TRCS96-05, Computer Science Department, University of California, Santa Barbara.

    Google Scholar 

  • French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121–129.

    Google Scholar 

  • Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GlOSS for the text data base discovery problem. In Proceedings of SIGMOD 94, pages 126–137. ACM.

    Google Scholar 

  • Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Office lnformation Systems, 17(1):40–76.

    Google Scholar 

  • Hearst, M. and Pedersen, J. O. (1996). Reeaxming the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 76–84.

    Google Scholar 

  • Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 329–338.

    Google Scholar 

  • Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice Hall.

    Google Scholar 

  • Kahle, B. and Medlar, A. (1991). An information system for corporate users: Wide Area Information Servers. Technical Report TMC 199, Thinking Machines Corporation.

    Google Scholar 

  • Kullback, S., Keegel, J., and Kullback, J. (1987). Topics In Statistical Information Theory. Springer-Verlag.

    Google Scholar 

  • Larkey, L. (1998). Some issues in the automatic classification of U.S. patents. In Learning for Text Categorization. Papers from the 1998 Workshop. AAAI Press, pages 87–90.

    Google Scholar 

  • Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400:107–109.

    Article  Google Scholar 

  • Miller, D., Leek, T., and Schwartz, R. (1999). A hidden markov model informa-tion retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214–221.

    Google Scholar 

  • Ponte, J. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–281.

    Google Scholar 

  • Rabiner, L. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice Hall.

    Google Scholar 

  • Salton, G. (1989). Automatic Text Processing. Addison Wesley.

    Google Scholar 

  • Silverstein, C. and Pedersen, J. O. (1997). Almost constant-time clustering of arbitrary corpus subsets. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 60–66.

    Google Scholar 

  • Song, F. and Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM), pages 316–321.

    Google Scholar 

  • van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, second edition.

    Google Scholar 

  • Voorhees, E. and Harman, D., editors (1998). TREC7 Proceedings. NIST.

    Google Scholar 

  • Weiss, R., Velez, B., Sheldon, M., Namprempre, C., Szilagyi, P., Duda, A., and Gifford, D. (1996). Hypursuit: A hiearchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext.

    Google Scholar 

  • Xu, J. and Callan, J. (1998). Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120.

    Google Scholar 

  • Yamron, J. (1997). Topic detection and tracking segmentation task. In Proceedings of the DARPA Topic Detection and Tracking Workshop, (unpublished).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

Xu, J., Croft, W.B. (2002). Topic-Based Language Models for Distributed Retrieval. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_6

Download citation

  • DOI: https://doi.org/10.1007/0-306-47019-5_6

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7812-9

  • Online ISBN: 978-0-306-47019-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics