Skip to main content

Similarity Model and Term Association for Document Categorization

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2553))

Abstract

In the information retrieval and document categorization context, both Euclidean distance- and cosine-based similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. This paper analyzes the properties of term-document space, term-category space and categorydocument space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define a ∞-similarity model of documents. Here we make best use of existing category membership represented by corpus as much as possible, and the objective is to improve categorization performance. The empirical results been obtained by k-NN classifier over Reuters-21578 corpus show that utilization of term association can improve the effectiveness of categorization system and ∞- similarity model outperforms than ones without term association.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Joahims, T., Text categorization with support vector machines: Learning with many relevant features. In the proceedings of the European conference on Machine learning, 1998, 137–142.

    Google Scholar 

  2. Lewis, D. D. & Ringuette, M., Comparison of two learning algorithms for text categorization. In Proceedings of the 3rd SDAIR, 1994, 81–93.

    Google Scholar 

  3. Raghavan V.V. & S.K.M. Wong., A Critical Analysis of the Vector space Model for information Retrieval, JASIS, 37:5, September 1986.

    Google Scholar 

  4. Salton, G. & Buckley, C., Term weighting approaches in automatic text retrieval, Information Processing and Management, Vol. 24, No.5, 1988, 513–523.

    Google Scholar 

  5. Salton, G., Introduction to Modern Information Retrieval, 1983, McGraw-Hill.

    Google Scholar 

  6. Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Pennsylvania, 1989.

    Google Scholar 

  7. Yang, Y. & Pedersen, J. O., A Comparative Study on Feature Selection in Text Categorization, In the 14th Int. Conf. On Machine Learning, 1997, 412–420.

    Google Scholar 

  8. Yang, Y., An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1), 1999, 69–90.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kou, H., Gardarin, G. (2002). Similarity Model and Term Association for Document Categorization. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_22

Download citation

  • DOI: https://doi.org/10.1007/3-540-36271-1_22

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00307-6

  • Online ISBN: 978-3-540-36271-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics