Skip to main content

Part of Speech Based Term Weighting for Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5478))

Included in the following conference series:

Abstract

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aslam, J.A., Pavlu, V.: Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–131 (1996)

    Article  Google Scholar 

  3. Bas, A., Denison, D., Keizer, E., Popova, G. (eds.): Fuzzy Grammar, a Reader. Oxford University Press, Oxford (2004)

    Google Scholar 

  4. Bookstein, A., Swanson, D.: Probabilistic models for automatic indexing. JASIS 25, 312–318 (1974)

    Article  Google Scholar 

  5. Brookes, B.C.: The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation 24, 41–54 (1968)

    Article  Google Scholar 

  6. Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  7. Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using Smart: TREC 4. In: TREC-4, pp. 25–48 (1995)

    Google Scholar 

  8. Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)

    Article  MathSciNet  Google Scholar 

  9. Cooper, W.S., Chen, A., Gey, F.: Full text retrieval based on probalistic equations with coefficients fitted by logistic regression. In: TREC-2, pp. 57–66 (1993)

    Google Scholar 

  10. Corston-Oliver, S., Ringer, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 43–50 (2004)

    Google Scholar 

  11. Craswell, N., Robertson, S.E., Zaragoza, H., Taylor, M.J.: Relevance weighting for query independent evidence. In: SIGIR, pp. 416–423 (2005)

    Google Scholar 

  12. Croft, B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)

    Book  MATH  Google Scholar 

  13. Harter, S.P.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS 26(4), 197–206 (1975)

    Article  Google Scholar 

  14. Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: ACL, pp. 392–399 (2002)

    Google Scholar 

  15. Jespersen, O.: The Philosophy of Grammar. Allen and Unwin (1929)

    Google Scholar 

  16. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing (4), 401–412 (2003)

    Google Scholar 

  17. Lioma, C., Ounis, I.: Light syntactically-based index pruning for information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 88–100. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  18. Lioma, C., van Rijsbergen, C.J.K.: Part of speech n-grams and information retrieval. RFLA 8, 9–22 (2008)

    Google Scholar 

  19. Lyons, J.: Semantics. 2. Cambridge University Press, Cambridge (1977)

    Book  Google Scholar 

  20. Margulis, E.L.: N-Poisson document modelling. In: SIGIR, pp. 177–189 (1992)

    Google Scholar 

  21. Mikk, J.: Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics 8(1), 67–80 (2001)

    Article  Google Scholar 

  22. Monz, C.: Model tree learning for query term weighting in question answering. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 589–596. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  23. Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of Web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)

    Article  MATH  Google Scholar 

  24. Papineni, K.: Why inverse document frequency? In: NAACL, pp. 25–33 (2001)

    Google Scholar 

  25. Pasca, M.: High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University (2001)

    Google Scholar 

  26. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: EMNLP, pp. 130–142 (1996)

    Google Scholar 

  27. Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: SIGIR, pp. 353–360

    Google Scholar 

  28. Robertson, S., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society of Information Science 27, 129–146 (1976)

    Article  Google Scholar 

  29. Robertson, S., Walker, S.: Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. Springer, Heidelberg (1994)

    Google Scholar 

  30. Santini, M., Power, R., Evans, R.: Implementing a characterization of genre for automatic genre identification of Web pages. In: COLING/ACL, pp. 699–706 (2006)

    Google Scholar 

  31. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing Studies (1997)

    Google Scholar 

  32. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR, pp. 21–29. ACM Press, New York (1996)

    Google Scholar 

  33. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)

    Article  Google Scholar 

  34. Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302. ACM, New York (2007)

    Google Scholar 

  35. Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

  36. Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In: EMNLP-CoNLL, pp. 112–121 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lioma, C., Blanco, R. (2009). Part of Speech Based Term Weighting for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds) Advances in Information Retrieval. ECIR 2009. Lecture Notes in Computer Science, vol 5478. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00958-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00958-7_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00957-0

  • Online ISBN: 978-3-642-00958-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics