Part of Speech Based Term Weighting for Information Retrieval

Lioma, Christina; Blanco, Roi

doi:10.1007/978-3-642-00958-7_37

Christina Lioma¹⁹ &
Roi Blanco²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5478))

Included in the following conference series:

European Conference on Information Retrieval

3280 Accesses
12 Citations

Abstract

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aslam, J.A., Pavlu, V.: Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)
Chapter Google Scholar
Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–131 (1996)
Article Google Scholar
Bas, A., Denison, D., Keizer, E., Popova, G. (eds.): Fuzzy Grammar, a Reader. Oxford University Press, Oxford (2004)
Google Scholar
Bookstein, A., Swanson, D.: Probabilistic models for automatic indexing. JASIS 25, 312–318 (1974)
Article Google Scholar
Brookes, B.C.: The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation 24, 41–54 (1968)
Article Google Scholar
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using Smart: TREC 4. In: TREC-4, pp. 25–48 (1995)
Google Scholar
Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)
Article MathSciNet Google Scholar
Cooper, W.S., Chen, A., Gey, F.: Full text retrieval based on probalistic equations with coefficients fitted by logistic regression. In: TREC-2, pp. 57–66 (1993)
Google Scholar
Corston-Oliver, S., Ringer, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 43–50 (2004)
Google Scholar
Craswell, N., Robertson, S.E., Zaragoza, H., Taylor, M.J.: Relevance weighting for query independent evidence. In: SIGIR, pp. 416–423 (2005)
Google Scholar
Croft, B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)
Book MATH Google Scholar
Harter, S.P.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS 26(4), 197–206 (1975)
Article Google Scholar
Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: ACL, pp. 392–399 (2002)
Google Scholar
Jespersen, O.: The Philosophy of Grammar. Allen and Unwin (1929)
Google Scholar
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing (4), 401–412 (2003)
Google Scholar
Lioma, C., Ounis, I.: Light syntactically-based index pruning for information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 88–100. Springer, Heidelberg (2007)
Chapter Google Scholar
Lioma, C., van Rijsbergen, C.J.K.: Part of speech n-grams and information retrieval. RFLA 8, 9–22 (2008)
Google Scholar
Lyons, J.: Semantics. 2. Cambridge University Press, Cambridge (1977)
Book Google Scholar
Margulis, E.L.: N-Poisson document modelling. In: SIGIR, pp. 177–189 (1992)
Google Scholar
Mikk, J.: Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics 8(1), 67–80 (2001)
Article Google Scholar
Monz, C.: Model tree learning for query term weighting in question answering. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 589–596. Springer, Heidelberg (2007)
Chapter Google Scholar
Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of Web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)
Article MATH Google Scholar
Papineni, K.: Why inverse document frequency? In: NAACL, pp. 25–33 (2001)
Google Scholar
Pasca, M.: High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University (2001)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: EMNLP, pp. 130–142 (1996)
Google Scholar
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: SIGIR, pp. 353–360
Google Scholar
Robertson, S., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society of Information Science 27, 129–146 (1976)
Article Google Scholar
Robertson, S., Walker, S.: Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. Springer, Heidelberg (1994)
Google Scholar
Santini, M., Power, R., Evans, R.: Implementing a characterization of genre for automatic genre identification of Web pages. In: COLING/ACL, pp. 699–706 (2006)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing Studies (1997)
Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR, pp. 21–29. ACM Press, New York (1996)
Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)
Article Google Scholar
Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302. ACM, New York (2007)
Google Scholar
Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)
Google Scholar
Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In: EMNLP-CoNLL, pp. 112–121 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science, Katholieke Universiteit Leuven, 3000, Belgium
Christina Lioma
Computer Science, La Coruna University, 15071, Spain
Roi Blanco

Authors

Christina Lioma
View author publications
You can also search for this author in PubMed Google Scholar
Roi Blanco
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université de Toulouse - IRIT,, 118 Route de Narbonne,, 31062,, Toulouse Cedex 4,, France
Mohand Boughanem
Laboratoire d’Informatique de Grenoble, BP 53,, Université Joseph Fourier,, 38041, Grenoble Cedex 9,, France
Catherine Berrut
Université de Toulouse - IRIT,, 118 Route de Narbonne,, 31062, Toulouse Cedex 4,, France
Josiane Mothe & Chantal Soule-Dupuy &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lioma, C., Blanco, R. (2009). Part of Speech Based Term Weighting for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds) Advances in Information Retrieval. ECIR 2009. Lecture Notes in Computer Science, vol 5478. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00958-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-00958-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00957-0
Online ISBN: 978-3-642-00958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics