Skip to main content

Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages Bengali, Gujarati and Marathi

  • Conference paper
Multilingual Information Access in South Asian Languages

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

Abstract

This paper presents results of a generative method for the management of morphological variation of query keywords in Bengali, Gujarati and Marathi. The method is called Frequent Case Generation (FCG). It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. We participated in the ad hoc task at FIRE 2011 and applied the FCG method on monolingual Bengali, Gujarati and Marathi test collections. Our evaluation was carried out with title and description fields of test topics, and the Lemur search engine. We used plain unprocessed word index as the baseline, and n-gramming and stemming as competing methods. The evaluation results show 30%, 16% and 70% relative mean average precision improvements for Bengali, Gujarati and Marathi respectively when comparing the FCG method to plain words. The method shows competitive performance in comparison to n-gramming and stemming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons (December 1998)

    Google Scholar 

  2. Dolamic, L., Savoy, J.: Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Trans. Asian Language Information Processing 9, 11:1–11:24 (2010)

    Article  Google Scholar 

  3. Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: an overview. Journal of Documentation 65(2), 267–290 (2009)

    Article  Google Scholar 

  4. Kettunen, K.: Automatic generation of frequent case forms of query keywords in text retrieval. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 222–236. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  5. Kettunen, K., Airio, E.: Is a morphologically complex language really that complex in full-text retrieval? In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 411–422. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Kettunen, K., Airio, E., Järvelin, K.: Restricted inflectional form generation in management of morphological keyword variation. Inf. Retr. 10, 415–444 (2007)

    Article  Google Scholar 

  7. Koskenniemi, K.: Finite state morphology and information retrieval. Natural Language Engineering 2(04), 331–336 (1996)

    Article  Google Scholar 

  8. Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a search service designed for the agglutinative nature of basque. In: Vilares, F., Lazarinis, J., Tait, J.I. (eds.) First Workshop on Improving Non English Web Searching (ACM Sigir 2007 Workshop), Marrakech, Morocco (2007)

    Google Scholar 

  9. Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and performance of morphological query expansion and language-filtering words on basque web searching. In: Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D., Calzolari, N. (eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association, ELRA (May 2008), http://www.lrec-conf.org/proceedings/lrec2008/

  10. Loponen, A., Järvelin, K.: A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Pal, S., Modak, D., Sanyal, S.: The fire 2008 evaluation exercise. ACM Trans. Asian Language Information Processing 9, 10:1–10:24 (2010)

    Article  Google Scholar 

  12. McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1-2), 73–97 (2004)

    Article  Google Scholar 

  13. McNamee, P., Nicholas, C.K., Mayfield, J.: Addressing morphological variation in alphabetic languages. In: SIGIR, pp. 75–82 (2009)

    Google Scholar 

  14. Paik, J.H., Mitra, M., Parui, S.K., Järvelin, K.: Gras: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29(4), 19 (2011)

    Article  Google Scholar 

  15. Voorhees, E.M.: Overview of the trec 2004 robust track. In: TREC (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Paik, J.H., Kettunen, K., Pal, D., Järvelin, K. (2013). Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages Bengali, Gujarati and Marathi. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40087-2_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40086-5

  • Online ISBN: 978-3-642-40087-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics