Skip to main content

A Case Study in Decompounding for Bengali Information Retrieval

  • Conference paper
Information Access Evaluation. Multilinguality, Multimodality, and Visualization (CLEF 2013)

Abstract

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alfonseca, E., Bilac, S., Pharies, S.: Decompounding query keywords from compounding languages. In: ACL/HLT 2008, HLT-Short 2008, pp. 253–256 (2008)

    Google Scholar 

  2. Braschler, M., Ripplinger, B.: Stemming and decompounding for German text retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 177–192. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Monz, C., de Rijke, M.: Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 262–277. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Koehn, P., Knight, K.: Empirical methods for compound splitting. In: EACL 2003, pp. 187–193. ACL, Stroudsburg (2003)

    Google Scholar 

  5. Chen, A., Gey, F.C.: Multilingual information retrieval using machine translation, relevance feedback and decompounding. Inf. Retr. 7(1-2), 149–182 (2004)

    Article  Google Scholar 

  6. Dash, N.S.: The morphodynamics of Bengali compounds – decomposing them for lexical processing. Language in India 6 (2006)

    Google Scholar 

  7. Dasgupta, S., Khan, M.: Morphological parsing of Bangla words using PC-KIMMO. In: ICCIT 2004 (2004)

    Google Scholar 

  8. Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of NAACL HLT 2007, April 22-27, pp. 155–163. ACL, Rochester (2007)

    Google Scholar 

  9. Roy, M.: Approaches to handle scarce resources for Bengali statistical machine translation. PhD thesis, School of Computing, Simon Fraser University (2010)

    Google Scholar 

  10. Deepa, S.R., Bali, K., Ramakrishnan, A.G., Talukdar, P.P.: Automatic generation of compound word lexicon for Hindi speech synthesis. In: LREC 2004 (2004)

    Google Scholar 

  11. McNamee, P.: N-gram tokenization for Indian language text retrieval. In: FIRE 2008, Kolkata, India (2008)

    Google Scholar 

  12. Leveling, J., Jones, G.J.F.: Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR. TALIP 9(3) (September 2010)

    Google Scholar 

  13. Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999)

    Google Scholar 

  14. Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, The Netherlands (2000)

    Google Scholar 

  15. Ganguly, D., Leveling, J., Jones, G.J.F.: DCU@FIRE 2012: Rule-based stemmers for Bengali and Hindi. In: FIRE 2012, pp. 37–42. ISI, Kolkata (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ganguly, D., Leveling, J., Jones, G.J.F. (2013). A Case Study in Decompounding for Bengali Information Retrieval. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visualization. CLEF 2013. Lecture Notes in Computer Science, vol 8138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40802-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40802-1_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40801-4

  • Online ISBN: 978-3-642-40802-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics