A Case Study in Decompounding for Bengali Information Retrieval

Ganguly, Debasis; Leveling, Johannes; Jones, Gareth J. F.

doi:10.1007/978-3-642-40802-1_14

Debasis Ganguly²¹,
Johannes Leveling²¹ &
Gareth J. F. Jones²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8138))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1783 Accesses
4 Citations

Abstract

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alfonseca, E., Bilac, S., Pharies, S.: Decompounding query keywords from compounding languages. In: ACL/HLT 2008, HLT-Short 2008, pp. 253–256 (2008)
Google Scholar
Braschler, M., Ripplinger, B.: Stemming and decompounding for German text retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 177–192. Springer, Heidelberg (2003)
Chapter Google Scholar
Monz, C., de Rijke, M.: Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 262–277. Springer, Heidelberg (2002)
Chapter Google Scholar
Koehn, P., Knight, K.: Empirical methods for compound splitting. In: EACL 2003, pp. 187–193. ACL, Stroudsburg (2003)
Google Scholar
Chen, A., Gey, F.C.: Multilingual information retrieval using machine translation, relevance feedback and decompounding. Inf. Retr. 7(1-2), 149–182 (2004)
Article Google Scholar
Dash, N.S.: The morphodynamics of Bengali compounds – decomposing them for lexical processing. Language in India 6 (2006)
Google Scholar
Dasgupta, S., Khan, M.: Morphological parsing of Bangla words using PC-KIMMO. In: ICCIT 2004 (2004)
Google Scholar
Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of NAACL HLT 2007, April 22-27, pp. 155–163. ACL, Rochester (2007)
Google Scholar
Roy, M.: Approaches to handle scarce resources for Bengali statistical machine translation. PhD thesis, School of Computing, Simon Fraser University (2010)
Google Scholar
Deepa, S.R., Bali, K., Ramakrishnan, A.G., Talukdar, P.P.: Automatic generation of compound word lexicon for Hindi speech synthesis. In: LREC 2004 (2004)
Google Scholar
McNamee, P.: N-gram tokenization for Indian language text retrieval. In: FIRE 2008, Kolkata, India (2008)
Google Scholar
Leveling, J., Jones, G.J.F.: Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR. TALIP 9(3) (September 2010)
Google Scholar
Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999)
Google Scholar
Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, The Netherlands (2000)
Google Scholar
Ganguly, D., Leveling, J., Jones, G.J.F.: DCU@FIRE 2012: Rule-based stemmers for Bengali and Hindi. In: FIRE 2012, pp. 37–42. ISI, Kolkata (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

CNGL, School of Computing, Dublin City University, Dublin 9, Ireland
Debasis Ganguly, Johannes Leveling & Gareth J. F. Jones

Authors

Debasis Ganguly
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Leveling
View author publications
You can also search for this author in PubMed Google Scholar
Gareth J. F. Jones
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for the Evaluation of Language and Communication Technologies (CELCT), via alla Cascata 56/c, 38123, Povo, Italy
Pamela Forner
HES-SO Valais, University of Applied Sciences Western Switzerland, Technopôle 3, 3960, Sierre, Switzerland
Henning Müller
Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València, Camino de Vera s/n, 46071, València, Spain
Roberto Paredes
Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València, Camino de Vera s/n, 46022, València, Spain
Paolo Rosso
Bauhaus-Universität Weimar, Bauhausstraße 11, 99423, Weimar, Germany
Benno Stein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ganguly, D., Leveling, J., Jones, G.J.F. (2013). A Case Study in Decompounding for Bengali Information Retrieval. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visualization. CLEF 2013. Lecture Notes in Computer Science, vol 8138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40802-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-40802-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40801-4
Online ISBN: 978-3-642-40802-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics