Skip to main content
Log in

An Efficient Corpus-Based Stemmer

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Word stemming is a linguistic process in which the various inflected word forms are matched to their base form. It is among the basic text pre-processing approaches used in Natural Language Processing and Information Retrieval. Stemming is employed at the text pre-processing stage to solve the issue of vocabulary mismatch or to reduce the size of the word vocabulary, and consequently also the dimensionality of training data for statistical models. In this article, we present a fully unsupervised corpus-based text stemming method which clusters morphologically related words based on lexical knowledge. The proposed method performs cognitive-inspired computing to discover morphologically related words from the corpus without any human intervention or language-specific knowledge. The performance of the proposed method is evaluated in inflection removal (approximating lemmas) and Information Retrieval tasks. The retrieval experiments in four different languages using standard Text Retrieval Conference, Cross-Language Evaluation Forum, and Forum for Information Retrieval Evaluation collections show that the proposed stemming method performs significantly better than no stemming. In the case of highly inflectional languages, Marathi and Hungarian, the improvement in Mean Average Precision is nearly 50% as compared to unstemmed words. Moreover, the proposed unsupervised stemming method outperforms state-of-the-art strong language-independent and rule-based stemming methods in all the languages. Besides Information Retrieval, the proposed stemming method also performs significantly better in inflection removal experiments. The proposed unsupervised language-independent stemming method can be used as a multipurpose tool for various tasks such as the approximation of lemmas, improving retrieval performance or other Natural Language Processing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.isical.ac.in/~fire

  2. http://clef-campaign.org/

  3. http://terrier.org/

  4. Available at http://members.unine.ch/jacques.savoy/clef/index.html

  5. Available at http://linguistica.uchicago.edu/linguistica.html

  6. Available at http://www.cis.hut.fi/projects/morpho/

  7. Available at http://www.isical.ac.in/~clia/resources.html

  8. Available at http://liks.fav.zcu.cz/HPS/

  9. Available at http://members.unine.ch/jacques.savoy/clef/marathiST.txt

  10. Available at http://members.unine.ch/jacques.savoy/clef/hungarianST.txt

  11. Available at http://members.unine.ch/jacques.savoy/clef/bengaliST.txt

  12. Available at http://www.anc.org/data/oanc/

  13. Available at http://www.inf.u-szeged.hu/

References

  1. Xu J, Croft WB. Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst. 1998;16(1):61–81.

    Article  Google Scholar 

  2. Bhamidipati NL, Pal SK. Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern: Publ IEEE Syst Man Cybern Soc. 2007;37(2):350–60. http://www.ncbi.nlm.nih.gov/pubmed/17416163

    Article  Google Scholar 

  3. Gupta V, Kaur N. A novel hybrid text summarization system for Punjabi text. Cogn Comput. 2016;8(2):261–77.

    Article  Google Scholar 

  4. Toutanova K, Suzuki H, Ruopp A. Applying morphology generation models to machine translation. In Association for computational linguistics. 2008;pp. 514–522.

  5. Shrivastava M, Bhattacharyya P. Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of International Conference on NLP (ICON08). 2008.

  6. Krovetz R. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993;pp. 191–202.

  7. Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AYA, Gelbukh A, et al. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn Comput. 2016;8:757–71. 1–15

    Article  Google Scholar 

  8. Hu M, Liu B. Mining opinion features in customer reviews. AAAI. 2004;4:755–60.

    Google Scholar 

  9. Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems. 2016.

  10. Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11:22–31.

    Google Scholar 

  11. Dawson JL. Suffix removal for word conflation. Bull Assoc Lit Linguist Comput. 1974;2(3):33–46.

    Google Scholar 

  12. Porter MF. An algorithm for suffix stripping. Prog Electron Libr Inf Syst. 1980;14(3):130–7.

    Google Scholar 

  13. Paice CD. Another stemmer. ACM SIGIR Forum. 1990;24(3):56–61.

    Article  Google Scholar 

  14. Popovic M, Willett P. The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci. 1992;43:384–90.

  15. Kraaij W, Pohlman R. Viewing stemming as recall enhancement. In Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1996 ;pp. 40–48.

  16. Majumder P, Mitra M, Pal D. Bulgarian, Hungarian and Czech stemming using YASS. In Advances in multilingual and multimodal information retrieval. 2008;pp. 49–56.

  17. Savoy J, Berger PY. Monolingual, bilingual, and GIRT information retrieval at CLEF-2005. In 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. 2006;pp. 131–140.

  18. Adam G, Asimakis K, Bouras C, Poulopoulos V. An efficient mechanism for stemming and tagging: the case of Greek language. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. 2010:pp. 389–397.

  19. Dolamic L, Savoy J. Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol. 2009a;60(12):2540–7.

    Article  Google Scholar 

  20. Dolamic L, Savoy J. Indexing and stemming approaches for the Czech language. Inf Process Manag. 2009b;45:714–20.

    Article  Google Scholar 

  21. Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). New York: ACM. 2011b; pp. 863–872.

  22. Paik JH, Parui SK, Pal D, Robertson SE. Effective and robust query-based stemming. ACM Trans Inf Syst. 2013;31(4):1–29. doi:10.1145/2536736.2536738.

    Article  Google Scholar 

  23. Orad D, Levow G, Cabezas C. CLEF experiments at Maryland: statistical stemming and back off translation. In: Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. Berlin: Springer-Verlag. 2001;pp. 176–187.

  24. Goldsmith J. Unsupervised learning of the morphology of a natural language. J Comput Linguist. 2001;27(2):153–98.

    Article  Google Scholar 

  25. Goldsmith J. An algorithm for the unsupervised learning of morphology. Nat Lang Eng. 2006;12(04):353–71.

    Article  Google Scholar 

  26. Melucci M, Orio N. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM’03). 2003;pp. 131–138.

  27. Bacchin M, Ferro N, Melucci M. A probabilistic model for stemmer generation. Inf Process Manag. 2005;41(1):121–37.

    Article  Google Scholar 

  28. Bacchin M, Ferro N, Melucci M. The effectiveness of a graph-based algorithm for stemming. In Digital libraries: people, knowledge, and technology. Springer; 2002. pp. 117–128.

  29. Creutz M, Lagus K. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans Speech Lang Process (TSLP). 2007;4(1):3. article

    Article  Google Scholar 

  30. Creutz M, Lagus K. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning. 2002; Vol. 6: pp. 21–30.

  31. Creutz M. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics. 2003;Vol. 1: pp. 280–287.

  32. Creutz M, Lagus K. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology. 2004:pp. 43–51.

  33. Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). 2005; Vol. 1: pp. 51–59.

  34. Kohonen O, Virpioja S, Klami M. Allomorfessor: towards unsupervised morpheme analysis. In Evaluating Systems for Multilingual and Multimodal Information Acces. Springer: 2008; pp. 975–982.

  35. Majumder P, Mitra M, Parui SK, Kole G, Mitra P, Datta K. YASS: Yet Another Suffix Stripper. ACM Trans Inf Syst. 2007;25(4):18.

    Article  Google Scholar 

  36. Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14(5–7):491–8.

    Article  PubMed  CAS  Google Scholar 

  37. Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.

  38. Makin R, Pandey N, Pingali P, Varma V. Approximate string matching techniques for effective CLIR among Indian languages. In International Workshop on Fuzzy Logic and Applications. 2007;pp. 430–437.

  39. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.

    Article  Google Scholar 

  40. Christen P. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06). 2006;pp. 290–294.

  41. Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation. 2003 ;Vol. 3: pp. 73–78.

  42. Paik J, Mitra M, Parui S, Jarvelin K. GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011a;29(4):1–24.

    Article  Google Scholar 

  43. Paik JH, Parui SK. A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process. 2011;10(2):1–16. doi:10.1145/1967293.1967295.

    Article  Google Scholar 

  44. Peng F, Lu Y. Context Sensitive Stemming for Web Search. Proceeding SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007;pp 639–46.

  45. Brychcín T, Konopík M. HPS: high precision stemmer. Inf Process Manag. 2015;51(1):68–91.

    Article  Google Scholar 

  46. McNamee P, Nicholas C, Mayfield J. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 2009; pp. 75–82.

  47. Pirkola A, Keskustalo H, Leppänen E, Känsälä A-P, Järvelin K. Targeted s-gram matching: a novel n-gram matching technique for cross and monolingual word form variants. Inf Res. 2002;7(2):2–7.

    Google Scholar 

  48. Järvelin A. Applications of S-grams in natural language information retrieval. 2014.

  49. Dolamic L, Savoy J. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process. 2010;9(3):11.

    Article  Google Scholar 

  50. Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.

    Article  Google Scholar 

  51. Brown PF, Desouza PV, Mercer RL, Pietra V, Della J, Lai JC. Class-based n-gram models of natural language. Comput Linguist. 1992;18(4):467–79.

    Google Scholar 

  52. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.

    Article  Google Scholar 

  53. Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst (TOIS). 2002;20(4):357–89.

    Article  Google Scholar 

  54. Singh J, Gupta V. A systematic review of text stemming techniques. Artif Intell Rev. 2016:1–61. article. doi:10.1007/s10462-016-9498-2.

  55. Sakai T, Manabe T, Koyama M. Flexible pseudo-relevance feedback via selective sampling. ACM Trans Asian Lang Inf Process (TALIP). 2005;4(2):111–35.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vishal Gupta.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, J., Gupta, V. An Efficient Corpus-Based Stemmer. Cogn Comput 9, 671–688 (2017). https://doi.org/10.1007/s12559-017-9479-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-017-9479-z

Keywords

Navigation