An Efficient Corpus-Based Stemmer

Singh, Jasmeet; Gupta, Vishal

doi:10.1007/s12559-017-9479-z

An Efficient Corpus-Based Stemmer

Published: 07 June 2017

Volume 9, pages 671–688, (2017)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Jasmeet Singh¹ &
Vishal Gupta¹

484 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Word stemming is a linguistic process in which the various inflected word forms are matched to their base form. It is among the basic text pre-processing approaches used in Natural Language Processing and Information Retrieval. Stemming is employed at the text pre-processing stage to solve the issue of vocabulary mismatch or to reduce the size of the word vocabulary, and consequently also the dimensionality of training data for statistical models. In this article, we present a fully unsupervised corpus-based text stemming method which clusters morphologically related words based on lexical knowledge. The proposed method performs cognitive-inspired computing to discover morphologically related words from the corpus without any human intervention or language-specific knowledge. The performance of the proposed method is evaluated in inflection removal (approximating lemmas) and Information Retrieval tasks. The retrieval experiments in four different languages using standard Text Retrieval Conference, Cross-Language Evaluation Forum, and Forum for Information Retrieval Evaluation collections show that the proposed stemming method performs significantly better than no stemming. In the case of highly inflectional languages, Marathi and Hungarian, the improvement in Mean Average Precision is nearly 50% as compared to unstemmed words. Moreover, the proposed unsupervised stemming method outperforms state-of-the-art strong language-independent and rule-based stemming methods in all the languages. Besides Information Retrieval, the proposed stemming method also performs significantly better in inflection removal experiments. The proposed unsupervised language-independent stemming method can be used as a multipurpose tool for various tasks such as the approximation of lemmas, improving retrieval performance or other Natural Language Processing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing the Stemming Paradigm

Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

Notes

http://www.isical.ac.in/~fire
http://clef-campaign.org/
http://terrier.org/
Available at http://members.unine.ch/jacques.savoy/clef/index.html
Available at http://linguistica.uchicago.edu/linguistica.html
Available at http://www.cis.hut.fi/projects/morpho/
Available at http://www.isical.ac.in/~clia/resources.html
Available at http://liks.fav.zcu.cz/HPS/
Available at http://members.unine.ch/jacques.savoy/clef/marathiST.txt
Available at http://members.unine.ch/jacques.savoy/clef/hungarianST.txt
Available at http://members.unine.ch/jacques.savoy/clef/bengaliST.txt
Available at http://www.anc.org/data/oanc/
Available at http://www.inf.u-szeged.hu/

References

Xu J, Croft WB. Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst. 1998;16(1):61–81.
Article Google Scholar
Bhamidipati NL, Pal SK. Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern: Publ IEEE Syst Man Cybern Soc. 2007;37(2):350–60. http://www.ncbi.nlm.nih.gov/pubmed/17416163
Article Google Scholar
Gupta V, Kaur N. A novel hybrid text summarization system for Punjabi text. Cogn Comput. 2016;8(2):261–77.
Article Google Scholar
Toutanova K, Suzuki H, Ruopp A. Applying morphology generation models to machine translation. In Association for computational linguistics. 2008;pp. 514–522.
Shrivastava M, Bhattacharyya P. Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of International Conference on NLP (ICON08). 2008.
Krovetz R. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993;pp. 191–202.
Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AYA, Gelbukh A, et al. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn Comput. 2016;8:757–71. 1–15
Article Google Scholar
Hu M, Liu B. Mining opinion features in customer reviews. AAAI. 2004;4:755–60.
Google Scholar
Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems. 2016.
Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11:22–31.
Google Scholar
Dawson JL. Suffix removal for word conflation. Bull Assoc Lit Linguist Comput. 1974;2(3):33–46.
Google Scholar
Porter MF. An algorithm for suffix stripping. Prog Electron Libr Inf Syst. 1980;14(3):130–7.
Google Scholar
Paice CD. Another stemmer. ACM SIGIR Forum. 1990;24(3):56–61.
Article Google Scholar
Popovic M, Willett P. The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci. 1992;43:384–90.
Kraaij W, Pohlman R. Viewing stemming as recall enhancement. In Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1996 ;pp. 40–48.
Majumder P, Mitra M, Pal D. Bulgarian, Hungarian and Czech stemming using YASS. In Advances in multilingual and multimodal information retrieval. 2008;pp. 49–56.
Savoy J, Berger PY. Monolingual, bilingual, and GIRT information retrieval at CLEF-2005. In 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. 2006;pp. 131–140.
Adam G, Asimakis K, Bouras C, Poulopoulos V. An efficient mechanism for stemming and tagging: the case of Greek language. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. 2010:pp. 389–397.
Dolamic L, Savoy J. Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol. 2009a;60(12):2540–7.
Article Google Scholar
Dolamic L, Savoy J. Indexing and stemming approaches for the Czech language. Inf Process Manag. 2009b;45:714–20.
Article Google Scholar
Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). New York: ACM. 2011b; pp. 863–872.
Paik JH, Parui SK, Pal D, Robertson SE. Effective and robust query-based stemming. ACM Trans Inf Syst. 2013;31(4):1–29. doi:10.1145/2536736.2536738.
Article Google Scholar
Orad D, Levow G, Cabezas C. CLEF experiments at Maryland: statistical stemming and back off translation. In: Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. Berlin: Springer-Verlag. 2001;pp. 176–187.
Goldsmith J. Unsupervised learning of the morphology of a natural language. J Comput Linguist. 2001;27(2):153–98.
Article Google Scholar
Goldsmith J. An algorithm for the unsupervised learning of morphology. Nat Lang Eng. 2006;12(04):353–71.
Article Google Scholar
Melucci M, Orio N. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM’03). 2003;pp. 131–138.
Bacchin M, Ferro N, Melucci M. A probabilistic model for stemmer generation. Inf Process Manag. 2005;41(1):121–37.
Article Google Scholar
Bacchin M, Ferro N, Melucci M. The effectiveness of a graph-based algorithm for stemming. In Digital libraries: people, knowledge, and technology. Springer; 2002. pp. 117–128.
Creutz M, Lagus K. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans Speech Lang Process (TSLP). 2007;4(1):3. article
Article Google Scholar
Creutz M, Lagus K. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning. 2002; Vol. 6: pp. 21–30.
Creutz M. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics. 2003;Vol. 1: pp. 280–287.
Creutz M, Lagus K. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology. 2004:pp. 43–51.
Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). 2005; Vol. 1: pp. 51–59.
Kohonen O, Virpioja S, Klami M. Allomorfessor: towards unsupervised morpheme analysis. In Evaluating Systems for Multilingual and Multimodal Information Acces. Springer: 2008; pp. 975–982.
Majumder P, Mitra M, Parui SK, Kole G, Mitra P, Datta K. YASS: Yet Another Suffix Stripper. ACM Trans Inf Syst. 2007;25(4):18.
Article Google Scholar
Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14(5–7):491–8.
Article PubMed CAS Google Scholar
Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.
Makin R, Pandey N, Pingali P, Varma V. Approximate string matching techniques for effective CLIR among Indian languages. In International Workshop on Fuzzy Logic and Applications. 2007;pp. 430–437.
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.
Article Google Scholar
Christen P. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06). 2006;pp. 290–294.
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation. 2003 ;Vol. 3: pp. 73–78.
Paik J, Mitra M, Parui S, Jarvelin K. GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011a;29(4):1–24.
Article Google Scholar
Paik JH, Parui SK. A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process. 2011;10(2):1–16. doi:10.1145/1967293.1967295.
Article Google Scholar
Peng F, Lu Y. Context Sensitive Stemming for Web Search. Proceeding SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007;pp 639–46.
Brychcín T, Konopík M. HPS: high precision stemmer. Inf Process Manag. 2015;51(1):68–91.
Article Google Scholar
McNamee P, Nicholas C, Mayfield J. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 2009; pp. 75–82.
Pirkola A, Keskustalo H, Leppänen E, Känsälä A-P, Järvelin K. Targeted s-gram matching: a novel n-gram matching technique for cross and monolingual word form variants. Inf Res. 2002;7(2):2–7.
Google Scholar
Järvelin A. Applications of S-grams in natural language information retrieval. 2014.
Dolamic L, Savoy J. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process. 2010;9(3):11.
Article Google Scholar
Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.
Article Google Scholar
Brown PF, Desouza PV, Mercer RL, Pietra V, Della J, Lai JC. Class-based n-gram models of natural language. Comput Linguist. 1992;18(4):467–79.
Google Scholar
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.
Article Google Scholar
Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst (TOIS). 2002;20(4):357–89.
Article Google Scholar
Singh J, Gupta V. A systematic review of text stemming techniques. Artif Intell Rev. 2016:1–61. article. doi:10.1007/s10462-016-9498-2.
Sakai T, Manabe T, Koyama M. Flexible pseudo-relevance feedback via selective sampling. ACM Trans Asian Lang Inf Process (TALIP). 2005;4(2):111–35.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University Institute of Engineering and Technology, Panjab University, Chandigarh, India
Jasmeet Singh & Vishal Gupta

Authors

Jasmeet Singh
View author publications
You can also search for this author in PubMed Google Scholar
Vishal Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vishal Gupta.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, J., Gupta, V. An Efficient Corpus-Based Stemmer. Cogn Comput 9, 671–688 (2017). https://doi.org/10.1007/s12559-017-9479-z

Download citation

Received: 27 June 2016
Accepted: 11 May 2017
Published: 07 June 2017
Issue Date: October 2017
DOI: https://doi.org/10.1007/s12559-017-9479-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Corpus-Based Stemmer

Abstract

Access this article

Similar content being viewed by others

Analyzing the Stemming Paradigm

Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Efficient Corpus-Based Stemmer

Abstract

Access this article

Similar content being viewed by others

Analyzing the Stemming Paradigm

Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation