Skip to main content

Learning to Lemmatise Slovene Words

  • Chapter
  • First Online:
Learning Language in Logic (LLL 1999)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1925))

Included in the following conference series:

Abstract

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell’s ‘1984’, split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brants, T. (2000). TnT-a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000 Seattle, WA. http://www.coli.uni-sb.de/~thorsten/tnt/.

  2. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21 (4), 543–565.

    Google Scholar 

  3. Chanod, J., & Tapanainen, P. (1995). Creating a tagset, lexicon and guesser for a French tagger. In Proceedings of the ACL SIGDAT workshop From Text to Tags: Issues in Multilingual Language Analysis Dublin.

    Google Scholar 

  4. Cussens, J. (1997). Part-of-speech tagging using Progol. In Proceedings of the 6th International Workshop on Inductive Logic Programming, pp. 93–108 Berlin. Springer.

    Google Scholar 

  5. Cussens, J., Džeroski, S., & Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Džeroski, S., & Flach, P. (Eds.), Inductive Logic Programming; 9th International Workshop ILP-99, Proceedings, No. 1634 in Lecture Notes in Artificial Intelligence, pp. 68–79 Berlin. Springer.

    Google Scholar 

  6. Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140 Trento, Italy.

    Google Scholar 

  7. Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: A memory-based part of speech tagger-generator. In Ejerhed, E., & Dagan, I. (Eds.), Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 Copenhagen.

    Google Scholar 

  8. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.-J., Petkevič, V., & Tufiş, D. (1998). Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In COLING-ACL ’98, pp. 315–319 Montréal, Québec, Canada.

    Google Scholar 

  9. Džeroski, S., Erjavec, T., & Zavrel, J. (1999). Morphosyntactic Tagging of Slovene: Evaluating PoS Taggers and Tagsets. Research report IJSDP 8018, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/lll/bib/dzerzareport/.

    Google Scholar 

  10. Erjavec, T. (1999). The ELAN Slovene-English Aligned Corpus. In Proceedings of the Machine Translation Summit VII, pp. 349–357 Singapore. http://nl.ijs.si/elan/.

  11. Erjavec, T., & (eds.), M. M. (1997). Specifications and notation for lexicon encoding. MULTEXT-East final report D1.1F, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/ME/CD/docs/mte-d11f/.

    Google Scholar 

  12. Erjavec, T., Lawson, A., & Romary, L. (1998). East meets West: A Compendium of Multilingual Resources. CD-ROM. ISBN: 3-922641-46-6.

    Google Scholar 

  13. Manandhar, S., Džeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Page, D. (Ed.), Inductive Logic Programming; 8th International Workshop ILP-98, Proceedings, No. 1446 in Lecture Notes in Artificial Intelligence, pp. 135–144. Springer.

    Google Scholar 

  14. Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23 (3), 405–424.

    Google Scholar 

  15. Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, pp. 1–24.

    Google Scholar 

  16. Ratnaparkhi, A. (1996). A maximum entropy part of speech tagger. In Proc. ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, pp. 491–497 Philadelphia.

    Google Scholar 

  17. Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford.

    Google Scholar 

  18. Steetskamp, R. (1995). An implementation os a probabilistic tagger. Master’s thesis, TOSCA Research Group, University of Nijmegen, Nijmegen. 48 p.

    Google Scholar 

  19. van Halteren, H. (Ed.). (1999). Syntactic Wordclass Tagging. Kluwer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Džeroski, S., Erjavec, T. (2000). Learning to Lemmatise Slovene Words. In: Cussens, J., Džeroski, S. (eds) Learning Language in Logic. LLL 1999. Lecture Notes in Computer Science(), vol 1925. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40030-3_5

Download citation

  • DOI: https://doi.org/10.1007/3-540-40030-3_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41145-1

  • Online ISBN: 978-3-540-40030-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics