Learning to Lemmatise Slovene Words

Džeroski, Sašo; Erjavec, Tomaž

doi:10.1007/3-540-40030-3_5

Sašo Džeroski³ &
Tomaž Erjavec³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1925))

Included in the following conference series:

International Conference on Learning Language in Logic

396 Accesses
6 Citations

Abstract

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell’s ‘1984’, split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brants, T. (2000). TnT-a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000 Seattle, WA. http://www.coli.uni-sb.de/~thorsten/tnt/.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21 (4), 543–565.
Google Scholar
Chanod, J., & Tapanainen, P. (1995). Creating a tagset, lexicon and guesser for a French tagger. In Proceedings of the ACL SIGDAT workshop From Text to Tags: Issues in Multilingual Language Analysis Dublin.
Google Scholar
Cussens, J. (1997). Part-of-speech tagging using Progol. In Proceedings of the 6th International Workshop on Inductive Logic Programming, pp. 93–108 Berlin. Springer.
Google Scholar
Cussens, J., Džeroski, S., & Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Džeroski, S., & Flach, P. (Eds.), Inductive Logic Programming; 9th International Workshop ILP-99, Proceedings, No. 1634 in Lecture Notes in Artificial Intelligence, pp. 68–79 Berlin. Springer.
Google Scholar
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140 Trento, Italy.
Google Scholar
Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: A memory-based part of speech tagger-generator. In Ejerhed, E., & Dagan, I. (Eds.), Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 Copenhagen.
Google Scholar
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.-J., Petkevič, V., & Tufiş, D. (1998). Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In COLING-ACL ’98, pp. 315–319 Montréal, Québec, Canada.
Google Scholar
Džeroski, S., Erjavec, T., & Zavrel, J. (1999). Morphosyntactic Tagging of Slovene: Evaluating PoS Taggers and Tagsets. Research report IJSDP 8018, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/lll/bib/dzerzareport/.
Google Scholar
Erjavec, T. (1999). The ELAN Slovene-English Aligned Corpus. In Proceedings of the Machine Translation Summit VII, pp. 349–357 Singapore. http://nl.ijs.si/elan/.
Erjavec, T., & (eds.), M. M. (1997). Specifications and notation for lexicon encoding. MULTEXT-East final report D1.1F, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/ME/CD/docs/mte-d11f/.
Google Scholar
Erjavec, T., Lawson, A., & Romary, L. (1998). East meets West: A Compendium of Multilingual Resources. CD-ROM. ISBN: 3-922641-46-6.
Google Scholar
Manandhar, S., Džeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Page, D. (Ed.), Inductive Logic Programming; 8th International Workshop ILP-98, Proceedings, No. 1446 in Lecture Notes in Artificial Intelligence, pp. 135–144. Springer.
Google Scholar
Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23 (3), 405–424.
Google Scholar
Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, pp. 1–24.
Google Scholar
Ratnaparkhi, A. (1996). A maximum entropy part of speech tagger. In Proc. ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, pp. 491–497 Philadelphia.
Google Scholar
Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford.
Google Scholar
Steetskamp, R. (1995). An implementation os a probabilistic tagger. Master’s thesis, TOSCA Research Group, University of Nijmegen, Nijmegen. 48 p.
Google Scholar
van Halteren, H. (Ed.). (1999). Syntactic Wordclass Tagging. Kluwer.
Google Scholar

Download references

Author information

Authors and Affiliations

Department for Intelligent Systems, Jožef Stefan Institute, Jamova 39, SI-1000, Ljubljana, Slovenia
Sašo Džeroski & Tomaž Erjavec

Authors

Sašo Džeroski
View author publications
You can also search for this author in PubMed Google Scholar
Tomaž Erjavec
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of York, YO10 5DD, Heslington, York, UK
James Cussens
Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Džeroski, S., Erjavec, T. (2000). Learning to Lemmatise Slovene Words. In: Cussens, J., Džeroski, S. (eds) Learning Language in Logic. LLL 1999. Lecture Notes in Computer Science(), vol 1925. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40030-3_5

Download citation

DOI: https://doi.org/10.1007/3-540-40030-3_5
Published: 01 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41145-1
Online ISBN: 978-3-540-40030-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics