Abstract
We use integrations and combinations of taggers to improve the tagging accuracy of Icelandic text. The accuracy of the best performing integrated tagger, which consists of our linguistic rule-based tagger for initial disambiguation and a trigram tagger for full disambiguation, is 91.80%. Combining five different taggers, using simple voting, results in 93.34% accuracy. By adding two linguistically motivated rules to the combined tagger, we obtain an accuracy of 93.48%. This method reduces the error rate by 20.5%, with respect to the best performing tagger in the combination pool.
Abbreviations
- DDT:
-
data-driven taggers
- HMM:
-
Hidden Markov model
- IFD:
-
Icelandic frequency dictionary
- LMR:
-
linguistically motivated rules
References
Borin, L. (2000). Something borrowed, something blue: Rule-based combination of POS taggers. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. Greece: Athens.
Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied natural language processing. Seattle, WA, USA.
Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: a Memory-Based Part of Speech Tagger-Generator. In Proceedings of the 4th Workshop on Very Large Corpora. Copenhagen, Denmark.
Daelemans, W., Zavrel, J., & van den Bosch, A. (2003). MBT: Memory-Based Tagger. Reference Guide: ILK Technical Report-ILK 03–13.
Dietterich, T. G. (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.
Hajič, J., Krbec, P., Oliva, K., Květoň, P., & Petkevič, V. (2001). Serial combination of rules and statistics: a case study in Czech tagging. In Proceedings of the 39th Association of Computational Linguistics Conference. Toulouse, France.
Helgadóttir, S. (2004). Testing Data-Driven Learning algorithms for PoS tagging of Icelandic. In H. Holmboe (Ed.), Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag.
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (1995). Constraint grammar: a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin, Germany.
Loftsson, H. (2006a). Tagging Icelandic text: A linguistic rule-based approach. Technical Report CS-06-04, Department of Computer Science, University of Sheffield.
Loftsson, H. (2006b). Tagging a morphologically complex language using heuristics. In T. Salakoski, F. Ginter, S. Pyysalo, & T. Pahikkala (Eds.), Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Proceedings. Turku, Finland.
Ngai, G., & Florian, R. (2001), Transformation-based learning in the fast lane. In Proceedings of the 2nd Conference of the North American Chapter of the ACL. Pittsburgh, PA, USA.
Pind, J., Magnússon, F., & Briem, S. (1991). The Icelandic frequency dictionary. The Institute of Lexicography at the University of Iceland, Reykjavik, Iceland.
Ratnaparkhi A. (1996) A Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference. Philadelphia, PA, USA.
Sjöbergh, J. (2003). Combining POS-taggers for improved accuracy on Swedish text. In Proceedings of NoDaLiDa 2003. Reykjavik, Iceland.
van Halteren, H., Zavrel, J., & Daelemans, W. (2001) Improving accuracy in wordclass tagging through combination of machine learning systems. Computational Linguistics, 27(2), 199–230.
Acknowledgements
Thanks to the Institute of Lexicography at the University of Iceland, for providing access to the IFD corpus, and Professor Y. Wilks for valuable comments and suggestions in the preparation of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Loftsson, H. Tagging Icelandic text: an experiment with integrations and combinations of taggers. Lang Resources & Evaluation 40, 175–181 (2006). https://doi.org/10.1007/s10579-006-9013-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-006-9013-5