Named Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches

Konkol, Michal; Konopík, Miloslav

doi:10.1007/978-3-319-10816-2_33

Michal Konkol²¹ &
Miloslav Konopík²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1554 Accesses
10 Citations

Abstract

In this paper, we study the effects of various lemmatization and stemming approaches on the named entity recognition (NER) task for Czech, a highly inflectional language. Lemmatizers are seen as a necessary component for Czech NER systems and they were used in all published papers about Czech NER so far. Thus, it has an utmost importance to explore their benefits, limits and differences between simple and complex methods. Our experiments are evaluated on the standard Czech Named Entity Corpus 1.1 as well as the newly created 2.0 version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mollá, D., Van Zaanen, M., Smith, D.: Named entity recognition for question answering. In: Cavedon, L., Zukerman, I. (eds.) Proceedings of the 2006 Australasian Language Technology Workshop, pp. 51–58 (2006)
Google Scholar
Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th International EAMT Workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT. EAMT 2003, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2003)
Chapter Google Scholar
Kabadjov, M., Steinberger, J., Steinberger, R.: Multilingual statistical news summarization. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing, vol. 2013, pp. 229–252. Springer, Heidelberg (2013)
Chapter Google Scholar
Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named entities in Czech: annotating data and developing NE tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007)
Chapter Google Scholar
Kravalová, J., Žabokrtský, Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, NEWS 2009, pp. 194–201. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Konkol, M., Konopík, M.: Maximum entropy named entity recognition for Czech language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 203–210. Springer, Heidelberg (2011)
Chapter Google Scholar
Král, P.: Features for Named Entity Recognition in Czech Language. In: KEOD, pp. 437–441 (2011)
Google Scholar
Konkol, M., Konopík, M.: Crf-based czech named entity recognizer and consolidation of czech ner research. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 153–160. Springer, Heidelberg (2013)
Google Scholar
Straková, J., Straka, M., Hajič, J.: A new state-of-the-art czech named entity recognizer. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 68–75. Springer, Heidelberg (2013)
Google Scholar
Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles University Press, Prague (2004)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2002)
Chapter Google Scholar
Kanis, J., Skorkovská, L.: Comparison of different lemmatization approaches through the means of information retrieval performance. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 93–100. Springer, Heidelberg (2010)
Chapter Google Scholar
Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)
Google Scholar
Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Computer Speech & Language 6, 225–242 (1992)
Article Google Scholar
Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Technical report, Computer Science Group, Harvard University (1998)
Google Scholar
Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková Razímová, M.: Prague dependency treebank 2.0, PDT 2.0 (2006)
Google Scholar
Šmerk, P.: Fast morphological analysis of czech. In: Proceedings of the Raslan Workshop 2009. Masarykova Univerzita, Brno (2009)
Google Scholar
Spoustová, D.J., Hajič, J., Raab, J., Spousta, M.: Semi-Supervised Training for the Averaged Perceptron POS Tagger. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 763–771. Association for Computational Linguistics, Athens (2009)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Konkol, M.: Brainy: A machine learning library. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2014, Part II. LNCS, vol. 8468, pp. 490–499. Springer, Heidelberg (2014)
Chapter Google Scholar
Ciaramita, M., Altun, Y.: Named-Entity Recognition in Novel Domains with External Lexical Knowledge. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 8, 306 14, Plzeň, Czech Republic
Michal Konkol & Miloslav Konopík

Authors

Michal Konkol
View author publications
You can also search for this author in PubMed Google Scholar
Miloslav Konopík
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Konkol, M., Konopík, M. (2014). Named Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_33
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics