Robustness of Named Entity Recognition: Case of Latvian

Viksna, Rinalds; Skadiņa, Inguna

doi:10.1007/978-3-030-89579-2_5

Rinalds Viksna^11,12 &
Inguna Skadiņa^11,12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13062))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

220 Accesses

Abstract

Recent developments in Named Entity Recognition (NER) have demonstrated good results for grammatically correct texts, even in low resourced settings. However, when the NER model faces ungrammatical text, it often shows poor performance. In this study, we analyze NER performance on datasets containing errors typical for user-generated texts in the Latvian language. We explore three different strategies to increase the robustness of the named entity recognition: error injection into grammatically correct texts, augmenting grammatically correct texts with erroneous texts and augmenting grammatically correct texts with erroneous texts that contain specific types of errors. We demonstrate that in low resourced settings, the best noise-robust model could be obtained by augmenting training data with datasets containing different error types. Our best model achieves an average F1 score of 83.5 (84.1 for baseline) on grammatically correct text, while keeping good performance (79 F1 vs. 66 for baseline) on noisy texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/RinaldsViksna/NER-data-augmentation.

References

Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019). http://dx.doi.org/10.18653/v1/W19-3712
Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.-B., Ritter, A., Xu, W.: Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In: Proceedings of the Workshop on Noisy User-generated Text, pp. 126–135 (2015). http://dx.doi.org/10.18653/v1/W15-4319
Bergmanis, T., Stafanovičs, A., Pinnis, M.: Robust neural machine translation: modeling orthographic and interpunctual variation. In: Human Language Technologies – The Baltic Perspective, pp. 80–86 (2020). https://doi.org/10.3233/FAIA200606
Bodapati, S., Yun, H., Al-Onaizan, Y.: Robustness to capitalization errors in named entity recognition. In: Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pp. 237–242 (2019). http://dx.doi.org/10.18653/v1/D19-5531
Brown, E.W., Coden, A.R.: Capitalization recovery for text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds.) IRTSA 2001. LNCS, vol. 2273, pp. 11–22. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45637-6_2
Chapter MATH Google Scholar
Chinchor, N.: MUC-7 named entity task definition. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, 29 April–1 May 1998 (1998). https://www.aclweb.org/anthology/M98-1028.
Deksne, D.: Chat language normalisation using machine learning methods. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART 2019), pp. 965–972 (2019). https://doi.org/10.5220/0007693509650972
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://www.aclweb.org/anthology/N19-1423/
Gruzitis, N., et al.: Creation of a balanced state-of-the-art multilayer corpus for NLU. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018). https://www.aclweb.org/anthology/L18-1714
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020). https://aclanthology.org/2020.acl-main.740.pdf
Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over OCRed documents. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019). https://doi.org/10.1109/JCDL.2019.00057
Mayhew, S., Tsygankova, T., Roth, D.: ner and pos when nothing is capitalized. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6256–6261 (2019). http://dx.doi.org/10.18653/v1/D19-1650
Mayhew, S., Gupta, N., Roth, D.: Robust named entity recognition with truecasing pretraining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8480–8487 (2020). https://doi.org/10.1609/aaai.v34i05.6368
Narayan, P.L., Nagesh, A., Surdeanu, M.: Exploration of noise strategies in semi-supervised named entity classification. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM), pp. 186–191 (2019). http://dx.doi.org/10.18653/v1/S19-1020
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14 (2020). https://aclanthology.org/2020.emnlp-demos.2/
Pinnis, M.: Latvian and Lithuanian named entity recognition with TildeNER. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1258–1265 (2012). https://www.aclweb.org/anthology/L12-1566/
Pinnis, M.: Latvian tweet corpus and investigation of sentiment analysis for Latvian. In: Frontiers in Artificial Intelligence and Applications. Volume 307: Human Language Technologies – The Baltic Perspective, pp. 112–119 (2018)
Google Scholar
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534 (2011). https://www.aclweb.org/anthology/D11-1141
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF (2020). https://arxiv.org/abs/1909.10649
Vāravs, A., Salimbajevs, A.: Restoring punctuation and capitalization using transformer models. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds.) SLSP 2018. LNCS (LNAI), vol. 11171, pp. 91–102. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00810-9_9
Chapter Google Scholar
Vīksna, R., Skadiņa, I.: Large language models for Latvian named entity recognition. In: Human Language Technologies – The Baltic Perspective, pp. 62–69 (2020). https://doi.org/10.3233/FAIA200603
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish (2019). https://arxiv.org/abs/1912.07076
Znotiņš, A., Cīrule, E.: NLP-PIPE: Latvian NLP tool pipeline. In: Human Language Technologies – The Baltic Perspective, pp. 183–189 (2018). https://doi.org/10.3233/978-1-61499-912-6-183
Znotiņš, A., Barzdiņš, G.: LVBERT: transformer-based model for Latvian language understanding. In: Human Language Technologies – The Baltic Perspective, pp. 111–115 (2020). https://doi.org/10.3233/FAIA200610

Download references

Acknowledgements

This research has been supported by the European Regional Development Fund within the joint research project of SIA Tilde and University of Latvia “Multilingual Artificial Intelligence Based Human Computer Interaction” No. 1.1.1.1/18/A/148.

Author information

Authors and Affiliations

Tilde, Vienibas gatve 75a, Riga, 1004, Latvia
Rinalds Viksna & Inguna Skadiņa
Faculty of Computing, University of Latvia, Riga, Latvia
Rinalds Viksna & Inguna Skadiņa

Authors

Rinalds Viksna
View author publications
You can also search for this author in PubMed Google Scholar
Inguna Skadiņa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rinalds Viksna .

Editor information

Editors and Affiliations

Cardiff University, Cardiff, UK
Luis Espinosa-Anke
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
School of Computer Science and Informatics, Cardiff University, Cardiff, UK
Irena Spasić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Viksna, R., Skadiņa, I. (2021). Robustness of Named Entity Recognition: Case of Latvian. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-89579-2_5
Published: 17 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89578-5
Online ISBN: 978-3-030-89579-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics