Skip to main content

Robustness of Named Entity Recognition: Case of Latvian

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13062))

Included in the following conference series:

  • 220 Accesses

Abstract

Recent developments in Named Entity Recognition (NER) have demonstrated good results for grammatically correct texts, even in low resourced settings. However, when the NER model faces ungrammatical text, it often shows poor performance. In this study, we analyze NER performance on datasets containing errors typical for user-generated texts in the Latvian language. We explore three different strategies to increase the robustness of the named entity recognition: error injection into grammatically correct texts, augmenting grammatically correct texts with erroneous texts and augmenting grammatically correct texts with erroneous texts that contain specific types of errors. We demonstrate that in low resourced settings, the best noise-robust model could be obtained by augmenting training data with datasets containing different error types. Our best model achieves an average F1 score of 83.5 (84.1 for baseline) on grammatically correct text, while keeping good performance (79 F1 vs. 66 for baseline) on noisy texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/RinaldsViksna/NER-data-augmentation.

References

  1. Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019). http://dx.doi.org/10.18653/v1/W19-3712

  2. Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.-B., Ritter, A., Xu, W.: Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In: Proceedings of the Workshop on Noisy User-generated Text, pp. 126–135 (2015). http://dx.doi.org/10.18653/v1/W15-4319

  3. Bergmanis, T., Stafanovičs, A., Pinnis, M.: Robust neural machine translation: modeling orthographic and interpunctual variation. In: Human Language Technologies – The Baltic Perspective, pp. 80–86 (2020). https://doi.org/10.3233/FAIA200606

  4. Bodapati, S., Yun, H., Al-Onaizan, Y.: Robustness to capitalization errors in named entity recognition. In: Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pp. 237–242 (2019). http://dx.doi.org/10.18653/v1/D19-5531

  5. Brown, E.W., Coden, A.R.: Capitalization recovery for text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds.) IRTSA 2001. LNCS, vol. 2273, pp. 11–22. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45637-6_2

    Chapter  MATH  Google Scholar 

  6. Chinchor, N.: MUC-7 named entity task definition. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, 29 April–1 May 1998 (1998). https://www.aclweb.org/anthology/M98-1028.

  7. Deksne, D.: Chat language normalisation using machine learning methods. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART 2019), pp. 965–972 (2019). https://doi.org/10.5220/0007693509650972

  8. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://www.aclweb.org/anthology/N19-1423/

  9. Gruzitis, N., et al.: Creation of a balanced state-of-the-art multilayer corpus for NLU. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018). https://www.aclweb.org/anthology/L18-1714

  10. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020). https://aclanthology.org/2020.acl-main.740.pdf

  11. Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over OCRed documents. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019). https://doi.org/10.1109/JCDL.2019.00057

  12. Mayhew, S., Tsygankova, T., Roth, D.: ner and pos when nothing is capitalized. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6256–6261 (2019). http://dx.doi.org/10.18653/v1/D19-1650

  13. Mayhew, S., Gupta, N., Roth, D.: Robust named entity recognition with truecasing pretraining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8480–8487 (2020). https://doi.org/10.1609/aaai.v34i05.6368

  14. Narayan, P.L., Nagesh, A., Surdeanu, M.: Exploration of noise strategies in semi-supervised named entity classification. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM), pp. 186–191 (2019). http://dx.doi.org/10.18653/v1/S19-1020

  15. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14 (2020). https://aclanthology.org/2020.emnlp-demos.2/

  16. Pinnis, M.: Latvian and Lithuanian named entity recognition with TildeNER. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1258–1265 (2012). https://www.aclweb.org/anthology/L12-1566/

  17. Pinnis, M.: Latvian tweet corpus and investigation of sentiment analysis for Latvian. In: Frontiers in Artificial Intelligence and Applications. Volume 307: Human Language Technologies – The Baltic Perspective, pp. 112–119 (2018)

    Google Scholar 

  18. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534 (2011). https://www.aclweb.org/anthology/D11-1141

  19. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF (2020). https://arxiv.org/abs/1909.10649

  20. Vāravs, A., Salimbajevs, A.: Restoring punctuation and capitalization using transformer models. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds.) SLSP 2018. LNCS (LNAI), vol. 11171, pp. 91–102. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00810-9_9

    Chapter  Google Scholar 

  21. Vīksna, R., Skadiņa, I.: Large language models for Latvian named entity recognition. In: Human Language Technologies – The Baltic Perspective, pp. 62–69 (2020). https://doi.org/10.3233/FAIA200603

  22. Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish (2019). https://arxiv.org/abs/1912.07076

  23. Znotiņš, A., Cīrule, E.: NLP-PIPE: Latvian NLP tool pipeline. In: Human Language Technologies – The Baltic Perspective, pp. 183–189 (2018). https://doi.org/10.3233/978-1-61499-912-6-183

  24. Znotiņš, A., Barzdiņš, G.: LVBERT: transformer-based model for Latvian language understanding. In: Human Language Technologies – The Baltic Perspective, pp. 111–115 (2020). https://doi.org/10.3233/FAIA200610

Download references

Acknowledgements

This research has been supported by the European Regional Development Fund within the joint research project of SIA Tilde and University of Latvia “Multilingual Artificial Intelligence Based Human Computer Interaction” No. 1.1.1.1/18/A/148.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rinalds Viksna .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Viksna, R., Skadiņa, I. (2021). Robustness of Named Entity Recognition: Case of Latvian. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89579-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89578-5

  • Online ISBN: 978-3-030-89579-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics