Injecting Temporal-Aware Knowledge in Historical Named Entity Recognition

González-Gallardo, Carlos-Emiliano; Boros, Emanuela; Giamphy, Edward; Hamdi, Ahmed; Moreno, José G.; Doucet, Antoine

doi:10.1007/978-3-031-28244-7_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13980))

Included in the following conference series:

European Conference on Information Retrieval

1575 Accesses
2 Citations

Abstract

In this paper, we address the detection of named entities in multilingual historical collections. We argue that, besides the multiple challenges that depend on the quality of digitization (e.g., misspellings and linguistic errors), historical documents can pose another challenge due to the fact that such collections are distributed over a long enough period of time to be affected by changes and evolution of natural language. Thus, we consider that detecting entities in historical collections is time-sensitive, and explore the inclusion of temporality in the named entity recognition (NER) task by exploiting temporal knowledge graphs. More precisely, we retrieve semantically-relevant additional contexts by exploring the time information provided by historical data collections and include them as mean-pooled representations in a Transformer-based NER model. We experiment with two recent multilingual historical collections in English, French, and German, consisting of historical newspapers (19C-20C) and classical commentaries (19C). The results are promising and show the effectiveness of injecting temporal-aware knowledge into the different datasets, languages, and diverse entity types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://gallica.bnf.fr/.
2.
https://trove.nla.gov.au/.
3.
The code is available at https://github.com/EmanuelaBoros/clef-hipe-2022-l3i.
4.
https://impresso.github.io/CLEF-HIPE-2020/.
5.
https://hipe-eval.github.io/HIPE-2022/.
6.
https://deepgraphlearning.github.io/project/wikidata5m.
7.
https://www.wikidata.org/.
8.
https://github.com/mniepert/mmkb/tree/master/TemporalKGs/wikidata.
9.
https://www.elastic.co/guide/en/elasticsearch/reference/8.1/release-highlights.html.
10.
We leave out the details that can be consulted in [45].
11.
We do not utilize in this case the additional Transformer layers with adapters, since these were specifically proposed for noisy/non-standard text and they do not bring any increase in performance on standard text [4].
12.
https://huggingface.co/bert-base-multilingual-cased.
13.
https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
14.
We utilized the HIPE-scorer https://github.com/hipe-eval/HIPE-scorer.
15.
We would expect higher results by utilising the temporal information, however, for this experimental setup, we were limited in terms of resources.
16.
https://mromanello.github.io/ajax-multi-commentary/.
17.
Although the exact date of its first performance is unknown, most scholars date it to relatively early in Sophocles’ career (possibly the earliest Sophoclean play still in existence), somewhere between 450 BCE to 430 BCE, possibly around 444 BCE.

References

Agarwal, P., Strötgen, J., del Corro, L., Hoffart, J., Weikum, G.: diaNED: Time-aware named entity disambiguation for diachronic corpora. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers). Assoc. Comput. Linguist. Melbourne, Australia. 2, pp. 686–693(Jul 2018). https://doi.org/10.18653/v1/P18-2109. https://aclanthology.org/P18-2109
Bircher, S.: Toulouse and Cahors are French Cities, but Ti*louse and Caa. Qrs as well. Ph.D. thesis, University of Zurich (2019)
Google Scholar
Boros, E., González-Gallardo, C.E., Giamphy, E., Hamdi, A., Moreno, J.G., Doucet, A.: Knowledge-based contexts for historical named entity recognition linking, pp. 1064–1078. http://ceur-ws.org/Vol-3180/#paper-84
Boroş, E., Hamdi, A., Pontes, E.L., Cabrera-Diego, L.A., Moreno, J.G., Sidere, N., Doucet, A.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th conference on computational natural language learning, pp. 431–441 (2020)
Google Scholar
Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digital Libr, pp. 1–26 (2022)
Google Scholar
Boroş, E., et al.: A comparison of sequential and combined approaches for named entity recognition in a corpus of handwritten medieval charters. In: 2020 17th International conference on frontiers in handwriting recognition (ICFHR), pp. 79–84. IEEE (2020)
Google Scholar
Cai, B., Xiang, Y., Gao, L., Zhang, H., Li, Y., Li, J.: Temporal knowledge graph completion: a survey. arXiv preprint arXiv:2201.08236 (2022)
Díez Platas, M.L., Ros Munoz, S., González-Blanco, E., Ruiz Fabo, P., Alvarez Mellado, E.: Medieval spanish (12th-15th centuries) named entity recognition and attribute annotation system based on contextual information. J. Assoc. Inf. Sci. Technol. 72(2), 224–238 (2021)
Article Google Scholar
Dikeoulias, I., Amin, S., Neumann, G.: Temporal knowledge graph reasoning with low-rank and model-agnostic representations. arXiv preprint arXiv:2204.04783 (2022)
Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., Douvet, A.: A Survey of Named Entity Recognition and Classification in Historical Documents. ACM Comput. Surv. (2022). https://arxiv.org/abs/2109.11406
Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification on historical documents: a survey. arXiv preprint arXiv:2109.11406 (2021)
Ehrmann, M., Romanello, M., Bircher, S., Clematide, S.: Introducing the CLEF 2020 HIPE shared task: Named entity recognition and linking on historical newspapers. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in information retrieval, pp. 524–532. Springer International Publishing, Cham (2020)
Chapter Google Scholar
Ehrmann, M., Romanello, M., Clematide, S., Ströbel, P.B., Barman, R.: Language resources for historical newspapers. the impresso collection. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 958–968 (2020)
Google Scholar
Ehrmann, M., Romanello, M., Doucet, A., Clematide, S.: Introducing the HIPE 2022 shared task: Named entity recognition and linking in multilingual historical documents. In: European Conference on Information Retrieval, pp. 347–354 (2022). https://doi.org/10.1007/978-3-030-99739-7_44
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of clef HIPE 2020: named entity processing on historical newspapers. In: CEUR Workshop Proceedings. 2696, CEUR-WS (2020)
Google Scholar
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers. In: Arampatzis, A., et al. (eds.) CLEF 2020. LNCS, vol. 12260, pp. 288–310. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58219-7_21
Chapter Google Scholar
Bellot, P., et al. (eds.): CLEF 2018. LNCS, vol. 11018. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7
Book Google Scholar
García-Durán, A., Dumančić, S., Niepert, M.: Learning sequence encoders for temporal knowledge graph completion. arXiv preprint arXiv:1809.03202 (2018)
Hamdi, A., et al.: A multilingual dataset for named entity recognition, entity linking and stance detection in historical newspapers. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2328–2334 (2021)
Google Scholar
Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Nat. Lang. Eng, pp. 1–24 (2022)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Huynh, V., Hamdi, A., Doucet, A.: When to Use OCR Post-correction for Named Entity Recognition? In: Ishita, E., Pang, N., Zhou, L. (eds.) ICADL 2020. LNCS, vol. 12504, pp. 33–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_3
Chapter Google Scholar
Kogkitsidou, E., Gambette, P.: Normalisation of 16th and 17th century texts in french and geographical named entity recognition. In: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities, pp. 28–34 (2020)
Google Scholar
Kristanti, T., Romary, L.: Delft and entity-fishing: Tools for clef HIPE 2020 shared task. In: CLEF 2020-Conference and Labs of the Evaluation Forum. 2696. CEUR (2020)
Google Scholar
Leblay, J., Chekol, M.W.: Deriving validity time in knowledge graph. In: Companion Proceedings of the The Web Conference 2018, pp. 1771–1776 (2018)
Google Scholar
Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., Boros, E., Hamdi, A., Doucet, A., Sidere, N., Coustaty, M.: Melhissa: a multilingual entity linking architecture for historical press articles. Int. J. Digit. Libr. 23(2), 133–160 (2022)
Article Google Scholar
Liu, T., Yao, J.G., Lin, C.Y.: Towards improving neural named entity recognition with gazetteers. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5301–5307 (2019)
Google Scholar
Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R., et al.: Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop, pp. 249–252. Herndon, VA (1999)
Google Scholar
Manjavacas, E., Fonteyn, L.: Adapting vs pre-training language models for historical languages (2022)
Google Scholar
Mayhew, S., Tsygankova, T., Roth, D.: ner and pos when nothing is capitalized. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Association for Computational Linguistics, Hong Kong, China, pp. 6256–6261 (Nov 2019). https://doi.org/10.18653/v1/D19-1650. https://aclanthology.org/D19-1650
Meng, T., Fang, A., Rokhlenko, O., Malmasi, S.: Gemnet: Effective gated gazetteer representations for recognizing complex entities in low-context input. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. Human Language Technol, pp. 1499–1512 (2021)
Google Scholar
Nguyen, N.K., Boros, E., Lejeune, G., Doucet, A.: Impact analysis of document digitization on event extraction. In: 4th workshop on natural language for artificial intelligence (NL4AI 2020) co-located with the 19th international conference of the Italian Association for artificial intelligence (AI* IA 2020). 2735, pp. 17–28 (2020)
Google Scholar
Oberbichler, S., Boroş, E., Doucet, A., Marjanen, J., Pfanzelter, E., Rautiainen, J., Toivonen, H., Tolonen, M.: Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians. J. Assoc. Inf. Sci. Technol. 73(2), 225–239 (2022)
Article Google Scholar
Pfeiffer, J., Vulić, I., Gurevych, I., Ruder, S.: MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 7654–7673 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.617. https://aclanthology.org/2020.emnlp-main.617
Pradhan, S., Moschitti, A., Xue, N., Ng, H.T., Björkelund, A., Uryupina, O., Zhang, Y., Zhong, Z.: Towards robust linguistic analysis using OntoNotes. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Assoc. Comput. Linguist. Sofia, Bulgaria, pp. 143–152. (2013). https://aclanthology.org/W13-3516
Rastas, I., et al.: Explainable publication year prediction of eighteenth century texts with the bert model. In: Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pp. 68–77 (2022)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Assoc. Comput. Linguist. Hong Kong, China, pp. 3982–3992 (2019). https://doi.org/10.18653/v1/D19-1410. https://aclanthology.org/D19-1410
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4512–4525 (2020)
Google Scholar
Ri, R., Yamada, I., Tsuruoka, Y.: mLUKE: The power of entity representations in multilingual pretrained language models. In: ACL 2022 (to appear) (2022)
Google Scholar
Riedl, M., Padó, S.: A named entity recognition shootout for german. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers) 2, pp. 120–125 (2018)
Google Scholar
Romanello, M., Najem-Meyer, S., Robertson, B.: Optical character recognition of 19th century classical commentaries: the current state of affairs. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 1–6 (2021)
Google Scholar
Suárez, P.J.O., Dupont, Y., Lejeune, G., Tian, T.: Sinner clef-hipe2020: sinful adaptation of SOTA models for named entity recognition in French and German. In: CLEF (Working Notes) (2020)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003). https://www.aclweb.org/anthology/W03-0419
Todorov, K., Colavizza, G.: Transfer learning for named entity recognition in historical corpora. In: CLEF (Working Notes) (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Proc. Syst. 30 (2017)
Google Scholar
Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., Tang, J.: Kepler: A unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021)
Article Google Scholar
Wang, X., et al.: Automated concatenation of embeddings for structured prediction. arXiv preprint arXiv:2010.05006 (2020)
Wang, X., et al.: Damo-nlp at semeval-2022 task 11: a knowledge-based system for multilingual named entity recognition. arXiv preprint arXiv:2203.00545 (2022)
Xu, C., Chen, Y.Y., Nayyeri, M., Lehmann, J.: Temporal knowledge graph completion using a linear temporal regularizer and multivector embeddings. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Assoc. Comput. Linguist, pp. 2569–2578 (2021). https://doi.org/10.18653/v1/2021.naacl-main.202. https://aclanthology.org/2021.naacl-main.202
Xu, C., Nayyeri, M., Alkhoury, F., Shariat Yazdi, H., Lehmann, J.: TeRo: A time-aware knowledge graph embedding via temporal rotation. In: Proceedings of the 28th International Conference on Computational Linguistics. Int. Committee Comput. Linguist. Barcelona, Spain, pp. 1583–1593 (2020). https://doi.org/10.18653/v1/2020.coling-main.139. https://aclanthology.org/2020.coling-main.139
Xu, C., Su, F., Lehmann, J.: Time-aware relational graph attention network for temporal knowledge graph embeddings (2021)
Google Scholar
Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: LUKE: Deep contextualized entity representations with entity-aware self-attention. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Assoc. Comput. Linguist, pp. 6442–6454 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.523, https://aclanthology.org/2020.emnlp-main.523
Zhu, C., Chen, M., Fan, C., Cheng, G., Zhang, Y.: Learning from history: modeling temporal knowledge graphs with sequential copy-generation networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. 35, pp. 4732–4740 (2021)
Google Scholar

Download references

Acknowledgements

This work has been supported by the ANNA (2019-1R40226) and TERMITRAD (2020–2019-8510010) projects funded by the Nouvelle-Aquitaine Region, France.

Author information

Authors and Affiliations

University of La Rochelle, L3i, 17000, La Rochelle, France
Carlos-Emiliano González-Gallardo, Emanuela Boros, Edward Giamphy, Ahmed Hamdi, José G. Moreno & Antoine Doucet
Preligens, 75009, Paris, France
Edward Giamphy
University of Toulouse, IRIT, 31000, Toulouse, France
José G. Moreno

Authors

Carlos-Emiliano González-Gallardo
View author publications
You can also search for this author in PubMed Google Scholar
Emanuela Boros
View author publications
You can also search for this author in PubMed Google Scholar
Edward Giamphy
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Hamdi
View author publications
You can also search for this author in PubMed Google Scholar
José G. Moreno
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos-Emiliano González-Gallardo .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

González-Gallardo, CE., Boros, E., Giamphy, E., Hamdi, A., Moreno, J.G., Doucet, A. (2023). Injecting Temporal-Aware Knowledge in Historical Named Entity Recognition. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-28244-7_24
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28243-0
Online ISBN: 978-3-031-28244-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Injecting Temporal-Aware Knowledge in Historical Named Entity Recognition