1 Introduction

The digital preservation and dissemination of funerary material is of interest and great value to researchers of many different disciplines: anthropologists, archaeologists, art historians, cataloguers, cultural historians, genealogists, gender theorists, library scientists, linguists, philologists, sociologists, theologists and perhaps more (Saller & Shaw, 1984; Eckert, 1993; Carmack, 2002; Streiter et al., 2007; Veit & Nonestied, 2008; Long, 2016). Inscriptions on tombstones are a rich source of information for all of these areas of scientific study. However, the objects of study, the tombstones, face a constant threat of disappearance, caused by the clearance of graveyards, obscurance caused by weathering (Matthias, 1967; Streiter et al., 2007), urban development, landslides, social or religious changes (Streiter et al., 2010a) or even political incentives (Streiter & Goudin, 2011).

Fortunately, several initiatives have been launched to ensure the digital preservation of tombstones and cemeteries (Streiter et al., 2010a; Toscano, 2019; Tobin et al., 2019). While many of these efforts are extremely valuable, they only target the first step of digitalisation. We believe that more can and should be done. In this article, we go a step further and propose a formalism that offers a semantic perspective on funerary material by capturing the meaning of tombstone inscriptions. The ultimate aim is to provide researchers of the above-mentioned disciplines with search capabilities that go beyond simple textual string matches.

This form of semantic search allows researchers to search for specific characteristics of their objects of study. For example, they can search for tombstones with pictures of animals or, more specifically, birds or butterflies; stones with names composed of patronymic suffixes; geographical locations of gravestones that refer to Psalm 103; stones where the deceased were married and where the wife is mentioned before the husband; Frisian headstones of teachers or priests; graves from a certain period with weeping trees; and so on. To enable this kind of semantic search for digital humanities research, formal descriptions are required that are ideally compatible with the Resource Description Framework (RDF) data model and other ongoing efforts on the semantic web (Antoniou et al., 2012).

The objectives of this paper are in line with the foundational work of Streiter et al. (2007), who propose an XML-based annotation scheme for graveyards and tombstones, focusing on the physical properties of tombstones. In our work, we aim to add structure to the representation of tombstone inscriptions. We extend the work of Streiter et al. (2007) by associating a tombstone inscription with a meaning representation that provides a model of the entities mentioned on the stone (people, dates, locations, symbols, religious references) and the relations between them (temporal relations, family relations, occupational relations), maintaining the difference between meaning and reference that Streiter et al. (2007) and Streiter et al. (2008) emphasise to be important to include in a corpus of tombstone annotations.

As the saying goes, there is no better data than more data, and large datasets are more easily produced with the help of computational algorithms to save time and money. In this article, we not only propose a formalism to capture the semantic dependencies of tombstone inscriptions, but also present a corpus of 1,100 semantically annotated images of funerary artefacts. This dataset is an important stepping stone for acquiring larger datasets of annotated data, and we envisage that machine learning could play a pivotal role in this (Tobin et al., 2019). As far as we are aware, this is the second large-scale corpus of tombstones and the first one of European tombstones, since Streiter et al. (2008) presented their corpus for Taiwanese tombstones.

This brings us to the research questions that we aim to answer in this article: What information is provided on tombstones in our study area (the north of the Netherlands)? What expressive power is required for a semantic formalism to represent this multi-modal information? Which vocabulary is required to describe this data in a formal meaning representation? And, finally, what challenges will be faced by computational approaches that have as their objective to learn these semantic annotations automatically?

Fig. 1
figure 1

Churchyard on Vlieland on 20 February 2020 (left) and the Zuiderbegraafplaats graveyard in the city of Groningen on 1 November 2019 (right)

To answer these questions, we first describe the process of data collection and give an informal characterisation of the types of information found on the tombstones in our dataset (Section 2). Next, in Section 3, we motivate the choice of semantic formalism. We proceed with describing how entities are conceptually instantiated and how semantic relations are formally captured (Section 4). Finally, in Section 5, we discuss the challenges that automatic tombstone reading approaches will have to deal with.

2 Collecting and interpreting tombstone data

2.1 Sources

There are several online databases with images of tombstones that would be appropriate for our purposes. However, as we are mainly interested in the information engraved on the stone, many of these pictures are not useful for our study because, too often, the texts are largely illegible. A second reason for collecting a new, raw dataset is to avoid the problems caused by intellectual property and ownership issues. This ensures that anyone can use our dataset for research purposes, as long as privacy issues of the deceased and their commemorators are taken into account.Footnote 1

Given these considerations, tombstone inscription data was collected by visiting 40 publicly accessible cemeteries in the three northern provinces of the Netherlands (Groningen, Drenthe and Friesland, see Fig. 1). Appendix Appendix lists all of the visited graveyards and churchyards. The cemeteries contain stones of the 18th, 19th, 20th and 21st centuries. The major religion in this part of the Netherlands is Protestantism, and this is reflected in the information found on the stones. Only one of the visited cemeteries was exclusively dedicated to the Roman Catholic Church. Although there are several Jewish and Muslim cemeteries in this area of the Netherlands, they were left out of the current study because of the expertise required to transcribe and interpret scripts written in Arabic or Hebrew.

2.2 Digitalisation

Using a Samsung A50 smartphone, high-resolution pictures of tombstones were taken from a close distance, without the use of a flash, light or tripod. Images were stored together with geolocation (using GPS, Global Positioning System) and time and date information. Files were saved in JPEG format, their orientation adjusted when needed, and irrelevant parts of the background were manually cropped from the image (Fig. 2). Each file was named with a unique identifier. In total, more than 1,000 images were collected in this way. Most of these are images of headstones; some of them capture crosses or other burial artefacts used for commemoration.

Fig. 2
figure 2

Original tombstone image (left) and cropped version (right)

Photographs were not taken if this interfered with mourners visiting the graveyard. If memorial objects or vegetation obstructed part of the view of the inscriptions, the stones were disregarded instead of temporally removing these objects for a full view of the tombstone. On rare occasions (fewer than 10 times), leaves, branches or small pieces of dirt were removed from stones to obtain a better photograph.

Not all of the tombs were photographed during a visit to a cemetery. Some stones, especially those crafted from marble, presented challenges to the photographer as the reflective surface of marble tombstones made photographing them difficult. Damaged stones or stones with unreadable inscriptions were also disregarded. For the sake of simplicity, funerary objects with information that required more than one image for full capture (e.g., obelisks or headstones with information on the front and back) were either ignored or selectively approached by taking only a single picture of the most important part. Tombs were selected with variety in mind, to build a diverse dataset with respect to language, symbols, usages of names, mentions of occupations, causes of death and family relations.

2.3 Information on stones: a concise, informal characterisation

Tombstones can generally be divided into family gravestones and headstones for individuals. Family gravestones are relatively rare and usually contain no more information than the name of the family. Individual stones contain information about one or more deceased person(s), usually with the dates of birth and death, and in rarer occasions, the date of death and the deceased’s age in years and months (fewer than 3% of the stones in our dataset). Other persons mentioned are the commemorators, commonly in connection with a characterisation of the social relation with the deceased. Other types of information are the person’s job or organisation they worked for, references to biblical texts, epitaphs, poems, cause of death and the name of the designer of the tombstone.

There is an abundance of linguistic variation for all of these types of information, with use of anaphoric and deictic pronouns, graphical coordination and coreference. Numeral expressions are used, but explicit universal quantification, disjunction or negation was not encountered. In just the three provinces of the northern Netherlands, seven languages were found: Dutch (obviously), plus Frisian (5x), Gronings (3x), English (5x), French (3x), Malay (2x) and Hebrew (1x).

The textual information is often decorated with symbols (more than half of the tombstones in our dataset) and sometimes portraits of the deceased (less than 1% in our dataset). The symbols are variations on common burial themes (weeping willows, broken flowers, winged sandglasses, anchors, axes, books, scythes, skulls, bones), animals (birds, snakes, butterflies, caterpillars, moths), religious or cult icons (Star of David, pentagram, swastika) or more contemporary artefacts (musical instruments, boats, cars, bicycles).

3 Information on stones: a formal characterisation

3.1 Choice of formalism

The quest of deciding on a meaning representation for a particular modelling task boils down to finding a healthy balance between expressive power and inferential tractability. As we have seen in our informal description of our domain, tombstone inscriptions require modelling entities of various kinds (persons, locations, date expressions) and the relations between these (family relations, occupations, references). Since negation and disjunction do not seem to play an important role, a simple graph-like representation (i.e., a fragment of first-order logic without negation) offers sufficient descriptive power for our task, rather than frameworks designed to deal with the richness of natural language, such as File Change Semantics (Heim, 1982) or Discourse Representation Theory (Kamp, 1984), which are equivalent to first-order logic.

Existing representation schemes used for semantic web applications (based on RDF) are not entirely suitable for our purpose either. The format used in the FOAF (Friend of a Friend) ontology is suitable for describing social relations between persons (Ding et al., 2005), but would need considerable extensions to describe entities on tombstones such as biblical references, symbolism and events. In event ontologies, such as the Simple Event Model (Van Hage et al., 2011), events take centre stage, and even though there are two events that play key roles in tombstone descriptions (birth and death), it would be cumbersome to describe these ’default’ events in the format required in these frameworks. Nonetheless, our proposed representation for these two core events found on tombstone descriptions allows for a straightforward mapping to standard event ontologies. Likewise, the Schema.org initiative (Guha et al., 2016) lacks coverage of the specific attributes of tombstones and is therefore not directly suitable for our needs. However, it is consistent with the framework that we propose and future work could provide an extension of Schema.org to include tombstones. Existing formal ontologies defined for applications in the digital humanities, such as CIDOC-CRM (Biagetti, 2016), come very close to our goals but are not tailored to tombstone inscriptions specifically and are therefore not suitable for modelling all of the details required for a good characterisation of the information found on tombstones.

Following this line of reasoning, and in accordance with recent developments in computational linguistics, our choice for representing tombstone interpretations is the PENMAN notation, a representation originally introduced for natural language generation systems (Kasper, 1989; Bateman et al., 1989; Bateman and Paris, 1989; Bateman, 1990), also known under the names of Penman Interface Notation or SPL (Sentence Plan Language). This representation format is flexible (it can be visualised in various ways), is relatively easy to understand by human annotators (Banarescu et al., 2013), is supported by various software packages for evaluation purposes (Cai & Knight, 2013), and several semantic parsers have been developed for it (see Section 5). An example of this kind of representation is shown in Fig. 3.

Fig. 3
figure 3

Tombstone (left) with annotated meaning representation in SPL (right)

SPL describes entities and relations between them. Every entity is assigned a unique identifier and needs to be declared as an instance of a concept taken from a certain ontology. This is done using the slash notation. For instance, in Fig. 3, p2 is an entity of the concept male.n.02, coded as (p2 / male.n.02 ... ), where the dots are used as a placeholder to add more information about p2. Here, p2 is an arbitrarily chosen identifier that has not been used before (any identifier could have been chosen that had not been used before in the same meaning representation). For clarity, however, we use certain notational conventions in the annotations, and identifiers are coded as a single lower-case letter followed by an integer (e.g., p1, p2, r1, r2, s1, s2, and so on). Constants are written in double quotes. Examples of constants in Fig. 3 are the dates and the proper names of people mentioned on the stone. The meaning representations can also be visualised as a directed graph, as shown in Fig. 4.

Fig. 4
figure 4

Graph visualisation of the tombstone information of Fig. 3

Moreover, the information can be simply stored as a collection of database triples, making it an ideal format to link with other knowledge resources or relational databases on the semantic web. Figure 5 illustrates this idea, as well as some SPARQL (Arenas et al., 2018) example queries that could be used to search for tombstone images with specific attributes.

Fig. 5
figure 5

Representation of the tombstone information of Fig. 3 as RDF-like triples (left) and three example SPARQL queries (right): the first selects tombstone inscriptions with female people, the second selects tombstones with people who passed away before the 20th century, and the third returns symbols on stones with widows mentioned on them

The SPL formalism gives us a general representation format, corresponding to a fragment of first-order logic (see Appendix Appendix). What SPL does not specify is the vocabulary of terms: the concepts and relations that are used for a specific domain or task. For instance, in the Abstract Meaning Representation project (Banarescu et al., 2013), SPL is used in combination with relations adopted from PropBank (Palmer et al., 2005) and concepts taken from OntoNotes (Hovy et al., 2006). In our work, we populate our vocabulary with concepts defined in WordNet (Section 3.2). WordNet (Fellbaum, 1998) is a lexical resource used extensively in computational linguistics and information retrieval, and is converted to standard ontologies (Gangemi et al., 2003) and RDF (Van Assem et al., 2006). In addition, we specify a novel set of relations tailored to the task of tombstone inscription interpretation in order to produce compact meaning representations (Section 3.3).

3.2 Encoding concepts: WordNet

There are two kinds of entities in SPL: constants and variables. Constants (literals) represent concrete values and are either numbers or strings. In our variant of SPL, we enclose strings in double quotes and transform them to upper case for normalisation purposes. Numbers are never enclosed in quotes, unless they are used as names. If an entity is represented by a variable, it needs to be linked to a concept. In our variant of SPL, we adopt WordNet synonym sets (synsets) as concepts (Fellbaum, 1998). A synset is a set of lemmata that, when used in an appropriate context, have the same or similar meaning. Ambiguous words are therefore binned in two or more different synsets.

A synset can be referred to by its identifier or by one of its members and a sense number. For notational convenience, we choose the latter option and use the lemma – part-of-speech – sense format familiar in natural language processing. For instance, oak.n.01 is the concept for “the hard durable wood of any oak, used especially for furniture and flooring”, whereas oak.n.02 is the concept for “a deciduous tree of the genus Quercus; has acorns and lobed leaves”. Here, n stands for the part-of-speech noun, and the numbers refer to the first and second sense of the noun “oak”. The other parts of speech are v (verb), a (adjective) and r (adverb).

WordNet is organised by relations between synsets. The hypernym relation is especially useful for implementing a semantic search, because it gives us the means to assign specific as well as general concepts to objects. Some examples are given in Appendix Appendix of how to exploit WordNet relations for the purpose of producing background knowledge.

English WordNet, a lexical database incorporating an ontology (Fellbaum, 1998), was chosen for its large coverage. It turns out that we can use the same base of concepts for describing persons, locations, jobs, family roles, symbols, religious references, and more. Even though we make use of the English WordNet, extension to multilingual WordNets (Bond & Foster, 2013) is certainly an idea worth considering. It is, however, important to realise that the meaning representations show concepts, not English words, and although these concepts are triggered by words, we assume some language-independence here. It could be the case that extensions are required in the future, as some non-English words could be untranslatable into English directly. Nonetheless, the original WordNet causes no substantial limitations for our current undertaking.

3.3 Encoding relations

The WordNet concepts provide an ontology of entities. However, we also need a vocabulary of relations to connect these entities. It is possible to re-use existing general purpose ontologies and adopt the event-based thematic role from PropBank (Palmer et al., 2005), VerbNet (Kipper et al., 2008) or FrameNet (Baker et al., 1998). Here, instead, we design our own set of relations, tailored to the task of describing tombstone texts, to keep the resulting representations compact.

A relation is signalled by a colon in SPL. In the example in Fig. 3, we have several relations, such as :sym, :ent, :nam, :rol and :dob. An entity can be linked to one or more entities. For instance, in Fig. 3, the entity p1 is related to entity r1 by the :rol relation, and related to the constant "01-05-1933" by the :dod relation. This gives us domain-specific relations such as :dod in Fig. 3, rather than the more cumbersome “X died in event E, and E took place on date D”. A link to general purpose relations can still be provided by postulating a set of axioms (see Appendix Appendix).

The general idea of a meaning representation is that it not only shows the entities involved but also how they relate to each other. In our domain of gravestone inscriptions, we ensure that the resulting meaning representation satisfies the property of connectivity with the help of the :ent and :sym relations that relate the tombstone to the entities that it displays. As a tombstone could comprise information on more than one deceased person without supplying information on how these people relate to each other, the root of the meaning representation is always the tombstone itself, as the example in Fig. 4 illustrates.

SPL has a feature that makes it possible to inverse relations (Bateman, 1990). An inverse relation is indicated with the -of suffix. Take, for instance, the meaning representation (b1 / book.n.01 :prt (c1 / chapter.n.01)), which says that a book b1 (the root) has a chapter c1. We can bring the chapter to the foreground by inverting the relation, arriving at (c1 / chapter.n.01 :prt-of (b1 / book.n.01)), where now the root is taken by c1, a chapter of book b1. It is important to note that these two meaning representations are equivalent, but are just packaged differently. This ability is sometimes useful for annotation purposes. We use inverse relations for biblical references, geographical relations and comparative constructions.

The complete collection of relations (and their definitions) is presented in Table 1, covering 32 different relation types. The usages of these relations and how they relate to concepts are explained in Section 4, where we discuss the phenomenon found in the inscriptions.

Table 1 Relations used in the tomb meaning representations

4 A formal description of tombstone phenomena

The number of types of entities found in the tomb inscriptions of our collection is, although limited, perhaps more than would be expected, and comprises people, dates, locations, occupations, organisations, family roles, measures, quotes (epitaphs) and references. Besides textual entities, objects are also displayed with the use of graphical images.

In this section, we describe the challenges to represent these entities in a formal way, by summarising the variety in which they appear on tombstones, how they can be linked to an ontology and, for computational purposes, how they can be represented in a normalised way. We cover nearly all of the types of information found on tombstones in our dataset, with the exception of epitaphs and cause of death. Epitaphs pose challenges because of the variety that they introduce and are perhaps less suitable to be included in a meaning representation. Cause of death is rarely mentioned on the tombstones in our dataset (less than 0.1%) and therefore represents a sample that is too small on which to design and test a proper meaning representation.

4.1 People

The names of the deceased or the commemorator (sometimes both) may appear on tombstones. On rare occasions (less than 1% in our dataset), the designer of the stone is also mentioned, although usually in a less prevalent position. Full proper names are usually used for the deceased, although sometimes only first names appear (especially for children) or surnames with initials or abbreviations of given names, and occasionally titles or patronymic suffixes are displayed as well. For married women, maiden names are regularly provided, and alternative names or nicknames can sometimes be found as well. Only on very rare occasions are people left unnamed (in 24 cases). Furthermore, people can also be implicitly introduced by the use of possessive pronouns (“my husband”, “our aunt”).

Named people are linked to a stone with the :ent relation. Usually (in 95% of the cases), the perceived gender of the person can be determined on the basis of the first name, maiden name, title or family relations provided on the stone. The WordNet concept female.n.02 is used for female persons and male.n.02 for male persons, without suggesting that these are the only options for representing gender. In cases where the gender identity of the person is non-binary or other-gendered, or in case the gender of the person cannot be recovered on the basis of the information provided by the inscription, the gender-neutral person.n.01 is chosen. For names on family graves, the concept family.n.04 is used. Groups of persons are coded with the WordNet concept people.n.01, meaning “a group of human beings”.

Names are represented as constants; that is, in upper-cased strings in double quotes. Abbreviations occur frequently in names, so the annotation guidelines should be clear on when to expand or not. Initials are not expanded because the ability to do this depends heavily on external resources (e.g., online family trees); multiple initial letters are encoded without spaces between punctuated letters. Abbreviated prepositions in the family name (common in Dutch surnames) are expanded, because these are mostly unambiguous and add another level of normalisation in the meaning representation.

If more than one name is used for the same person, and they are different (for instance in the case of nicknames), an entity will be associated with more than one name via multiple use of the :nam relation. However, if there are several names on a stone for the same person, only the most specific name is coded in the meaning representation (because the less specific name is subsumed by the more general one, following basic principles of semantics). Pronouns are sometimes used in tomb inscriptions to establish coreference between named entities. These are not only possessive pronouns, but also the pronoun “both” (Dutch beiden).

Patronymic suffixes, honorifics and titles are not considered to be part of the proper name itself and introduce their own relations, either via :pfx (prefix) or :sfx (suffix). Examples of name prefixes are honorifics or titles (paraphrased in English): IR (engineer), JNKV (lady), DS (minister), MEVROUW (Mrs), HEER (Mr), DR (doctor). Examples of name suffixes are AZ (Arent’s son), B.ZOON (Berend’s son), EZ (Egbert’s son or Evert’s son), GZN (Geert’s son), JZ or JZN (Jan’s son), HZ or HZN (Hendrik’s son), HARMZN (Harm’s son), PZ (Pieter’s son), KZ (Koop’s son), RZ or RZN (Roelof’s son), WZN (Willem’s son). These examples show that, even though abbreviated prefixes are generally resolvable, patronymic suffixes require careful study of genealogical databases. We therefore choose not to expand suffixes, and for consistency we do not expand prefixes either.

The representation of names that we propose is incredibly simple. It is possible to analyse names further by breaking them down into given names, family names, maiden names, nicknames, and perhaps more. This is not simple or straightforward in some cases without the addition of external knowledge resources, and the solution we adopt has the advantage of being applicable to many cultures and traditions of naming people. Future work could address a more detailed semantic analysis of personal names.

4.2 Date expressions

An essential piece of information on a gravestone is the date of death (although family graves commonly do not contain dates). In most cases, both the date of birth and date of death are shown (in that order), often with a textual (Fig. 3) or symbolic (Fig. 6) sign that indicates birth or death. The three components of a date (the day, the month and the year) are usually all present and, when they are, they come in this order in our dataset. In some cases (1.3%), only the year is provided.

Fig. 6
figure 6

Tombstone with biblical reference (left) and corresponding meaning representation (right)

The day of the month and the year are nearly always written in numbers, although the century may sometimes be omitted (e.g., ’42 for 1942). The expressions denoting months show a large variety, ranging from simple numbers to Roman numerals, full names (sometimes with different spellings) in lower-case or upper-case characters and abbreviations in various forms. There is also a noticeable heterogeneity in the use of punctuation in date expressions, with short, long and lowered hyphens, colons, periods and diverse spacing. An impression of this variety is given in Table 2.

Table 2 Different formats of date expressions found on tombstones in Groningen with their normalisation

All date expressions are normalised using the format "YYYY-MM-DD", where DD are digits for the day of month (0131), MM digits for the month (0112) and YYYY digits for the year. In case the days and months are not provided, "XX" is used to represent this information. In partial dates where the century is not provided, contextual information (the state of the tombstone, general knowledge of life expectancy) or online genealogy databases are used to complete the missing information.

We adopt this compact notation of date expressions for practical reasons, as it will ease the process of annotation. However, for the purpose of temporal inference (see Appendix Appendix) or fine-grained evaluation, we may want to deconstruct this notation into its atomic parts. For instance, the 1st of February 2003 is represented in our default scheme by the constant "2003-02-01". It is straightforward to break this down automatically into the structured, but equivalent (d1 / date.n.05 :dom "01" :moy "02" :yoc "2003"), where the relations :dom, :moy, :yoc designate the day of month, month of year, and year of century, respectively.

4.3 Toponyms

When toponyms (names of villages, towns or cities) are used in our dataset, they commonly denote the birthplace or the place of death. Location names are occasionally (4%) used in connection with the deceased’s line of work. Note that, on tombstones of different cultures, toponyms can also be used to indicate the origin of settlers or social identity, as displayed on Taiwanese tombstones (Streiter et al., 2010b; Streiter et al., 2008). Besides the name of a village or town, further information is sometimes given for non-local places or ambiguous place names, such as the region, province or country. Many tombstone descriptions contain location names. In our dataset, we find an average of one mentioned location per tombstone, but if a location name is found on a stone, it is usually more than one (in 70% of the cases).

Interpreting toponyms raises the challenge of grounding; in other words, of mapping to a location in the real world (Leidner, 2008; Kew et al., 2019). Names are sometimes abbreviated, or are written following a non-standard or old-fashioned spelling, and this can make them hard to interpret (Table 3). An additional problem is that many names for locations are ambiguous with respect to their reference in the real world. Villages that go by the same name are widespread in the Netherlands (some examples from our dataset are Den Oever, Erica, Hoorn, Langelo, Nes, Niekerk, Noordwolde, Oostwold, Spijk, Siegerswoude, Westerlee, Winsum, Zevenhuizen and Zuidwolde) or can be used in other countries (Kampen, Oostburg, Roden, Zutphen). Another type of referential ambiguity is caused by cities, municipalities or provinces that share the same name (Groningen, Grootegast, Vries).

Table 3 Examples of normalisation of toponyms with their GeoNames identifiers

Our approach for toponym interpretation is similar to that of Leidner (2008), who grounds toponyms with an extensional semantics in the form of a pair of latitude and longitude of the centroid of the location. However, instead of using geographical coordinates, we consult the geographical database GeoNames (Maltese and Farazi, 2013) for a unique identifier for a given place name. GeoNames is a widely-used multilingual gazetteer in the semantic web (Ahlers, 2013) that makes it possible to link places to geographical coordinates and further topographical data. It also contains historical names, which is convenient as some villages found on old tombstones no longer exist or have been incorporated into larger urban districts.Footnote 2 Example mappings from toponyms to GeoNames identifiers are shown in Table 3.

As with names of persons, location names are orthographically represented by upper-cased strings. Names of locations may differ across languages, and it would be counter-intuitive to translate names in a particular language for the sole purpose of normalisation. Instead, we preserve the original name and annotate the language using the three-letter codes proposed in the ISO 639-2 standard. A full example in SPL for a toponym (the Dutch-Frisian city of Leeuwarden) is the following:


(c1 / city.n.01 :nam (n1 / name.n.01 :lit "LJOUWERT" :lan "FRY") :geo "2751792")

As this example shows, we also associate every toponym with a WordNet concept that uniquely identifies its type. Table 4 shows concepts that we use in our domain, together with their definitions adapted from WordNet.

Table 4 Types of toponyms, defined by WordNet concepts

A complete SPL representation with toponyms is shown in Fig. 6. This figure also demonstrates the shorthand notation that we use in SPL for Dutch place names (the majority), because we only encode the language if it is not Dutch, to make the annotation task lighter for the human annotator. Hence, the shorthand expression :nam "LEEUWARDEN" is interpreted as :nam (n1 / name.n.01 :lit "LEEUWARDEN" :lan "NLD").

Finally, we found two types of locative demonstrative pronouns in our dataset that are used to refer to locations. The Dutch word aldaar (‘the earlier mentioned place’) is regularly used for the place of death in case it is the same as the birthplace of the deceased (in our dataset, it is never utilised to refer to a location mentioned in connection with another deceased person). Because it refers back to an already mentioned place name, an earlier introduced referent in the SPL representation is re-used, as Fig. 6 shows. Sometimes, the Dutch word alhier (‘here, at this place’) is used to refer to the village, town or city in which the cemetery is situated. This information is impossible to retrieve from the text on the stone, but the metadata of the stone’s image contains the GPS coordinates of the location, which is then used to determine the place name.

4.4 Family relationships

Family relationships (such as wife, husband, mother, father, grandmother, grandfather, great-grandmother, great-grandfather, widow, widower, parent, child, grandparent, grandchild, daughter, son, spouse, aunt, uncle, sister, brother, and rarer ones like grand-aunt, grand-uncle, daughter-in-law, brother-in-law, sister-in-law, foster-parent and housemate) are often displayed on tombstones. We propose to represent them as WordNet concepts. However, as they express a relationship between two persons (or groups of persons), we also need to have the means to signify the person that carries out the role of the relation, and the person that is the target. These links are made via the :rol and :tgt relations.

Several instances of family roles are shown on the tombstone introduced earlier in Fig. 3. Three roles are supplied for person p1: she plays the role of wife.n.01 (“LIEVE VROUW”), mother.n.01 (“MOEDER”) and spouse.n.01 (“ECHTGENOOTE”). As the meaning representation for this stone shows, roles introduce their own entity variable; in this case, we have r1, r2 and r3. The first two roles do not explicitly express their target on the inscription. However, the target of the first role, that of wife, can be derived from further information on the stone, and is denoted by p2. We therefore end up with :rol (r1 / wife.n.01 :tgt p2). The target of the mother role is left unspecified – the stone provides no information. The target of the third role of spouse is explicit, it is again p2. Note that, although the meaning representation contains redundant information, as wife.n.01 subsumes spouse.n.01, we opt to represent both family relationships for reasons of consistency (see Appendix Appendix for notes on inferences).

The motivation to have explicit entities for roles is that a person can play several roles – even the same type of role with different targets – and that roles can be anchored in time. For example, a person can first be a widower of one person, and later a widower of another person. This example also brings up an interesting characteristic of roles, which is the time span that they implicitly denote. This is sometimes made explicit, and to connect a role to a measurement phrase we use the :dur relation conveying duration. If several roles are mentioned, they can also be compared to each other, as in “first widow of X, later of Y”. To express these temporal relations, we use :bef (before) and :aft (after) as relations between roles.

4.5 Phrases of measurement

Tombstones sometimes provide information about the age of the deceased or about the length of a certain state of relationship. This phenomenon is relatively rare in our dataset: nearly 4% of tombstones have descriptions of measurements. We use the WordNet concept measure.n.02 (“how much there is or how many there are of something that you can quantify”) as a group with members that denote a certain unit of measurement (see Appendix Appendix for further information on the semantic interpretation of groups) and quantity that specifies the number of units that have been measured.

Measurement phrases on stones denote periods and are measured in years, and sometimes (in 10% of the cases) in finer units of time (months or days). In our dataset, we capture these with the following WordNet concepts for units of measurement: year.n.01, month.n.01 and day.n.01. Quantities are described by numbers and linked to the measurement with the :qua relation. Numbers, as usual in formal systems, are constants, but for notational simplicity we do not enclose them in double quotes. For instance, the expression “75 years” is represented as a measure comprising 75 members of the unit of measurement year:


(m1 / measure.n.02 :qua 75 :mem (u1 / year.n.01))

We introduce two relations to link measurements to other entities: :age to specify the age of people and :dur to specify the duration of relationships (see Table 1). Moreover, measurement phrases can be combined with vague modifiers, and global units are sometimes joined with finer units of measurement to add precision to the phrase.

First consider the latter. This concerns measurement phrases that combine different units of measurement, as in the phrase “57 years and four months”. The problem here is that there is some summation taking place of the two measures expressed, rather than viewing two independent phrases separately. One solution would be to normalise the two different units to a common unit and sum the quantities. In the example above, this would yield a measure of units of 57 ∗ 12 + 4 = 688 months. However, the meaning representation would deviate considerably from the surface string. An alternative, and the solution that we propose, is to represent complex measures in a recursive way, using sub-groups:


(m1 / measure.n.02 :sub (m2 / measure.n.02 :qua 57 :mem (u1 / year.n.01))) :sub (m3 / measure.n.02 :qua 4 :mem (u2 / month.n.01)))

Measurements can also be modified with expressions that introduce some kind of vagueness. Two different types of modification for measurements are encountered in our dataset: the Dutch adverbs ruim (over) and bijna (nearly). The meaning of ruim, as in “ruim 40 jaar”, indicates an amount more than the concrete value specified by the phrase that it modifies, but its precise meaning is hard to pinpoint. It seems to carry a conventional implicature (Grice (1975)) too – in our example it conveys that the age is less than, but close to, 41 years. Its antonym, bijna, as in “bijna 54 jaar”, denotes an amount less than that expressed by the phrase it modifies. Again, it triggers a conventional implicature (namely that the amount that is lacking to fulfil the specified amount is very small and less than one unit of it).

A simple, practical solution is to ignore this type of vague modification. As ruim N U (where N is a number and U a unit of measurement) entails N U, we could just ignore the contribution of the modifier in our semantic analysis. Likewise, for bijna N U, we could represent its meaning as N-1 U. However, this strategy could have undesired consequences (for instance, consider the case where N = 1) and places little faith in our journey of formalisation. The direction that we take is to make these modifiers explicit in the meaning representations, by introducing an additional measure entity and relating it to the entity that it modifies. We then make a comparison between these two measures, and specify whether the first is more or less than the second. For instance, “more than 40” is mapped to the following representation:


(c1 / more.r.01 :op1 (m1 / measure.n.02) :op2 (m2 / measure.n.02 :qua 40 :mem (u1 / year.n.01)))

Here we have the WordNet concept more.r.01 (and likewise, we adopt the complementary less.r.01), which is related to the comparator via :op1 and the comparand via :op2. We apply relation inversion (see Section 3.3) to make the measure specified by the comparator the root of the meaning representation:


(m1 / measure.n.02 :op1-of (c1 / more.r.01 :op2 (m2 / measure.n.02 :qua 40 :mem (u1 / year.n.01))))

This representational treatment of vague modifiers gives us a first handle on its semantic impact, but future work, including empirical research (i.e., looking at more data), is required to fully pinpoint the meaning of this interesting linguistic phenomenon.

4.6 Occupations and organisations

Occasionally, information about the line of work in the life of the deceased is provided on a tombstone (in about 8% of the stone inscriptions in our dataset). We integrate this information into our meaning representation for stone inscriptions by combining an existing classification of job descriptions (HISCO) with WordNet concepts (Table 5). HISCO, short for Historical International Standard Classification of Occupations, is a multilingual resource for classifying occupations (Van Leeuwen et al., 2002; Tobin et al., 2019), and it provides a hierarchical way of organising job titles. This is accomplished with a single code comprising five digits, where the first digit indicates a general class, and the digits following make the occupation more specific (e.g., “0” represents generic technical workers, “04” officers of ships and aircraft, “041” aircraft pilots, and “04120” air transport pilots). HISCO contains examples of occupations in several languages, including Dutch, and is therefore especially suitable for our domain of interest.

Table 5 Occupations found on Dutch tombstones, and corresponding WordNet concepts and HISCO code

Occupations are linked to the person with the :occ (has occupation) relation. The occupation itself is represented by a WordNet concept and is connected to a HISCO code with the relation :hco. For instance, the job of goudsmid is represented as :occ (o1 / goldsmith.n.01 :hco "88050").

In some cases, further information about the job is given on stone inscriptions: the period or duration of the occupation, the organisation that the person worked for, or the location of the workplace. A period is indicated with the :beg and :end relations. Locations (toponyms) are linked to the occupation using the :loc relation. Organisations are represented in a similar way as people: as named entities combining an appropriate WordNet concept for the type of organisation with its name. Organisations comprise, for instance, companies (company.n.01), schools (school.n.01), unions (union.n.07), religious communities (community.n.02) and universities (university.n.02). They are linked to the occupation using the :org relation.

4.7 Biblical references and quotes

Quotes from the Bible, or short references to these texts, appear on around 15% of the tombstones in our dataset. They are meant as a source of comfort and relief for the commemorators, and their use is typical on Protestant gravestones (Krommenhoek, 2016). The references are highly structured, referring to one or more verses of a book in the Old or New Testament. For instance, “Ps.119:1 BER” refers to the rhymed version of the first verse of the 119th book of Psalms. There are three ways in which references appear on stones: a reference without the text itself; the text without its reference; or a combination of reference and text. In the first case, we capture just the reference in the meaning representation (because various versions of the text exist, so it is impossible to tell which one was meant). In the second and third case, we represent both the reference and the quote.

Our annotation makes the structure of biblical references explicit by distinguishing between verses (verse.n.01), chapters (chapter.n.01) and books. Sometimes, just a part of a verse is referred to, in which case we use the concept part.n.09. All of the books of the Bible are listed as instances of the Old or New Testament in WordNet, so there is no need to include another ontology to uniquely identity biblical references. An example of the resulting representation is shown in Fig. 6.

Psalms have been adapted to rhymed versions to make them more suitable for congregational singing (Slenk, 1969), and whether or not this process of conversion is applied is sometimes explicitly indicated on tombstones. This is accomplished by a variety of possible expressions, including “BERIJMD”, “BER.”, “BER” or “berijmdt” (rhymed) and “ONB.” or “O.B.” (unrhymed). We represent these attributes with the aid of the :att relation and the WordNet concepts rhymed.a.01 and unrhymed.a.01. If more than one verse is cited, as in “PSALM 118:17-20”, we use the group construction (see Appendix Appendix).

We adopt the relation :rel to link the biblical reference to other entities of the tombstone inscription. However, it is not always clear to which entity this link should be made. On stones with more than one deceased, biblical references are sometimes clearly positioned to apply to a particular person. In such cases, it is obvious that the intention of the commemorators was to connect specific references to each of the deceased. If there is more than one deceased person but only a single biblical reference (placed at the bottom of a stone), the connection is less clear. In this case, the reference is linked to the stone itself.

4.8 Symbolism

Tombstones are often (in more than half of the cases in our dataset) decorated with interesting symbols – some of them modern, others older. Our primary aim is not to understand the symbolism itself by deciphering its deeper meaning, even though this can give information about the beliefs and social status of the deceased (Krommenhoek, 2016): the Christian cross is a symbol of Catholicism, the Star of David of Judaism, the Square and Compasses of Freemasonry, and a crest of the Salvation Army.

Rather, we take a descriptive approach, capturing the symbols that appear on stones with concepts, and leaving further interpretation for other branches of research. An existing classification system designed for art is ICONCLASS (Couprie, 1983). This is used by researchers (mostly art historians) to annotate an image with the concepts it visualises. Its definitions are hierarchically ordered, in a way similar to the HISCO scheme for occupations (Section 4.6). The inclusion of ICONCLASS in our meaning representation could be realised with a relation similar to :hco, but it requires expert knowledge from an art historian and is therefore the subject of future interdisciplinary work. Instead, we use WordNet to describe the symbols found on the tombstones in our dataset (Fig. 7).Footnote 3

Fig. 7
figure 7

Examples of symbols found in our datasets with their corresponding WordNet concepts

The semantic relation that we use to indicate that a stone is decorated with symbols is :sym (see Table 1). We distinguish two main types of symbol representations: single symbols and composite symbols. Symbols that appear alone are simply represented with an appropriate WordNet concept that is as specific as possible, examples of which are given in Fig. 7. Some symbols are displayed with prominent attributes: a broken flower, a winged sandglass, an inverted torch or clasped hands. These are represented with the help of the :att relation, as Fig. 8 demonstrates, with the corresponding WordNet concept for adjectives to capture the type of modification.

Fig. 8
figure 8

Tombstone symbols with attributes, paired with their corresponding meaning representations

Composite symbols are two or more symbols that are visualised next to each other or touching each other. Examples are two fronds that are tied together with a ribbon, a weeping willow decorated with a scythe and an inverted torch, a dove with a branch in its beak, a horse standing next to a gate, or a hand holding a torch. We represent these compound symbols with the WordNet concept composite.n.01 and the relation :prt. It goes beyond the scope of this article to describe composite symbols with spatial relations and the position of the symbols on the gravestone. However, the meaning representation that we develop here could accommodate such an extension.

5 Automatic reading of tombstones

To facilitate the process of annotation, dedicated software could be used to produce meaning representations for tombstone images. Research has been carried out in this area of computational semantics for similar meaning representations (Wang and Xue, 2017; Van Noord & Bos, 2017), but these approaches assume textual input, not images of text. Parsing tombstone inscriptions raises new challenges, and here we give an overview of what we think could be potential showstoppers of this ambitious but exciting idea. An automatic tombstone parser has the potential to speed up human annotation, for instance in a setting where the parser produces a draft meaning representation, for correction by a human annotator.

There are two extremes of natural language processing architectures that could be considered for this task. The first is a pipeline architecture, a sequence of tools dedicated to specific subtasks, typically where a component adds information to the output of the previous component in the pipeline (the first component would take the images as input; the last would output the meaning representation). The second is an architecture based on neural networks, where a machine learning model trained on pairs of input-output representations (here: images and their corresponding meaning representations) would predict a meaning representation for images not seen during the training stage. And, of course, anything in between these two extreme types of architectures is also possible. Here, we mainly consider the first type of architecture.

Considering a classic pipeline architecture, the first component faces the task of optical character recognition (OCR). There are various challenges here, as there is a tremendous variation in font type and size, background and foreground colour, orientation, and obscured or unreadable parts. It could also be challenging to map regular symbols, such as the star and cross to denote birth and death, to standard characters (see Fig. 6).

The next component would address the issue of segmentation: dividing characters into word tokens. This requires a proper interpretation of hyphenation (line breaks) and spaces: sometimes there is no spacing at all on small stones. Furthermore, parallel columns of texts on stones could introduce errors.

The next logical step would be to assign linguistic categories to the word tokens that are the result of the segmentation process. Recognising names of people, dates and locations (named entity recognition) is a first step, but this needs to be further extended for dealing with job descriptions, epitaphs, measurements and biblical references.

The interpretation component requires the normalisation of date expressions, the expansion of abbreviations (see Table 6), the resolution of toponyms (see Section 4.3), word sense disambiguation, pronoun resolution and coreference resolution. In other words, nearly all of the challenges of natural language understanding are present on tombstone inscriptions. On top of this, information is presented in a variety of languages, and code switching occurs as well.

Table 6 Abbreviations found on tombstones in Groningen with their expansion and English translation

Connections between entities are sometimes hard to establish because of long-distance dependencies. A case in point is a date of birth that does not belong to the closest person mentioned on the stone, as in the sequence “Here lies person A, married to person B, born on date C”. Without further context, “person B” could be linked erroneously to “date C”.

Finally, coordinate structures may appear within dates, toponyms and personal names, where the coordinator is a parenthesis (Fig. 9). Coordination is a linguistic device that links together shared information, thereby reducing the amount of space required for the expression. Coordination is known as one of the hardest problems in natural language processing (Dahl & McCord, 1983), and the graphical coordination displayed on tombstones increases this difficulty even further.

Fig. 9
figure 9

Examples of graphical coordination

If automatic approaches to tombstone parsing are developed, it is important to assess their quality. A standard way of doing this in natural language processing is to compare system output with “gold standard” representations; in other words, manually labelled meaning representations. For SPL, several tools have been developed to do this automatically, by computing precision and recall of matching triples (Cai & Knight, 2013). The harmonic mean of precision and recall gives a single score that measures the quality of the parser. A score of around 0.75, which is roughly what is achieved for text interpretation for similar meaning representations (Wang & Xue, 2017), would probably speed up the manual annotation of tombstone interpretation.

6 Conclusion and future work

The information found on tombstones in the north of the Netherlands is diverse and stretches far beyond the name of the deceased and dates of birth and death. Commemorators, locations of birth and death (or other important places), family status, occupations, durations (someone’s age or length of relationship or occupation) and references to biblical texts are common. Textual information is also regularly accompanied by symbolism, ranging from singular objects to complex collections of burial elements.

The expressive power required for a semantic formalism to capture the information found on tombstones in our dataset can be limited to a fragment of first-order logic without negation. The SPL formalism lends itself well for this purpose, as we have shown in this article.

A closed class of 32 two-place predicates is needed to establish fine-grained relations between the entities found on tombstones. As we have shown, WordNet provides a generic way of representing concepts, covering persons, locations, occupations, family relationships, measurements, organisations, biblical references and symbols. For our dataset of over 1,000 tombstones, a set of around 250 different concepts sufficed. The core formalism can be easily extended to existing third-party databases or classifications, including HISCO (Van Leeuwen et al., 2002), ICONCLASS (Couprie, 1983) and GeoNames (Maltese & Farazi, 2013). This can be seen as a first step towards a standard for tombstone inscriptions for initiatives such as CIDOC-CRM (Biagetti, 2016) and Schema.org (Guha et al., 2016).

Not all phenomena have been integrated into our formalism. Information of cause of death, for example, is infrequent in our dataset, and more data is needed to design a generalised scheme compatible with existing classifications such as ICD-10 (Tobin et al., 2019). Another phenomenon that is not included is epitaphs (short phrases or texts in honour of the deceased person), as this requires further study (classification) and integration into SPL.

Computational approaches to interpreting tombstone inscriptions are likely to speed up the annotation process for datasets that are several orders of magnitude larger than the one presented in this work. They will need to deal with a large variety of complicated natural language processing issues, ranging from optical character reading and image recognition to segmentation, relation extraction, concept disambiguation, coreference resolution, and more. The dataset originating from this work will be released for research with this purpose in mind. Intelligent software that helps with annotation, and even imperfect approximations, could accelerate data labelling considerably.

There are various possible directions to take for further exploration. One exciting area is to consider tombstones of religions other than Christianity, such as gravestones of Jewish and Muslim communities. This is likely to require additional expertise to transcribe and interpret Hebrew and Arabic scripts. Another area is visualisation and consistency checking of the annotated data, to make the dataset ‘AI-ready’. Exploring alternative ways of visualising the information of one or more, possibly related, tombstones, for instance by displaying them as timelines, will not only offer new perspectives on the data, but also provide means to capture contradictions, discover redundancies and detect anomalies.