1 Introduction

rdf’s distributed graph model encouraged adoption for publication and manipulation of e.g. social and biological data. Coding errors in data stores like DBpedia have largely been handled in a piecemeal fashion with no formal mechanism for detecting or describing schema violations. Extending uptake into environments like medicine, business and banking requires structural validation analogous to what is available in relational or xml schemas.

While owl ontologies can be used for limited structural validation, they are generally used for formal models of reusable classes and predicates describing objects in some domain. Applications typically consume and produce graphs composed of precise compositions of such ontologies. A company’s human resources records may leverage terms from foaf and Dublin Core, but only certain terms, composed into specific structures, and subject to additional use-specific constraints. We would no more want to impose the constraints of a single human resources application suite on foaf and Dublin Core than we would want to assert that such applications need to consume all ontologically valid permutations of foaf and Dublin Core entities. Further, open-world constraints on owl ontologies make it impossible to use conventional owl tools to e.g. detect missing properties. Shape expression schemas (ShEx 1.0) [6, 8] were introduced as a high level language in which it is easy to mix terms from arbitrary ontologies. They provide a schema language in which one can define structural constraints (arc labels, cardinalities, datatypes, etc.) and since version 2.0 (ShEx 2.0)Footnote 1, mix them using Boolean connectives (disjunction, conjunction and negation).

A schema language for any data format has several uses: communicating to humans and machines the form of input/output data; enabling machine-verification of data for production, publication, or consumption; driving query and input interfaces; static analysis of queries. In this, ShEx provides a similar role as relational and xml schemas. A ShEx schema validates nodes in a graph against a schema construct called a shape. In xml, validating an element against an XML SchemaFootnote 2 type or element or Relax NGFootnote 3 production recursively tests nested elements against constituent rules. In ShEx, validating a node in a graph against a shape recursively tests the nodes which are the object of triples constrained in that shape. An essential difference however is that unlike trees, graphs can have cycles and recursive definitions can yield infinite computation. Moreover, ShEx 2.0 includes a negation operator, and it is well known that mixing recursion with negation can lead to incoherent semantics.

Contributions. In this paper we present shapes schemas, a schema language that is the foundation of ShEx 2.0 (Sect. 2). The precise relationship between shapes schemas and ShEx 2.0 is given at the end of Sect. 2. We formally define the semantics of shapes schemas and show that it is sound for schemas that mix recursion and negation in a stratified manner (Sect. 3). We then propose two algorithms for validating an RDF graph node against a shapes schema. Both algorithms are shown to be correct w.r.t. the semantics (Sect. 4). We finally discuss future research directions and conclude (Sect. 5).

Related Work. In [8] we gave semantics for ShEx 1.0. The latter does not use Boolean operators end because of negation, the extension to ShEx 2.0 (and thus to shapes schemas) is non trivial.

Closest to shapes schemas is the shacl Footnote 4 language both in terms of purpose and expressiveness. shaclalso defines named constraints called shapes to be checked on rdf graph nodes. Unlike ShEx, shaclis not completely independent from the RDF Schema vocabulary: rdfs:Classes play a particular role there as a shape can be required to hold for all the nodes that are instances of some rdfs:Class. Therefore validation in shaclrequires partial RDF Schema entailment in order to discover all rdfs:Classes of a node. Regarding expressiveness, the main differences between shacland shapes schemas are that shaclallows to define constraints based on property paths and for comparison of values; shacldoes not have the algebraic operators some-of and each-of and uses Boolean connectives for defining complex shapes; finally shacldoes not define the semantics of recursive shapes.

Ontology languages such as owl, description logics or RDF Schema are not meant to define (complex) constraints on the data and we do not compare shapes schemas with them. Proposals were made for using owl with a closed world assumption in order to express integrity constraints [5, 9]. They associate alternative semantics with the existing owl syntax and can be misleading for users.

Some approaches use sparql to express constraints on graphs (spin Footnote 5, RDFUnit [3]), or compile a domain specific language into sparql queries [2]. sparql allows to express complex constraints but does not support recursion. While sparql constraints can be validated by standard sparql engines, they are harder to write and maintain compared to high-level schemas like ShEx and shacl.

Description Set ProfilesFootnote 6 is a constraint language that uses an rdf vocabulary to define templates and constrain the value and cardinality of properties. It does not have any equivalent of the each-of algebraic operator, and was not designed to be human-readable.

Introductory Example. Let is: be a namespace prefix from some ontology, ex: be the prefix used in the example schema and instance, and foaf: and xsd: be the standard foaf and xsd prefixes, respectively. The schema \(\mathbf {S}_0\) is as follows

figure a

where \({ Def }_{\mathsf{Client}}\) is the definition of , and similarly for \({ Def }_{\mathsf{User}}\), and − in the definition of is the set difference operator. The schema \(\mathbf {S}_0\) defines four shapes intended to describe users, programmers, clients and issues, respectively. requires that a node has one foaf:name property with string value, and an optional foaf:mbox that is an iri. The optional mailbox is specified by the cardinality constraint [0;1]. Other cardinality constraints used in \(\mathbf {S}_0\) are [0;*] for zero or more, and [1;5] for one up to five. When no cardinality is given, the default is “exactly one”. A node has zero or more ex:expertise properties with values that are iri s, and one ex:experience property whose value is one among ex:senior and ex:junior. A has either a ex:clientNbr that is an integer, or a ex:clientAffil(iation) with unconstrained value (i.e. ), but not both. Finally, an issue () is reported by somebody who is client and user, is reproduced by one to five programmers, and can be related to zero or more issues.

The shapes in \(\mathbf {S}_0\) whose name contains Value specify the set of allowed values for a node. This can be the set of all values of some literal datatype (e.g. string, integer), the set of all nodes of some kind (e.g. iri), or an explicitly given set (e.g. ). is satisfied by every node. It states that the node can have zero or more outgoing triples whose predicates can be any iri , and whose objects match . Finally, uses a conjunction to require that a node has both the client and the user properties. Its definition is a bit technical. The right hand side of the conjunction states that the node must have a foaf:name and an optional foaf:mbox (\( Def _{\mathsf{User}}\)). Moreover (the ; operator), the node can have any number ([0;*]) of properties that can be any iri except for foaf:name and foaf:mbox and whose value is unconstrained. The latter is necessary in order to allow the “client” properties required by the left hand side of the conjunction.

Graph \(\mathrm {\mathbf {G}}_0\) here after is described by schema \(\mathbf {S}_0\). Nodes ex:issue1 and ex:issue2 have shape ; ex:fatima and ex:emin are ; ex:ren and ex:noa have shape .

figure b

The RDF Graph Model. As usual, we assume three disjoint sets: \(\mathsf {IRI} \) a set of iri s, \(\mathsf {Lit} \) a set of literals, and \(\mathsf {Blank} \) a set of blank nodes. An RDF graph is a set of triples over \(\mathsf {IRI} \cup \mathsf {Blank} \times \mathsf {IRI} \times \mathsf {IRI} \cup \mathsf {Lit} \cup \mathsf {Blank} \). For a triple (spo) in some graph, s is called its subject, p is called its predicate, and o is called its object. We denote \(\mathrm {\mathbf {N}odes}(\mathrm {\mathbf {G}})\) the set of nodes of the graph \(\mathrm {\mathbf {G}}\), that is, the elements that appear in a subject or object position in some triple of \(\mathrm {\mathbf {G}}\). The neighbourhood of node n in graph \(\mathrm {\mathbf {G}}\) is the set of triples in \(\mathrm {\mathbf {G}}\) that have n as subject, and is denoted \( neigh _\mathrm {\mathbf {G}}(n)\) or simply \( neigh (n)\) when \(\mathrm {\mathbf {G}}\) is clear from the context. We use disjoint union on sets of triples, denoted \(\uplus \): if \(N, N_1, N_2\) are sets of triples, \(N = N_1 \uplus N_2\) means that \(N_1 \cup N_2 = N\) and \(N_1 \cap N_2 = \emptyset \).

2 Shapes Schemas

A shapes schema \(\mathbf {S}\) defines a set of named shapes. A shape is a description of the graph structure that can be visited starting from a particular node. It can talk about the value of the node itself and about its neighbourhood. Shapes can use (Boolean combinations of) other shapes and can be recursive.

Formally, a shapes schema \(\mathbf {S}\) is a pair \((\mathbf {L}, \mathbf {def})\), where \(\mathbf {L} \) is a set of shape labels used as names of shapes and \(\mathbf {def} \) is a function that with every shape label associates a shape expression. In examples, we write \(L \rightarrow S\) as short for \(\mathbf {def} (L) = S\) (for a shape label L and a shape expression S).

Shape Expressions. The grammar for shape expressions is given on Fig. 1a. A shape expression (\(\mathsf {ShExpr} \)) is a Boolean combination of two atomic components: value description and neighbourhood description. A neighbourhood description (\(\mathsf {NeigDescr} \)) defines the expected neighbourhood of a node and is given by a triple expression (\(\mathsf {TExpr} \), see below). A value description (\(\mathsf {ValueDescr} \)) is a set that declares the admissible values for a node. The set can contain iri s, literals, and the special constant \(\mathsf {\_b} \) to indicate that the node can be a blank node. ShEx 2.0 proposes concrete syntax for different kinds of value description sets (literal datatypes, regex patterns to be matched by \(\mathsf {IRI} \)s, intervals, etc.). Here we focus on defining the semantics so the concrete syntax for such sets is irrelevant. A \(\mathsf {ValueDescr} \) can be an arbitrary set with the unique assumption that it has a finite representation for which membership can be effectively computed.

Fig. 1.
figure 1

The grammar for shape expressions and triple expressions.

Triple expressions. Triple expressions describe the expected neighbourhood of a node. They are inspired by regular expressions likewise dtds and XML Schema for xml. A triple expression will be matched by the neighbourhood of a node in a graph, similarly to type definitions in XML Schema that are matched by the children of some node. The main difference is that the neighbourhood of a node in an rdf graph is a (unordered) set, whereas the children of a node in an xml document form a sequence.

The grammar for triple expressions (\(\mathsf {TExpr} \)) is given on Fig. 1b, in which \(\textit{min}\) is a natural, and \(\textit{max}\) is a natural or the special value \(*\). The basic triple expression is a triple pattern and it constrains triples. A triple expression composed of each-of (separated by a ‘;’), some-of (separated by a ‘\(\mid \)’) and repetition operators is satisfied if some distribution of the triples in the neighborhood of a node exactly satisfies the expression. Section 3.1 defines this and draws the analogy with regular expressions. In examples, we omit the braces for singleton \(\mathsf {PropSet} \)s, e.g. we write instead of .

Example 1

(Shape expressions, triple expressions). In schema \(\mathbf {S}_0\) from the introductory example, the definitions of the five shapes with name ...Shape... are triple expressions and collectively make use of all the operators: each-of (;), some-of (\(\mid \)), repetition. All shapes with name ...Value... are defined by atomic \(\mathsf {ValueDescr} \)s. The definition of is a \(\mathsf {ShapeAnd} \) expression.   \(\square \)

Relationship between Shapes Schemas and ShEx 2.0. Shapes schemas slightly generalizes ShEx 2.0 and thus allows for a more concise definition of syntax and semantics. For readers familiar with ShEx 2.0 we now explain how shapes schemas differ from ShEx 2.0. First, \(\mathsf {TriplePattern}\) uses a set of properties, whereas the analogous triple constraint in ShEx 2.0 uses a single property. This slight generalization allows to encode the closed and extra constructs of ShEx 2.0. In shapes schemas, triple expressions are always closed (whereas in ShEx 2.0 they are non closed by default) but an expression E can be made non-closed by transforming it into , where P is the set of all \(\mathsf {IRI} \)s not mentioned as properties in E, and is as defined in the introductory example. The extra modifier is encoded in a similar way, using sets of properties in triple patterns and negation. Second, a \(\mathsf {ValueDescr} \) is an arbitrary set of values that can be iri s, literals or blank nodes, whereas the analogous node constraint in ShEx 2.0 defines a set of allowed values using a combination of elementary constraints such as xsd datatypes, facets, numerical intervals, node kinds. Using an arbitrary set of values allows to get rid of unnecessary (w.r.t. defining the semantics) details. Third, ShEx 2.0 allows to use shape labels in shape definitions; this is syntactic sugar and is equivalent to replacing the label by its definition. Finally, in shapes schemas we omit inverse properties which would make the proofs longer without representing any additional challenge w.r.t. the semantics.

3 Semantics of Shapes Schemas

A shape defines the structure of a graph when visited starting from a node that has that shape. In this section we give a precise meaning of the following statement

shapes_sem: node n in graph \(\mathrm {\mathbf {G}}\) has shape (or type) L from schema \(\mathbf {S}\) Footnote 7

To give a sound definition for shapes_sem is not trivial because of the presence of recursion. It also requires to make a design choice that we explain now.

Example 2

(Simple recursive schema). Let schema \(\mathbf {S}_1\) and graph \(\mathrm {\mathbf {G}}_1\) be:

figure c

Example 2 captures the essence of recursion. If has shape then also has shape . If on the other hand does not have shape , then neither does . This illustrates two important aspects of the semantics of shapes schemas. First, whether a node has some shape cannot be defined independently of the shapes of the other nodes in the graph. The consequence of this apparently simple fact is that we need a global statement about which nodes satisfy which shapes; we call this a typing. A typing must be correct, i.e. coherent with itself. Second, in the above example there is a (design) choice to make. Clearly, there are two acceptable alternatives: either (1) both and have shape , or (2) none of them does. Such choice is well known for recursive languages: (1) corresponds to a maximal solution, and (2) to a minimal solution. Both choices can lead to sound semantics. In shapes schemas we choose the maximal solution. This is justified by applications: in the above example we do want to consider as a valid . It would not be the case with semantics based on a minimal solution.

3.1 Typing and Correct Typing

The semantics is based on the notion of typing: this is a set of couples that associate a node of an RDF graph with a shape label (a type). In the sequel we consider a graph \(\mathrm {\mathbf {G}}\) and a schema \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\).

Definition 1

(node-type association, typing). A node-type association is a couple (nL) in \(\mathrm {\mathbf {N}odes}(\mathrm {\mathbf {G}}) \times \mathbf {L} \). A typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\) is a set of node-type associations.

Example 3

With \(\mathbf {S}_1\) and \(\mathrm {\mathbf {G}}_1\) from Example 2, the following are typings

A typing is correct if, intuitively, it contains an evidence for every node-type association in it. In the above example \( typing _1\) and \( typing _2\) are correct, whereas \( typing _3\) is not correct as it contains e.g. the association but does not contain the association that is required for to have type . The empty typing (\( typing _4\)) is always correct.

Definition 2

(correct typing). Let \( typing \subseteq \mathrm {\mathbf {N}odes}(\mathrm {\mathbf {G}}) \times \mathbf {S}\). We say that \( typing \) is a correct typing if for any \((n, L) \in typing \), it holds \( typing , n \vdash \mathbf {def} (L)\), where \(\vdash \) is the relation defined on Fig. 2a.

Fig. 2.
figure 2

Definitions of the \(\vdash \) and \(\vDash \) relations.

Discussion on \(\vdash \) . For a shape expression S, the definition of \( typing , n \vdash S\) on Fig. 2a is by recursion on the structure of S. In Rules se-value-descr, V is a subset of \(\mathsf {IRI} \cup \mathsf {Lit} \cup \{\mathsf {\_b} \}\) defining a \(\mathsf {ValueDescr} \). A node n satisfies the value description V if n belongs to the set V, or if n is a blank node and \(\mathsf {\_b} \) is in V. The other base case is Rule se-neig-descr, in which E is a \(\mathsf {TExpr} \) representing a neighbourhood description. A node n satisfies the \(\mathsf {NeigDescr} \) E if the neighbourhood of n matches the triple expression E. The matching relation \(\vDash \) is defined on Fig. 2b and discussed below. The remaining four rules are for the Boolean operators. The rules for \(\mathbin {\mathsf {AND}}\) and \(\mathbin {\mathsf {OR}}\) are as one would expect. Regarding negation, a node satisfies a \(\mathsf {ShapeNot} \) expression if it does not satisfy its sub-expression, as stated by Rule se-shape-not. The premise of that rule is \( typing , n \not \vdash S\) and means that (using the inference rules on Fig. 2a) it is impossible to construct a proof for \( typing ,n \vdash S\).

Discussion on \(\vDash \) . For a set of triples N, a typing \( typing \) and a \(\mathsf {TExpr} \) E, we say that N matches E with \( typing \), and we write \( typing , N \vDash E\), as defined recursively on the structure of E on Fig. 2b. Note that the \(\vDash \) relation is defined for an arbitrary set of triples N. In practice, N will be (a subset of) the neighbourhood of some node. In the basic Rule te-tpattern, P@L is a \(\mathsf {TriplePattern}\) with P a set of \(\mathsf {IRI} \)s and L a shape label. A singleton set of triples \(\{( subj , pred , obj )\}\) matches the triple pattern if the predicate \( pred \) belongs to P and the object has type L in \( typing \). The other basic rule is Rule te-empty: an empty set of triples satisfies the \(\mathsf {EMPTY} \) triple expression.

The remaining rules are about the composed triple expressions. A set of triples matches a \(\mathsf {SomeOfExpr}\) if it matches one of its sub-expressions (Rules te-some-of). The semantics of a \(\mathsf {EachOfExpr}\) is a bit more complex. A set N matches an each-of triple expression \(E_1 ; E_2\) if N is the disjoint union of two sets \(N_1\) and \(N_2\), and \(N_1\) matches the sub-expressions \(E_1\), and \(N_2\) matches the sub-expression \(E_2\). Let us make a parallel between regular expressions and triple expressions. The each-of operator is analogous to concatenation. Recall that a string w matches a regular expression \(R_1\cdot R_2\) (where \(\cdot \) is concatenation) whenever w can be “split” into two strings \(w_1\) and \(w_2\) such that their concatenation gives w (\(w = w_1\cdot w_2\)), and \(w_1\) matches \(R_1\), and \(w_2\) matches \(R_2\). In the case of triple expressions, the set of triples N is “split” into two disjoint sets \(N_1\) and \(N_2\): disjoint union on sets is analogous to concatenation on words. Following the same analogy, repetition in triple expressions corresponds to Kleene star (the star operator) in regular expressions, with the difference that it allows to express arbitrary intervals for the number of allowed repetitions, whereas Kleene star is always \([0,*]\). So, in Rule te-repet, a set of triples N matches a repetition triple expression \(E[ min ; max ]\) if N can be split as the disjoint union of k sets \(N_1, \ldots , N_k\) such that k is within the interval bound \([ min ; max ]\) and each of these sets matches the sub-expression E. Note that \(k = 0\) is possible only when \(N = \emptyset \).

The laws of the Boole algebra can be used to put a shape expression in disjunctive normal form in which only atomic sub-expressions \(\mathsf {ValueDescr} \) and \(\mathsf {NeigDescr} \) are negated. From now on we consider only shape expressions in disjunctive normal form. Note also that the each-of and some-of operators are associative and commutative and we use them as operators of arbitrary arity, as e.g. in schema \(\mathbf {S}_0\) from the introductory example.

3.2 Stratified Negation

Because of the presence of recursion and negation, the notion of correct typing is not sufficient for defining sound semantics of shapes schemas.

Example 4

(Negation and recursion). Let schema \(\mathbf {S}_2\) and graph \(\mathrm {\mathbf {G}}_2\) below:

figure d

These two typings of \(\mathrm {\mathbf {G}}_2\) by \(\mathbf {S}_2\) are both correct: and    \(\square \)

The two typings in Example 4 strongly contradict each other. In order to prove that node has shape (in \( typing _5\)), we need to prove that does not have shape . The latter however does hold in \( typing _6\). Such strong contradictions are possible only in presence of negation. In comparison, in Example 2 we also have two contradicting typings, but none of them uses in its proof a negative statement that is positive in the other typing.

This problem is well known in logic programming e.g. in Datalog, see Chap. 15 in [1] for an overview. The literature considers several solutions for defining coherent semantics in this case, among which the most popular are negation-as-failure, stratified negation and well-founded semantics. For instance, well-founded semantics would answer undefined to the question “does n have shape L” whenever there exist two proofs that contradict each-other on that fact. We exclude this solution for two reasons: it is not helpful for users, and it might require to compute all possible typings which is costly. We opt for stratification semantics instead. It imposes a syntactic restriction on the use of recursion together with negation, so that schemas as the one on Example 4 are not allowed. This is a reasonable restriction because negation in ShEx is expected to be used mainly locally, e.g. to forbid some property in the neighbourhood of a node.

We now define of stratified negation. The dependency graph of \(\mathbf {S}\) is a graph whose set of nodes is \(\mathbf {L} \), and that has two kinds of edges labelled \( dep ^{-}\) and \( dep ^{+}\) defined by (recall that shape expressions in disjunctive normal form):

  • There is a negative dependency edge \( dep ^{-}(L_1, L_2)\) from \(L_1\) to \(L_2\) iff the shape label \(L_2\) appears in \(\mathbf {def} (L_1)\) under an occurrence of the \(\mathop {\mathsf {NOT}}\) operator;

  • There is a positive dependency edge \( dep ^{+}(L_1, L_2)\) from \(L_1\) to \(L_2\) iff the shape label \(L_2\) appears in \(\mathbf {def} (L_1)\) but never under an occurrence of \(\mathop {\mathsf {NOT}}\).

Definition 3

(schema with stratified negation). A schema \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) is with stratified negation if there exists a natural number k and a mapping \({ strat }\) from \(\mathbf {L} \) to the interval [1; k] such that for all shape labels \(L_1, L_2\):

  • if \( dep ^{-}(L_1, L_2)\), then \({ strat }(L_1) > strat (L_2)\);

  • if \( dep ^{+}(L_1, L_2)\), then \({ strat }(L_1) \ge strat (L_2)\).

The mapping \({ strat }\) is called a stratification of \(\mathbf {S}\). A well known property of stratified negation is that the dependency graph does not have a cycle that goes through a negative dependency edge. This intuitively means that if shape \(L_1\) depends negatively on shape \(L_2\), then \(L_2\) does not (transitively) depend on \(L_1\). Positive interdependence is allowed in an unrestricted manner, as in \(\mathbf {S}_1\) from Example 2. \(\mathbf {S}_2\) from Example 4 is not with stratified negation because and

Example 5

(Stratification). Let schema \(\mathbf {S}_3\) below.

figure e

The dependency graph contains the edges , , , . The unique loop is around and and it goes through positive dependencies only, so the schema is stratified. A stratification should be such that and are on stratums strictly lower than , and and are on the same stratum. One possible stratification is on stratum 2 and the other three shape labels on stratum 1. Another one is and on stratum 1, on stratum 2, and on stratum 3. The latter is called a most refined stratification as none of the stratums can be split.

3.3 Maximal Correct Typing

Recall from Example 3 that both \( typing _1\) and \( typing _4\) are correct. Note that has shape according to \( typing _1\) but not according to \( typing _4\). Then what is the correct answer of shapes_sem for and ? Does have shape at the end? This section provides an answer to that question. In one sentence: we trust \( typing _1\) because it is greater; actually it is the greatest (maximal) typing. The comparison is based on set inclusion.

The following Lemma 1 establishes that a maximal typing always exists in absence of negation. The proof is based on Lemma 2 in [8] that can be easily extended for the richer schemas we have here.

Lemma 1

Let \(\mathbf {S}\) be a schema that does not use the negation operator \(\mathop {\mathsf {NOT}}\). Then for all graphs \(\mathrm {\mathbf {G}}\), there exists a correct typing \( typing _{\mathrm {g}}\) of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\) such that for every \( typing '\), if \( typing '\) is a correct typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\), then \( typing ' \subseteq typing _{\mathrm {g}}\).

The typing \( typing _{\mathrm {g}}\) can be computed as the union of all correct typings \( typing '\).

Let us now define a maximal typing in presence of negation. Let \({ strat }\) be a stratification of \(\mathbf {S}\) that has k strata, with \(k \ge 1\). For any \(1 \le i \le k\), the schema \(\mathbf {S}_i\) is the restriction of \(\mathbf {S}\) that uses only the shape labels whose stratum is less than i. Formally, \(\mathbf {S}_i = (\mathbf {L} _i, \mathbf {def} _i)\) with \(\mathbf {L} _i = \{L \in \mathbf {L} \mid { strat }(L) \le i\}\), and their respective definitions \(\mathbf {def} _i(L) = \mathbf {def} (L)\). Remark that if \(\mathbf {S}\) is stratified, then \(\mathbf {S}_1\) is negation-free.

For a set of labels \(\mathbf {L} _i \subseteq \mathbf {L} \), \( typing _{|\mathbf {L} _i} = \{(n,L) \in typing \mid L \in \mathbf {L} _i\}\) is the restriction of \( typing \) on the labels from \(\mathbf {L} _i\).

Definition 4

(stratification-maximal correct typing). Let \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) be a schema, \(\mathrm {\mathbf {G}}\) be a graph, and \({ strat }\) be a stratification of \(\mathbf {S}\) with k stratums (for \(k \ge 1\)). For any \(1 \le i \le k\), let \( typing _i\) be the typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}_i\), defined by:

  • \( typing _1\) is the maximal correct typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}_1\), as defined in Lemma 1;

  • for any \(1 \le i < k\), \( typing _{i+1}\) is the union of all correct typings \( typing '\) of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}_{i+1}\) s.t. \( typing '_{|\mathbf {L} _i} = typing _i\).

The stratification-maximal correct typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\) with stratification \({ strat }\) is \(Typing(\mathrm {\mathbf {G}}, \mathbf {S}, { strat }) = typing _k\).

\(\mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S}, { strat })\) from the above definition is indeed a correct typing for \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\), as shown in the following proposition that is the core of the proof of soundness for the semantics of shapes schemas.

Proposition 1

For any schema \(\mathbf {S}\), any stratification \({ strat }\) of \(\mathbf {S}\) and any graph \(\mathrm {\mathbf {G}}\), \(\mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S}, { strat })\) is a correct typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\).

Proof

Goes by induction on the number of stratums. The base case (1 stratum) is Lemma 1. For the induction case and stratum \(i+1\), by induction hypothesis \( typing _i\) is correct for \(\mathrm {\mathbf {G}}\) and \(\mathbf {S}_i\). It is enough to show that if \( typing '\) and \( typing ''\) are two correct typings for \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}_{i+1}\) and \( typing '_{|\mathbf {L} {i}} = typing ''_{|\mathbf {L} {i}} = typing _i\), then their union \( typing = typing ' \cup typing ''\) is correct for \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}_{i+1}\). Let \((n,L) \in typing \) and suppose that \((n,L) \in typing '\). Because \( typing '\) is correct, we have \( typing ', n \vdash \mathbf {def} (L)\). We will show that (*) the proof for \( typing ', n \vdash \mathbf {def} (L)\) can be used as a proof for \( typing , n \vdash \mathbf {def} (L)\). If \(\mathbf {def} (L)\) does not contain a negation of a triple expression, then (*) easily follows from the definition of \(\vdash \).

So suppose \(\mathbf {def} (L)\) contains a negation operator on top of the triple expressions \(E_1, \ldots , E_l\). That is, (recall that shape expressions are in disjunctive normal form), \(\mathop {\mathsf {NOT}}E_j\) is a sub-expression of \(\mathbf {def} (L)\) for every \(1 \le j \le l\). Then the proof for \( typing ', n \vdash \mathbf {def} (L)\) contains applications of Rule se-shape-not for \(\mathop {\mathsf {NOT}}E_j\) that witness that there does not exist a proof for \( typing ', n \vdash E_j\), for every \(1 \le j \le l\). We need to show that a proof \( typing , n \vdash E_j\) cannot exist. Suppose by contradiction that P is a proof for \( typing , n \vdash E_j\), for some \(1 \le j \le l\). Let \(\mathbf {L} '\) be the set of all shape labels that appear in \(E_j\), then P uses only node-type associations with labels from \(\mathbf {L} '\). That is, \( typing _{|\mathbf {L} '}, n \vdash E_j\) holds. As \(E_j\) is negated in \(\mathbf {def} (L)\), we have \(\mathbf {L} ' \subseteq \mathbf {L} _i\), so \( typing _{|\mathbf {L} _i}, n \vdash E_j\) also holds. But \( typing _{|\mathbf {L} _i} = typing _i \subseteq typing '\). Contradiction.   \(\square \)

Lemma 2 below establishes that \(Typing(\mathrm {\mathbf {G}}, \mathbf {S}, { strat })\) does not depend on the stratification being chosen. This allows to define the maximal correct typing (Definition 5) and to give a precise meaning of shapes_sem (Definition 6) which was the objective of this section.

Lemma 2

Let \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) be a schema and \(\mathrm {\mathbf {G}}\) be a graph. Let \({ strat }_1\) and \({ strat }_2\) be two stratifications of \(\mathbf {S}\). Then \(Typing(\mathrm {\mathbf {G}}, \mathbf {S}, {{ strat }}_1) = Typing(\mathrm {\mathbf {G}}, \mathbf {S}, {{ strat }}_2)\).

Proof

(Idea) The proof uses a classical technique as e.g. for stratified Datalog. There exists a unique (up to permutation on the numbering of stratums) most refined stratification \({ strat }_{\text {ref}}\) such that for any other stratification \({ strat }'\), each stratum of \({ strat }'\) can be obtained as a union of stratums of \({ strat }_{\text {ref}}\). Then we show that for any stratification \({ strat }'\), \(Typing(\mathrm {\mathbf {G}}, \mathbf {S}, { strat }') = Typing(\mathrm {\mathbf {G}}, \mathbf {S}, { strat }_{\text {ref}})\).

Definition 5

(maximal correct typing). Let \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) be a schema and \(\mathrm {\mathbf {G}}\) be a graph. The maximal correct typing of \(\mathrm {\mathbf {G}}\) by \(\mathbf {S}\) is denoted \(Typing(\mathrm {\mathbf {G}}, \mathbf {S})\) and is defined as \(Typing(\mathrm {\mathbf {G}}, \mathbf {S}, { strat })\) for some stratification \({ strat }\) of \(\mathbf {S}\).

Definition 6

(shapes_sem). Let \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) be a schema and \(\mathrm {\mathbf {G}}\) be a graph. We say that node n (of \(\mathrm {\mathbf {G}}\)) has shape L (from \(\mathbf {S}\)) if \((n,L) \in Typing(\mathrm {\mathbf {G}},\mathbf {S})\).

4 Validation

In Sect. 3 we have given a declarative semantics of the shapes language. We now consider the related computational problem. We are again interested by the shapes_sem statement (as defined in Sect. 3), i.e. checking whether a given node has a given shape.

4.1 Refinement Algorithm

figure f

Algorithm 1 computes \(\mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S}, { strat })\). The i-th iteration of the loop on line 2 computes \( typing _i\) from Definition 4. The algorithm is correct thanks to Lemma 2 from [8] applied to every stratum i. According to that lemma, the maximal typing defined as the union of all correct typings (i.e. \( typing _i\)) can be computed by iteratively removing unsatisfied node-type associations (done on line 11) until a fixed point is reached (detected when \( changing \) remains \( false \)). The advantage of the \( refine \) algorithm is that once \(\mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S})\) is computed, testing whether node n has shape L is done with no additional cost by testing whether (nL) belongs to \(\mathop { Typing }(\mathrm {\mathbf {G}},\mathbf {S})\). The drawback is that it considers all node-type associations which is not always necessary, as shown here after.

4.2 Recursive Algorithm

Algorithm 2 allows to check whether node n has shape L without constructing \(\mathop { Typing }(\mathrm {\mathbf {G}},\mathbf {S})\). The idea is to visit only a sufficiently large portion of \(\mathop { Typing }(\mathrm {\mathbf {G}},\mathbf {S})\).

Example 6

(Motivation of the prove algorithm). Considering schema \(\mathbf {S}_3\) from Example 5 and graph \(\mathrm {\mathbf {G}}_3\) below:

figure g

We want to check whether has shape . Remark that the neighbor nodes of are and 4, whereas the shape labels on which the definition of depends are and . Any correct proof for (or for ) would have as leaves either applications of Rule se-value-descr that do not depend on \( typing \), or applications of Rule te-tpattern that uses node-type associations (nL) where n is a neighbor of and \(L'\) is a label such that or .

Assume schema \(\mathbf {S}= (\mathbf {L}, \mathbf {def})\) and graph \(\mathrm {\mathbf {G}}\). For a shape label L in \(\mathbf {S}\) and a node n in \(\mathrm {\mathbf {G}}\), we denote \( dep (n, L)\) the set of node-type associations \((n', L')\) s.t. \(n'\) is a neighbor of n (that is, \((n, p, n') \in neigh (n)\) for some \(\mathsf {IRI} \) p) and \(L'\) appears as a shape reference in \(\mathbf {def} (L)\). Algorithm 2 uses this easy to show property: \( typing , n \,\models \, \mathbf {def} (L)\) iff \( typing \cap dep (n,L), n \,\models \, \mathbf {def} (L)\). In order to check whether n has shape L, Algorithm 2 will (recursively) check whether \(n'\) has shape \(L'\) for all \((n',L')\) in \( dep (n,L)\). The parameter \( Hyp \) is a stack of node-type associations that is also seen (on line 3) as the set of node-type associations it contains. \( Dep \) is a set of node-type associations.

figure h

Example 7

(Execution trace of the prove algorithm). Here is the tree of recursive calls generated during the evaluation of for graph \(\mathrm {\mathbf {G}}_3\) and schema \(\mathbf {S}_3\), where [] is the empty stack. The returned value is given on the right. generates four recursive calls that correspond to . The call for and does not generate any recursive call: contains only which is on the stack.

figure i

The correctness of the \( prove \) algorithm is stated by the following:

Proposition 2

(Correctness of the \( prove \) algorithm). For any node n and any shape label L, the evaluation of \( prove (n, L, [])\) terminates and returns true if \((n,L) \in \mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S})\) and false otherwise.

Proof

(Sketch). For termination: the recursion cannot be infinite-breadth as \( prove \) generates a finite number of recursive calls on line 4. Infinite-depth recursion is also impossible because \( Hyp \) is a call stack and the condition on line 3 prevents from (recursively) calling \( prove \) with the same node and label.

The proof of correctness goes by induction on the stratum of L using the most refined stratification \({ strat }\). For every stratum i we show that whenever \( Hyp \) contains only node-type associations \((n',L')\) with \(strat(L') > i\), and for any L s.t. \({ strat }(L) = i\), \(\mathop { Typing }, n \vdash \mathbf {def} (L)\) iff \( prove (n,L, Hyp )\) returns true. For the \(\Rightarrow \) direction, the main argument is that if \(\mathop { Typing }, n \vdash \mathbf {def} (L)\) then also \(\mathop { Typing }\cup Hyp , n \vdash \mathbf {def} (L)\). This is not true in general because of negation, but is true if \( Hyp \) is on stratum \(\ge { strat }(L)\) as in this case no type in \( Hyp \) is negated in \(\mathbf {def} (L)\). For the \(\Leftarrow \) direction, we need to show that if \( prove (n,L, Hyp )\) returns \( true \) then \((n,L) \in \mathop { Typing }(\mathrm {\mathbf {G}},\mathbf {S})\). The problematic case is when \( prove (n,L, Hyp )\) returns \( true \) whereas n does not have label L. Such error necessarily comes from the fact that on line 6 the algorithm used some \((n',L') \in Hyp \smallsetminus \mathop { Typing }(\mathrm {\mathbf {G}},\mathbf {S})\) in the proof for \( Dep \cup Hyp , n \vdash \mathbf {def} (L)\). Consequently, \({ strat }(L) = { strat }(L')\), and because we consider the most refined stratification, it follows that L and \(L'\) mutually depend on each other in the dependency graph of \(\mathrm {\mathbf {G}}\). Then we need to distinguish two cases. Either all shape labels on stratum i only depend on each others, as for instance and from Example 5. In that case \( prove (n,L, Hyp )\) returns \( true \) based only on hypotheses in \( Hyp \), which is correct w.r.t. the semantics based on maximal solution: if nothing outside stratum i allows to disprove that n has label L, then it is indeed the case. The other possibility is that a shape label \(L'\) on stratum i depends also on shapes from lower stratums, as from Example 2 that depends on . Then the test on line 6 of the call of \( prove \) with \(L'\) will take this dependency into account and return \( true \) only if all conditions, including those that depend on the lower stratums, are satisfied.    \(\square \)

4.3 On Implementation of the Validation Algorithms

Both algorithms use a test for \( typing , n \vdash \mathbf {def} (L)\), which non trivial part is the test of the \(\vDash \) relation required in Rule se-neig-descr . The latter is equivalent to checking whether a word (a string) matches a regular expression disregarding the ordering of the letters of the word. Here the word is over the alphabet of triple patterns that occur in the triple expression. In [4] we presented an algorithm for this problem based on regular expression derivatives. In [8] we gave another algorithm for so called deterministic single-occurrence triple expressions. That algorithm can be extended to general expressions, and was used in several of the implementations of ShEx available as open sourceFootnote 8.

The \( prove \) algorithm was presented in a form that is easier to understand but not optimized. An implementation could reduce considerably the search space of the algorithm by exploring only relevant node-shape associations from \( dep (n,L)\) For instance, in Example 7 checking 4 against is useless first because 4 is accessible from by whereas in the schema is accessible from by .

A more involved version of the \( prove \) algorithm could memorize portion of \(\mathop { Typing }(\mathrm {\mathbf {G}}, \mathbf {S})\) to be reused. This however should be done carefully: one should not memorize all node-shape associations (nL) for which the algorithm returned \( true \), as some of these can be false positives as discussed in the proof of Proposition 2.

5 Conclusion

In this paper we introduced shapes schemas that formalize the semantics of ShEx 2.0 and we showed that the semantics of ShEx 2.0 is sound. We also presented two algorithms for validating an rdf graph against a shapes schema.

ShEx and the underlying formalism presented here are still evolving, and there are several promising directions some of which are already being explored: introduce operators for value comparison, use property paths in triple patterns, define an rdf transformation language based on ShEx. We also plan to consider several heuristics and optimizations as the ones discussed in Sect. 4.3 in order to accelerate the validation of shapes schemas. These will be validated on examples. Another open problem is error reporting in ShEx: how to give useful feedback for correcting validation errors. We also plan to explore the exact relationship between shapes schemas and shacland establish whether shapes schemas can be encoded in sparql extended with recursion as the one defined in [7].