Query Execution Optimization for Clients of Triple Pattern Fragments

Van Herwegen, Joachim; Verborgh, Ruben; Mannens, Erik; Van de Walle, Rik

doi:10.1007/978-3-319-18818-8_19

Joachim Van Herwegen¹⁹,
Ruben Verborgh¹⁹,
Erik Mannens¹⁹ &
…
Rik Van de Walle¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9088))

Included in the following conference series:

European Semantic Web Conference

2310 Accesses
15 Citations

Abstract

In order to reduce the server-side cost of publishing queryable Linked Data, Triple Pattern Fragments (tpf) were introduced as a simple interface to rdf triples. They allow for sparql query execution at low server cost, by partially shifting the load from servers to clients. The previously proposed query execution algorithm uses more http requests than necessary, and only makes partial use of the available metadata. In this paper, we propose a new query execution algorithm for a client communicating with a tpf server. In contrast to a greedy solution, we maintain an overview of the entire query to find the optimal steps for solving a given query. We show multiple cases in which our algorithm reaches solutions with far fewer http requests, without significantly increasing the cost in other cases. This improves the efficiency of common sparql queries against tpf interfaces, augmenting their viability compared to the more powerful, but more costly, sparql interface.

You have full access to this open access chapter, Download conference paper PDF

Interactive Comparison of Triple Pattern Fragments Query Approaches

Bindings-Restricted Triple Pattern Fragments

Ladda: SPARQL Queries in the Fog of Browsers

Keywords

1 Introduction

In the past few years, there has been a steady increase of available rdf data [10]. If a publisher decides to provide live queryable access to datasets, the default choice is to offer a public sparql endpoint. Users can then query this data using the sparql query language [5]. The downside of the flexibility of sparql is that some queries require significant processing power. Asking a lot of these complex queries can put a heavy load on the server, causing a significant delay or even downtime. Recently, triple pattern fragments (tpf [15]) were introduced as a way to reduce this load on the server by partially offloading query processing to clients. This is done by restricting the tpf server interface to more simple queries. Clients can then obtain answers to complex sparql queries by requesting multiple simple queries and combining the results locally. Concretely, a tpf server only replies to requests for a single triple pattern. The response of the server is then a list of matching triples, which can be paged in case the response would be too large. Furthermore, each tpf contains metadata and hypermedia controls to aid clients with query execution.

The biggest challenge for the client is deciding which triple pattern queries result in the most efficient solution strategy. Since every subquery causes a new http request to the server, minimizing the number of queries reduces the network load and improves the total response time. The algorithm proposed by Verborgh et al. [15] is greedy: at each decision point, clients choose the local optimum by executing the request that has the fewest results. This works fine for certain classes of queries, but others can perform quite badly. In this paper, we therefore propose a new solution that tries to minimize the number of http requests, thus reducing the network traffic, server load, and total response time. We make use of all metadata provided by the tpf server and attempt to predict the optimal query path based on a combination of both metadata and intermediate results.

In Sect. 2, we outline the core concepts of the problem space and relate them to existing work. In Sect. 3, we take a closer look at the problem statement and its necessity. Section 4 introduces our solution for tpf-based query optimization, while an optimized triple store for this algorithm is described in Sect. 5. The results of our work are evaluated in Sect. 6 before concluding in Sect. 7.

2 Core Concepts and Related Work

Since the way queries are executed on the Web depends on the available interfaces on the server side, we first discuss the range of existing interfaces. We then describe different approaches to execute queries over such interfaces.

2.1 RDF Interfaces on the Web

Linked Data Fragments. In order to characterize the many possibilities for publishing Linked Datasets on the Web, Linked Data Fragments (ldf [15]) was introduced as a uniform view on all possible Web apis to Linked Data. The common characteristic of all interfaces is that, in one way or another, they offer specific parts of a dataset. Consequently, by analyzing the parts offered by an interface, we can analyze the interface itself. Each part is called a Linked Data Fragment, consisting of:

data: the triples of the dataset that match an interface-specific selector;
metadata: triples to describe the fragment itself;
controls: hyperlinks and/or hypermedia forms that lead to other fragments.

The choices made for each of those elements influence the functional and non-functional properties of an interface. This includes the server-side effort to generate fragments, the cacheability of those fragments, the availability and performance of query execution, and the party responsible for executing those queries.

File-Based Datasets. So-called data dumps are conceptually the most simple apis: the data consists of all triples in the dataset. They are combined into a (usually compressed) archive and published at a single URL. Sometimes the archive contains metadata, but controls— with the possible exception of http uris in rdf triples— are not present. Query execution on these file-based datasets is entirely the responsibility of the client; obtaining up-to-date query results requires re-downloading the entire dataset periodically or upon change.

SPARQL Endpoints. The sparql query language [5] allows to express very precise selections of triples in rdf datasets. sparql endpoints [4] allow the execution of sparql queries on a dataset through http. A sparql fragment’s data consists of triples matching the query (assuming the CONSTRUCT form); the metadata and control sets are empty. Query execution is performed entirely by the server, and because each client can ask highly individualized requests, the reusability of fragments is low. This, combined with complexity of sparql query execution, likely contributes to the low availability of public sparql endpoints [3].

Triple Pattern Fragments. The triple pattern fragments api [14] interface has been designed to minimize server-side processing, while at the same time enabling efficient live querying on the client side. A fragment’s data consists of all triples that match a specific triple pattern, and can possibly be paged. Each fragment page mentions the estimated total number of matches to allow for query planning, and contains hypermedia controls to find all other triple pattern fragments of the same dataset. Since requests are less individualized, fragments are more likely to be reused across clients, which increases the benefit of caching [14]. Because of the decreased complexity, the server does not necessarily require a triple store to generate fragments, which enables less expensive servers.

2.2 Query Execution Approaches

Server-Side Query Processing. The traditional way of executing sparql queries is to let the server handle the entire query processing. The server hosts the triple store containing all the data, and is responsible for parsing and executing queries. The client simply pushes a query and receives the results. Several research efforts focus on optimizing how servers execute queries, for example, by using heuristics to predict the optimal join path [13], or by rewriting to produce a less complex query [11]. Quite often, these interfaces are made available through public sparql endpoints, with varying success [3]. Another downside is that it is unclear which queries servers can execute, as not all servers support the complete sparql standard [3].

Client-Side Query Processing. Hartig [6] surveyed several approaches to client-side query processing, in particular link-traversal-based querying. The only assumption for such approaches is the existence of dereferencing, i.e., a server-side api such that a request for a URL results in rdf triples that describe the corresponding entity. sparql queries are then solved by dereferencing known URLs inside of them, traversing links to obtain more information. While this approach works with a limited server-side api, querying is slow and not all queries can be solved in general.

Hybrid Query Processing. With hybrid query processing approaches, clients and servers each solve a part of a sparql query, enabling faster queries than link-traversal-based strategies, yet lower server-side processing cost than that of sparql endpoints. One such strategy is necessary when the server offers a triple pattern interface: complex sparql queries are decomposed into triple patterns by clients [14]. While this reduces server load, it means that clients must execute more complex queries themselves. In this paper, we devise an optimized algorithm for tpf-based querying.

Federated Query Processing. Executing federated queries requires access to data on multiple servers. The problems pertaining to this include source selection, i.e., finding which servers are necessary to solve a specific query, and executing the query in such a way that network traffic and response time is minimized [7, 9, 12]. Our approach similarly aims to reduce the number of http requests. The difference is again the type of queries allowed by the interface. While federated systems similarly require splitting up the query depending on the content of the servers, it is still assumed these servers answer to complete sparql queries.

3 Problem Statement

As mentioned in Sect. 1, the greedy algorithm to execute sparql queries against triple pattern fragments [14] performs badly in several situations. For instance, consider the query in Listing 1.1, taken from the original tpf paper [15].

The example shows how many matches the server indicates when requesting the first page of each triple pattern. Between different tpf servers, the page size (number of triples per request) can vary. Assuming a page size of 100, the results of $p_3$ fit on the first page. The greedy algorithm would thus start from the triples from $p_3$, map all of its ?city bindings to $p_2$ (57 cities with an average of 750 people per city $\approx \pm 430$ calls), then map all ?person bindings to $p_1$ ($\pm 43,000$ calls). A more efficient solution would be to download all triples from $p_1$ (12 calls) and join them locally with the values of $p_2$, thus reducing the total number of calls from $\pm 43,440$ to $\pm 440$.

The problem is that because of the limited information, we cannot know in advance what the optimal solution would be, which means heuristics will be necessary. The algorithm we will propose next tries to find a more efficient solution by looking for a global optimum instead of a local one, and this while emitting results in a streaming way.

4 Client-Side Query Execution Algorithm

To find the optimal queries to ask the server, we need to maximize the utility of all available metadata, which becomes increasingly available as responses arrive. During every iteration, we re-evaluate the choices made based on new data from the server. Decisions are based on estimates, which are updated continuously.

Like typical client-side querying approaches [6], our optimization focuses on Basic Graph Pattern (bgp) queries. Filters, unions, and other non-bgp elements are applied locally to the results of these bgp components. For generality, we assume all triple patterns in the bgp are connected through their variables; if not, Cartesian joins can connect independent parts. The algorithm consists of (1) an initialisation, and an iteration of (2) selection; (3) reading; (4) propagation; (5) termination.

4.1 Initialization

During initialization, we try to use the available information to make our initial assumptions. Information is still sparse at this point: tpfs only contain an estimated match count of each triple pattern. Using these counts, we try to predict which patterns would be best to start. Once the algorithm is iterating, these predictions will be updated based on new data we receive.

Triple Pattern Roles. Our goal is to find all relevant triples for every triple pattern and then join these locally. The algorithm assigns one of two ways to obtain relevant triples for a pattern, called the role of a pattern.

Patterns with the download role—simply called download patterns—are the most straightforward option. To receive download pattern data, we request the corresponding triple pattern from the server. The server replies with a page of initial triples and a link to the next page. By continuously requesting the remaining pages, we obtain all matches. An advantage of this role is that each new http request results in a full page of data, which is the highest possible number of results per request.

In contrast, bind patterns are dependent on the results of other patterns. They bind values to one of their variables, hence the name. For each binding that arrives from upstream, the client sends a request to the server for the bound triple pattern (which is then subsequently treated as a download pattern). The total number of http requests needed for this role depends on the number of bound values and on the average number of triples per binding. If the number of bindings is low, the bind role potentially uses significantly less http requests to retrieve all relevant triples. If, on the other hand, the number of bindings is high, using a download pattern would be more efficient.

To clarify these roles, consider the example in Listing 1.1. Assuming we already obtained all the European capitals from $p_3$, we then have to choose a role for $p_2$. Choosing the download role amounts to sending the pattern $p_2$ to the server and requesting its pages. Assuming a page size of 100, this requires $\pm 4,300$ requests. The bind role would bind the variable ?city to the local list of European capitals. We would then send all these bound patterns (e.g., ?person dbpprop:birthplace dbpedia-owl:Amsterdam, ?person dbpprop:birthplace dbpedia-owl:Athens...) to the server and request all their pages. In this case, this results in a total of $\pm 430$ calls—10 times less than if we chose the download role.

Initial Role Assignment. The role choice for each pattern has a big influence on the number of http requests. Unfortunately, the initial count metadata provides almost no knowledge about the data properties of each pattern. In general, we can decide after having executed the query which solution would have been best. At runtime, we are thus forced to make assumptions for role assignment. Our initial role assignment is purposely simple, as can be seen in Algorithm 1. We make use of the following multiple helper functions and sets.

${P}$	A query’s bgp, consisting of triple patterns $t_0,\ldots ,t_n$.
${V}$	All variables in $P$.
${R}$	$\{ download \} \cup \{ bind _v \mid v \in V\}$
$\mathrm {vars}(t)$	$P \rightarrow 2^V$	All variables in the given triple pattern
$\mathrm {count}(t)$	$P \rightarrow \mathbb {N}$	The total match estimate for the given triple pattern
$\mathrm {role}(t)$	$P \rightarrow R$	The role of a pattern

All $ bind _v$ patterns bind their variable $v$ to values found by other patterns. Since not all patterns can depend on each other, we need at least one download pattern. We choose the smallest pattern to be our initial download pattern, which is the best possible choice given the initial knowledge. Each remaining pattern is assigned a $ bind _v$ role for a specific $v$, since bind patterns are often a lot more efficient than download patterns. We will show later how to update roles at runtime in case this assumption is proven wrong.

Supply Graph. A pattern $t$ supplies values for a variable $v$ if $v \in \mathrm {vars}(t)$ and $\mathrm {role}(t) \ne bind_v$. A pattern $t$ is supplied by a variable $v$ if $\mathrm {role}(t) = bind _v$. If a pattern is supplied by a variable and has no other variables, we say it filters that variable. These filter patterns provide no new values; they can only be used to check if the bindings found so far are valid. Using these definitions we can introduce the supply graph.

A supply graph visualizes the dependencies between different patterns. The supply graph in Fig. 1 is the result of applying Algorithm 1 to the query in Listing 1.2. These dependencies will be used multiple times by the algorithm.

Extended Initial Role Assignment. The supply graph allows us to improve upon the naive role assignment of Algorithm 1. Therefore, Algorithm 2 extends the initial role assignment algorithm, using this helper function:

$\mathrm {suppliers}(v)$ $V \rightarrow 2^P$ The suppliers for the given variable.

The main purpose of this extended role assignment is to improve our initial assignments. Although suboptimal role assignment would be detected at runtime, this would take some time, increasing the wait until the first results.

The changes are twofold. Firstly, if a pattern has a much lower count than the patterns that supply its bound variable, we change its role to download. The underlying assumption is that even if only 1 in 100 (empirically chosen) bindings can be matched between the suppliers, it would still be more efficient to download this pattern upfront. Secondly, we check whether it would be more efficient to bind a pattern to one of its other variables. If the suppliers of its other variable have a lower count, we assume there will exist fewer bindings for that variable.

Pattern Dependencies. During execution, multiple patterns might be bound to the same variable. Binding all known accepted values for that variable to both patterns would be wasteful: there is no need for the second pattern to check a binding rejected by the first. To solve this, we introduce pattern dependencies. These are all preceding patterns a value has to “pass through” before it can be used by a pattern. This is done by generating an ordered list of patterns, using the ordering $\prec \ \subset P \times P$ defined below.

We define $\forall t, t' \in P$:

$\mathrm {role}(t) = download \;\wedge \; \mathrm {role}(t') \ne download \Rightarrow t \prec t'$
$\exists v \in V : v \in \mathrm {vars}(t') \;\wedge \; t \in \mathrm {suppliers}(v) \wedge t' \notin \mathrm {suppliers}(v) \Rightarrow t \prec t'$
If ordering not implied by previous rules: $\mathrm {count}(t) < \mathrm {count}(t') \Rightarrow t \prec t'$

This ordering gets applied to the output of Algorithm 2. Algorithm 3 then uses this sorted list to generate the dependencies for a bind pattern. We take all patterns from the ordered list preceding the given pattern. Because of the way this list is structured, this includes all patterns that directly or indirectly supply that pattern.

4.2 Selection

This is the first iterative step of the algorithm and assumes we have all the information that was generated during the initialization phase. Every iteration we download triples from a single pattern. The choice of which pattern we download from is obviously quite important: if we only download triples for a single pattern we won’t reach results for the full query. Hence we try to estimate which pattern has the highest chance of providing new results with more triples. We do this by first finding local optima: for every variable we find the pattern that would improve the results for that variable the most. Afterwards, we find the global optimum among them, which would provide the best result for the query.

Locally Optimal Patterns. First, we determine for every variable which pattern we need to download triples from to get more bindings for that variable. The reason for this is that as soon as we get more bindings for a variable, we can use these in all other patterns containing that variable, bringing us closer to a solution for the query. We only add a binding to a variable if each of its suppliers and filter patterns have a triple containing that binding. We can only know that these triples exist if we downloaded them previously, which is why it is important to choose the correct pattern to download from.

We go through four steps to find our local optimum:

1.
Start with all the suppliers and filter patterns of the variable.
2.
Remove patterns that cannot supply new values. These are bind patterns that have no (unused) bindings.
3.
If there are still filter patterns remaining, return one of these.
4.
If not, return the pattern that has downloaded the least triples so far.

We prioritize filter patterns since having a value for a filter pattern means it already passed all other (non-filter) suppliers. After that, we verify the download count to ensure no supplier is ignored.

Globally Optimal Pattern. Because of the previous step, we now have a single pattern for every variable. First, we filter out any pattern that has a supply path going to any of the other patterns in the list of results, as described in Fig. 1. If there are still multiple patterns remaining, we pick the one with the smallest number of stored triples.

We prioritize patterns on the bottom of the supply graph for the same reason we prioritize filter patterns: if a value reaches that point, it has already passed preceding patterns, increasing the odds of this value leading to a query result.

4.3 Reading

Once we have chosen a triple pattern, we fetch its results through a single http request. For download patterns, this involves downloading the first page of triples we did not encounter yet. For bind patterns, we check if there is a page remaining for the current binding. In that case we download the next page. If not, we bind a new stored value to the bound variable and download the first page of that pattern. These new triples are then stored in a local triple store.

4.4 Propagation

The previous step added new triples to the local triple store. We now want to use these triples to improve our results in the next iteration. For bind patterns, this means finding new values which can be used as bindings. We do this by executing a query on the local triple store per bind pattern, consisting of all the dependencies of the pattern, as described in Algorithm 3.

Cost Estimation. At this point we also want to verify if our pattern role assumptions were correct. Maybe we made a mistake during initialization because of the limited information. To do this we introduce the following functions:

$G$	The set of ground triples.
$\mathrm {triples}(t)$	$P \rightarrow 2^ G $	Triples downloaded so far for the given pattern
$\mathrm {pagesize}(t)$	$P\rightarrow \mathbb {N}$	The page size for the given pattern
$\mathrm {avgTriples}(t)$	$P \rightarrow \mathbb {N}$	Average triple count per binding (detailed later)
$\mathrm {valCount}(t)$	$P \rightarrow \mathbb {N}$	Number of bind values found so far for the pattern

Even though the algorithm has already performed several http requests, it might still be more efficient to let a pattern switch roles. To verify this, we need to estimate how many http requests are still required to finish a bind pattern and compare that to the number of requests needed if the pattern were a download pattern $\left( \text {which is}\left\lceil \frac{\mathrm {count}(t)}{\mathrm {pagesize}(t)}\right\rceil \right) $.

To estimate the number of requests for a bind pattern, we start by estimating how many values will be bound to its variable. We use the following function to estimate the total number of requests needed for a bind pattern $t$:

$$\begin{aligned}&(\text {average pages per binding for }t) \\&\cdot (\text {percentage of supplier triples that contain a new binding}) \\&\cdot (\text {total number of supplier triples}) \end{aligned}$$

We estimate these values with the following functions:

In case we did not find any values yet, we assume the estimate to be $\infty $, but we do not change the pattern role. This formula looks at the number of values we found compared to the total number of triples downloaded so far. We assume this ratio will be stable for the remaining triples we download. This assumption might be too strong, which is why we re-evaluate it at every iteration. We take the maximum value of these estimates to compensate for the fact that some patterns might have already downloaded more triples than others.

The function $\mathrm {avgTriples}$ is an estimate of how many triples are returned per variable binding for this pattern: we need to take into account that a single bound value might have multiple pages that need to be downloaded. This is done by looking at the values we already bound so far. We take the average number of triples for these values and assume this represents the average of future values. Because wrong estimates can substantially skew the results, we only trust the estimate after having acquired multiple counts. We only trust the result if the estimate remains within the margin of error after adding a new value, assuming a Gaussian distribution and a 95 % level of confidence. Similarly as before, if we have no values to estimate, or we do not trust the estimate, we assume it to be $\infty $ without changing the pattern role.

After these steps, we have an estimate for the number of requests of a bind pattern and can compare it to the number of requests if it was a download pattern. If our estimates indicate that continued use of the bind pattern would require at least 10 % more requests (empirically chosen) than switching to the download role, we change its role and update the supply graph.

Intermediate Results. To find intermediate results to the query, we execute the complete query on our local triple store. This will return all answers to the query that can be found using the triples we have downloaded so far. We do this after every iteration to see if we found new results during that iteration. By using the techniques described in Sect. 5, we minimize the local computation time.

4.5 Termination

Once all download patterns finished retrieving all their pages, and all bind patterns finished all their bindings, the algorithm terminates. All results found so far, which have been emitted in a streaming way, form the response to the query.

5 Local Triple Store

Due to the nature of the algorithm, many similar or even identical queries are executed against a client’s local triple store. For example, in the Intermediate results step we need to execute the complete query to find new results. During every Propagation step we execute a query for each pattern. This query contains the dependencies of the the pattern and is thus a subquery of the complete query. A standard triple store might cache repeated queries, but this does not serve our purpose since the data changes every iteration. We instead want to maximize reuse of previous query results. For repeated queries, this means storing the results of intermediate steps. For queries where one is a subquery of the other, this means sharing the intermediate steps. At every iteration of the algorithm, we only download new triples for a single triple pattern. This causes the local database, as well as the intermediate query results, to only change slightly. While related work on such specialized caching exists [8], we can cache even more efficiently since we have substantially more information about the queries that will be executed on the store. We introduce helper functions for our local triple store algorithm, explained in depth in the next paragraphs.

$C$	The set of cache entries (described below).
$B$	The set of bindings. A binding maps one or more variables $v \in V$ to a value.
$\mathrm {cache}(P')$	$2^P \rightarrow C$	The cache entry corresponding to the patterns, or an empty entry if not used before
$\mathrm {patterns}(c)$	$C \rightarrow 2^P$	The patterns in the given cache entry. Inverse of the cache function
$\mathrm {bindings}(c)$	$C \rightarrow 2^B$	The bindings stored in the cache entry
$\mathrm {tripleCounts}(c)$	$C \rightarrow (P\rightarrow \mathbb {N})$	Function that the value of $\|triples(t)\|$ when the cache entry was last updated
$\mathrm {binding_t}(g)$	$G \rightarrow B$	Transforms a triple $ g $ to a binding based on the given pattern $t \in P$
$\mathrm {ids}(b)$	$B \rightarrow (P\rightarrow \mathbb {N})$	The indices stored for the given binding
$\mathrm {join}(B', B'')$	$(B \times B) \rightarrow B$	Joins the two given sets of bindings

Our triple store consists of two data components: the cache entries ($C$) and the ground triples ($G$). For every pattern, we store the triples in the order they were downloaded, which means we can associate an index with each of them. We will use these indices ($\mathrm {ids}(b)$) to determine which results can be reused. The cache entries represent these intermediate results. Whenever we calculate a set of bindings $B'$ for a set of patterns $P' \subseteq P$, we store them in the cache object $\mathrm {cache}(P')$. Besides the bindings, the cache entry also stores $|\mathrm {triples}(t)|$ for every $t \in P'$ ($\mathrm {tripleCounts}(c)$). This allows us to identify which triples have been downloaded since the last time this cache entry was used. When we generate a binding from a triple ($\mathrm {binding_t}(x)$), that binding also includes which pattern the triple belongs to and what its index is for that pattern. If we join bindings ($\mathrm {join}(B', B'')$), the indices are also joined.

Algorithm 4 describes the process of executing a query on our local triple store. For clarity, we have used less strict notions of lists and sets, preferring legibility over mathematical rigor. When performing the query, we try to maximize the amount of data we reuse. We also try to order the uncached patterns to minimize the size of the join operations. During the join process we split the triples for the current pattern in two sets $G_{ old }$ and $G_{ new }$. $G_{ old }$ represents the triples that were already used in a previous iteration to create bindings for the current cache entry. Similarly, we split the bindings from the previous cache entry $c_{ prev }$ in $B'_{used}$ and $B'_{ unused }$. $B'_{ unused }$ contains the bindings that were not used to create the bindings in the current cache entry because they did not exist at the time. To do a full join between our results so far and the triples of the current pattern, we would have to calculate $(G_{ old } \cup G_{ new }) \times (B'_{ used } \cup B'_{ unused })$. Since we store old results in our cache entries, we already have a part of this join: $G_{ old } \times B'_{ used }$ corresponds to the results stored in cache entry ($=\mathrm {bindings}(c)$). If we also calculate $G_{ old } \times B'_{ unused }$ ($=B_{ old }$) and $G_{ new } \times (B'_{ used } \cup B'_{ unused })$ ($=B_{ new }$) we have a full join of these two sets of bindings, while limiting the number of joins that need to be executed.

6 Evaluation

To evaluate our implementation, we executed a set of sparql queries using both the original greedy implementation [14] and our proposed algorithm. Since our goal was to reduce execution time by minimizing the number of http requests, we measured both execution time and the number of http requests per query. We also calculated the time and requests until we found the first result. To precisely control network latency, the server and client ran on the same machine (Intel Core i5-3230 M cpu at 2.60 GHz with 8 GB of ram). To simulate the time it might take a server to respond to a client over the internet, we introduced an artificial delay of 100 ms on the server. We used a query timeout of 5 min and noted how many results (and http requests) were found up to that point.

The WatDiv benchmark was designed to stress test and compare multiple sparql query algorithms using only bgp queries [1, 2]. This makes WatDiv perfectly suited for our evaluation, since the two algorithms focus on bgp queries. We used a set of $\pm $1,000 queries that were generated for the WatDiv stress test [1] against the WatDiv dataset of 10 million triples. We clustered the queries based on the number of triple patterns in the query, ranging from 1 to 12. The median results can be seen in Fig. 2.

Figure 2a shows how many http requests were executed before a first result was found. This shows that in most of the cases the optimizations of our algorithm focusing on quickly finding results help in reducing the http requests for the first result. Figure 2b shows the number of http requests executed during the query. This is the most important graph since this was the main focus of our optimizations and as can be seen, our optimizations had a big impact on the number of requests. Because of the higher processing time, our algorithm has a lower call count if both algorithms exceed the timeout. This is mostly the case in the queries with a higher pattern count. Figures 2c and d show the total execution time and number of results found respectively. Note that both algorithms guarantee a complete result set; observed differences are entirely due to the timeout of 5 min. When we combine these figures, it becomes clear that our algorithm performs better in the majority of cases. For the queries with 12 patterns, the original algorithm has a median of 0 results because it often timed out before even getting its first result.

Both our evaluation code^{Footnote 1} as our complete evaluation result logs^{Footnote 2} can be found online to repeat the tests and interpret the results.

7 Conclusion

In this paper we introduced an optimized way to query low-cost servers of triple pattern fragments. We designed and implemented an algorithm that uses all metadata present in tpfs to find a solution for queries in a minimal number of http requests. The workload on the client increases, but this is compensated by fewer http requests. Especially in environments with an elevated server response time or network latency, is this a major improvement. It also allows the execution of queries that used to take an excessive amount of time. Besides improving the queries in general, we also improved the amount of effort required until a first result is returned. This can be useful for streaming applications: the faster a result is found, the faster the remainder of the pipeline can continue.

In the future we also want do make a more extensive comparison of our methods and those already existing for generic sparql and sql query optimization.

An obvious possible improvement is parallelism. Multiple parts of the algorithm can be done in parallel. For example, we can download triples for multiple patterns at the same time instead of just a single pattern at a time. The multiple queries we execute on our local database can also be executed in parallel, although care has to be taken when accessing the cache entries. Besides that, the algorithm can still be improved in multiple ways: the local triple store can generate better join trees or have even better caching, the prediction of which pattern to download from can be improved, etc.

A remaining optimization is to detect those cases where a greedy algorithm would provide more results faster (at the cost of more http requests). Furthermore, our algorithm mainly focuses on bgp queries. Other queries constructs are supported, but not optimized. While bgps are the most essential part of a query, in the future, our algorithm could be extended by taking the other components into account. For example, limits could be incorporated in the estimations of total http requests still needed, and pattern-specific filters could be processed early on. Although we have not arrived at a complete tpf solution yet, the algorithm introduced here drastically increases the scope of efficiently supported queries.

Notes

References

Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 197–212. Springer, Heidelberg (2014)
Chapter Google Scholar
Aluç, G., Özsu, M.T., Daudjee, K., Hartig, O.: Chameleon-db: a workload-aware robust RDF data management system. Technical report CS-2013-10, University of Waterloo (2013). https://cs.uwaterloo.ca/~galuc/papers/chameleon-db-research.pdf
Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013)
Chapter Google Scholar
Feigenbaum, L., Williams, G.T., Clark, K.G., Torres, E.: SPARQL 1.1 protocol. Recommendation, World Wide Web Consortium (March 2013). http://www.w3.org/TR/sparql11-protocol/
Harris, S., Seaborne, A.: SPARQL 1.1 query language. Recommendation, World Wide Web Consortium (March 2013). http://www.w3.org/TR/sparql11-query/
Hartig, O.: An overview on execution strategies for linked data queries. Datenbank-Spektrum 13(2), 89–99 (2013)
Article Google Scholar
Hartig, O., Bizer, C., Freytag, J.-C.: Executing SPARQL queries over the web of linked data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 293–309. Springer, Heidelberg (2009)
Chapter Google Scholar
Martin, M., Unbehauen, J., Auer, S.: Improving the performance of semantic web applications with SPARQL query caching. In: The Semantic Web: Research and Applications (2010)
Google Scholar
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)
Chapter Google Scholar
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014)
Chapter Google Scholar
Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: Proceedings of the $13^{\rm th}$ International Conference on Database Theory, pp. 4–33. ACM (2010)
Google Scholar
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
Chapter Google Scholar
Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL basic graph pattern optimization using selectivity estimation. In: Proceedings of the $17^{\rm th}$ International Conference on World Wide Web, pp. 595–604 (2008)
Google Scholar
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., Cyganiak, R., Colpaert, P., Mannens, E., Van de Walle, R.: Querying datasets on the web with high availability. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 180–196. Springer, Heidelberg (2014)
Chapter Google Scholar
Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., Van de Walle, R.: Web-scale querying through linked data fragments. In: Proceedings of the $7^{\rm th}$ Workshop on Linked Data on the Web, April 2014
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Lab – Ghent University – iMinds, Gaston Crommenlaan 8 Bus 201, 9050, Ledeberg-Ghent, Belgium
Joachim Van Herwegen, Ruben Verborgh, Erik Mannens & Rik Van de Walle

Authors

Joachim Van Herwegen
View author publications
You can also search for this author in PubMed Google Scholar
Ruben Verborgh
View author publications
You can also search for this author in PubMed Google Scholar
Erik Mannens
View author publications
You can also search for this author in PubMed Google Scholar
Rik Van de Walle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joachim Van Herwegen .

Editor information

Editors and Affiliations

Inria, Sophia Antipolis, France
Fabien Gandon
Technische Universität Wien, Wien, Austria
Marta Sabou
Hasso-Plattner-Institut, Potsdam, Germany
Harald Sack
Università degli Studi di Bari "Aldo Moro", Bari, Italy
Claudia d’Amato
University of Fribourg, Fribourg, Switzerland
Philippe Cudré-Mauroux
École des Mines de Saint-Étienne, Saint-Étienne, France
Antoine Zimmermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Herwegen, J., Verborgh, R., Mannens, E., Van de Walle, R. (2015). Query Execution Optimization for Clients of Triple Pattern Fragments. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2015. Lecture Notes in Computer Science(), vol 9088. Springer, Cham. https://doi.org/10.1007/978-3-319-18818-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-18818-8_19
Published: 21 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18817-1
Online ISBN: 978-3-319-18818-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

\({P}\)	A query’s bgp, consisting of triple patterns \(t_0,\ldots ,t_n\).
\({V}\)	All variables in \(P\).
\({R}\)	\(\{ download \} \cup \{ bind _v \mid v \in V\}\)
\(\mathrm {vars}(t)\)	\(P \rightarrow 2^V\)	All variables in the given triple pattern
\(\mathrm {count}(t)\)	\(P \rightarrow \mathbb {N}\)	The total match estimate for the given triple pattern
\(\mathrm {role}(t)\)	\(P \rightarrow R\)	The role of a pattern

\(G\)	The set of ground triples.
\(\mathrm {triples}(t)\)	\(P \rightarrow 2^ G \)	Triples downloaded so far for the given pattern
\(\mathrm {pagesize}(t)\)	\(P\rightarrow \mathbb {N}\)	The page size for the given pattern
\(\mathrm {avgTriples}(t)\)	\(P \rightarrow \mathbb {N}\)	Average triple count per binding (detailed later)
\(\mathrm {valCount}(t)\)	\(P \rightarrow \mathbb {N}\)	Number of bind values found so far for the pattern

\(C\)	The set of cache entries (described below).
\(B\)	The set of bindings. A binding maps one or more variables \(v \in V\) to a value.
\(\mathrm {cache}(P')\)	\(2^P \rightarrow C\)	The cache entry corresponding to the patterns, or an empty entry if not used before
\(\mathrm {patterns}(c)\)	\(C \rightarrow 2^P\)	The patterns in the given cache entry. Inverse of the cache function
\(\mathrm {bindings}(c)\)	\(C \rightarrow 2^B\)	The bindings stored in the cache entry
\(\mathrm {tripleCounts}(c)\)	\(C \rightarrow (P\rightarrow \mathbb {N})\)	Function that the value of \(\|triples(t)\|\) when the cache entry was last updated
\(\mathrm {binding_t}(g)\)	\(G \rightarrow B\)	Transforms a triple \( g \) to a binding based on the given pattern \(t \in P\)
\(\mathrm {ids}(b)\)	\(B \rightarrow (P\rightarrow \mathbb {N})\)	The indices stored for the given binding
\(\mathrm {join}(B', B'')\)	\((B \times B) \rightarrow B\)	Joins the two given sets of bindings