Keywords

1 Introduction

dbpedia  [2] is currently the most well-known dataset within the Semantic Web community. It consists of hundreds of millions of rdf triples automatically generated from the free Wikipedia encyclopedia. Such large Linked Datasets come with important challenges – most prominently: how do we provide scalable queryable access to them? The traditional answer has been to set up a public sparql endpoint [4], but such endpoints suffer from low availability rates [3]. Yet reliable access is a prerequisite to build applications on top of a queryable dbpedia interface.

Mid-October 2014, the dbpedia community opened a Triple Pattern Fragments interfaceFootnote 1 [14] maintained by the author of this paper. This interface is designed to allow high availability on the server side, while still enabling live querying on the client side. Queries take more time and bandwidth, because they are mostly executed by the client, but the timings are consistent so that building applications on top of a public dbpedia interface becomes realistic.

In this paper, we discuss the analysis of the first 9 full months of usage data of the English dbpedia Triple Pattern Fragments interface, as well as availability statistics measured by an independent party.

The remainder of this paper is structured as follows. First, we discuss interfaces to Linked Data in Sect. 2. We then discuss the hardware and software setup of the server and analysis in Sect. 3. Next, Sect. 4 formulates and answers usage questions with log data. Finally, we conclude in Sect. 5.

2 Related Work

In this section, we summarize existing Web apis to publish Linked Datasets. Linked Data Fragments (ldf, [14, 16]) were introduced as a uniform view to capture the characteristics of any Linked Data Web api. The common aspect of all interfaces is that, in one way or another, they offer specific parts of a dataset. Each part is referred to as a Linked Data Fragment, consisting of:

  • data the triples of the dataset that match an interface-specific selector;

  • metadata triples to describe the fragment itself;

  • controls hyperlinks and/or hypermedia forms that lead to other fragments.

File-Based Datasets. So-called data dumps are conceptually the most simple apis: the data consists of all triples in the dataset. They are combined into a (usually compressed) archive and published at a single url. Sometimes the archive contains metadata, but controls–with the possible exception of http uris in rdf triples–are not present.

Linked Data Documents. Datasets published through the Linked Data principles [1] are available as individual documents per subject, which can be retrieved by performing an http GET request on the subject’s url (“dereferencing”). Each such document is a fragment, in which the data consists of triples related to that subject, the metadata set might contain properties such as author and publication data, and the controls consist of links to other Linked Data documents. Querying is possible through strategies such as link traversal [7].

SPARQL Endpoints. sparql endpoints [4] allow executing sparql queries [6] on a dataset through http. A sparql fragment’s data consists of triples matching the query (assuming the CONSTRUCT form); the metadata and control sets are empty. Query execution is performed entirely by the server, and because each client can ask highly individualized requests, the reusability of fragments is low. This, combined with complexity of sparql query execution, likely contributes to the low availability of public sparql endpoints [3].

Triple Pattern Fragments. The Triple Pattern Fragments interface has been designed to minimize server-side processing, while at the same time enabling efficient live querying on the client side [11, 14]. A fragment’s data consists of all triples that match a specific triple pattern, and can possibly be paged. Each fragment page mentions the estimated total number of matches as metadata, and contains hypermedia controls to find all other Triple Pattern Fragments of the same dataset. Since requests are less individualized, fragments are more likely to be reused across clients, which increases the benefits of caching [14].

3 Deployment and Analysis Setup

3.1 Server Specifications

Hardware. The official dbpedia Triple Pattern Fragments interface is hosted on a virtual machine from the Amazon Elastic Compute Cloud (ec2). We opted for a c3.2xlarge machine configuration, which has the following characteristics:

figure a

We would like to stress that the above specifications are actually too high for our purpose; as a result, the server is currently mostly idle. The issue is, however, that Amazon does not allow customization of machines. While lighter configurations exist, they come with lower disk throughput and/or bandwidth.

Software. The machine has been configured with the following open-source software packages and versions:

  • operating system Ubuntu Linux 14.04 lts

  • Web server nginx 1.4.6

  • application server ldf serverFootnote 2 1.1.4 on top of Node.js 0.10.36

The nginx server acts as a reverse proxy and cache. All requests first reach nginx, which checks whether a response is present in the cache based on a unique identifier consisting of the request uri and the value of the http Accept header. If so, it is sent to the client; if not, the request is forwarded to the application server. The application server then parses the request, and retrieves the dbpedia data from an hdt file [5] that is loaded into memory. It is then serialized in a format according to the Accept header, sent to the client, and stored in the cache.

3.2 Analysis Setup

All incoming requests are logged line by line in a file by the nginx Web server. Note that logging does not happen on the application server, as this server only receives those requests that are not handled by the cache. The resulting access logs are hosted publicly.Footnote 3 Each log line contains the following fields:

  1. 1.

    the client’s ip address;

  2. 2.

    the uri requested by the client;

  3. 3.

    value of the client’s Accept header;

  4. 4.

    value of the client’s Referer header;

  5. 5.

    value of the client’s User-Agent header;

  6. 6.

    the server’s local time;

  7. 7.

    the server’s response size;

  8. 8.

    the server’s response cache status;

  9. 9.

    the server’s response http status code.

Additionally, the availability of the http interface is monitored by the independent third-party service Pingdom, because public availability—by definition—cannot reliably be monitored by the Web server under consideration. Pingdom performs an http request once every minute for the ?s rdf:type ?o fragment and notes whether a response was successfully received. If no timely response arrives, the interface is assumed to be unavailable.

4 Usage Analysis

In this section, we will search for answers to the following usage questions:

  1. 1.

    How many requests were issued?

  2. 2.

    Which clients made these requests?

  3. 3.

    What types of content were those clients interested in?

  4. 4.

    Where did the requests originate from?

  5. 5.

    What kind of triple patterns were requested?

  6. 6.

    How effective has the cache been?

  7. 7.

    What part of time was the server (un-)available?

We focused on requests with an http 200 0K response only, in order to remove (very minimal) noise from the 0.19 % invalid requests against the interface.

4.1 Number of Requests

The server logs reveal a total of 16,776,170 requests for Triple Pattern Fragments of the English dbpedia versionFootnote 4, or an average of 1,864,019 requests per month during the 9 considered months (November 2014–July 2015). Large outliers skew this average, so perhaps the median of 486,045 (obtained in March) is a more meaningful number. Figure 1 visualizes the number of requests per month.

Fig. 1.
figure 1

April had an exceptionally high consumption of fragments, followed by November. The median is 486,045, which was obtained in March.

April 2015 had a strong traffic peak with over 10 million requests. While we have no concrete evidence, we assume there is a connection with the submission deadline of the International Semantic Web Conference 2015: main track papers were due April 30, 2015, so perhaps one or more research groups were running experiments against the interface.

4.2 User Agents

Figure 2 shows the proportion of user agents per requested fragment. The majority of requests were issued by clients that identified themselves as Triple Pattern Fragment (tpf) clients. This is the identification sent by the JavaScript client-side library,Footnote 5 which can either be used as standalone utility from the command-line or as a library inside of other applications (which thus cannot be distinguished by default). If the library runs inside of a browser application, the user agent will be that of the browser, so these usages are counted in that category.

Fig. 2.
figure 2

The Triple Pattern Fragments client is by far the most common client, followed by an unknown client, crawlers, and finally browsers. Some minorities (command-line utilities, other languages...) are marginally represented.

The second-most popular client identifies itself with “User-Agent: 2.7”, which requested approximately 3 million requests, all of which originated from Germany. This seems to be an error, as “2.7” does not follow a conventional format for user agent strings. We could not determine the identify of this client more precisely. There might be a connection with Python 2.7 (which is a common version), but other Python clients identified themselves in a more conventional way (e.g., “python-requests/2.7.0”).

Crawlers requested a large portion of dbpedia fragments. This is especially remarkable because such usage contrasts with sparql endpoints, which belong to the so-called “deep Web”: in order to access data, a user must write a sparql query in an html form. The only sparql endpoint resources that are accessible on the Web are sparql queries that are explicitly linked from another page. While the Triple Pattern Fragments specification only demands the presence of a hypermedia form (which would thus also hide fragments in the deep Web), the used server implementation explicitly links to relevant fragments. For instance, the fragment “subjects born in Slovenia” links to fragments for the birthplace predicate, and all individual subjects born in Slovenia. This allows people and crawlers to browse the interface similar to how Linked Data documents are navigated. An added value of Triple Pattern Fragments is that all resources can be followed within the interface, not only those resources that share the uri space of the current document (as is the cases with Linked Data documents [16]).

Finally, browser consumption accounted for almost a million requests, the majority of which (840,006) appear to be performed by client-side scripts. This latter number was determined by looking for non-human-targeted content types such as Turtle or json, which we discuss in more detail in the next section.

4.3 Requested Content Types

Figure 3 shows the preferred content type indicated in the Accept header of the request. Given that the tpf client made most of the requests (Fig. 2), this client’s preferred content type TriG is also the most popular. Previous versions of the tpf client consumed Turtle, which partly accounts for the popularity of this format in Fig. 3. The drawback of triple-based formats such as Turtle is that they do not allow a clean separation of data and metadata, while quad-based formats like TriG do [14]. In a previous analysis, we predicted the increased popularity of TriG because of this changed client preference [15]. The popularity of json(-ld) might come as a surprise for the Linked Data community, but perhaps less so for JavaScript developers, who employ json as a native format.

Fig. 3.
figure 3

The quad-based TriG and json-ld formats were most popular, followed by the triple-based format Turtle. Over a million requests did not carry a preference.

Only a minority of clients did not indicate a specific preference, either by explicitly accepting */* or not sending an Accept header at all. These clients received an html representation from the server.

4.4 Geographic Location

The majority of requests originated from Germany, as visualized in Fig. 4. On the second place is Belgium; most Belgian requests were generated by our team at Ghent University – iMinds, which coordinates the dbpedia Triple Pattern Fragments server. Requests from the United States and China were largely made by respectively the Google and Baidu crawlers.

Fig. 4.
figure 4

Germany requested most fragments, followed by Belgium. Automated crawlers strongly influence consumption from the United States (Google) and China (Baidu).

In total, requests arrived from 80 different countries. While determining more detailed locations is possible (e.g., city level), we decided not to do so out of privacy concerns.

4.5 Requested Triple Patterns

sparql endpoints typically receive highly specific queries that provide an insight into a concrete question. If Triple Pattern Fragments interfaces are used to evaluate sparql queries, only triple patterns arrive at the server. Such individual patterns only provide a limited idea of what clients were doing, which could be enhanced by analyzing series of requests made by particular clients. However, there is no guarantee that clients are actually executing sparql queries (except perhaps if the user agent is “TPF Client”), since the interface can be used in many different ways. For example, crawlers do not evaluate complex queries.

Figure 5 shows the four most requested kinds of patterns:

  1. 1.

    “Subjects of a specific type” fragments (?s rdf:type <o> for specific <o>) are most common. We expect such patterns to occur frequently in sparql queries.

  2. 2.

    The pattern “subjects that have a type” (?s rdf:type ?o) is highly popular. The main reason for this is that the Pingdom service (see Sect. 4.7) requests this fragment to measure availability.

  3. 3.

    The “all” fragment (?s ?p ?o), which is the most generic fragment of the dataset and thus a natural starting point, was requested 171,746 times. Query evaluations typically start from this fragment (but this is by no means an obligation). This number might thus be a vague indication for the number of executed sparql queries.

  4. 4.

    “Subclasses of a specific subject” fragments (<s> rdf:type ?o for specific <s>) were also represented significantly in the total number.

Fig. 5.
figure 5

Type and subclass selections were popular, as well as the generic “all” fragment.

The information these patterns yield is quite limited. While this is positive for privacy on the one hand, it restricts the possibilities for analysis on the other hand. In Sect. 5, we review possible strategies to circumvent this obstacle.

4.6 Cache Effectiveness

A premise of the Triple Pattern Fragments interface is that clients partly reuse the same fragments to achieve different but similar goals. With sparql endpoints, clients instead send highly specialized requests; overlapping information between them cannot be reused on the http interface level. With Triple Pattern Fragments, the number of unique requests is relatively smaller, so regular http caches function more effectively.

The nginx reverse proxy server has been configured to cache requested fragments for a maximum time of 1 h. Uniqueness of requests is determined by a combination of url and Accept header. As such, the Triple Pattern Fragments server generates each unique response at most once per hour; all subsequent requests are handled by the cache. Furthermore, the proxy server sets the expiration date of responses to 7  days in the future. Clients that have a built-in cache themselves, such as browsers, are thereby suggested to only repeat a request for a resource after a week. Note that the standalone tpf client does not have a persistent cache; each invocation of that client results in new resource accesses.

Fig. 6.
figure 6

59.5 % of requested fragments was already present in the cache; 16.5 % had expired, but 43.0 % could be reused. The remaining 40.5 % was not cached.

As Fig. 6 shows, 43.0 % of responses were served directly from the nginx cache; another 16.5 % were present but had been so for longer than the expiration time. In other words, roughly two fifths of all responses were needed again by the same client or other clients within the hour. Finally, a few requests (3,054) explicitly asked to bypass the cache. So while 57.0 % of requests were not served by the cache, the caching mechanism was able to reduce the load on the application server by 43.0 %. Since the dataset in this case is static, and the number of fragments finite, we could set a higher (of even infinite) cache timeout. At the moment, however, there was no necessity to do so.

4.7 Availability

One of the main goals of the Triple Pattern Fragments interface is to maximize availability, in order to allow building applications on public, live queryable Linked Data sources. During the period of November 2014 to July 2015, a fragment was retrieved from the server every minute to verify availability. This amounts to a total of 273 days \(\times \) 1, 440 min per day = 393, 120 min. The results are available in an online interfaceFootnote 6 and summarized in Fig. 7.

Fig. 7.
figure 7

Most months had 100 % uptime. The total downtime was limited to 13 min in the 9 full months the interface has been available so far.

The results reveal that the interface had 100 % uptime in 7 months out of 9. There were three separate downtime incidents during which ping requests did not result in a timely response: one in November (1 min) and two in July (1 min and 11 min). In all of these cases, the Pingdom requests did not even reach the nginx server, as evidenced by a lack of log entries during those periods. There is no indication that the application server was unavailable or overloaded. The cause of the incidents could unfortunately not be determined.

In any case, these observations allow a precise calculation of the availability during the observed period of 9 months. Dividing the minutes of availability by the total number of minutes gives \(393{,}107 / 393{,}120 \approx 99.9967\,\%\). This amounts to an availability level of “4 nines”, which can only be achieved with an average maximum of 4 min of downtime per month, which was met in all months except July. The next availability level (“5 nines”) would restrict downtime to 26 seconds per month, which was met in all months except November and July.

5 Conclusions

When the official Triple Pattern Fragments interface for dbpedia was released, we mostly heard three types of questions:

  • Will this interface be used?

  • If so, how will clients use it?

  • Will the availability of this interface be sufficient for live application usage?

The analysis in this paper allows us to formulate a preliminary answer to all three of them.

First of all, the interface has indeed been used, as evidenced by more than 16 million requests in the course of its first 9 months. Most of this usage came from the client-side sparql query executor we built for the Triple Pattern Fragments interface, while can serve as a library for many types of applications. Search engine crawlers also consumed many fragments of the interface. Relatively few people browsed the interface directly, as it is primarily targeted at machines. It does raise the question whether it makes sense to improve accessibility for people. Client ip addresses from 80 countries (as opposed to 47 in a previous analysis [15]) show that usage is spreading geographically.

Second, while the analysis provides us with some insights about how the interface is used, more high-level patterns are absent. On the one hand, this is a blessing for privacy: clients only ask generic questions, and they themselves can combine this to answers for more complex questions in any way they see fit. On the other hand, it makes it harder to understand what kind of usage is popular, and for which use cases we could or might need to optimize. This process could be facilitated if we explicitly ask clients to provide feedback [13]. For now, we are in the dark as to precisely what sparql queries–and other tasks–clients have executed. Having more information would allow us to compare this with, for instance, the logs of the public dbpedia sparql endpoint. At the same time, we should realize that not all clients of Triple Pattern Fragments interfaces necessarily have the evaluation of sparql queries as a task or subtask.

Third, the average availability of 99.99 % of the server (100 % in 7 months out of 9) removes any doubt that the Triple Pattern Fragments interface is sufficiently reliable for live applications. We must, however, remark two things here. While 16.8 million requests is a large quantity for a young interface, it is still nowhere near full capacity. The server is still mostly idling, so in order to really find out its limits, more requests are necessary. Also, the number of requests cannot be compared to that of a sparql endpoint, as in many cases, more requests are necessary to achieve the same goal. When talking about availability, we therefore need to mention expressivity too. The goal of the Triple Pattern Fragments interface is to reliably balance both.

In the future, we should consider experimenting with more expressive interfaces. For instance, we could provide extra functionality such as substring searches [10], which enable faster evaluation of sparql queries with certain FILTER clauses. It could also drive specific tools and applications such as autocompletion widgets or linking and reconciliation tasks [8]. To improve query performance, the incorporation of additional metadata in responses can be beneficial [12].

Our conclusion is that applications now have a reliable interface to query the public dbpedia dataset. Therefore, we seem to have overcome one of the main obstacles that could hold developers from building applications on top of live Linked Data. An important question remains: is this enough? Now that reliable access is possible, what excuses remain for not building intelligent Linked Data clients? It seems the next move should be made by application developers, given that the data and the tools are now really there, 99.99 % of time. We should keep our eyes, ears, and minds open to the demands of this community to help evolve the concept of Semantic Web applications from vision to reality. Furthermore, the dbpedia use case can act as an inspiration for others who want to publish queryable Linked Data at low cost – or even for free [9].