Keywords

1 Introduction

The more the amount of information grows on the Web and within information systems, the more important are efficient and effective querying, exploration, and retrieval approaches. For information available as plain text, Information Retrieval is a long established research field; a vast number of mature commercial but also open implementations such as Apache Solr are now driving large-scale applications. Also, in the area of the Semantic Web, a number of approaches, techniques, and platforms have been developed (e.g., [12]) which unify search across unstructured (Web documents) and structured data (RDF). However, for many applications, heterogeneous information represented in different modalities (structured, semi-structured, or unstructured) and spread across distributed data sources have to be made searchable and explorable for end users in an integrated way. Further, these distributed data sources are accessible through a variety of interfaces. In addition to plain documents on the Web, there exist the Deep Web, whose content is generated from databases and often accessible through APIs (e.g., for social networks or e-commerce platforms), and the Dark Web, whose contents is not even accessible openly. Indexing information from all these sources is not generally feasible; also, in many situations, it is not allowed by the terms of use of certain services or by data protection and privacy laws.

FuhSen is a federated semantic hybrid engine that relies on semantics encoded in a knowledge graph to integrate data collected from a federation of heterogeneous sources, e.g., unstructured, semi-structured, or structured. To this end, sources are wrapped, and answers to keyword queries are represented using the Resource Description Framework (RDF). Each data source answer is modelled as a set of RDF triples that share the same subject resource, i.e., an RDF molecule [4, 5]. FuhSen utilizes semantic similarity measures to determine relatedness between two resources in terms of the relatedness of their RDF molecules. Highly similar RDF molecules are aggregated into an integrated RDF molecule that corresponds to the answer of a keyword based query; further, integrated molecules are included as part of the knowledge graph. This semantic aggregation of sources searches answers allowing a more meaningful integration of the collected data and corresponds to the main difference of FuhSen with respect to triple-based integration [10, 13]. Thus, the main contributions of this paper are: (a) A data integration system named FuhSen able to provide a unify view of heterogeneous search engines. (b) An RDF-Molecule integration approach that utilizes semantic similarity measures to integrate pieces of information about the same entity in different data sources.

Motivating Example: We briefly describe a distributed and heterogeneous search application scenario in the context of crime investigation. During a crime investigation process, collecting and analysing information from different sources is a key step performed by investigators. Although scene analysis is always required, a crime investigation process can greatly benefit from searching information about people, products, and organisations on the Web. Commonly, data collected from the following data sources are utilised for enhancing crime analysis processes: (1) The Social Web encompasses user generated content and personal profiles. (2) The Deep Web advertises products and services offered by organisations, e.g., the eBay e-commerce platform. (3) The Web of Data includes billions of machine-comprehensible facts, which can serve as background knowledge for collecting information about different types of entities. (4) The Dark Web refers to sites accessible only with specific software, and restricted trading of goods that can be accessed through the so-called dark-net markets.

Fig. 1.
figure 1

Pieces of information (RDF Molecules) about Joaquín “El Chapo” Guzmán spread over different social networks on the Web.

Figure 1 illustrates data of a suspected drug dealer (Joaquín Chapo Guzmán) which is collected from different Web data sources. Although all the social networks share the profile name, the alias of the suspect is found in Twitter, while his birthplace and location are from Google+ and Facebook, respectively. Currently, the process of data integration is performed by experts manually, negatively affecting the investigation process, since this process is extremely cumbersome and time-consuming because it requires access to a large number of different data sources and manually integrating individual search results. FuhSen exploits REST APIs provided by Web data sources to search, create, and aggregate molecules of data, to then enrich and summarize information about an entity (e.g., a suspect). Using Linked Data as the core technology, the FuhSen engine is able to: (1) integrate on demand heterogeneous data extracted from APIs into a unified data schema using the OntoFuhSen vocabulary, (2) create a knowledge graph on demand with the data extracted from the different data sources, and (3) enrich this knowledge graph using algorithms such as entity disambiguation, typing and entity summarization, and ranking.

Preliminaries: FuhSen creates a knowledge graph on-demand when a keyword query is given as input. A knowledge graph is composed of a set of concepts, their properties, and relations among these concepts. To properly understand these concepts, we follow the notation from Arenas et al. [1], Piro [9], and Fernandez et al. [5], to define RDF triples, knowledge graphs, and RDF molecules.

Definition 1

(RDF triple [1]). Let \(\mathbf {I}\), \(\mathbf {B}\), \(\mathbf {L}\) be disjoint infinite sets of URIs, blank nodes, and literals, respectively. A tuple \((s, p, o) \in (\mathbf {I} \cup \mathbf {B}) \times \mathbf {I} \times (\mathbf {I} \cup \mathbf {B} \cup \mathbf {L})\) is denominated an RDF triple, where s is called the subject, p the predicate, and o the object.

Definition 2

(Knowledge Graph [9]). Given a set T of RDF triples, a knowledge graph is a pair \(G=(V, E)\), where \(V = \{s \mid (s, p, o) \in T\} \cup \{o \mid (s, p, o) \in T\}\) and \(E=\{(s, p, o) \in T\}\).

Definition 3

RDF Subject Molecule [5]). Given an RDF graph G, an RDF subject-molecule \(M \subseteq G\) is a set of triples \({t_1, t_2, \dots , t_n}\) in which \( subject (t_1) = subject (t_2) = \dots = subject (t_n)\).

2 The FuhSen Approach

The knowledge graph (KG) that contains results of the FuhSen engine is built and accessed in real time, i.e., the graph is created on-demand. Based on the definition of the linked data lifecycle [2], the FuhSen approach applies the following process: (1) Create the RDF-Molecules from heterogeneous data sources, (2) Compute the similarity among the molecules to integrate them, (3) Enrich the knowledge graph, and (4) Provide a faceted browsing to explore the knowledge graph. Figure 2 depicts the FuhSen high-level architecture.

Fig. 2.
figure 2

The FuhSen Architecture. High-level architecture comprising (a) Mediator and wrappers architecture to build the (b) knowledge graph on demand. The answer of a keyword query corresponds to an RDF subject-molecule that integrates RDF molecules collected from the wrappers. (c) The components to enrich the results KG.

2.1 Creating RDF-Molecules from Heterogeneous Data Sources

In comparison with traditional extract, transform, and load (ETL) methods, the on-demand KGs as created by FuhSen require new knowledge acquisition approaches. Numerous wrappers expose heterogeneous, highly-dynamic (i.e., facts are frequently added, updated and deleted) data, which has to be mapped with the core ontology to obtain a common representation. FuhSen therefore relies on capabilities of the wrappers to create the RDF-Molecules and data adhering to the common ontology. FuhSen implements a mediator-wrapper architecture to create and aggregate the RDF-Molecules. FuhSen uses the OntoFuhSen vocabulary as its core data model, which allows FuhSen to deal effectively with heterogeneity of source data, to aggregate the results in a knowledge graph to find relations between entities, and to link the KG with external knowledge bases such as DBpedia.

2.2 Computing Similarity of RDF-Molecules

Similar molecules should be interlinked in order to create a fused, universal representation of a certain entity. In contrast with triple-based linking engines like Silk [13], we employ a molecule-based approach increasing the abstraction level and considering the semantics of molecules. That is, we do not work with independent triples, but rather with a set of triples belonging to a certain subject. The molecule-based approach allows for natural clustering of a knowledge graph, reducing the complexity of the linking algorithm. We use Jaccard distance to compute a similarity score of two molecules. Let A be an RDF molecule with a set \( T_1\) of n properties and values (e.g., \( card(T_1) = n \)), and let B be an RDF molecule with a set \( T_2 \) of k properties and values (e.g., \( card(T_2) = k \)). The intersection set contains only those pairs of \(\langle property , val \rangle \) which are present in both \( T_1\) and \( T_2\). The union set contains all unique \(\langle property , val \rangle \) pairs.

2.3 FuhSen Global Vocabulary

The OntoFuhSen Footnote 1 vocabulary serves as a global schema to integrate data coming from different sources. The vocabulary is divided into the following three modules:

(1) Search engine metadata: comprises classes modelling user search activity (e.g., fs:Search, fs:SearchableEntity). This module has been designed taking into account the provenance of resources. The PROV Footnote 2 vocabulary enables the tracking of provenance. PROV classes have been extended to model the provenance of information related to users’ search activities during a search process.

(2) Data sources metadata: contains classes describing data sources API services and access points (e.g., fs:API, fs:Parameter, fs:Operation). These classes model the APIs and services from which the data is extracted, e.g., Facebook or Twitter.

(3) Domain specific metadata: includes classes for describing the results collected from FuhSen during keyword query processing. For the crime domain concepts include: gr:ProductOrService and org:Organization. Reusing existing terms is considered a best practice in vocabulary engineering [7]. Based on this principle, we built some of the concepts of the FuhSen vocabulary by utilizing existing well-known ontologies, e.g., terms from FOAF, GoodRelations, and the Organization Ontology Footnote 3.

2.4 Enriching the Knowledge Graph of Results

Once the graph is constructed, FuhSen allows for additional quality improvement by enriching the graph with new facts acquired through the typing process [6]. It is thus possible to attach additional semantic information to the KG, e.g., location information. “Mexico” coming from Twitter can be recognized and annotated with resources from other knowledge graphs, such as DBpedia’s Mexico resourceFootnote 4. Provenance information is a built-in advantage of the on-demand KGs built by FuhSen, and allows for tracing the origins of a certain fact to a certain source. Additionally, enrichment of on-demand KGs is achievable through facts mining based on the existing facts and using graph analysis algorithms. Moreover, such KGs are able to evolve over time according to the changes appearing in the source datasets. Updates ingestion and propagation are therefore tasks to be addressed by FuhSen.

Entity Typing and Linking: FuhSen identifies named entities and tries to link them to semantic entities from external knowledge bases in the Linked Data Cloud. A well-established entity annotation tool is used DBpedia Spotlight Footnote 5 which combines named entity recognition and disambiguation based on the DBpedia linked dataset. The second tool employed during the enrichment is the Silk Framework. It allows for entity linking among several datasets. Given Source and Target datasets acquired from different wrappers we check whether they semantically describe the same entity. In case they are the same, we enrich each molecule with the properties of another. We annotate subjects of the molecules with the owl:sameAs and rdfs:seeAlso properties. We compare the properties (foaf:birthday, foaf:name, and foaf:gender) of two different datatypes (xsd:string and xsd:date). A Threshold value indicates a minimal similarity value to be taken into consideration by the linking engine. A Weight value represents a degree of importance to be assigned to each operation that affects the final similarity value. Comparing names from the source and target datasets we leave room for possible inequalities in spelling, thus increasing the granularity parameter. Comparing birthdays we check exact equality of the property values. The same rules applies to genders. Finally, we compute a weighted average similarity value using numbers from the previous stage.

FuhSen Entity Summarization: Enhances query answers with triples containing images and human understandable textual descriptions for every entity. OntoFuhSen states the properties to be summarized for each entity according to the entity type, i.e., rdf:type. We use the approach described by Thalhammer and Stadtmüller [11] to generate a summary. The summarization component of FuhSen computes several metrics, e.g., the most frequent property, top-K number of properties to return, requested languages, and composes a template representation similar to the Google Knowledge Graph Cards which is shown to a user.

FuhSen Semantic Ranking: Finally, FuhSen calculates each result ranking score. This score is mainly used for ordering the results in the user interface. It is calculated from three factors: (1) exact match of a keyword in the rdfs:label property from the result entity, (2) number of properties and relations of the entity, (3) data source trustworthiness expressed in terms of the OntoFuhSen vocabulary. An RDF triple with the predicate fs:rank and the ranking score is attached to each entity in the result.

Interacting with the Knowledge Graph of results: FuhSen users pose keyword-based queries and explore query answers using a multi-faceted browsing user interface. In an earlier publication [3], we presented a demo of the user interface, comprising the following elements: a text box for the search query, a result list, entity summaries, and a faceted navigation component. Our choice of JSON-LD, the standard JSON encoding of Linked Data, as the messaging format avoids unnecessary data transformations for the UI components, as they use JSON natively.

3 Experimental Evaluation

We conducted a user evaluation study with FuhSen to validate the following hypotheses: (1) Are end users able to execute keyword-based queries more efficiently using FuhSen rather than conventional search engines, e.g., Google? (2) Is the FuhSen user interface simpler and more pleasant to use than interfaces of conventional search engines? We used a formative evaluation technique and a usability evaluation in a controlled environment. We selected 10 users with high expertise in using Web search engines. A moderator introduced the experiments to the participants, controlled the task execution time, and provided a usability survey to be filled out anonymously.

Formative evaluation: To assess the quality of FuhSen and validate our research hypotheses, we assigned ten users (as suggested by Xu and Mease [14]) to execute three tasks and measured the execution time that participants required to accomplish the tasks. Task1: Find a person named “John Smith Allegro”, who is 33 years old and lives in Bonn, Germany. Task2: Find yourself. Task3: Find offers of a used Nexus 4 in the United States. We instructed users to stop when they considered that they had invested enough effort to cope with the task. In the longest case it took a participant five minutes to complete the task.

Results. Five users applied common search engines such as Google and Bing, while the others used FuhSen. The gold standard for the Task1 was built from the Google+ account of John Smith Allegro; an eBay offer for a Nexus, four was created manually as the ground truth for the Task3. Information about the evaluation participants was used as gold standard for the Task2. Figure 3(a) reports on the average task execution time (in seconds) during the evaluation.

Fig. 3.
figure 3

User Evaluation results. (a) task execution time (average in mins.); (b) comparison of execution time on FuhSen and traditional engines; (c) Average of scores of usability tests

Discussion. We observed that no user was able to complete Task 1 using a conventional search engine, whereas all participants were able to find John Smith Allegro with FuhSen. We assume that the ranking algorithm used by conventional search engines prevented the completion of the Task1 on time. Only one person could not complete the Task2 using FuhSen; from the post-study questionnaire we identified that this person did not have any accounts in the web services which are used as information sources in our current prototype. Similarly, only one person could not complete the Task3 using FuhSen. A possible explanation might be that the participant employed a routine learned from the usage of conventional search engines; cf. the discussion below. Figure 3(c) shows the time participants needed to complete the tasks using FuhSen. The maximum time to complete Task 1 using FuhSen was close to one minute, which is an acceptable value. Participants completed the Task2 faster using conventional search engines. The results are explained by the fact that the participants knew exactly beforehand which keyword combinations would lead them to the expected output. Results of Task3 illustrate the advantages of using FuhSen in finding rather specific information. The search process tends to be faster with FuhSen in comparison with conventional engines.

Usability Evaluation: This evaluation was performed with those participants who used FuhSen. Two techniques were used during this evaluation: think aloud protocols and a Post-Study System Usability Questionnaire (PSSUQ) [8].

Results. Figure 3(b) summarizes the results, FuhSen user interface received high scores in all aspects. The evaluation outcome emphasizes the efficiency of the design decisions for the user interaction taken during the implementation of FuhSen.

Discussion. One of the main usability troubles we identified resides in the filters, i.e., users did not realize at first use that there were any filters. Nevertheless, filters were heavily used after a participant explored the interface. Another relevant observation is that users tend to apply a search routine learnt from using conventional search engines, such as searching for John Smith Allegro Facebook or Bonn Germany John Smith. This practice has to be taken into consideration for further improvements of the user interface and query expansion.

4 Conclusions

In this paper we have presented FuhSen, a federated semantic hybrid engine that creates a knowledge graph on-demand by integrating data collected from a federation of heterogeneous data sources using an RDF-Molecule integration approach. We showed the creation of RDF-Molecules by using RDF-Wrappers, and we also presented how semantic similarity measures are used to determine the relatedness of two resources in terms of the relatedness of their RDF molecules. The RDF-Molecule integration approach followed by FuhSen devises a novel integration paradigm incorporating elements from linked data and federated search engines. Although our initial use cases addresses the criminal investigation domain, we deem that there are numerous further use cases, e.g., related to e-commerce (e.g., price comparison) or social media. This work is the first step of a larger research agenda aiming at establishing the concept molecule-based integration approach for building knowledge graphs on-demand.