Abstract
A vast amount of information about various types of entities is spread across the Web, e.g., people or organizations on the Social Web, product offers on the Deep Web or on the Dark Web. These data sources can comprise heterogeneous data and are equipped with different search capabilities e.g., Search API. End users such as investigators from law enforcement institutions searching for traces and connections of organized crime have to deal with these interoperability problems not only during search time but also while merging data collected from different sources. We devise FuhSen, a keyword-based federated engine that exploits the search capabilities of heterogeneous sources during query processing and generates knowledge graphs on-demand applying an RDF-Molecule integration approach in response to keyword-based queries. The resulting knowledge graph describes the semantics of entities collected from the integrated sources, as well as relationships among these entities. Furthermore, FuhSen utilizes ontologies to describe the available sources in terms of content and search capabilities and exploits this knowledge to select the sources relevant for answering a keyword-based query. We conducted a user evaluation where FuhSen is compared to traditional search engines. FuhSen semantic search capabilities allow users to complete search tasks that could not be accomplished with traditional Web search engines during the evaluation study.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Keywords
1 Introduction
The more the amount of information grows on the Web and within information systems, the more important are efficient and effective querying, exploration, and retrieval approaches. For information available as plain text, Information Retrieval is a long established research field; a vast number of mature commercial but also open implementations such as Apache Solr are now driving large-scale applications. Also, in the area of the Semantic Web, a number of approaches, techniques, and platforms have been developed (e.g., [12]) which unify search across unstructured (Web documents) and structured data (RDF). However, for many applications, heterogeneous information represented in different modalities (structured, semi-structured, or unstructured) and spread across distributed data sources have to be made searchable and explorable for end users in an integrated way. Further, these distributed data sources are accessible through a variety of interfaces. In addition to plain documents on the Web, there exist the Deep Web, whose content is generated from databases and often accessible through APIs (e.g., for social networks or e-commerce platforms), and the Dark Web, whose contents is not even accessible openly. Indexing information from all these sources is not generally feasible; also, in many situations, it is not allowed by the terms of use of certain services or by data protection and privacy laws.
FuhSen is a federated semantic hybrid engine that relies on semantics encoded in a knowledge graph to integrate data collected from a federation of heterogeneous sources, e.g., unstructured, semi-structured, or structured. To this end, sources are wrapped, and answers to keyword queries are represented using the Resource Description Framework (RDF). Each data source answer is modelled as a set of RDF triples that share the same subject resource, i.e., an RDF molecule [4, 5]. FuhSen utilizes semantic similarity measures to determine relatedness between two resources in terms of the relatedness of their RDF molecules. Highly similar RDF molecules are aggregated into an integrated RDF molecule that corresponds to the answer of a keyword based query; further, integrated molecules are included as part of the knowledge graph. This semantic aggregation of sources searches answers allowing a more meaningful integration of the collected data and corresponds to the main difference of FuhSen with respect to triple-based integration [10, 13]. Thus, the main contributions of this paper are: (a) A data integration system named FuhSen able to provide a unify view of heterogeneous search engines. (b) An RDF-Molecule integration approach that utilizes semantic similarity measures to integrate pieces of information about the same entity in different data sources.
Motivating Example: We briefly describe a distributed and heterogeneous search application scenario in the context of crime investigation. During a crime investigation process, collecting and analysing information from different sources is a key step performed by investigators. Although scene analysis is always required, a crime investigation process can greatly benefit from searching information about people, products, and organisations on the Web. Commonly, data collected from the following data sources are utilised for enhancing crime analysis processes: (1) The Social Web encompasses user generated content and personal profiles. (2) The Deep Web advertises products and services offered by organisations, e.g., the eBay e-commerce platform. (3) The Web of Data includes billions of machine-comprehensible facts, which can serve as background knowledge for collecting information about different types of entities. (4) The Dark Web refers to sites accessible only with specific software, and restricted trading of goods that can be accessed through the so-called dark-net markets.
Figure 1 illustrates data of a suspected drug dealer (Joaquín Chapo Guzmán) which is collected from different Web data sources. Although all the social networks share the profile name, the alias of the suspect is found in Twitter, while his birthplace and location are from Google+ and Facebook, respectively. Currently, the process of data integration is performed by experts manually, negatively affecting the investigation process, since this process is extremely cumbersome and time-consuming because it requires access to a large number of different data sources and manually integrating individual search results. FuhSen exploits REST APIs provided by Web data sources to search, create, and aggregate molecules of data, to then enrich and summarize information about an entity (e.g., a suspect). Using Linked Data as the core technology, the FuhSen engine is able to: (1) integrate on demand heterogeneous data extracted from APIs into a unified data schema using the OntoFuhSen vocabulary, (2) create a knowledge graph on demand with the data extracted from the different data sources, and (3) enrich this knowledge graph using algorithms such as entity disambiguation, typing and entity summarization, and ranking.
Preliminaries: FuhSen creates a knowledge graph on-demand when a keyword query is given as input. A knowledge graph is composed of a set of concepts, their properties, and relations among these concepts. To properly understand these concepts, we follow the notation from Arenas et al. [1], Piro [9], and Fernandez et al. [5], to define RDF triples, knowledge graphs, and RDF molecules.
Definition 1
(RDF triple [1]). Let \(\mathbf {I}\), \(\mathbf {B}\), \(\mathbf {L}\) be disjoint infinite sets of URIs, blank nodes, and literals, respectively. A tuple \((s, p, o) \in (\mathbf {I} \cup \mathbf {B}) \times \mathbf {I} \times (\mathbf {I} \cup \mathbf {B} \cup \mathbf {L})\) is denominated an RDF triple, where s is called the subject, p the predicate, and o the object.
Definition 2
(Knowledge Graph [9]). Given a set T of RDF triples, a knowledge graph is a pair \(G=(V, E)\), where \(V = \{s \mid (s, p, o) \in T\} \cup \{o \mid (s, p, o) \in T\}\) and \(E=\{(s, p, o) \in T\}\).
Definition 3
RDF Subject Molecule [5]). Given an RDF graph G, an RDF subject-molecule \(M \subseteq G\) is a set of triples \({t_1, t_2, \dots , t_n}\) in which \( subject (t_1) = subject (t_2) = \dots = subject (t_n)\).
2 The FuhSen Approach
The knowledge graph (KG) that contains results of the FuhSen engine is built and accessed in real time, i.e., the graph is created on-demand. Based on the definition of the linked data lifecycle [2], the FuhSen approach applies the following process: (1) Create the RDF-Molecules from heterogeneous data sources, (2) Compute the similarity among the molecules to integrate them, (3) Enrich the knowledge graph, and (4) Provide a faceted browsing to explore the knowledge graph. Figure 2 depicts the FuhSen high-level architecture.
2.1 Creating RDF-Molecules from Heterogeneous Data Sources
In comparison with traditional extract, transform, and load (ETL) methods, the on-demand KGs as created by FuhSen require new knowledge acquisition approaches. Numerous wrappers expose heterogeneous, highly-dynamic (i.e., facts are frequently added, updated and deleted) data, which has to be mapped with the core ontology to obtain a common representation. FuhSen therefore relies on capabilities of the wrappers to create the RDF-Molecules and data adhering to the common ontology. FuhSen implements a mediator-wrapper architecture to create and aggregate the RDF-Molecules. FuhSen uses the OntoFuhSen vocabulary as its core data model, which allows FuhSen to deal effectively with heterogeneity of source data, to aggregate the results in a knowledge graph to find relations between entities, and to link the KG with external knowledge bases such as DBpedia.
2.2 Computing Similarity of RDF-Molecules
Similar molecules should be interlinked in order to create a fused, universal representation of a certain entity. In contrast with triple-based linking engines like Silk [13], we employ a molecule-based approach increasing the abstraction level and considering the semantics of molecules. That is, we do not work with independent triples, but rather with a set of triples belonging to a certain subject. The molecule-based approach allows for natural clustering of a knowledge graph, reducing the complexity of the linking algorithm. We use Jaccard distance to compute a similarity score of two molecules. Let A be an RDF molecule with a set \( T_1\) of n properties and values (e.g., \( card(T_1) = n \)), and let B be an RDF molecule with a set \( T_2 \) of k properties and values (e.g., \( card(T_2) = k \)). The intersection set contains only those pairs of \(\langle property , val \rangle \) which are present in both \( T_1\) and \( T_2\). The union set contains all unique \(\langle property , val \rangle \) pairs.
2.3 FuhSen Global Vocabulary
The OntoFuhSen Footnote 1 vocabulary serves as a global schema to integrate data coming from different sources. The vocabulary is divided into the following three modules:
(1) Search engine metadata: comprises classes modelling user search activity (e.g., fs:Search, fs:SearchableEntity). This module has been designed taking into account the provenance of resources. The PROV Footnote 2 vocabulary enables the tracking of provenance. PROV classes have been extended to model the provenance of information related to users’ search activities during a search process.
(2) Data sources metadata: contains classes describing data sources API services and access points (e.g., fs:API, fs:Parameter, fs:Operation). These classes model the APIs and services from which the data is extracted, e.g., Facebook or Twitter.
(3) Domain specific metadata: includes classes for describing the results collected from FuhSen during keyword query processing. For the crime domain concepts include: gr:ProductOrService and org:Organization. Reusing existing terms is considered a best practice in vocabulary engineering [7]. Based on this principle, we built some of the concepts of the FuhSen vocabulary by utilizing existing well-known ontologies, e.g., terms from FOAF, GoodRelations, and the Organization Ontology Footnote 3.
2.4 Enriching the Knowledge Graph of Results
Once the graph is constructed, FuhSen allows for additional quality improvement by enriching the graph with new facts acquired through the typing process [6]. It is thus possible to attach additional semantic information to the KG, e.g., location information. “Mexico” coming from Twitter can be recognized and annotated with resources from other knowledge graphs, such as DBpedia’s Mexico resourceFootnote 4. Provenance information is a built-in advantage of the on-demand KGs built by FuhSen, and allows for tracing the origins of a certain fact to a certain source. Additionally, enrichment of on-demand KGs is achievable through facts mining based on the existing facts and using graph analysis algorithms. Moreover, such KGs are able to evolve over time according to the changes appearing in the source datasets. Updates ingestion and propagation are therefore tasks to be addressed by FuhSen.
Entity Typing and Linking: FuhSen identifies named entities and tries to link them to semantic entities from external knowledge bases in the Linked Data Cloud. A well-established entity annotation tool is used DBpedia Spotlight Footnote 5 which combines named entity recognition and disambiguation based on the DBpedia linked dataset. The second tool employed during the enrichment is the Silk Framework. It allows for entity linking among several datasets. Given Source and Target datasets acquired from different wrappers we check whether they semantically describe the same entity. In case they are the same, we enrich each molecule with the properties of another. We annotate subjects of the molecules with the owl:sameAs and rdfs:seeAlso properties. We compare the properties (foaf:birthday, foaf:name, and foaf:gender) of two different datatypes (xsd:string and xsd:date). A Threshold value indicates a minimal similarity value to be taken into consideration by the linking engine. A Weight value represents a degree of importance to be assigned to each operation that affects the final similarity value. Comparing names from the source and target datasets we leave room for possible inequalities in spelling, thus increasing the granularity parameter. Comparing birthdays we check exact equality of the property values. The same rules applies to genders. Finally, we compute a weighted average similarity value using numbers from the previous stage.
FuhSen Entity Summarization: Enhances query answers with triples containing images and human understandable textual descriptions for every entity. OntoFuhSen states the properties to be summarized for each entity according to the entity type, i.e., rdf:type. We use the approach described by Thalhammer and Stadtmüller [11] to generate a summary. The summarization component of FuhSen computes several metrics, e.g., the most frequent property, top-K number of properties to return, requested languages, and composes a template representation similar to the Google Knowledge Graph Cards which is shown to a user.
FuhSen Semantic Ranking: Finally, FuhSen calculates each result ranking score. This score is mainly used for ordering the results in the user interface. It is calculated from three factors: (1) exact match of a keyword in the rdfs:label property from the result entity, (2) number of properties and relations of the entity, (3) data source trustworthiness expressed in terms of the OntoFuhSen vocabulary. An RDF triple with the predicate fs:rank and the ranking score is attached to each entity in the result.
Interacting with the Knowledge Graph of results: FuhSen users pose keyword-based queries and explore query answers using a multi-faceted browsing user interface. In an earlier publication [3], we presented a demo of the user interface, comprising the following elements: a text box for the search query, a result list, entity summaries, and a faceted navigation component. Our choice of JSON-LD, the standard JSON encoding of Linked Data, as the messaging format avoids unnecessary data transformations for the UI components, as they use JSON natively.
3 Experimental Evaluation
We conducted a user evaluation study with FuhSen to validate the following hypotheses: (1) Are end users able to execute keyword-based queries more efficiently using FuhSen rather than conventional search engines, e.g., Google? (2) Is the FuhSen user interface simpler and more pleasant to use than interfaces of conventional search engines? We used a formative evaluation technique and a usability evaluation in a controlled environment. We selected 10 users with high expertise in using Web search engines. A moderator introduced the experiments to the participants, controlled the task execution time, and provided a usability survey to be filled out anonymously.
Formative evaluation: To assess the quality of FuhSen and validate our research hypotheses, we assigned ten users (as suggested by Xu and Mease [14]) to execute three tasks and measured the execution time that participants required to accomplish the tasks. Task1: Find a person named “John Smith Allegro”, who is 33 years old and lives in Bonn, Germany. Task2: Find yourself. Task3: Find offers of a used Nexus 4 in the United States. We instructed users to stop when they considered that they had invested enough effort to cope with the task. In the longest case it took a participant five minutes to complete the task.
Results. Five users applied common search engines such as Google and Bing, while the others used FuhSen. The gold standard for the Task1 was built from the Google+ account of John Smith Allegro; an eBay offer for a Nexus, four was created manually as the ground truth for the Task3. Information about the evaluation participants was used as gold standard for the Task2. Figure 3(a) reports on the average task execution time (in seconds) during the evaluation.
Discussion. We observed that no user was able to complete Task 1 using a conventional search engine, whereas all participants were able to find John Smith Allegro with FuhSen. We assume that the ranking algorithm used by conventional search engines prevented the completion of the Task1 on time. Only one person could not complete the Task2 using FuhSen; from the post-study questionnaire we identified that this person did not have any accounts in the web services which are used as information sources in our current prototype. Similarly, only one person could not complete the Task3 using FuhSen. A possible explanation might be that the participant employed a routine learned from the usage of conventional search engines; cf. the discussion below. Figure 3(c) shows the time participants needed to complete the tasks using FuhSen. The maximum time to complete Task 1 using FuhSen was close to one minute, which is an acceptable value. Participants completed the Task2 faster using conventional search engines. The results are explained by the fact that the participants knew exactly beforehand which keyword combinations would lead them to the expected output. Results of Task3 illustrate the advantages of using FuhSen in finding rather specific information. The search process tends to be faster with FuhSen in comparison with conventional engines.
Usability Evaluation: This evaluation was performed with those participants who used FuhSen. Two techniques were used during this evaluation: think aloud protocols and a Post-Study System Usability Questionnaire (PSSUQ) [8].
Results. Figure 3(b) summarizes the results, FuhSen user interface received high scores in all aspects. The evaluation outcome emphasizes the efficiency of the design decisions for the user interaction taken during the implementation of FuhSen.
Discussion. One of the main usability troubles we identified resides in the filters, i.e., users did not realize at first use that there were any filters. Nevertheless, filters were heavily used after a participant explored the interface. Another relevant observation is that users tend to apply a search routine learnt from using conventional search engines, such as searching for John Smith Allegro Facebook or Bonn Germany John Smith. This practice has to be taken into consideration for further improvements of the user interface and query expansion.
4 Conclusions
In this paper we have presented FuhSen, a federated semantic hybrid engine that creates a knowledge graph on-demand by integrating data collected from a federation of heterogeneous data sources using an RDF-Molecule integration approach. We showed the creation of RDF-Molecules by using RDF-Wrappers, and we also presented how semantic similarity measures are used to determine the relatedness of two resources in terms of the relatedness of their RDF molecules. The RDF-Molecule integration approach followed by FuhSen devises a novel integration paradigm incorporating elements from linked data and federated search engines. Although our initial use cases addresses the criminal investigation domain, we deem that there are numerous further use cases, e.g., related to e-commerce (e.g., price comparison) or social media. This work is the first step of a larger research agenda aiming at establishing the concept molecule-based integration approach for building knowledge graphs on-demand.
References
Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., Franconi, E., Eiter, T., Gutierrez, C., Handschuh, S., Rousset, M.-C., Schmidt, R.A. (eds.) Reasoning Web 2009. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009). doi:10.1007/978-3-642-03754-2_4
Auer, S., Bryl, V., Tramp, S. (eds.): Linked Open Data – Creating Knowledge Out of Interlinked Data. LNCS, vol. 8661. Springer, Heidelberg (2014)
Collarana, D., Lange, C., Auer, S.: FuhSen: a platform for federated, RDF based hybrid search. In: WWW Companion Volume (2016)
Ding, L., et al.: Tracking RDF graph provenance using RDF molecules. In: International Semantic Web Conference (Poster) (2005)
Fernández, J.D., Llaves, A., Corcho, O.: Efficient RDF interchange (ERI) format for RDF data streams. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8797, pp. 244–259. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11915-1_16
Gunaratna, K., Thirunarayan, K., Sheth, A., Cheng, G.: Gleaning types for literals in RDF triples with application to entity summarization. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 85–100. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_6
Heath, T., Bizer, C.: Linked data: evolving the web into a global data space. In: Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)
Lewis, J.R.: IBM computer usability satisfaction questionnaires: psychometric evaluation and instructions for use. Int. J. Hum. Comput. Interact. 7(1), 57–78 (1995)
Pirrò, G.: Explaining and suggesting relatedness in knowledge graphs. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 622–639. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25007-6_36
Schultz, A., et al.: LDIF-a framework for large-scale Linked Data integration. In: 21st International World Wide Web Conference (WWW: Developers Track), Lyon, France (2012)
Thalhammer, A., Stadtmüller, S.: SUMMA: a common API for linked data entity summaries. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 430–446. Springer, Heidelberg (2015). doi:10.1007/978-3-319-19890-3_28
Usbeck, R., Ngomo, A.-C.N., Bühmann, L., Unger, C.: HAWK – hybrid question answering using linked data. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 353–368. Springer, Heidelberg (2015). doi:10.1007/978-3-319-18818-8_22
Volz, J., et al.: Silk - a link discovery framework for the web of data. In: Bizer, C., et al. (eds.) Proceedings of the WWW 2009 Workshop on Linked Data on the Web, LDOW 2009, vol. 538, Madrid, Spain, April 20, 2009. CEUR Workshop Proceedings. CEUR-WS.org (2009)
Xu, Y., Mease, D.: Evaluating web search using task completion time. In: SIGIR (2009)
Acknowledgments
This work was funded by the German Ministry of Education and Research grant no. 13N13627.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Collarana, D., Galkin, M., Lange, C., Grangel-González, I., Vidal, ME., Auer, S. (2016). FuhSen: A Federated Hybrid Search Engine for Building a Knowledge Graph On-Demand (Short Paper). In: Debruyne, C., et al. On the Move to Meaningful Internet Systems: OTM 2016 Conferences. OTM 2016. Lecture Notes in Computer Science(), vol 10033. Springer, Cham. https://doi.org/10.1007/978-3-319-48472-3_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-48472-3_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48471-6
Online ISBN: 978-3-319-48472-3
eBook Packages: Computer ScienceComputer Science (R0)