Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Motivation

The current NLP research is moving from the linear word/sentence/discourse representation towards semantic representations, where entities and their relations are made explicit. Even though NLP components are still being improved by emerging techniques like deep learning, the quality of existing components is sufficient to work on the semantic level – one level of abstraction up from surface text. We want to semantify text by assigning word sense IDs to the content words in the document. Working on the semantic level does not only provide us with entities like nouns, but also with their relations between each other. After semantification, we can use this representation to grasp the meaning of the document, e.g. what are the subjects in the document and to which class do they belong.

Semantic Web (SW) applications use entities, e.g. to disambiguate which Turkey the text refers to: the country or the animal Footnote 1. However, knowledge bases like Freebase [6] or DBpedia [8] only relate concepts, not necessarily disambiguating senses. While linking words to such knowledge bases is useful in their current state, we propose an all-word conceptualization where all content words are identified by their senses.

With our symbolic conceptualizations we are able to identify concepts in a text (contextualization). The contextualization of a document text will improve the performance of entity linking for the ‘classic Semantic Web’ because we assume that concepts/entities also follow the Distributional Hypothesis – concepts co-occur with similar concepts in a document. Identification of all content words in a document will enable semantic applications like semantic indexing.

2 State of the Art

2.1 Conceptualizations

The creation of conceptualizations is related to Ontology Learning [2], which uses supervised and unsupervised methods. Supervised approaches use WordNet [11] (e.g. [32]), Wikipedia, DBpedia and Freebase for ontology induction or entity extraction [8, 21, 35]. DBpedia Spotlight [8] relies on explicit links between entities in Wikipedia. Entity extraction is performed by a pre-trained prefix tree, and the disambiguation is done with a language model (LM). Sense embeddings can also be trained for conceptualizations [15]. The proposed approach is related to [7] but does not use knowledge bases.

Unsupervised methods often rely on relation extraction (cf. Sect. 2.3). OntoUSP [27] extracts a probabilistic ontology for the medical domain using dependency path features. OntoGain [9] relies on multi-word expressions (MWEs) to construct a taxonomic graph using hierarchical clustering. This graph is expanded with non-taxonomic relations. The hybrid approach of [35], which accesses search engines, can also identify novel MWEs, which signify entities. Local taxonomic relations can be extracted from text by identifying certain syntactic patterns in a text [14, 16]. There are also approaches that directly work on the SW graph and utilize distributional methods to establish concept similarities, e.g. [24].

2.2 Contextualization

The two most recent contextualization shared tasks are the Word Sense Disambiguation (WSD) tasks of SemEval 2010 [20] and SemEval 2013 [23]. The participating systems often use knowledge bases like YAGO [19], WordNet or other ontologies to assign sense identifiers to target words (usually nouns) in a sentence. Knowledge-free systems employ co-occurrence and distributional similarities together with language models.

2.3 Relation Extraction

TextRunner [36] extracts explicit relationship tuples \((R, T_1, T_2)\) from POS-tagged text. These relations can be seen as ‘facts’ and aid question answering. GraBTax [34] can build taxonomies by utilizing co-occurrence and lexical similarity of n-gram topics from document titles.

OntoUSP [27] focuses on identifying specific relations in the medical domain. It can identify nominal MWEs and group different spellings into clusters. Also, it performs hierarchical clustering on verbs. [31] present a relation learning approach. By extracting known facts, they create triggers to extract similar relations of different entities. [28] uses distributional statistics to extract relations between nouns. This produces highly precise relations, but the recall is quite low. We believe that we can alleviate this problem by extracting relations not on the word level (leaf in the taxonomy graph) but between taxonomic concepts (nodes).

2.4 Linked Open Data

There are many useful data collections available on the Web. Freebase [6] or DBpedia [17] offer a large number of relations, usually as RDF triples, that can be queried using APIs. On top of that, there are applications like DBPedia Spotlight [8] or Babelfy [22] that can annotate texts with e.g. DBPedia entities, which in turn can link to Wikipedia.

We plan to extract and display information similar to the Weltmodell [1] or ConceptNet [33]. By utilizing disambiguated concepts, we believe that we can extract more (higher recall on concept level vs. word level) and more precise relations (handling polysemy).

3 Problem Statement and Contributions

The most trivial sentences and phrases can be difficult to understand and process for computers. Supervised ontologies and dictionaries help to a large extent, however, often they do not fit the textual domain to which they are applied. Search engines can be viewed as semantic applications, as they are able to identify word senses by using the provided keywords, e.g. throw a ball vs. attend a ball. However, it is not shown how many senses of the provided query terms the search engine knows about. We believe that this is important information and making this information available would help many users. Semantic resources like WordNet are often too fine-grained, which reduces usability in semantic applicationsFootnote 2.

To alleviate such problems, we plan to investigate the following research questions:

  • How can we construct a semantic model that improves WSD performance? In this task we are going to use Distributional Semantics (DS) methods [13] to create conceptualizations from an input corpus. Afterwards, we are going to structure the concepts in a global taxonomy graph by utilizing the hypernymy structure.

  • Which unsupervised, knowledge-free methods can we use to obtain state-of-the-art WSD? This task involves identifying concepts in their context to obtain an explicit semantic representation of the corpus. To the best of our knowledge, there have been no attempts to create a fully disambiguated corpus.

  • How can we identify significant relations between concepts to enrich our conceptualization? Using the contextualized corpus, we can extract relations on the concept and hypernym level, allowing us to extract more relations. We need to make sure that we only add significant relations to our concept graph.

Furthermore, we plan to create demonstrator applications based on the conceptualizations to exemplify the semantic annotation capability. We are also going to release the created applications as free, open-source applications with a focus on usability.

4 Research Methodology and Approach

4.1 Conceptualizations

To be able to annotate and semantify text, we need a knowledge model. Therefore, we are going to use the JoBimText framework [5] to create symbolic conceptualizations. We believe that having an explicit symbolic representation is an advantage to vector-based models like deep learning because of direct interpretability. We are going to create JoBimText models [30] and extend those to interconnected graphs, where we introduce new semantic relations between the nodes. A JoBimText model consists of a Distributional Thesaurus (DT) with sense-disambiguated entries. We induce word senses using Chinese Whispers [3], a knowledge-free graph clustering algorithm. The senses are labeled with the most frequent hypernym terms that were obtained using lexico-syntactic patterns [14, 16], producing local taxonomies. In addition, the models contain significant context features for each word and a DT of such context features. The combination of entities like named entities and multi-word expressions [29] with common words will create an all-word knowledge base. Since it is fully based on the input corpus, there is no need for domain adaptation.

By utilizing the hypernymy structure, we can aggregate context features on concept levels. E.g. jaguars, tigers and wolves are all animals, but in our corpus, we only find sentences where tigers and wolves hunt. From this information, we can infer that jaguars probably can hunt as well, thus projecting contextual information through aggregation into the animal concept.

We apply contextualization to obtain a sense-disambiguated corpus. Then we compute the similarity graph once again, this time using word sense IDs instead of words. This should result in a DT, where each entry is fully disambiguated, allowing us to create a more detailed and more precise model (more entries per word sense, disambiguated context features).

4.2 Contextualization

With our sense-disambiguated semantic models, we can perform semantic text annotation. We put a strong focus on the contextualization technique, since it is going to connect the conceptualized knowledge to text. Using word sense disambiguation on the input text, we are going to infer the senses of the words. By using similar context features from the model, we are able to identify the word sense, even if the term–context feature combination has never been observed in the corpus. Preliminary experiments have shown that utilizing similar features improves recall, with a slight decline in precision. To further increase recall, we will use co-occurrence features and a language model.

Once we have established a conceptualization with relations between concepts and aggregated context features per concept, we can even infer the concept for yet unseen words – a zero-shot contextualization, e.g. by matching the context features of concepts to the unseen word, we can assume that X in the phrase X hunts its prey is an animal.

4.3 Relation Extraction

The conceptualization yields a taxonomy graph with sense-labeled leaf nodes. We want to extend such a ‘taxonomic skeleton’ into a dense graph with many types of relations. Our semantic model already contains a large number of facts, like jaguar is-an animal or cars can be driven. While such facts seem trivial, in our model there are many facts that are not considered common knowledge, e.g. impala is-a car (indicating a Chevrolet Impala). We employ Open Information Extraction (OIE) techniques [26] to extract additional semantic relations.

Using the disambiguated corpus, we propagate the dependency relations in the hypernym graph to extend our conceptualization. If we take the input phrase jaguar kills deer, we can extract a multitude of facts:

  • basic jaguar kills deer

  • agent expansion tiger kills deer

  • object expansion jaguar kills prey

  • verb expansion jaguar wounds deer

  • hypernymies animal kills animal

  • combinations animal kills deer, jaguar wounds animal, etc.

Especially to identify relations between named entities [12], we need a larger, richer set of relations. Therefore, we are going to use supervised resources like WordNet to extract examples of a relation tuple \((R, T_1, T_2)\) and – based on this input relation – find patterns that can be used to extract such relations. This bootstrapping method is similar to [31].

4.4 Linked Open Data

To make our approach usable to other researchers, as well as to incorporate our conceptualizations into the SW, we are going to publish the models as Linked Open Data. We are going build a semantic network similar to the hyper graph that is available in JoBimViz [30] and extend it with the concept taxonomies and contextualization techniques. This will allow to browse our conceptualizations with unique identifiers.

We are going to offer our contextualization technique through an open API. The annotated text would consist of a sequence of sense IDs, e.g. jaguar#NN_2 hunt#VB_1 deer#NN_1 for the source sentence jaguar hunts deer. To bridge the gap between our inferred knowledge base to existing knowledge bases, we can create an alignment using Lesk [18]. This will increase the usability, since users can obtain identifiers of established SW resources and use this information to semantify their texts.

5 Preliminary Results

5.1 Conceptualization

Extending [5], we are now able to create semantic models that contain word senses and (unstructured) hypernyms. We call these models JoBimText models [30] and already use them for contextualization. The next step is to create a concept taxonomy with aggregated context features.

5.2 Contextualization

We have implemented a contextualization system that we are now extending with new features for a publication in the near future. Currently, it performs sense annotation based on a context feature extractor, e.g. trigram or dependency features. Using large language model with and word co-occurrences, we achieve a performance comparable to the systems in SemEval 2013, task 13 [23].

5.3 Relation Extraction

This task has not yet started, because it relies on a contextualized corpus.

5.4 Web Demonstrators/LOD

Our JoBimVizFootnote 3 web application is used to exemplify our semantic models [30]. It already features word identifiers, consisting of the model and a word representation (e.g. lemma#POS). It can be browsed as a semantic network, by following the links, similarly to LOD repositories. Furthermore, it features a transparent Java API for machine access. As a demonstrator for contextualized corpora, we have created a semantic search demo based on Apache Solr and PHP. It incorporates keyword search as well as search for concepts and displays possible MWE expansions.

6 Evaluation Plan

6.1 Conceptualizations

Using a path based measure [25], we can assess the structural similarity of our conceptualizations with WordNet. The sense clustering method is flexible and allows for different granularities. We want to identify the best granularity settings extrinsically, by evaluating the performance of several clusterings (with different granularities) using the contextualization technique. We use the Turk Bootstrap Word Sense Inventory (TWSI) 2.0 dataset [4]. It contains sense-annotated sentences from Wikipedia and a crowdsourced sense inventory with substitutions for about 1,000 nouns.

To verify our intuition that a model computed on a domain-specific corpus outperforms general or foreign-domain models, we plan to compute several models and cross-evaluate them.

6.2 Contextualization

To evaluate the performance of the contextualization system, we are going to use the TWSI dataset [4] here as well. It contains contextualized substitutions for about 150,000 sentences, a larger collection than used for SemEval WSD tasks. The TWSI dataset is mostly used for parameter tuning and determining the best feature configuration. Once the best feature set is established, we are going to evaluate our contextualization on the SemEval 2010 [20] and SemEval 2013 [23] datasets. This allows us to compare our unsupervised contextualization technique to state-of-the-art techniques, and possibly to participate in a future WSD challenge.

To evaluate the zero-shot contextualization, we can remove sentences with certain (even polysemous) target terms from the input corpus and create the conceptualization. Then we can input the sentences with the “unknown” words and evaluate the concept identification. To demonstrate improvements of the complex structured semantic model, we compare it with a simple distributional model.

6.3 Relation Extraction

The evaluation of relation extraction is challenging. To evaluate our approach, we are going to apply the relation extraction on a slightly different task. We are going to extract named entities like politicians (news data) and use our conceptualization to identify the events and relations in which they are involved. Most other open information systems rely on manually created datasets [10] to evaluate their systems.

7 Conclusion

In this proposal we have presented a framework for unsupervised conceptualizations based on unstructured text collections. Its advantage is that the resulting models are tied to the input text, thus allowing for applications without domain adaptation. The generated models are data-driven and can therefore be created for every domain where large amounts of texts are available. Using a contextualization technique, the framework creates a fully semantified sentence and document representation. This representation is tied to Linked Open Data resources.