Abstract
In Web search, entity-seeking queries often trigger a special question answering (QA) system. It may use a parser to interpret the question to a structured query, execute that on a knowledge graph (KG), and return direct entity responses. QA systems based on precise parsing tend to be brittle: minor syntax variations may dramatically change the response. Moreover, KG coverage is patchy. At the other extreme, a large corpus may provide broader coverage, but in an unstructured, unreliable form. We present AQQUCN, a QA system that gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of query syntax, between well-formed questions to short “telegraphic” keyword sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query. AQQUCN models the ideal interpretation as an unobservable or latent variable. Interpretations and candidate entity responses are scored as pairs, by combining signals from multiple convolutional networks that operate collectively on the query, KG and corpus. On four public query workloads, amounting to over 8000 queries with diverse query syntax, we see 5–16% absolute improvement in mean average precision (MAP), compared to the entity ranking performance of recent systems. Our system is also competitive at entity set retrieval, almost doubling F1 scores for challenging short queries.
Similar content being viewed by others
1 Introduction
A large fraction of Web queries involve and seek entities (Lin et al. 2012). Such queries may seek details of celebrities or movies (e.g., kingsman release date), historical events (e.g., Who killed Gandhi?), travel (e.g., nearest airport to baikal lake), and so on. Queries that match certain patterns are handed off to specialized QA systems that directly return entity responses from a KG. Sometimes, a semantic parse of the textual query is attempted (Berant et al. 2013; Yih et al. 2015) to translate it to a structured query over the KG, which is then executed to fetch a set of response entities.Footnote 1 While providing precise answers if everything goes well, this approach to KG-driven QA is fraught with several difficulties.
-
The input textual query may range from grammatically well-formed questions (e.g., In which band did Jimmy Page perform before Led Zeppelin?) to free-form “telegraphic” keyword queries (e.g., band jimmy page was in before led zeppelin). QA systems are often brittle with regard to input syntax, backing off if the input does not match specific syntactic patterns.
-
A curated, structured collection of facts in a KG reduces the QA task to “compiling” the textual query into a structured form which is directly executed on the KG. But KG coverage is always patchy—with nodes and/or edges missing—particularly when less popular entities are concerned. For example, over 70% of people in Freebase do not have a place of birth in Freebase (West et al. 2014). If the types and relations expressed textually in the query cannot be mapped confidently to the KG, most QA systems back off.
-
Alternatively, one can extend IR-style text search by using an entity-annotated corpus of Web pages. Any text snippet s in the corpus which mentions entity e and also matches the question q well, can be considered supporting evidence for that entity e to be the answer for q. However, such evidence from the Web corpus can be noisy due to incorrect entity linking of s or q, and imperfect text matching between q and s.
Example query and response Figure 1 demonstrates the advantages and complexities of effective entity-level QA involving both KG and corpus. Tokens in the query have diverse, possibly overlapping roles. Specifically, a query span may hint at an entity, type or relation, or it can be used to match passages in the corpus. Understanding the roles and disambiguating the hint to respective semantic nodes in the KG (wherever applicable) helps interpret that query. For example, the query \(q=\)band jimmy page was in before led zeppelin has a reference to entities \(e_1={{\texttt{Jimmy\_Page}}}\) and \({\texttt{Led\_Zeppelin}}\). The set of such entities, grounded in the query, will be called \({\mathcal {E}}_1\) with members \(e_1, e'_1\), etc. Mentions in both the query and corpus documents are linked to entity nodes in the KG (e.g. jimmy page). The target type of the query will be denoted \(t_2\) and a candidate answer entity will be denoted \(e_2\). The set of all answer candidates will be called \({\mathcal {E}}_2\), and the set of gold (ground truth) entities will be called \({\mathcal {E}}^*_2\). Band hints at \(t_2 = {{\texttt {musical\_group}}}\) of the expected answer entity \(e_2 = {{\texttt {The\_Yardbirds}}}\). The (rather weak) hint was in hints at the relation \(r={{\texttt {/music/musical\_group/member}}}\) connecting \(e_1, e_2\). Thus, identifying \(e_1\), r and \(t_2\) can lead us to many candidate \(e_2\)s, \({\texttt{The\_Yardbirds}}\) being one of them. Yet, the query interpretation is not complete because an important token ‘before’ is not considered. If the KG does not have timestamps on membership, or the QA engine cannot do arithmetic with timestamps, passages in the corpus can still offer supplementary evidence by matching before with prior to, along with mentions of \(e_1\) and \(e_2\). Thus, using both KG and corpus allows combining structured and unstructured evidence to answer a query.
This example also serves to highlight our challenges. The various curved, colored lines in Fig. 1 map the query hints to the KG or the corpus evidence, either created during pre-processing stage (e.g. entity linking in the corpus) or at run-time (e.g. matching the target type \(t_2 = {{\texttt {musical\_group}}}\) to the query text Band through a type model). The machine-learnt models which map the hints to KG entities, types or relations need to handle a great deal of ambiguity, as a hint may match many correct and incorrect KG artifacts. Thus, there may be multiple KG subgraphs and corpus text snippets, each appearing to support different correct or incorrect entity candidates. There is thus a clear need for robust and seamless aggregation of supporting evidence across corpus and KG.
Our contributions We present a new QA system, AQQUCN,Footnote 2 with these salient features:
-
AQQUCN is resilient to a spectrum of query styles, between syntactically well-formed questions to short ‘telegraphic’ Web queries. It does not attempt a grammar-based parse of the query.
-
AQQUCN uses KG and corpus signals in conjunction to score responses. Rather than a single comparison network between query and corpus (Severyn and Moschitti 2015; Bahdanau et al. 2014), AQQUCN uses a heterogeneous network architecture tailored to structural properties of queries.
-
Instead of choosing one structured interpretation and executing it on the KG to get a response set, AQQUCN is capable of directly ranking entities based on evidence pooled over multiple structured interpretations.
We review related work in Sect. 2. In Sect. 3 we give an overview of AQQUCN, and in Sect. 4 we describe all the modules in detail. In Sect. 5 we evaluate AQQUCN against recent competitive baseline systems. Our code with relevant data will be made available (CSAW 2018).
2 Related work
Recent QA systems are the result of convergence between several communities: Information Retrieval (IR), NLP, machine learning, and neural networks.
2.1 Corpus-oriented entity search
Early work in the IR community focused on corpus-driven QA in the TREC-QA track (Wang 2006; Cardie 2012). The Web and IR community has traditionally assumed a free-form query that is often ‘telegraphic’. Web search queries being far more noisy, the goal of structure discovery is more modest. Indeed, in expert search, one of the earliest forms of corpus-based entity search focused on finding experts (people) in a given field, query structure discovery was given no importance. State-of-the-art expert search systems (Balog et al. 2009; Macdonald and Ounis 2011; Petkova and Croft 2007) collect text snippets (or documents) containing query words, match each snippet evidence to an expert (e.g. using the signal that the expert’s name is mentioned in the snippet); and aggregate such evidence snippets to rank the experts. Generative language models (Balog et al. 2006), proximity based kernels (Petkova and Croft 2007) and feature-based supervised discriminative learning (Fang et al. 2010) were evaluated to score the evidence match. Conversely, document retrieval can be improved by expanding the query with entity features (Dalton et al. 2014).
2.2 Entity search from knowledge graphs (KGQA/KBQA)
As the information extraction and NLP communities developed more tools for annotating corpus spans with named entity (NE) types (Ling and Weld 2012) and canonical entity IDs from KGs (Ganea and Hofmann 2017), corpus-based techniques were refined to match answer types (Murdock et al. 2012). With support from large KGs like Wikipedia and Freebase, the NLP community developed semantic parsers (Berant et al. 2013; Yao and Van Durme 2014; Yih et al. 2015; Kasneci et al. 2008; Pound et al. 2012; Yahya et al. 2012; Kwiatkowski et al. 2013) that translated natural language queries to a target graph query language similar to SPARQL. These approaches typically assume that question utterances are grammatically well-formed, from which precise clause structure, ground constants, variables, and connective relations can be inferred via semantic parsing. A similar system called AQQU (Bast and Haußmann 2015) emerged from the IR community. We base our system on it, so we will describe it separately in Sect. 3.1. Such approaches are often correlated with the assumption that all usable knowledge has been curated into the KG. The query is first translated to a structured form, which is then executed on the KG.
2.3 Combining corpus and KG
There is increasing interest in combining corpus and KG for QA. AQQUCN is related to the single-relation QA system described by Joshi et al. (2014). Unlike AQQUCN, they attempted an explicit 4-way segmentation of the query to identify the mention spans \({\tilde{e}}_{1}\) that mark grounded entities \(e_1\), mention spanFootnote 3\({\tilde{r}}\) that marks a hint to relation r, span \({\tilde{t}}_{2}\) that marks a hint to the target type \(t_2\), and the remaining tokens \({\tilde{s}}\) are designated as selectors that are meant to keyword-match corpus snippets. This overall query segmentation is denoted z. The segmentation z guides their system to propose structured KG artifacts \(e_1, r, t_2\) corresponding to query spans \({\tilde{e}}_{1}, {\tilde{r}}, {\tilde{t}}_{2}\). The candidate response entity \(e_2\) is then scored over admissible values of all latent variables (see Fig. 2). KG and corpus signals are unified as various factors in the model. We address two limitations in this system. First, we do not attempt a hard query segmentation. Second, we replace the traditional discrete language models that inform the factor potentials with continuous neural counterparts.
More work in this vein followed rapidly. Xu et al. (2016) presented a KGQA system with a corpus-based postprocessing pruning stage that removes candidates with weak corpus support. A more symmetric architecture called Text2KB was proposed by Savenkov and Agichtein (2016). A KGQA system as in Sect. 2.2 collects candidate answers. A corpus search collects snippets from top-ranking documents and annotates (Globerson et al. 2016; Ganea and Hofmann 2017) them with KG entities. For each candidate in the union, features are collected from both KG and corpus snippets to rank them. In a similar spirit, Xiong et al. (2017) propose AttR-Duet. Both the query and corpus passage are represented as bags of words as well as entities. Definition and mention texts in the KG and corpus are used to bridge between the space of entities and words, and four families of similarities (\(\{{\text {query}}, {\text {passage}}\} \times \{{\text {word}}, {\text {entity}}\}\)) are defined. These are then combined using a neural network. Reminiscent of how Joshi et al. (2014) incorporated a corpus-based factor/potential into a graphical model (Fig. 2), Bast and Buchhold (2017) proposed QLever, which extended (Chakrabarti 2010) SPARQL with predicates over an entity-annotated corpus. The primary focus of QLever is on high performance in the face of query clauses spanning KG and corpus indices, not ranking accuracy per se.
2.4 Complex QA using neural techniques
Early improvements to QA systems resulted from replacing discrete word matching and scoring with word vector counterparts. In corpus-based QA, Bordes et al. (2014) modeled queries and passages as bags of words and simply added up their word embeddings to represent and compare them. Yang et al. (2014) used embeddings to translate questions into relational predicates. More refined sentence/query embeddings have been created (Severyn and Moschitti 2015; Iyyer et al. 2014) via recurrent networks (RNNs) and convolutional networks (CNNs), but usually applied to syntax-rich, well-formed questions. Dong et al. (2015) obtained better accuracy than Bordes et al. (2014) by replacing the aggregated word vector query representation with multiple parallel CNNs for extracting deep representations for relation r between query entity (i.e. the entity mentioned in the query) and answer entity, answer type and the KG neighborhood of the query entity. While the above works dealt with entity retrieval, CNNs have also been actively explored in query-document matching for complex answer retrieval (Hui et al. 2017, 2018; MacAvaney et al. 2018). EviNets (Savenkov and Agichtein 2017) embeds the query and evidence passage as average word vectors. Then it collects various aggregates (Macdonald and Ounis 2011) of vector match scores, which are combined using a trained pooling network. None of these neural systems seek a structural understanding of the query and how its parts relate differentially to the KG and corpus. Recently, neural learning techniques are being used to translate very complex queries (Saha et al. 2018) like how many countries have more rivers than Brazil into multi-layer expression graphs or multi-step imperative programs (Andreas et al. 2016; Dong and Lapata 2016; Miller et al. 2016; Zhong et al. 2017; Reed and De Freitas 2015; Liang et al. 2016). The extreme complexity of the reinforcement learning formulations needed have, thus far, precluded scaling them to Web-scale corpora and noisy query and corpus annotation tools.
3 AQQUCN overview
AQQUCN, a system we have built by extending AQQU (Bast and Haußmann 2015), implements all the inference pathways shown in Fig. 1. Unlike AQQU and some other systems, the end goal of AQQUCN is not to “compile” the input into a structured query to execute on the KG, because broken/missing input syntax can make this attempt fail in brittle ways. Instead, the end goal of AQQUCN is to directly create a ranking over entities using KG and corpus, using interpretations as latent variables. We first review AQQU briefly, and then describe the three stages of AQQUCN in the rest of this section (Fig. 3).
3.1 AQQU review
AQQU interprets the input question with reference to three possible “query templates”, each having a direct translation to a SPARQL query. First, grounded entities \(e_1\) in the query are identified. Then, a guided expansion in the KG locates candidates \(e_2\), resulting in structured interpretations I that may take forms such as these one- and two-hop queries:
-
.
-
, where m is a mediator ‘entity’ often representing a ternary relation.
AQQU then extracts features from question q, interpretation I (which could have a few distinct structures as above), and all the candidate \(e_2\)s together. These features are used in logistic regression or random forests to score each interpretation. The best interpretation (with unbounded placeholders for \(e_2\) and m, if applicable) is then executed on the KG. During training, the gold interpretation is not known, but executing each interpretation gives a system response set \({\hat{{\mathcal {E}}}}_{2}\) that can be compared against the gold \({\mathcal {E}}^*_2\) to get an F1 score. This helps AQQU train logistic regression or random forests to score better interpretations higher than worse ones. The unit of scoring and ranking is a single interpretation, not an entity, nor a joint space of interpretation and entity, as is the case in AQQUCN. Moreover, AQQU does not use corpus signals.
3.2 Modifications and new modules
Our implementation of AQQUCN is based on AQQU because it provides a well-written, reusable implementation of query entity linking and interpretation generation. We modify and enhance AQQU in the following ways, as also illustrated in Fig. 4 (for the query template). The important steps are listed below. (Item numbers correspond to modules in Fig. 4.)
-
1.
Given a query q, we first identify the set of in-query entities \({\mathcal {E}}_1\) using the widely used entity tagger TagMeFootnote 4 (Ferragina and Scaiella 2010).
-
2.
The KG neighborhood of each \(e_1\) is explored to collect candidate \(e_2\)s, similar to prior KBQA (Knowledge Base driven QA) systems (Bast and Haußmann 2015; Yih et al. 2015; Xu et al. 2016). Like AQQU, we limit \({\mathcal {E}}_2\) to the set of entities occurring in the 2-hop KG neighborhoodFootnote 5 of any entity in \({\mathcal {E}}_1\).
-
3.
As the example in Fig. 1 illustrated, some evidence supporting the correct entity, such as before in the query and prior to in the snippet, may come from the corpus. As we shall see in Sect. 5, corpus evidence can greatly augment KG-based evidence. Therefore, we also gather text snippets (at roughly the granularity of sentences) that mention some \(e_1\) or words from the query.
-
4.
The next step is to identify the set of candidate answer entities, \({\mathcal {E}}_2\). Apart from KG neighborhoods of \(e_1\)s, we also collect (non-\(e_1\)) entities that occur in any snippet as a candidate \(e_2\). This allows the system to recover from early errors, such as when \({\mathcal {E}}_1\) identified by the entity tagger is empty or wrong, or when the query and answer entities are more than two hops apart in the KG. The union of candidates from KG and corpus are called \(e_2\)s.
-
5.
We use three neural modules to inform the score combination network, which replaces AQQU’s interpretation ranking module. The query corpus convnet (QCN), described in Sect. 4.1, scores the evidence in the context snippets, given the query and the candidate \(e_2\).
-
6.
Candidates \(e_2\) collected from the KG may be accompanied with designated types \(t_2\).
-
7.
The query type convnet (QTN), described in Sect. 4.2, scores the potential presence of a textual clue to \(t_2\) somewhere in the query.
-
8.
Candidate \(e_2\) found in the KG is also connected to \(e_1\) via a relation r.
-
9.
The query relation convnet (QRN), described in Sect. 4.3, scores the potential presence of a textual clue to r somewhere in the query.
-
10.
The score combination network works on the joint space of candidate interpretations and entities, ending with a ranking of candidate entities that may draw signals from multiple interpretations in general. AQQU scores are used as additional features (not shown). Outputs from the three convnets are wired together in a markedly non-uniform architecture, consistent with the inference pathways shown in Fig. 1.
A summary of notation introduced thus far is given in Table 1.
Unlike Joshi et al. (2014), AQQUCN does not attempt to segment the query into disjoint spans that describe \(e_1, r, t_2\), but lets multiple neural networks run over the query. This allows AQQUCN to process long queries that Joshi et al. (2014) could not. Moreover, a query span can inform multiple networks; consider queries who discovered penicillin and who discovered antarctica, where ‘who’ carries a lot less information about \(t_2\) than the \(e_1\) mentions ‘penicillin’ and ‘antarctica’.
4 Detailed design of AQQUCN modules
In this section we present the details of the new modules we added to AQQU: the query-corpus, query-type and query-relation networks, as well as the score combination network.
4.1 Query-Corpus Network (QCN)
For each query q, we get zero or more evidence snippets from the entity-annotated Web corpus. Each snippet contains a candidate answer entity \(e_2 \in {\mathcal {E}}_2\). We use the query corpus network (QCN) for assigning a relevance score to each snippet. This relevance score, and also the confidence score for linking \(e_2\) to a snippet, are features used in the final score combination network of AQQUCN (Sect. 4.4).
Given a query q and a snippet from the Web corpus, QCN should assign high score to the pair if the snippet contains evidence to correctly answer q. Training data for this network is in the form of positive and negative snippets for each training query. As manual generation of such labels involves considerable effort, we resort to (possibly noisy) indirect supervision instead. We treat all text snippets centered around a gold answer entity and containing at least one query word (non-stopword) as positively labeled snippets. Similarly, we treat all text snippets centered around any non-answer entity and containing some or all query words as negatively labeled snippets (examples in Table 2). To train this network, we use the state-of-the-art short text ranking system proposed by Severyn and Moschitti (2015). Once QCN is trained, we have a score for each snippet belonging to a candidate answer entity \(e_2 \in {\mathcal {E}}_2.\)
While Siamese convolutional networks served well in the QCN module, the broad architecture of AQQUCN can accommodate competitive alternatives (Lv and Zhai 2009; Petkova and Croft 2007; Zhiltsov et al. 2015; Hui et al. 2017). Measuring the effect of this choice on QA accuracy is left for future work.
Match signals from multiple snippets supporting an entity \(e_2\) have to be aggregated before passing on to the combination network shown at the bottom of Fig. 4. Each candidate \(e_2\) may have diverse number of supporting snippets. Usually, a number of standard aggregates (sum, max, etc.) are computed and then a weighted combination learnt (Balog et al. 2009; Macdonald and Ounis 2006; Sawant and Chakrabarti 2013; Joshi et al. 2014). For our data sets, we found a simple sum of snippet scores to be adequate: we add up the snippet scores over all snippets belonging to an entity \(e_2\) for a query q, and use it as a feature for (q, \(e_2\)) (feature 1 in Table 5). More complex pooled aggregators can be explored in future work.
4.2 Query-type network (QTN)
The query-type network (QTN) outputs a compatibility score between the query q and a candidate type \(t_2\). This is a multi-class, multi-label classification problem, as a query may imply more than one correct answer type (e.g. \({\texttt{/music/composer}}\) and \({\texttt{/music/artist}}\) for the query saturday night fever music band). Good quality training data in the form of (query, type) pairs is essential to ensure that the network learns to handle different types of (or the lack of) query syntax and its correspondence with the answer type. In our first attempt, we included all (q, t) pairs in the training data, where t was any type connected to the gold answer entity \(e_2\) for the query q in the training set. However, this strategy resulted in many spurious types. For example, \({\texttt{/broadcast/radio\_station\_owner}}\) is not the correct answer type for the query maya moore college, even if the answer entity \({\texttt{University\_of\_Connecticut}}\) belongs to that type. Therefore, we used human supervision. Paid student volunteers were asked to label (q, t) pairs as correct or incorrect, which helped remove approximately 30% (q, t) pairs as irrelevant and improve training data quality.
We found that, given the small number of training queries, the data obtained through above process may not be enough to understand the variety of syntax used to imply a type. For robust training, we resorted to representing the type through additional patterns (Table 3) obtained as follows.
- Freebase relation names :
-
Consider a (\({\texttt{subject}}\), \({\texttt{relation}}\), \({\texttt{object}}\)) fact triple e.g. (\({{\texttt {Captain\_America:\_The\_First\_Avenger}}}\), \({{\texttt {/film/film/prequel}}}\), \({{\texttt {Thor}}}\)). Relations in freebase have composite names in the form “/x/y/z” where x is a topical domain and y and z indicate the types for \({{\texttt {subject}}}\) and \({{\texttt {object}}}\). E.g. \({\texttt{prequel}}\) is a type indicator word for \({{\texttt {Thor}}}\). Meanwhile, Freebase declares expected type for endpoint entities of each r. For \({{\texttt {/film/film/prequel}}}\), expected end type is \({{\texttt {/film/film}}}\). We combine these two information nuggets and add \({{\texttt {prequel}}}\) as a pattern expressing the type \({{\texttt {/film/film}}}\).
- Freebase type names :
-
Ending substrings of the type name are also considered as patterns (e.g. ‘treatment’ for the type \({{\texttt {medical\_treatment}}}\)).
At the end of this exercise, we have zero or more patterns for each type (Table 3).
Figure 5 illustrates the design of QTN. This multi-class multi-label architecture is partly inspired by Kim (2014) and Severyn and Moschitti (2015). For each query q input to the network, the network provides an output score vector of size equal to the number of types, as follows:
-
1.
In the initial layer, each query word is represented as a vector embedding learnt during training. Then convolution and pooling layers are used to extract a fixed-length feature vector from the variable length input.
-
2.
In a separate layer, we compute word overlap features inspired by Severyn and Moschitti (2015). Specifically, we compute Jaccard similarity between the query and each type name, by representing each as a bag of words. We also compute Jaccard similarity between the query and each type pattern, and then take max over all patterns of a given type. This process results in two features for each (query, type) pair.
-
3.
Similar to the bag of words based overlap, we compute Jaccard similarity between the word stem (lemmatized) form for each query, type name and type pattern; resulting in two more features for each (query, type) pair.
-
4.
Once the overlap features as well as convolution-pooling features are computed, a fully connected hidden layer with sigmoid activation function is used at the last stage, to score all types.
4.3 Query-relation network (QRN)
The query-relation network (QRN) outputs a compatibility score between a candidate relation r and the query q. As with QTN, we generate multi-class, multi-label training data in the form of (q, r) pairs, where r is a relation connecting to the gold answer entity \(e_2\) to the entity \(e_1\) mentioned in the query q. There could be multiple r, for the same (\(q, e_1, e_2\)) tuple. Such training data generation process is common in previous work (Dong et al. 2015; Yih et al. 2015; Bordes et al. 2014) and human curation is not used to remove noise.
Similar to QTN, we enrich our training data using relation description patterns. To generate these patterns, we start with (\({{\texttt {subject}}}\), \({{\texttt {relation}}}\), \({{\texttt {object}}}\)) facts in the KG and locate sentences in the annotated corpus where both \({{\texttt {subject}}}\) and \({{\texttt {object}}}\) are mentioned. We identify the path connecting the two in the dependency parse of the sentence, expressed as a sequence of lemmatized words. We count the number of times each path was found, and retain only the most frequent paths. This gives a bag of path patterns that describe relation r (examples in Table 4). The network architecture for QRN is same as in Fig. 5. The only difference is that for QRN we have (query, relation) and (corpus pattern, relation) tuples as training instances.
4.4 Score combination network
Referring back to Fig. 4, QRN, QTN, and multiple QCNs send their scores as features to a final combination network that represents each candidate \(e_2\) as a feature vector and scores it in conjunction withFootnote 6\(I = \langle e_1, r, t_2 \rangle \). In abstract terms, if I denotes an interpretation and \(e_2\) a candidate entity, at this point we have a score matrixS(I, e) indexed by interpretations as rows and candidate response entities as columns.
During both training and inference, I is latent. Standard learning to rank methods (Liu 2009) are not directly applicable to our score combination network because of the latent variables implicit in interpretation I. In fact, the additional complications posed by these latent variables currently limit us to the relatively simple pairwise ranking paradigm with a linear scoring function (Joachims 2002). Direct optimization of listwise and setwise metrics in presence of latent variables is left as future work. In the rest of this section, we will describe three approaches to train and deploy the score combination network.
4.4.1 Features for score combination
The score combination module shown at the bottom of Fig. 4 uses a feature vector \(\phi \) to describe the match between query q, each candidate query interpretation \(I \in {\mathcal {I}}\) and each candidate answer entity \(e_2 \in {\mathcal {E}}_2\). These features are informed by three role-differentiated convolutional networks (described in detail in the rest of this section). These are
-
the query-corpus network (QCN) described in Sect. 4.1,
-
the query-type network (QTN) described in Sect. 4.2, and
-
the query-relation network (QRN) described in Sect. 4.3.
As in AQQU, we also include additional features such as entity tagger scores. Table 5 shows the complete list of features.
4.4.2 Single interpretation (AQQUCN-1)
In some data sets like SimpleQuestions (Bordes et al. 2015), each query, by construction, can be answered from the KG alone, using exactly one interpretation. This is largely true of WebQuestions (Berant et al. 2013) as well. As we report later, the post-facto single best (‘silver’, because the ‘gold’ interpretation is not provided) interpretation retrieves the gold entity set with accuracy much higher than any system. Therefore, all that remains is to try to infer the silver interpretation. This is exactly what AQQU attempts to do. AQQU first aggregates \(S(I,e_2)\) over all candidates \(e_2\) to get a per-interpretation score, which is used to rank them and choose the top interpretation. Training is provided by comparing the observed F1 scores of competing candidate interpretations. Our resulting system, AQQUCN-1, is similar to AQQU, except that we use convnets and draw on corpus information. The entity set retrieved by the single interpretation \({\tilde{I}}\) can then be sorted by decreasing \(S({\tilde{I}},e)\) for ranking, if needed.
4.4.3 Allowing a limited number of interpretations (AQQUCN-FEW)
The assumption that a single interpretation can recall all relevant answer entities may not be valid in all situations. In particular, as we shall report later, interpretations derived from both KG and corpus can be necessary to cover the gold response entities.
In a set \({\mathcal {E}}_2\) of candidate entities \(e_2\), if the score of each \(e_2\) is determined by its best supporting interpretation, then the number of distinct interpretations supporting the candidate set \({\mathcal {E}}_2\) may approach \(|{\mathcal {E}}_2|\) itself. In the next section, we will allow that to happen freely. In this section, we will take a small step to generalize one interpretation to a limited number of interpretations.
Suppose the universe of available interpretations is \({\mathcal {I}}\), from which we can admitFootnote 7\({\mathcal {I}}' \subseteq {\mathcal {I}}\), with \(|{\mathcal {I}}'| \le K'\), while scoring all the candidate entities. The score of an entity is thereby restricted from \(\max _{I\in {\mathcal {I}}} S(I,e_2)\) to \(S(e_2) = \max _{I\in {\mathcal {I}}'} S(I,e_2)\). We are given the set of gold (relevant) entities \({\mathcal {E}}^+_2\). Let irrelevant candidates be called \({\mathcal {E}}^-_2\). Then we want \(S(e^+_2) > S(e^-_2)\) for any pair \(e^+_2\in {\mathcal {E}}^+_2, e^-_2 \in {\mathcal {E}}^-_2\), which is turned into a hinge loss \([S(e^-_2) + \varDelta - S(e^+_2)]_+\), where \(\varDelta \) is a margin hyperparameter and \([\bullet ]_+ \equiv \max \{0,\bullet \}\) is the hinge or ReLU operator. Summarizing, the loss we seek to minimize during training is
During inference, we do not know \({\mathcal {E}}^+_2, {\mathcal {E}}^-_2\). Therefore we find
and then sort candidate entities \(e_2\) by decreasing \(\max _{I \in {\mathcal {I}}^*} S(I,e_2)\). Both expressions take time to evaluate that are exponential in \(K'\), but we expect \(K'\) to be very small, usually under 3 (set heuristically). While we can try to optimize expression (1) directly to learn model parameters inside \(S(I,e_2)\), the objective is highly nonconvex. We found it better to use the technique in the next section for training and use expression (2) for inference.
4.4.4 Allowing unlimited supporting interpretations (AQQUCN-ALL)
If a candidate entity \(e_2\) is supported by multiple interpretations I, a reasonable view is that the overall score of \(e_2\) is \(\max _I S(I,e_2)\), from the best supporting interpretation, which induces a ranking among candidate \(e_2\)s. The set of gold \(e_2^+\)s is then used to define a loss and train the combination network. We use a pairwise loss, comparing, for a fixed query q, a relevant entity \(e_2^+\) with an irrelevant entity \(e_2^-\):
where \(\varDelta \) is a margin hyperparameter. Note that the best supporting \(e_1^+, t_2^+, r^+\) for \(e_2^+\) may be different from the best supporting \(e_1^-, t_2^-, r^-\) for \(e_2^-\).
-
\({\mathcal {Q}}\) is the set of queries, and q is one query.
-
\(e_2^+\) is a relevant entity, \(e_2^-\) is an irrelevant entity, for query q.
-
\(\phi \) is the feature vector (see Sect. 4.4.1) representing an interpretation, composed of \(e_1\) (one or more entities mentioned in query q), r (relation mentioned or hinted at in q), \(t_2\) (type mentioned or hinted at in q) and \(e_2\) (candidate answer entity). \(\phi \) incorporates inputs from the three convnets.
-
\(\xi \) is a vector of non-negative slack variables.
-
C a balancing regularization parameter.
-
w is the weight vector to be learnt.
The max in the LHS of constraint (3) leads to nonconvexity, which we address by introducing auxiliary variables \(u(q, e_1^+, t_2^+, r^+; e_2^+)\) for each relevant candidate entity in the following optimization.
For tractability, we relax the 0/1 constraint over u variables to the continuous range [0, 1]:
The relaxation does not correspond to any discrete interpretation, but is a device to make the optimization tractable. We obtain a local optimum for (4) by alternately updating w and u. Each of these is a convex optimization problem. Figure 6 shows the pseudocode for inference in our proposed system and can be directly and efficiently solved. Through optimization (4), AQQUCN integrates query interpretation and entity response ranking into a unified framework, rather than a two-stage compile-and-execute strategy common in other QA systems, which effectively gambles on one best structured interpretation.
4.5 End-to-end versus modular training
In recent years, end-to-end training of complex neural architectures has lost some appeal. Shalev-Shwartz and Shashua (2016) showed that the sample complexity of end-to-end training can be exponentially larger than the modular training of individual stages. Roth (2017) made similarFootnote 8 arguments about some NLP tasks. With complex notions of ‘match’ (Fig. 1), a commensurately complex network (Fig. 4) to train, and comparatively few training instances \((q, E_2^*)\) that do not come with gold structured interpretations, we, too, chose modular training of QRN, QTN and QCN, followed by training the score combination network. It might be argued that the additional labeled data used to train individual modules renders unfair the competition between various systems. While there is some validity to this protest, many open-domain QA systems already use externally trained word embeddings (Bast and Haußmann 2015; Yih et al. 2015; Xu et al. 2016), externally-trained target type recognizers (Murdock et al. 2012; Yavuz et al. 2016), and augmented training from SimpleQuestions (Bordes et al. 2015).
4.6 Set retrieval versus ranking and the threshold module
Choosing \({\mathcal {I}}'\) and then ranking candidate entities is the most natural method to drive AQQUCN. In this mode, AQQUCN can be directly compared against any other entity ranking system, such as the one by Joshi et al. (2014). On the other hand, comparing AQQUCN-FEW and AQQUCN-ALL with other systems that retrieve entity sets is not directly possible, unless the desired size of the entity set were specified, or the ranked list is somehow truncated. One way to approach this is to threshold the ranked list based on some criterion. We explore two thresholding strategies. In the first strategy, we set the score threshold value to x% of the top ranked entity’s score (“relative threshold”). Tuned on held-out data, x turned out to be 0.95. In the second strategy (which we refer to as “ideal threshold”) we threshold at a position which results in the best value of F1 that can be extracted from the ranking output by our system. As is obvious, the first one provides unfair advantage to existing KBQA systems, whereas the second provides unfair advantage to our system and merely provides an idealized, non-constructive upper bound on F1.
5 Experiments and results
5.1 Testbed
KG We used Freebase (Bollacker et al. 2008) as the KG, specifically, the OpenLink Virtuoso snapshot provided with AQQU. It provides 2.9 billion relation facts on 44 million entities. The set of answer types, with about 4976 member types, is also curated and provided with AQQU. It should be possible to adapt other KGs such as WikiData for use with AQQU and AQQUCN.
Annotated corpus We used ClueWeb09B (ClueWeb09 2009) as the corpus. It has about 50 million Web documents in WARC format, same as the rest of ClueWeb09 and ClueWeb12 (over 500 million documents each). Any corpus in WARC format can be used with AQQUCN, assuming they have useful entity annotations. For reproducibility, we used the public FACC1 entity annotations released by Google (Gabrilovich et al. 2013). The typical document has 13–15 entities annotated. Given enough computational capacity, one can run an entity tagger like TagMe (Ferragina and Scaiella 2010) over the corpus. The number of tokens in each corpus snippet was limited to 20, based on hyperparamemer tuning in early versions of the system. We also verified that minor changes to snippet length did not change the results noticeably.
Query sets Joshi et al. (2014) provided syntax-poor translations (CSAW 2018) of syntax-rich queries from TREC and INEX question answering competitions, as well as a fraction of syntax-rich queries from WebQuestions (Berant et al. 2013). This gave us four query sets summarized in Table 6 and called TREC-INEX-KW, TREC-INEX, WebQuestions-KW and WebQuestions.
By design, all WebQuestions queries can be answered using the Freebase KG. In contrast, only 57% of TREC-INEX queries can be answered from KG alone under the restriction that \(e_1\) and \(e_2\) lie within two hops. Thus corpus evidence is important for TREC-INEX.
Convnet training protocol Data used to train the convnets is available (CSAW 2018). Some important design choices are described below.
- QTN and QRN :
-
Initial word vectors are learnt using the CNN-non-static version of Kim (2014). Filter sizes are set to 3 and 4 with 150 feature maps each. Drop-out rate is 0.5, with 100 epochs and early stopping using a validation split of 10%. Training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update.
- QCN :
-
The width of the convolution filters is set to 5, the number of convolutional feature maps 150, batch size 50. L2 regularization term is \(10^{-5}\) for the parameters of convolutional layers and \(10^{-4}\) for all the others. The dropout rate is set to 0.5. We initialize the word vectors using the word embeddings trained by Huang et al. (2012).
Score combination network training protocol All KG paths emanating from a query entity \(e_1\) can contribute to candidate answer set \({\mathcal {E}}_2\). We use the pruning process of (Bast and Haußmann 2015) to restrict the set of KG queries (and hence \({\mathcal {E}}_2\)) to a practical size. We normalize all feature values in [0, 1] before sending them to the score combination stage.
Evaluation protocol We evaluate entity ranking using measures common in Information Retrieval, applied to the response list of entities: mean average precision (MAP), mean reciprocal rank (MRR), and normalized discounted cumulative gain at rank 10 (NDCG@10). For ranking evaluation, the threshold module of Sect. 4.6 is not applied. For set retrieval evaluation, the system output entity set \({\mathcal {E}}_2\) is created by thresholding the ranked list. It is then compared against gold set \({\mathcal {E}}^*_2\) to compute recall, precision and F1.
5.2 QRN and QTN heatmaps
To understand the workings of QTN and QRN, we used the public implementation of Local Interpretable Model-Agnostic Explanations (LIME)Footnote 9 (Ribeiro et al. 2016). LIME linearly approximates a neural model’s behavior around the vicinity of a particular instance to detect the sensitivity of a label decision to input features. Figure 7 illustrates various sentences, their top predicted classes, and the sensitivity to each query word—positive, neural, or negative—in predicting that class. The observed polarities and intensities were generally intuitive.
5.3 Entity ranking comparison
The vast majority of KBQA papers report set retrieval accuracy in terms of recall, precision and F1. To our knowledge, only Joshi et al. (2014) report entity ranking accuracy. We compare AQQUCN against their system in Table 7, using standard ranking performance measures: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). In Table 7, we see 5–16% absolute improvement in mean average precision (MAP) over various query sets. The improvement is statistically significant (at \(p < 0.0001\)). Ablation studies in Sect. 5.6 suggest some causes for the improvements.
5.4 Effect of number of interpretations allowed
Figure 8 shows interpolated precision against recall, without thresholding, for the variants AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL. The trends on the two datasets are opposites: on TREC-INEX, performance increases with the number of interpretations, but on WebQuestions, allowing more interpretations reduces F1. This makes sense, because 85% of WebQuestions queries can be answered using a single relation (Yao 2015) and without corpus support. AQQUCN-FEW restricts the number of interpretations and limits the damage.
For AQQUCN-1, note that features (2–26) in Table 5 are unique to an interpretation. Consequently, when an entity has no support from QCN, particularly when the entity is rare or absent in the corpus, the scores of entities retrieved by an interpretation are all the same because of identical feature vectors. This situation is not so rare that we can ignore its effects. In such cases, even if AQQUCN-1 retrieves a set of reasonable quality, its ranking results from arbitrary tie-breaking. AQQUCN-ALL and AQQUCN-FEW are largely immune to this problem, because ties are less likely among the entity scores \(\max _I S(I,e_2)\).
Table 8 reports F1 scores for AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL, after applying thresholding. The trends are similar to those shown in Fig. 8. WebQuestions has somewhat larger \({\mathcal {E}}^*_2\) on average (natural queries: 2.4; telegraphic: 2.1), but KG-based interpretations are frequently adequate. Each KG-based interpretation covers more entities, and therefore, relatively fewer interpretations are needed to cover \({\mathcal {E}}^*_2\). In contrast, TREC-INEX has smaller \({\mathcal {E}}^*_2\)s on average (natural queries: 1.5; telegraphic: 1.4). But TREC-INEX depends on corpus-based interpretations, each of which typically covers only one entity. Therefore, TREC-INEX needs more interpretations to get good F1 scores.
Given the almost exclusive focus on WebQuestions and SimpleQuestions in prior work, these important considerations were discovered only after instrumenting AQQUCN. Hereafter, we will refer to the best performing variant among AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL as AQQUCN-Best.
5.5 Entity set retrieval comparison
In Table 9, we compare F1 of entity set retrieval across severalFootnote 10 KBQA systems (CodaLab 2016). AQQUCN is presented with both relative threshold (AQQUCN-Best) and ideal threshold, to separate the quality of ranking versus thresholding. Also quoted is the F1 score achievable in principle if the single best interpretation is used from KG and corpus.
The first striking observation on Table 9 is that, for TREC-INEX, using a single interpretation is a terrible plan; both AQQUCN-Best and AQQUCN-ALL with ideal threshold are far better. Predictably, without access to a corpus, Berant and Liang (2015) and Bast and Haußmann (2015) perform poorly. In sharp contrast, owing to the nature of WebQuestions, a clairvoyant choice of the single best interpretation beats everything else (including Savenkov and Agichtein (2016) with ideal threshold) by a very wide margin. While AQQUCN with ideal threshold far exceeds all other systems, it is not even close to the best single interpretation. In terms of non-clairvoyant achievable accuracies, AQQUCN-Best is visibly best for WebQuestions-KW, and ranked second for WebQuestions. Clearly a great deal of ground has been covered since 2013 for WebQuestions, yet there is plenty of room to improve. It is also clear that AQQUCN is much better at ranking than thresholding.
5.6 Ablation tests
To understand the contributions of the three convnets to the score combination network, we removed each network in turn, re-trained the best performing model, and tabulated the resulting F1 scores in Table 10. TREC-INEX (in both query forms) suffers a serious hit if QCN is removed. This makes sense because of the critical evidence brought in by corpus snippets in case of TREC-INEX. The effect of removing QCN is smaller for WebQuestions, but still visible, showing that, even if gold entities are in the KG, corpus evidence can help score them better.
The influence between QRN and QTN is more nuanced. The relation r involved in a query may asserts strong selectional preferences on the types of participating entities. Therefore, an accurate r predicted by the QRN can often make up for mistakes made in predicting \(t_2\) by the QTN. Conversely, accurately predicting \(t_2\) may mitigate a misleading choice of r. Overall, though, Table 10 shows at least some performance reduction if either QRN or QTN is removed. For the terse queries in WebQuestions-KW, removing QTN hurts more than removing QCN.
5.7 Wins and losses
We performed a side-by-side analysis of a sample of queries for which we found our system to be doing better and worse than related work. Our system improved on some queries containing qualifiers such as ‘first’, ‘oldest’, since we harness signals from the text corpus. For example, Who was the first U.S. president ever to resign? can be translated to a complex graph query involving a max/sort over dates, making it difficult to interpret it using only the knowledge graph. Yih et al. (2015) handled some of these queries using extensively hand-engineered features. However, such information was readily found in the Web corpus (examples in Fig. 2). The corpus also helped when the KG was incomplete (e.g., president sworn on airplane) or for answering queries with no clear \(e_1\) (e.g., which kennedy died first?).
We performed worse on some queries, especially when the corpus signal added more noise than information and our type or relation CNNs were not able to narrow down to the correct answer. Some of these queries had a non-trivial syntactic structure, possibly not captured by the corpus. For example, What nation is home to the Kaaba?. At times, high annotation density in the corpus promoted popular non-answer entities over not-so-popular answer entities. For the query creator of the daily show, \({{\texttt {Jon\_Stewart}}}\) ranked above \({{\texttt {Madeleine\_Smithberg}}}\), purely based on corpus popularity. Such cases highlight an opportunity to improve our corpus, type and relation CNNs.
6 Conclusion
We presented AQQUCN, a system that unifies structured interpretation of queries with ranking of response entities. Apart from seamlessly integrating corpus and KG information, AQQUCN has two salient features: it can deal with the full spectrum of query styles between keyword queries and well-formed questions; and it directly ranks response entities, rather than ‘compile’ the input to a structured query and execute that on the KG alone.
Notes
These are known as KBQA or “Knowledge Base Question Answering” systems.
Our system is named AQQUCN because it augments the AQQU system of Bast and Haußmann (2015) with convolutional networks.
Hint of relation r might be distributed among multiple disjoint spans, but this is not a serious problem for our proposed system because we allow spans with multiple roles.
As in all QA systems, \(e_1\)-linking accuracy does affect QA accuracy, but the variation is hard to characterize without a battery of entity linking methods with carefully controlled recall/precision profiles. AQQU gave slightly better accuracy with TagMe than with its own linker, so we used TagMe for all experiments. SMAPH (Cornolti et al. 2014) would be a better choice, but it is provided only as a network service, and it needs Google search as yet another level of network service, which has severe usage volume restriction.
Two hops are needed to traverse mediator nodes like m.
For simplicity, we describe the single-relation case; multi-hop cases with mediator nodes are handled analogously.
We use K for the number of top entities in the response to the user, and \(K'\) for the number of interpretations to be used internally.
Also see Chapter 11 (End-to-end Deep Learning) of http://www.mlyearning.org/.
For three cases, only AQQU (Bast and Haußmann 2015) and Sempre (Berant and Liang 2015) code were available. Text2KB is available at https://github.com/DenXX/aqqu, but with missing corpus files and no format specification.
References
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Learning to compose neural networks for question answering. arXiv preprint arXiv:160101705
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR. arXiv:1409.0473
Balog, K., Azzopardi, L., & de Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR conference (pp. 43–50). https://doi.org/10.1145/1148170.1148181. http://staff.science.uva.nl/~kbalog/files/sigir2006-expertsearch.pdf
Balog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework for expert finding. Information Processing and Management, 45(1), 1–19. https://doi.org/10.1016/j.ipm.2008.06.003
Bast, H., & Buchhold, B. (2017). QLever: A query engine for efficient sparql+text search. In CIKM (pp. 647–656). https://github.com/ad-freiburg/QLever
Bast, H., & Haußmann, E. (2015). More accurate question answering on freebase. In CIKM (pp. 1431–1440). http://ad-publications.informatik.uni-freiburg.de/CIKM_freebase_qa_BH_2015.pdf
Berant, J., & Liang, P. (2015). Imitation learning of agenda-based semantic parsers. TACL 3, 545–558. https://www.transacl.org/ojs/index.php/tacl/article/viewFile/646/160
Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic parsing on Freebase from question-answer pairs. In EMNLP conference (pp. 1533–1544). http://aclweb.org/anthology//D/D13/D13-1160.pdf
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD conference (pp. 1247–1250). http://ids.snu.ac.kr/w/images/9/98/sc17.pdf
Bordes, A., Chopra, S., & Weston, J. (2014). Question answering with subgraph embeddings. arXiv preprint arXiv:14063676
Bordes, A., Usunier, N., Chopra, S., & Weston, J. (2015). Large-scale simple question answering with memory networks. arXiv preprint arXiv:150602075
Cardie, C. (2012). CS 4740: Introduction to natural language processing. https://www.cs.cornell.edu/courses/cs4740/2012sp/lectures.htm
Chakrabarti, S. (2010). Bridging the structured-unstructured gap: Searching the annotated Web. Keynote talk at WSDM 2010. https://www.cse.iitb.ac.in/~soumen/doc/wsdm2010/TalkSlides.pdf
ClueWeb09. (2009). http://www.lemurproject.org/clueweb09.php/
CodaLab. (2016). Webquestions benchmark for question answering. http://bit.ly/2kvXroJ
Cornolti, M., Ferragina, P., Ciaramita, M., Rued, S., & Schuetze, H. (2014). The SMAPH system for query entity recognition and disambiguation. In ERD challenge workshop. https://research.google.com/pubs/archive/42720.pdf
CSAW. (2018). The CSAW project at IIT Bombay. https://www.cse.iitb.ac.in/~soumen/doc/CSAW/
Dalton, J., Dietz, L., & Allan, J. (2014). Entity query feature expansion using knowledge base links. In SIGIR conference. http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1143
Dong, L., & Lapata, M. (2016). Language to logical form with neural attention. In ACL (Vol. 1, pp. 33–43). arXiv:1601.01280.
Dong, L., Wei, F., Zhou, M., & Xu, K. (2015). Question answering over freebase with multi-column convolutional neural networks. In ACL conference.
Fang, Y., Si, L., & Mathur, A. P. (2010). Discriminative models of integrating document evidence and document-candidate associations for expert search. In SIGIR conference. http://www.cs.purdue.edu/homes/fangy/SIGIR2010_Expert_Search.pdf
Ferragina, P., & Scaiella, U. (2010). TAGME: On-the-fly annotation of short text fragments (by wikipedia entities). arXiv:1006.3498.
Gabrilovich, E., Ringgaard, M., & Subramanya, A. (2013). FACC1: Freebase annotation of ClueWeb corpora. http://lemurproject.org/clueweb12/. http://lemurproject.org/clueweb12/, version 1 (Release date 2013-06-26, Format version 1, Correction level 0).
Ganea, O. E., & Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. arXiv preprint arXiv:170404920.
Globerson, A., Lazic, N., Chakrabarti, S., Subramanya, A., Ringgaard, M., & Pereira, F. (2016). Collective entity resolution with multi-focal attention. In ACL conference (pp. 621–631). https://www.aclweb.org/anthology/P/P16/P16-1059.pdf
Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. In ACL conference.
Hui, K., Yates, A., Berberich, K., & de Melo, G. (2017). PACRR: A position-aware neural IR model for relevance matching. arXiv preprint arXiv:170403940.
Hui, K., Yates, A., Berberich, K., & de Melo, G. (2018). Co-PACRR: A context-aware neural IR model for ad-hoc retrieval. In WSDM conference (pp. 279–287).
Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., & Daumé, III. H. (2014). A neural network for factoid question answering over paragraphs. In EMNLP conference (pp. 633–644).
Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD conference, ACM (pp. 133–142). http://www.cs.cornell.edu/People/tj/publications/joachims_02c.pdf
Joshi, M., Sawant, U., & Chakrabarti, S. (2014). Knowledge graph and corpus driven segmentation and answer inference for telegraphic entity-seeking queries. In EMNLP conference (pp. 1104–1114). http://www.emnlp2014.org/papers/pdf/EMNLP2014117.pdf, download http://bit.ly/1OCKbVW
Kasneci, G., Suchanek, F. M., Ifrim, G., Ramanath, M., & Weikum, G. (2008). NAGA: Searching and ranking knowledge. In ICDE, IEEE.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:14085882.
Kwiatkowski, T., Choi, E., Artzi, Y., & Zettlemoyer, L. S. (2013). Scaling semantic parsers with on-the-fly ontology matching. In EMNLP conference (pp. 1545–1556). http://homes.cs.washington.edu/~lsz/papers/kcaz-emnlp13.pdf
Liang, C., Berant, J., Le, Q., Forbus, K. D., & Lao, N. (2016). Neural symbolic machines: Learning semantic parsers on Freebase with weak supervision. arXiv:1611.00020.
Lin, T., Pantel, P., Gamon, M., Kannan, A., & Fuxman, A. (2012). Active objects: Actions for entity-centric search. In WWW conference, ACM (pp. 589–598). https://doi.org/10.1145/2187836.2187916. http://research.microsoft.com/apps/pubs/default.aspx?id=161389
Ling, X., & Weld, D. S. (2012). Fine-grained entity recognition. In AAAI conference. http://xiaoling.github.io/pubs/ling-aaai12.pdf
Liu, T. Y. (2009). Learning to rank for information retrieval. In Foundations and trends in information retrieval (Vol. 3, pp. 225–331). Now Publishers. https://doi.org/10.1561/1500000016. http://www.nowpublishers.com/product.aspx?product=INR&doi=1500000016
Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In SIGIR conference (pp. 299–306). https://doi.org/10.1145/1571941.1571994. http://sifaka.cs.uiuc.edu/czhai/pub/sigir09-PLM.pdf
MacAvaney, S., Yates, A., Cohan, A., Soldaini, L., Hui, K., Goharian, N., & Frieder, O. (2018). Characterizing question facets for complex answer retrieval. arXiv preprint arXiv:180500791.
Macdonald, C., & Ounis, I. (2006). Voting for candidates: Adapting data fusion techniques for an expert search task. In CIKM (pp. 387–396). https://doi.org/10.1145/1183614.1183671.
Macdonald, C., & Ounis, I. (2011). Learning models for ranking aggregates. In Advances in information retrieval. LNCS (Vol. 6611, pp. 517–529). New York: Springer. http://www.dcs.gla.ac.uk/~craigm/publications/macdonald11learned.pdf
Miller, A. H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. arXiv:1606.03126.
Murdock, J. W., Kalyanpur, A., Welty, C., Fan, J., Ferrucci, D. A., Gondek, D. C., Zhang, L., & Kanayama, H. (2012). Typing candidate answers using type coercion. IBM Journal of Research and Development 56(3/4), 7:1–7:13. https://pdfs.semanticscholar.org/765d/0956e46846a33a1062749daede11ba71680f.pdf
Petkova, D., & Croft, W. B. (2007). Proximity-based document representation for named entity retrieval. In CIKM (pp. 731–740). ACM. https://doi.org/10.1145/1321440.1321542. http://portal.acm.org/citation.cfm?id=1321440.1321542
Pound, J., Hudek, A. K., Ilyas, I. F., & Weddell, G. (2012). Interpreting keyword queries over Web knowledge bases. In CIKM. https://cs.uwaterloo.ca/~jpound/pubs/pound-cikm2012.pdf
Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. arXiv preprint arXiv:151106279.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In SIGKDD conference (pp. 1135–1144).
Roth, D. (2017). On the necessity of learning and reasoning: A perspective from natural language understanding. McCarthy award acceptance speech at IJCAI 2017. https://www.youtube.com/watch?v=tAKn3Gt75rg
Saha, A., Pahuja, V., Khapra, M. M., Sankaranarayanan, K., & Chandar, S. (2018). Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph. arXiv preprint arXiv:180110314.
Savenkov, D., & Agichtein, E. (2016). When a knowledge base is not enough: Question answering over knowledge bases with external text data. In SIGIR conference (pp. 235–244). https://dl.acm.org/citation.cfm?id=2911536
Savenkov, D., & Agichtein, E. (2017). Evinets: Neural networks for combining evidence signals for factoid question answering. In ACL conference (Vol. 2, pp. 299–304). http://aclweb.org/anthology/P17-2047
Sawant, U., & Chakrabarti, S. (2013). Features and aggregators for web-scale entity search. arXiv:1303.3164.
Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). http://disi.unitn.it/~severyn/papers/sigir-2015-long.pdf
Shalev-Shwartz, S., & Shashua, A. (2016). On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv:1604.06915
Wang, M. (2006). A survey of answer extraction techniques in factoid question answering. Computational Linguistics 1(1). https://nlp.stanford.edu/mengqiu/publication/LSII-LitReview.pdf
West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., & Lin, D. (2014). Knowledge base completion via search-based question answering. In WWW conference (pp. 515–526). https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42024.pdf
Xiong, C., Callan, J., & Liu, T. Y. (2017). Word-entity duet representations for document ranking. In SIGIR conference (pp. 763–772). arXiv:1706.06636
Xu, K., Reddy, S., Feng, Y., Huang, S., & Zhao, D. (2016). Question answering on Freebase via relation extraction and textual evidence. arXiv preprint arXiv:160300957.
Yahya, M., Berberich, K., Elbassuoni, S., Ramanath, M., Tresp, V., & Weikum, G. (2012). Natural language questions for the Web of data. In EMNLP conference, Jeju Island, Korea (pp. 379–390). http://www.aclweb.org/anthology/D12-1035
Yang, M. C., Duan, N., Zhou, M., & Rim, H. C. (2014). Joint relational embeddings for knowledge-based question answering. In EMNLP conference (pp. 645–650).
Yao, X. (2015). Lean question answering over Freebase from scratch. In NAACL conference (pp. 66–70). http://www.aclweb.org/website/old_anthology/N/N15/N15-3014.pdf
Yao, X., & Van Durme, B. (2014). Information extraction over structured data: Question answering with Freebase. In ACL conference, ACL. http://www.cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf
Yavuz, S., Gur, I., Su, Y., Srivatsa, M., & Yan, X. (2016). Improving semantic parsing via answer type inference. In EMNLP conference (pp. 149–159). http://cs.ucsb.edu/~ysu/papers/emnlp16_type.pdf
Yih, S. Wt., Chang, M. W., He, X., & Gao, J. (2015). Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL conference (pp. 1321–1331). http://anthology.aclweb.org/P/P15/P15-1128.pdf
Zhiltsov, N., Kotov, A., & Nikolaev, F. (2015). Fielded sequential dependence model for ad-hoc entity retrieval in the Web of data. In SIGIR conference (pp. 253–262). http://www.cs.wayne.edu/kotov/docs/zhiltsov-sigir15.pdf
Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:170900103
Acknowledgements
Thanks to the reviewers for their constructive suggestions. Thanks to Elmar Haußmann for generous help with AQQU. Thanks to Doug Oard for advice on set versus ranked retrieval. Thanks to Saurabh Sarda for migrating the code of Joshi et al. (2014) to use AQQU. Partly supported by grants from IBM and nVidia.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sawant, U., Garg, S., Chakrabarti, S. et al. Neural architecture for question answering using a knowledge graph and web corpus. Inf Retrieval J 22, 324–349 (2019). https://doi.org/10.1007/s10791-018-9348-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-018-9348-8