1 Introduction

A large fraction of Web queries involve and seek entities (Lin et al. 2012). Such queries may seek details of celebrities or movies (e.g., kingsman release date), historical events (e.g., Who killed Gandhi?), travel (e.g., nearest airport to baikal lake), and so on. Queries that match certain patterns are handed off to specialized QA systems that directly return entity responses from a KG. Sometimes, a semantic parse of the textual query is attempted (Berant et al. 2013; Yih et al. 2015) to translate it to a structured query over the KG, which is then executed to fetch a set of response entities.Footnote 1 While providing precise answers if everything goes well, this approach to KG-driven QA is fraught with several difficulties.

  • The input textual query may range from grammatically well-formed questions (e.g., In which band did Jimmy Page perform before Led Zeppelin?) to free-form “telegraphic” keyword queries (e.g., band jimmy page was in before led zeppelin). QA systems are often brittle with regard to input syntax, backing off if the input does not match specific syntactic patterns.

  • A curated, structured collection of facts in a KG reduces the QA task to “compiling” the textual query into a structured form which is directly executed on the KG. But KG coverage is always patchy—with nodes and/or edges missing—particularly when less popular entities are concerned. For example, over 70% of people in Freebase do not have a place of birth in Freebase (West et al. 2014). If the types and relations expressed textually in the query cannot be mapped confidently to the KG, most QA systems back off.

  • Alternatively, one can extend IR-style text search by using an entity-annotated corpus of Web pages. Any text snippet s in the corpus which mentions entity e and also matches the question q well, can be considered supporting evidence for that entity e to be the answer for q. However, such evidence from the Web corpus can be noisy due to incorrect entity linking of s or q, and imperfect text matching between q and s.

Example query and response Figure 1 demonstrates the advantages and complexities of effective entity-level QA involving both KG and corpus. Tokens in the query have diverse, possibly overlapping roles. Specifically, a query span may hint at an entity, type or relation, or it can be used to match passages in the corpus. Understanding the roles and disambiguating the hint to respective semantic nodes in the KG (wherever applicable) helps interpret that query. For example, the query \(q=\)band jimmy page was in before led zeppelin has a reference to entities \(e_1={{\texttt{Jimmy\_Page}}}\) and \({\texttt{Led\_Zeppelin}}\). The set of such entities, grounded in the query, will be called \({\mathcal {E}}_1\) with members \(e_1, e'_1\), etc. Mentions in both the query and corpus documents are linked to entity nodes in the KG (e.g. jimmy page). The target type of the query will be denoted \(t_2\) and a candidate answer entity will be denoted \(e_2\). The set of all answer candidates will be called \({\mathcal {E}}_2\), and the set of gold (ground truth) entities will be called \({\mathcal {E}}^*_2\). Band hints at \(t_2 = {{\texttt {musical\_group}}}\) of the expected answer entity \(e_2 = {{\texttt {The\_Yardbirds}}}\). The (rather weak) hint was in hints at the relation \(r={{\texttt {/music/musical\_group/member}}}\) connecting \(e_1, e_2\). Thus, identifying \(e_1\), r and \(t_2\) can lead us to many candidate \(e_2\)s, \({\texttt{The\_Yardbirds}}\) being one of them. Yet, the query interpretation is not complete because an important token ‘before’ is not considered. If the KG does not have timestamps on membership, or the QA engine cannot do arithmetic with timestamps, passages in the corpus can still offer supplementary evidence by matching before with prior to, along with mentions of \(e_1\) and \(e_2\). Thus, using both KG and corpus allows combining structured and unstructured evidence to answer a query.

Fig. 1
figure 1

QA example using KG and corpus. Parts of the query have different roles and match diverse artifacts in the KG and corpus, requiring a complex flow of evidence toward the correct response

This example also serves to highlight our challenges. The various curved, colored lines in Fig. 1 map the query hints to the KG or the corpus evidence, either created during pre-processing stage (e.g. entity linking in the corpus) or at run-time (e.g. matching the target type \(t_2 = {{\texttt {musical\_group}}}\) to the query text Band through a type model). The machine-learnt models which map the hints to KG entities, types or relations need to handle a great deal of ambiguity, as a hint may match many correct and incorrect KG artifacts. Thus, there may be multiple KG subgraphs and corpus text snippets, each appearing to support different correct or incorrect entity candidates. There is thus a clear need for robust and seamless aggregation of supporting evidence across corpus and KG.

Our contributions We present a new QA system, AQQUCN,Footnote 2 with these salient features:

  • AQQUCN is resilient to a spectrum of query styles, between syntactically well-formed questions to short ‘telegraphic’ Web queries. It does not attempt a grammar-based parse of the query.

  • AQQUCN uses KG and corpus signals in conjunction to score responses. Rather than a single comparison network between query and corpus (Severyn and Moschitti 2015; Bahdanau et al. 2014), AQQUCN uses a heterogeneous network architecture tailored to structural properties of queries.

  • Instead of choosing one structured interpretation and executing it on the KG to get a response set, AQQUCN is capable of directly ranking entities based on evidence pooled over multiple structured interpretations.

We review related work in Sect. 2. In Sect. 3 we give an overview of AQQUCN, and in Sect. 4 we describe all the modules in detail. In Sect. 5 we evaluate AQQUCN against recent competitive baseline systems. Our code with relevant data will be made available (CSAW 2018).

2 Related work

Recent QA systems are the result of convergence between several communities: Information Retrieval (IR), NLP, machine learning, and neural networks.

2.1 Corpus-oriented entity search

Early work in the IR community focused on corpus-driven QA in the TREC-QA track (Wang 2006; Cardie 2012). The Web and IR community has traditionally assumed a free-form query that is often ‘telegraphic’. Web search queries being far more noisy, the goal of structure discovery is more modest. Indeed, in expert search, one of the earliest forms of corpus-based entity search focused on finding experts (people) in a given field, query structure discovery was given no importance. State-of-the-art expert search systems (Balog et al. 2009; Macdonald and Ounis 2011; Petkova and Croft 2007) collect text snippets (or documents) containing query words, match each snippet evidence to an expert (e.g. using the signal that the expert’s name is mentioned in the snippet); and aggregate such evidence snippets to rank the experts. Generative language models (Balog et al. 2006), proximity based kernels (Petkova and Croft 2007) and feature-based supervised discriminative learning (Fang et al. 2010) were evaluated to score the evidence match. Conversely, document retrieval can be improved by expanding the query with entity features (Dalton et al. 2014).

2.2 Entity search from knowledge graphs (KGQA/KBQA)

As the information extraction and NLP communities developed more tools for annotating corpus spans with named entity (NE) types (Ling and Weld 2012) and canonical entity IDs from KGs (Ganea and Hofmann 2017), corpus-based techniques were refined to match answer types (Murdock et al. 2012). With support from large KGs like Wikipedia and Freebase, the NLP community developed semantic parsers (Berant et al. 2013; Yao and Van Durme 2014; Yih et al. 2015; Kasneci et al. 2008; Pound et al. 2012; Yahya et al. 2012; Kwiatkowski et al. 2013) that translated natural language queries to a target graph query language similar to SPARQL. These approaches typically assume that question utterances are grammatically well-formed, from which precise clause structure, ground constants, variables, and connective relations can be inferred via semantic parsing. A similar system called AQQU (Bast and Haußmann 2015) emerged from the IR community. We base our system on it, so we will describe it separately in Sect. 3.1. Such approaches are often correlated with the assumption that all usable knowledge has been curated into the KG. The query is first translated to a structured form, which is then executed on the KG.

2.3 Combining corpus and KG

There is increasing interest in combining corpus and KG for QA. AQQUCN is related to the single-relation QA system described by Joshi et al. (2014). Unlike AQQUCN, they attempted an explicit 4-way segmentation of the query to identify the mention spans \({\tilde{e}}_{1}\) that mark grounded entities \(e_1\), mention spanFootnote 3\({\tilde{r}}\) that marks a hint to relation r, span \({\tilde{t}}_{2}\) that marks a hint to the target type \(t_2\), and the remaining tokens \({\tilde{s}}\) are designated as selectors that are meant to keyword-match corpus snippets. This overall query segmentation is denoted z. The segmentation z guides their system to propose structured KG artifacts \(e_1, r, t_2\) corresponding to query spans \({\tilde{e}}_{1}, {\tilde{r}}, {\tilde{t}}_{2}\). The candidate response entity \(e_2\) is then scored over admissible values of all latent variables (see Fig. 2). KG and corpus signals are unified as various factors in the model. We address two limitations in this system. First, we do not attempt a hard query segmentation. Second, we replace the traditional discrete language models that inform the factor potentials with continuous neural counterparts.

Fig. 2
figure 2

The graphical model used by Joshi et al. (2014) to score each candidate entity \(e_2\) in turn by enumerating over all other variables (nodes), which are unobserved or ‘latent’. Factors nodes are denoted by filled square. (Although random variables in nodes are conventionally written in uppercase, we use lowercase to avoid confusion with sets\({\mathcal {E}}_1, {\mathcal {E}}_2\), etc.)

More work in this vein followed rapidly. Xu et al. (2016) presented a KGQA system with a corpus-based postprocessing pruning stage that removes candidates with weak corpus support. A more symmetric architecture called Text2KB was proposed by Savenkov and Agichtein (2016). A KGQA system as in Sect. 2.2 collects candidate answers. A corpus search collects snippets from top-ranking documents and annotates (Globerson et al. 2016; Ganea and Hofmann 2017) them with KG entities. For each candidate in the union, features are collected from both KG and corpus snippets to rank them. In a similar spirit, Xiong et al. (2017) propose AttR-Duet. Both the query and corpus passage are represented as bags of words as well as entities. Definition and mention texts in the KG and corpus are used to bridge between the space of entities and words, and four families of similarities (\(\{{\text {query}}, {\text {passage}}\} \times \{{\text {word}}, {\text {entity}}\}\)) are defined. These are then combined using a neural network. Reminiscent of how Joshi et al. (2014) incorporated a corpus-based factor/potential into a graphical model (Fig. 2), Bast and Buchhold (2017) proposed QLever, which extended (Chakrabarti 2010) SPARQL with predicates over an entity-annotated corpus. The primary focus of QLever is on high performance in the face of query clauses spanning KG and corpus indices, not ranking accuracy per se.

2.4 Complex QA using neural techniques

Early improvements to QA systems resulted from replacing discrete word matching and scoring with word vector counterparts. In corpus-based QA, Bordes et al. (2014) modeled queries and passages as bags of words and simply added up their word embeddings to represent and compare them. Yang et al. (2014) used embeddings to translate questions into relational predicates. More refined sentence/query embeddings have been created (Severyn and Moschitti 2015; Iyyer et al. 2014) via recurrent networks (RNNs) and convolutional networks (CNNs), but usually applied to syntax-rich, well-formed questions. Dong et al. (2015) obtained better accuracy than Bordes et al. (2014) by replacing the aggregated word vector query representation with multiple parallel CNNs for extracting deep representations for relation r between query entity (i.e. the entity mentioned in the query) and answer entity, answer type and the KG neighborhood of the query entity. While the above works dealt with entity retrieval, CNNs have also been actively explored in query-document matching for complex answer retrieval (Hui et al. 2017, 2018; MacAvaney et al. 2018). EviNets (Savenkov and Agichtein 2017) embeds the query and evidence passage as average word vectors. Then it collects various aggregates (Macdonald and Ounis 2011) of vector match scores, which are combined using a trained pooling network. None of these neural systems seek a structural understanding of the query and how its parts relate differentially to the KG and corpus. Recently, neural learning techniques are being used to translate very complex queries (Saha et al. 2018) like how many countries have more rivers than Brazil into multi-layer expression graphs or multi-step imperative programs (Andreas et al. 2016; Dong and Lapata 2016; Miller et al. 2016; Zhong et al. 2017; Reed and De Freitas 2015; Liang et al. 2016). The extreme complexity of the reinforcement learning formulations needed have, thus far, precluded scaling them to Web-scale corpora and noisy query and corpus annotation tools.

3 AQQUCN overview

AQQUCN, a system we have built by extending AQQU (Bast and Haußmann 2015), implements all the inference pathways shown in Fig. 1. Unlike AQQU and some other systems, the end goal of AQQUCN is not to “compile” the input into a structured query to execute on the KG, because broken/missing input syntax can make this attempt fail in brittle ways. Instead, the end goal of AQQUCN is to directly create a ranking over entities using KG and corpus, using interpretations as latent variables. We first review AQQU briefly, and then describe the three stages of AQQUCN in the rest of this section (Fig. 3).

Fig. 3
figure 3

Simplified sketch of AQQU (Bast and Haußmann 2015)—training the interpretation ranker (follow the numbers). The \(e_1\) linker may shortlist alternative candidates, or candidates for multiple grounded entities \(e_1, e'_1\), etc. Exploration around these produce interpretations I. Each I is associated with a system output \({\mathcal {E}}_2\) of a set of answer entities. This is compared against the gold \({\mathcal {E}}^*_2\) to compute the reward \(F_1\), which guides the interpretation ranker

3.1 AQQU review

AQQU interprets the input question with reference to three possible “query templates”, each having a direct translation to a SPARQL query. First, grounded entities \(e_1\) in the query are identified. Then, a guided expansion in the KG locates candidates \(e_2\), resulting in structured interpretations I that may take forms such as these one- and two-hop queries:

  • .

  • , where m is a mediator ‘entity’ often representing a ternary relation.

AQQU then extracts features from question q, interpretation I (which could have a few distinct structures as above), and all the candidate \(e_2\)s together. These features are used in logistic regression or random forests to score each interpretation. The best interpretation (with unbounded placeholders for \(e_2\) and m, if applicable) is then executed on the KG. During training, the gold interpretation is not known, but executing each interpretation gives a system response set \({\hat{{\mathcal {E}}}}_{2}\) that can be compared against the gold \({\mathcal {E}}^*_2\) to get an F1 score. This helps AQQU train logistic regression or random forests to score better interpretations higher than worse ones. The unit of scoring and ranking is a single interpretation, not an entity, nor a joint space of interpretation and entity, as is the case in AQQUCN. Moreover, AQQU does not use corpus signals.

3.2 Modifications and new modules

Our implementation of AQQUCN is based on AQQU because it provides a well-written, reusable implementation of query entity linking and interpretation generation. We modify and enhance AQQU in the following ways, as also illustrated in Fig. 4 (for the query template). The important steps are listed below. (Item numbers correspond to modules in Fig. 4.)

Fig. 4
figure 4

AQQUCN block diagram showing \(e_1\)-linking (1), joint candidate generation from KG and corpus (2), followed by scoring by three convolutional networks for matches with corpus contexts (3–5), target type (6, 7), and relation (8, 9), followed by the score combination network (10) trained with a ranking loss (11). Here we show only the query template ; other cases are similar. KG+Corpus emit a candidate set \({\mathcal {E}}_2\) along with multiple interpretations per candidate, but (3–10) are shown as acting on one candidate \(e_2\), to reduce clutter. After each \(e_2\) is scored, \({\mathcal {E}}_2\) can be ranked

  1. 1.

    Given a query q, we first identify the set of in-query entities \({\mathcal {E}}_1\) using the widely used entity tagger TagMeFootnote 4 (Ferragina and Scaiella 2010).

  2. 2.

    The KG neighborhood of each \(e_1\) is explored to collect candidate \(e_2\)s, similar to prior KBQA (Knowledge Base driven QA) systems (Bast and Haußmann 2015; Yih et al. 2015; Xu et al. 2016). Like AQQU, we limit \({\mathcal {E}}_2\) to the set of entities occurring in the 2-hop KG neighborhoodFootnote 5 of any entity in \({\mathcal {E}}_1\).

  3. 3.

    As the example in Fig. 1 illustrated, some evidence supporting the correct entity, such as before in the query and prior to in the snippet, may come from the corpus. As we shall see in Sect. 5, corpus evidence can greatly augment KG-based evidence. Therefore, we also gather text snippets (at roughly the granularity of sentences) that mention some \(e_1\) or words from the query.

  4. 4.

    The next step is to identify the set of candidate answer entities, \({\mathcal {E}}_2\). Apart from KG neighborhoods of \(e_1\)s, we also collect (non-\(e_1\)) entities that occur in any snippet as a candidate \(e_2\). This allows the system to recover from early errors, such as when \({\mathcal {E}}_1\) identified by the entity tagger is empty or wrong, or when the query and answer entities are more than two hops apart in the KG. The union of candidates from KG and corpus are called \(e_2\)s.

  5. 5.

    We use three neural modules to inform the score combination network, which replaces AQQU’s interpretation ranking module. The query corpus convnet (QCN), described in Sect. 4.1, scores the evidence in the context snippets, given the query and the candidate \(e_2\).

  6. 6.

    Candidates \(e_2\) collected from the KG may be accompanied with designated types \(t_2\).

  7. 7.

    The query type convnet (QTN), described in Sect. 4.2, scores the potential presence of a textual clue to \(t_2\) somewhere in the query.

  8. 8.

    Candidate \(e_2\) found in the KG is also connected to \(e_1\) via a relation r.

  9. 9.

    The query relation convnet (QRN), described in Sect. 4.3, scores the potential presence of a textual clue to r somewhere in the query.

  10. 10.

    The score combination network works on the joint space of candidate interpretations and entities, ending with a ranking of candidate entities that may draw signals from multiple interpretations in general. AQQU scores are used as additional features (not shown). Outputs from the three convnets are wired together in a markedly non-uniform architecture, consistent with the inference pathways shown in Fig. 1.

A summary of notation introduced thus far is given in Table 1.

Table 1 Notation summary

Unlike Joshi et al. (2014), AQQUCN does not attempt to segment the query into disjoint spans that describe \(e_1, r, t_2\), but lets multiple neural networks run over the query. This allows AQQUCN to process long queries that Joshi et al. (2014) could not. Moreover, a query span can inform multiple networks; consider queries who discovered penicillin and who discovered antarctica, where ‘who’ carries a lot less information about \(t_2\) than the \(e_1\) mentions ‘penicillin’ and ‘antarctica’.

4 Detailed design of AQQUCN modules

In this section we present the details of the new modules we added to AQQU: the query-corpus, query-type and query-relation networks, as well as the score combination network.

4.1 Query-Corpus Network (QCN)

For each query q, we get zero or more evidence snippets from the entity-annotated Web corpus. Each snippet contains a candidate answer entity \(e_2 \in {\mathcal {E}}_2\). We use the query corpus network (QCN) for assigning a relevance score to each snippet. This relevance score, and also the confidence score for linking \(e_2\) to a snippet, are features used in the final score combination network of AQQUCN (Sect. 4.4).

Given a query q and a snippet from the Web corpus, QCN should assign high score to the pair if the snippet contains evidence to correctly answer q. Training data for this network is in the form of positive and negative snippets for each training query. As manual generation of such labels involves considerable effort, we resort to (possibly noisy) indirect supervision instead. We treat all text snippets centered around a gold answer entity and containing at least one query word (non-stopword) as positively labeled snippets. Similarly, we treat all text snippets centered around any non-answer entity and containing some or all query words as negatively labeled snippets (examples in Table 2). To train this network, we use the state-of-the-art short text ranking system proposed by Severyn and Moschitti (2015). Once QCN is trained, we have a score for each snippet belonging to a candidate answer entity \(e_2 \in {\mathcal {E}}_2.\)

Table 2 Example positive and negative corpus snippets for queries

While Siamese convolutional networks served well in the QCN module, the broad architecture of AQQUCN can accommodate competitive alternatives (Lv and Zhai 2009; Petkova and Croft 2007; Zhiltsov et al. 2015; Hui et al. 2017). Measuring the effect of this choice on QA accuracy is left for future work.

Match signals from multiple snippets supporting an entity \(e_2\) have to be aggregated before passing on to the combination network shown at the bottom of Fig. 4. Each candidate \(e_2\) may have diverse number of supporting snippets. Usually, a number of standard aggregates (sum, max, etc.) are computed and then a weighted combination learnt (Balog et al. 2009; Macdonald and Ounis 2006; Sawant and Chakrabarti 2013; Joshi et al. 2014). For our data sets, we found a simple sum of snippet scores to be adequate: we add up the snippet scores over all snippets belonging to an entity \(e_2\) for a query q, and use it as a feature for (q, \(e_2\)) (feature 1 in Table 5). More complex pooled aggregators can be explored in future work.

4.2 Query-type network (QTN)

The query-type network (QTN) outputs a compatibility score between the query q and a candidate type \(t_2\). This is a multi-class, multi-label classification problem, as a query may imply more than one correct answer type (e.g. \({\texttt{/music/composer}}\) and \({\texttt{/music/artist}}\) for the query saturday night fever music band). Good quality training data in the form of (query, type) pairs is essential to ensure that the network learns to handle different types of (or the lack of) query syntax and its correspondence with the answer type. In our first attempt, we included all (qt) pairs in the training data, where t was any type connected to the gold answer entity \(e_2\) for the query q in the training set. However, this strategy resulted in many spurious types. For example, \({\texttt{/broadcast/radio\_station\_owner}}\) is not the correct answer type for the query maya moore college, even if the answer entity \({\texttt{University\_of\_Connecticut}}\) belongs to that type. Therefore, we used human supervision. Paid student volunteers were asked to label (qt) pairs as correct or incorrect, which helped remove approximately 30% (qt) pairs as irrelevant and improve training data quality.

We found that, given the small number of training queries, the data obtained through above process may not be enough to understand the variety of syntax used to imply a type. For robust training, we resorted to representing the type through additional patterns (Table 3) obtained as follows.

Freebase relation names :

Consider a (\({\texttt{subject}}\), \({\texttt{relation}}\), \({\texttt{object}}\)) fact triple e.g. (\({{\texttt {Captain\_America:\_The\_First\_Avenger}}}\), \({{\texttt {/film/film/prequel}}}\), \({{\texttt {Thor}}}\)). Relations in freebase have composite names in the form “/x/y/z” where x is a topical domain and y and z indicate the types for \({{\texttt {subject}}}\) and \({{\texttt {object}}}\). E.g. \({\texttt{prequel}}\) is a type indicator word for \({{\texttt {Thor}}}\). Meanwhile, Freebase declares expected type for endpoint entities of each r. For \({{\texttt {/film/film/prequel}}}\), expected end type is \({{\texttt {/film/film}}}\). We combine these two information nuggets and add \({{\texttt {prequel}}}\) as a pattern expressing the type \({{\texttt {/film/film}}}\).

Freebase type names :

Ending substrings of the type name are also considered as patterns (e.g. ‘treatment’ for the type \({{\texttt {medical\_treatment}}}\)).

Table 3 Example path patterns for types

At the end of this exercise, we have zero or more patterns for each type (Table 3).

Figure 5 illustrates the design of QTN. This multi-class multi-label architecture is partly inspired by Kim (2014) and Severyn and Moschitti (2015). For each query q input to the network, the network provides an output score vector of size equal to the number of types, as follows:

  1. 1.

    In the initial layer, each query word is represented as a vector embedding learnt during training. Then convolution and pooling layers are used to extract a fixed-length feature vector from the variable length input.

  2. 2.

    In a separate layer, we compute word overlap features inspired by Severyn and Moschitti (2015). Specifically, we compute Jaccard similarity between the query and each type name, by representing each as a bag of words. We also compute Jaccard similarity between the query and each type pattern, and then take max over all patterns of a given type. This process results in two features for each (query, type) pair.

  3. 3.

    Similar to the bag of words based overlap, we compute Jaccard similarity between the word stem (lemmatized) form for each query, type name and type pattern; resulting in two more features for each (query, type) pair.

  4. 4.

    Once the overlap features as well as convolution-pooling features are computed, a fully connected hidden layer with sigmoid activation function is used at the last stage, to score all types.

Fig. 5
figure 5

Query-type network (QTN) architecture and inputs. The query-relation network (QRN) architecture is identical except that types are replaced by relations in the training data

4.3 Query-relation network (QRN)

The query-relation network (QRN) outputs a compatibility score between a candidate relation r and the query q. As with QTN, we generate multi-class, multi-label training data in the form of (qr) pairs, where r is a relation connecting to the gold answer entity \(e_2\) to the entity \(e_1\) mentioned in the query q. There could be multiple r, for the same (\(q, e_1, e_2\)) tuple. Such training data generation process is common in previous work (Dong et al. 2015; Yih et al. 2015; Bordes et al. 2014) and human curation is not used to remove noise.

Similar to QTN, we enrich our training data using relation description patterns. To generate these patterns, we start with (\({{\texttt {subject}}}\), \({{\texttt {relation}}}\), \({{\texttt {object}}}\)) facts in the KG and locate sentences in the annotated corpus where both \({{\texttt {subject}}}\) and \({{\texttt {object}}}\) are mentioned. We identify the path connecting the two in the dependency parse of the sentence, expressed as a sequence of lemmatized words. We count the number of times each path was found, and retain only the most frequent paths. This gives a bag of path patterns that describe relation r (examples in Table 4). The network architecture for QRN is same as in Fig. 5. The only difference is that for QRN we have (query, relation) and (corpus pattern, relation) tuples as training instances.

Table 4 Example path patterns for relations

4.4 Score combination network

Referring back to Fig. 4, QRN, QTN, and multiple QCNs send their scores as features to a final combination network that represents each candidate \(e_2\) as a feature vector and scores it in conjunction withFootnote 6\(I = \langle e_1, r, t_2 \rangle \). In abstract terms, if I denotes an interpretation and \(e_2\) a candidate entity, at this point we have a score matrixS(Ie) indexed by interpretations as rows and candidate response entities as columns.

During both training and inference, I is latent. Standard learning to rank methods (Liu 2009) are not directly applicable to our score combination network because of the latent variables implicit in interpretation I. In fact, the additional complications posed by these latent variables currently limit us to the relatively simple pairwise ranking paradigm with a linear scoring function (Joachims 2002). Direct optimization of listwise and setwise metrics in presence of latent variables is left as future work. In the rest of this section, we will describe three approaches to train and deploy the score combination network.

4.4.1 Features for score combination

The score combination module shown at the bottom of Fig. 4 uses a feature vector \(\phi \) to describe the match between query q, each candidate query interpretation \(I \in {\mathcal {I}}\) and each candidate answer entity \(e_2 \in {\mathcal {E}}_2\). These features are informed by three role-differentiated convolutional networks (described in detail in the rest of this section). These are

  • the query-corpus network (QCN) described in Sect. 4.1,

  • the query-type network (QTN) described in Sect. 4.2, and

  • the query-relation network (QRN) described in Sect. 4.3.

As in AQQU, we also include additional features such as entity tagger scores. Table 5 shows the complete list of features.

Table 5 Features used by the score combination network

4.4.2 Single interpretation (AQQUCN-1)

In some data sets like SimpleQuestions (Bordes et al. 2015), each query, by construction, can be answered from the KG alone, using exactly one interpretation. This is largely true of WebQuestions (Berant et al. 2013) as well. As we report later, the post-facto single best (‘silver’, because the ‘gold’ interpretation is not provided) interpretation retrieves the gold entity set with accuracy much higher than any system. Therefore, all that remains is to try to infer the silver interpretation. This is exactly what AQQU attempts to do. AQQU first aggregates \(S(I,e_2)\) over all candidates \(e_2\) to get a per-interpretation score, which is used to rank them and choose the top interpretation. Training is provided by comparing the observed F1 scores of competing candidate interpretations. Our resulting system, AQQUCN-1, is similar to AQQU, except that we use convnets and draw on corpus information. The entity set retrieved by the single interpretation \({\tilde{I}}\) can then be sorted by decreasing \(S({\tilde{I}},e)\) for ranking, if needed.

4.4.3 Allowing a limited number of interpretations (AQQUCN-FEW)

The assumption that a single interpretation can recall all relevant answer entities may not be valid in all situations. In particular, as we shall report later, interpretations derived from both KG and corpus can be necessary to cover the gold response entities.

In a set \({\mathcal {E}}_2\) of candidate entities \(e_2\), if the score of each \(e_2\) is determined by its best supporting interpretation, then the number of distinct interpretations supporting the candidate set \({\mathcal {E}}_2\) may approach \(|{\mathcal {E}}_2|\) itself. In the next section, we will allow that to happen freely. In this section, we will take a small step to generalize one interpretation to a limited number of interpretations.

Suppose the universe of available interpretations is \({\mathcal {I}}\), from which we can admitFootnote 7\({\mathcal {I}}' \subseteq {\mathcal {I}}\), with \(|{\mathcal {I}}'| \le K'\), while scoring all the candidate entities. The score of an entity is thereby restricted from \(\max _{I\in {\mathcal {I}}} S(I,e_2)\) to \(S(e_2) = \max _{I\in {\mathcal {I}}'} S(I,e_2)\). We are given the set of gold (relevant) entities \({\mathcal {E}}^+_2\). Let irrelevant candidates be called \({\mathcal {E}}^-_2\). Then we want \(S(e^+_2) > S(e^-_2)\) for any pair \(e^+_2\in {\mathcal {E}}^+_2, e^-_2 \in {\mathcal {E}}^-_2\), which is turned into a hinge loss \([S(e^-_2) + \varDelta - S(e^+_2)]_+\), where \(\varDelta \) is a margin hyperparameter and \([\bullet ]_+ \equiv \max \{0,\bullet \}\) is the hinge or ReLU operator. Summarizing, the loss we seek to minimize during training is

$$\begin{aligned} \min _{{\mathcal {I}}' \subseteq {\mathcal {I}}, |{\mathcal {I}}'|\le K'} \sum _{e^+_2 \in {\mathcal {E}}^+_2} \sum _{e^-_2 \in {\mathcal {E}}^-_2} \left[ \max _{I\in {\mathcal {I}}'} S(I, e^-_2) + \varDelta - \max _{I\in {\mathcal {I}}'} S(I, e^+_2) \right] _+. \end{aligned}$$
(1)

During inference, we do not know \({\mathcal {E}}^+_2, {\mathcal {E}}^-_2\). Therefore we find

$$\begin{aligned} {\mathcal {I}}^*&= {\mathop {{{\,\mathrm{argmax}\,}}}\limits _{{{\mathcal {I}}}' \subseteq {\mathcal {I}}, |{\mathcal {I}}'|\le K'}} \sum _{e_2 \in {\mathcal {E}}_2} \max _{I\in {\mathcal {I}}'} S(I, e_2) \end{aligned}$$
(2)

and then sort candidate entities \(e_2\) by decreasing \(\max _{I \in {\mathcal {I}}^*} S(I,e_2)\). Both expressions take time to evaluate that are exponential in \(K'\), but we expect \(K'\) to be very small, usually under 3 (set heuristically). While we can try to optimize expression (1) directly to learn model parameters inside \(S(I,e_2)\), the objective is highly nonconvex. We found it better to use the technique in the next section for training and use expression (2) for inference.

4.4.4 Allowing unlimited supporting interpretations (AQQUCN-ALL)

If a candidate entity \(e_2\) is supported by multiple interpretations I, a reasonable view is that the overall score of \(e_2\) is \(\max _I S(I,e_2)\), from the best supporting interpretation, which induces a ranking among candidate \(e_2\)s. The set of gold \(e_2^+\)s is then used to define a loss and train the combination network. We use a pairwise loss, comparing, for a fixed query q, a relevant entity \(e_2^+\) with an irrelevant entity \(e_2^-\):

$$\begin{aligned} \max _{e_1^+, t_2^+, r^+} w \cdot \phi (q, e_1^+, t_2^+, r^+; e_2^+) + \xi _{q, e_2^+, e_2^-}&\ge \varDelta + \max _{e_1^-, t_2^-, r^-}w \cdot \phi (q, e_1^-, t_2^-, r^-; e_2^-) \end{aligned}$$
(3)

where \(\varDelta \) is a margin hyperparameter. Note that the best supporting \(e_1^+, t_2^+, r^+\) for \(e_2^+\) may be different from the best supporting \(e_1^-, t_2^-, r^-\) for \(e_2^-\).

  • \({\mathcal {Q}}\) is the set of queries, and q is one query.

  • \(e_2^+\) is a relevant entity, \(e_2^-\) is an irrelevant entity, for query q.

  • \(\phi \) is the feature vector (see Sect. 4.4.1) representing an interpretation, composed of \(e_1\) (one or more entities mentioned in query q), r (relation mentioned or hinted at in q), \(t_2\) (type mentioned or hinted at in q) and \(e_2\) (candidate answer entity). \(\phi \) incorporates inputs from the three convnets.

  • \(\xi \) is a vector of non-negative slack variables.

  • C a balancing regularization parameter.

  • w is the weight vector to be learnt.

The max in the LHS of constraint (3) leads to nonconvexity, which we address by introducing auxiliary variables \(u(q, e_1^+, t_2^+, r^+; e_2^+)\) for each relevant candidate entity in the following optimization.

$$\begin{aligned} \min _{\xi \ge \mathbf {0}, w} &\tfrac{1}{2}\Vert w\Vert _2^2 + \frac{C}{|{\mathcal {Q}}|} \xi \cdot \mathbf {1} \quad {\text {such}} {\text {that}} \nonumber \\ \forall q,e_2^+, e_2^-; e_1^-, t_2^-, r^-:&\quad \sum _{e_1^+, t_2^+, r^+} u(q, e_1^+, t_2^+, r^+; e_2^+)\, w \cdot \phi (q, e_1^+, t_2^+, r^+; e_2^+) \nonumber \\&\quad \quad \ge \varDelta - \xi _{q,e_2^+, e_2^-} + w \cdot \phi (q, e_1^-, t_2^-, r^-; e_2^-) \nonumber \\ \forall q, e_1^+, t_2^+, r^+; e_2^+:&\quad u(q, e_1^+, t_2^+, r^+; e_2^+) \in \{0, 1\} \nonumber \\ \forall q,e_2^+:&\quad \sum _{e_1^+, t_2^+, r^+} u(q, e_1^+, t_2^+, r^+; e_2^+) = 1 \nonumber \\ \forall q,e_2^+,e_2^-:&\quad \xi _{q,e_2^+, e_2^-} \ge 0 \end{aligned}$$
(4)

For tractability, we relax the 0/1 constraint over u variables to the continuous range [0, 1]:

$$\begin{aligned} \forall q, e_1^+, t_2^+, r^+; e_2^+:&\quad u(q, e_1^+, t_2^+, r^+; e_2^+) \in [0, 1] \end{aligned}$$
(5)

The relaxation does not correspond to any discrete interpretation, but is a device to make the optimization tractable. We obtain a local optimum for (4) by alternately updating w and u. Each of these is a convex optimization problem. Figure 6 shows the pseudocode for inference in our proposed system and can be directly and efficiently solved. Through optimization (4), AQQUCN integrates query interpretation and entity response ranking into a unified framework, rather than a two-stage compile-and-execute strategy common in other QA systems, which effectively gambles on one best structured interpretation.

Fig. 6
figure 6

High-level pseudocode for AQQUCN-ALL inference. Also see Fig. 4

4.5 End-to-end versus modular training

In recent years, end-to-end training of complex neural architectures has lost some appeal. Shalev-Shwartz and Shashua (2016) showed that the sample complexity of end-to-end training can be exponentially larger than the modular training of individual stages. Roth (2017) made similarFootnote 8 arguments about some NLP tasks. With complex notions of ‘match’ (Fig. 1), a commensurately complex network (Fig. 4) to train, and comparatively few training instances \((q, E_2^*)\) that do not come with gold structured interpretations, we, too, chose modular training of QRN, QTN and QCN, followed by training the score combination network. It might be argued that the additional labeled data used to train individual modules renders unfair the competition between various systems. While there is some validity to this protest, many open-domain QA systems already use externally trained word embeddings (Bast and Haußmann 2015; Yih et al. 2015; Xu et al. 2016), externally-trained target type recognizers (Murdock et al. 2012; Yavuz et al. 2016), and augmented training from SimpleQuestions (Bordes et al. 2015).

4.6 Set retrieval versus ranking and the threshold module

Choosing \({\mathcal {I}}'\) and then ranking candidate entities is the most natural method to drive AQQUCN. In this mode, AQQUCN can be directly compared against any other entity ranking system, such as the one by Joshi et al. (2014). On the other hand, comparing AQQUCN-FEW and AQQUCN-ALL with other systems that retrieve entity sets is not directly possible, unless the desired size of the entity set were specified, or the ranked list is somehow truncated. One way to approach this is to threshold the ranked list based on some criterion. We explore two thresholding strategies. In the first strategy, we set the score threshold value to x% of the top ranked entity’s score (“relative threshold”). Tuned on held-out data, x turned out to be 0.95. In the second strategy (which we refer to as “ideal threshold”) we threshold at a position which results in the best value of F1 that can be extracted from the ranking output by our system. As is obvious, the first one provides unfair advantage to existing KBQA systems, whereas the second provides unfair advantage to our system and merely provides an idealized, non-constructive upper bound on F1.

5 Experiments and results

5.1 Testbed

KG We used Freebase (Bollacker et al. 2008) as the KG, specifically, the OpenLink Virtuoso snapshot provided with AQQU. It provides 2.9 billion relation facts on 44 million entities. The set of answer types, with about 4976 member types, is also curated and provided with AQQU. It should be possible to adapt other KGs such as WikiData for use with AQQU and AQQUCN.

Annotated corpus We used ClueWeb09B (ClueWeb09 2009) as the corpus. It has about 50 million Web documents in WARC format, same as the rest of ClueWeb09 and ClueWeb12 (over 500 million documents each). Any corpus in WARC format can be used with AQQUCN, assuming they have useful entity annotations. For reproducibility, we used the public FACC1 entity annotations released by Google (Gabrilovich et al. 2013). The typical document has 13–15 entities annotated. Given enough computational capacity, one can run an entity tagger like TagMe (Ferragina and Scaiella 2010) over the corpus. The number of tokens in each corpus snippet was limited to 20, based on hyperparamemer tuning in early versions of the system. We also verified that minor changes to snippet length did not change the results noticeably.

Query sets Joshi et al. (2014) provided syntax-poor translations (CSAW 2018) of syntax-rich queries from TREC and INEX question answering competitions, as well as a fraction of syntax-rich queries from WebQuestions (Berant et al. 2013). This gave us four query sets summarized in Table 6 and called TREC-INEX-KW, TREC-INEX, WebQuestions-KW and WebQuestions.

Table 6 Summary of different query sets

By design, all WebQuestions queries can be answered using the Freebase KG. In contrast, only 57% of TREC-INEX queries can be answered from KG alone under the restriction that \(e_1\) and \(e_2\) lie within two hops. Thus corpus evidence is important for TREC-INEX.

Convnet training protocol Data used to train the convnets is available (CSAW 2018). Some important design choices are described below.

QTN and QRN :

Initial word vectors are learnt using the CNN-non-static version of Kim (2014). Filter sizes are set to 3 and 4 with 150 feature maps each. Drop-out rate is 0.5, with 100 epochs and early stopping using a validation split of 10%. Training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update.

QCN :

The width of the convolution filters is set to 5, the number of convolutional feature maps 150, batch size 50. L2 regularization term is \(10^{-5}\) for the parameters of convolutional layers and \(10^{-4}\) for all the others. The dropout rate is set to 0.5. We initialize the word vectors using the word embeddings trained by Huang et al. (2012).

Score combination network training protocol All KG paths emanating from a query entity \(e_1\) can contribute to candidate answer set \({\mathcal {E}}_2\). We use the pruning process of (Bast and Haußmann 2015) to restrict the set of KG queries (and hence \({\mathcal {E}}_2\)) to a practical size. We normalize all feature values in [0, 1] before sending them to the score combination stage.

Evaluation protocol We evaluate entity ranking using measures common in Information Retrieval, applied to the response list of entities: mean average precision (MAP), mean reciprocal rank (MRR), and normalized discounted cumulative gain at rank 10 (NDCG@10). For ranking evaluation, the threshold module of Sect. 4.6 is not applied. For set retrieval evaluation, the system output entity set \({\mathcal {E}}_2\) is created by thresholding the ranked list. It is then compared against gold set \({\mathcal {E}}^*_2\) to compute recall, precision and F1.

5.2 QRN and QTN heatmaps

To understand the workings of QTN and QRN, we used the public implementation of Local Interpretable Model-Agnostic Explanations (LIME)Footnote 9 (Ribeiro et al. 2016). LIME linearly approximates a neural model’s behavior around the vicinity of a particular instance to detect the sensitivity of a label decision to input features. Figure 7 illustrates various sentences, their top predicted classes, and the sensitivity to each query word—positive, neural, or negative—in predicting that class. The observed polarities and intensities were generally intuitive.

Fig. 7
figure 7

QRN and QTN heatmaps showing query words that most strongly determine the predicted relations and types

5.3 Entity ranking comparison

The vast majority of KBQA papers report set retrieval accuracy in terms of recall, precision and F1. To our knowledge, only Joshi et al. (2014) report entity ranking accuracy. We compare AQQUCN against their system in Table 7, using standard ranking performance measures: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). In Table 7, we see 5–16% absolute improvement in mean average precision (MAP) over various query sets. The improvement is statistically significant (at \(p < 0.0001\)). Ablation studies in Sect. 5.6 suggest some causes for the improvements.

Table 7 Entity ranking performance comparison with Joshi et al. (2014)

5.4 Effect of number of interpretations allowed

Figure 8 shows interpolated precision against recall, without thresholding, for the variants AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL. The trends on the two datasets are opposites: on TREC-INEX, performance increases with the number of interpretations, but on WebQuestions, allowing more interpretations reduces F1. This makes sense, because 85% of WebQuestions queries can be answered using a single relation (Yao 2015) and without corpus support. AQQUCN-FEW restricts the number of interpretations and limits the damage.

Fig. 8
figure 8

Interpolated precision versus recall curves for different number of interpretations allowed (without thresholding)

For AQQUCN-1, note that features (2–26) in Table 5 are unique to an interpretation. Consequently, when an entity has no support from QCN, particularly when the entity is rare or absent in the corpus, the scores of entities retrieved by an interpretation are all the same because of identical feature vectors. This situation is not so rare that we can ignore its effects. In such cases, even if AQQUCN-1 retrieves a set of reasonable quality, its ranking results from arbitrary tie-breaking. AQQUCN-ALL and AQQUCN-FEW are largely immune to this problem, because ties are less likely among the entity scores \(\max _I S(I,e_2)\).

Table 8 reports F1 scores for AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL, after applying thresholding. The trends are similar to those shown in Fig. 8. WebQuestions has somewhat larger \({\mathcal {E}}^*_2\) on average (natural queries: 2.4; telegraphic: 2.1), but KG-based interpretations are frequently adequate. Each KG-based interpretation covers more entities, and therefore, relatively fewer interpretations are needed to cover \({\mathcal {E}}^*_2\). In contrast, TREC-INEX has smaller \({\mathcal {E}}^*_2\)s on average (natural queries: 1.5; telegraphic: 1.4). But TREC-INEX depends on corpus-based interpretations, each of which typically covers only one entity. Therefore, TREC-INEX needs more interpretations to get good F1 scores.

Table 8 F1 (after thresholding) comparison with different number of distinct interpretations

Given the almost exclusive focus on WebQuestions and SimpleQuestions in prior work, these important considerations were discovered only after instrumenting AQQUCN. Hereafter, we will refer to the best performing variant among AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL as AQQUCN-Best.

5.5 Entity set retrieval comparison

In Table 9, we compare F1 of entity set retrieval across severalFootnote 10 KBQA systems (CodaLab 2016). AQQUCN is presented with both relative threshold (AQQUCN-Best) and ideal threshold, to separate the quality of ranking versus thresholding. Also quoted is the F1 score achievable in principle if the single best interpretation is used from KG and corpus.

Table 9 F1 comparison with recent KGQA systems

The first striking observation on Table 9 is that, for TREC-INEX, using a single interpretation is a terrible plan; both AQQUCN-Best and AQQUCN-ALL with ideal threshold are far better. Predictably, without access to a corpus, Berant and Liang (2015) and Bast and Haußmann (2015) perform poorly. In sharp contrast, owing to the nature of WebQuestions, a clairvoyant choice of the single best interpretation beats everything else (including Savenkov and Agichtein (2016) with ideal threshold) by a very wide margin. While AQQUCN with ideal threshold far exceeds all other systems, it is not even close to the best single interpretation. In terms of non-clairvoyant achievable accuracies, AQQUCN-Best is visibly best for WebQuestions-KW, and ranked second for WebQuestions. Clearly a great deal of ground has been covered since 2013 for WebQuestions, yet there is plenty of room to improve. It is also clear that AQQUCN is much better at ranking than thresholding.

5.6 Ablation tests

To understand the contributions of the three convnets to the score combination network, we removed each network in turn, re-trained the best performing model, and tabulated the resulting F1 scores in Table 10. TREC-INEX (in both query forms) suffers a serious hit if QCN is removed. This makes sense because of the critical evidence brought in by corpus snippets in case of TREC-INEX. The effect of removing QCN is smaller for WebQuestions, but still visible, showing that, even if gold entities are in the KG, corpus evidence can help score them better.

Table 10 Ablation shows the relative effectiveness of the QCN, QTN and QRN networks

The influence between QRN and QTN is more nuanced. The relation r involved in a query may asserts strong selectional preferences on the types of participating entities. Therefore, an accurate r predicted by the QRN can often make up for mistakes made in predicting \(t_2\) by the QTN. Conversely, accurately predicting \(t_2\) may mitigate a misleading choice of r. Overall, though, Table 10 shows at least some performance reduction if either QRN or QTN is removed. For the terse queries in WebQuestions-KW, removing QTN hurts more than removing QCN.

5.7 Wins and losses

We performed a side-by-side analysis of a sample of queries for which we found our system to be doing better and worse than related work. Our system improved on some queries containing qualifiers such as ‘first’, ‘oldest’, since we harness signals from the text corpus. For example, Who was the first U.S. president ever to resign? can be translated to a complex graph query involving a max/sort over dates, making it difficult to interpret it using only the knowledge graph. Yih et al. (2015) handled some of these queries using extensively hand-engineered features. However, such information was readily found in the Web corpus (examples in Fig. 2). The corpus also helped when the KG was incomplete (e.g., president sworn on airplane) or for answering queries with no clear \(e_1\) (e.g., which kennedy died first?).

We performed worse on some queries, especially when the corpus signal added more noise than information and our type or relation CNNs were not able to narrow down to the correct answer. Some of these queries had a non-trivial syntactic structure, possibly not captured by the corpus. For example, What nation is home to the Kaaba?. At times, high annotation density in the corpus promoted popular non-answer entities over not-so-popular answer entities. For the query creator of the daily show, \({{\texttt {Jon\_Stewart}}}\) ranked above \({{\texttt {Madeleine\_Smithberg}}}\), purely based on corpus popularity. Such cases highlight an opportunity to improve our corpus, type and relation CNNs.

6 Conclusion

We presented AQQUCN, a system that unifies structured interpretation of queries with ranking of response entities. Apart from seamlessly integrating corpus and KG information, AQQUCN has two salient features: it can deal with the full spectrum of query styles between keyword queries and well-formed questions; and it directly ranks response entities, rather than ‘compile’ the input to a structured query and execute that on the KG alone.