: A Lightweight Complex Question Answering System
Abstract.
Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contemporary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present , an unsupervised question answering model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
1. Introduction
Open-domain question answering (QA), the task of providing direct answers to (factoid) questions, has remained an active area of research within the Information Retrieval (IR) and Natural Language Processing (NLP) communities. Most recent work in this area belongs to one of two paradigms: extracting answers from structured knowledge bases (KBs) or knowledge graphs (KGs), and extracting answers from unstructured text documents. While KBs are high-quality sources of organised information, they also have important limitations: real-life KBs that are up-to-date, and have both deep and broad coverage remain a holy grail. Much less technical machinery is required in order to contribute information to, and maintain, predominantly unstructured textual resources like the Web in general, and Wikipedia in particular. For the near future, therefore, traditional QA architectures based on text retrieval and analysis appear to be the paradigm of choice for flexible, open-domain QA. In this study, we present one such system, . The three notable features of that together distinguish it from most other recently proposed QA systems are: (i) its unsupervised nature, (ii) the ability to handle complex questions, and (iii) its efficiency. These are discussed in greater detail below.
Unsupervised nature. An unsupervised QA system attempts to answer a user-question Q given only the question, and a corpus C (Lewis et al., 2019). In contrast, many of the best-known, modern QA systems are supervised in nature (Zhao et al., 2020; Yang et al., 2018; Clark and Gardner, 2018; Rajpurkar et al., 2016). Given a large number of training samples, each consisting of a question, a textual passage containing the answer, and the answer itself, such systems generally โlearnโ to extract the answer from a given passage that is known to contain the answer.
An unsupervised QA system like has some practical advantages over supervised systems. First, it is concerned with the more general problem of finding answers to a question from a collection of documents, rather than finding the precise answer to a question from a piece of text that contains the answer. Second, it eliminates the need for large volumes of human-annotated training data. Finally, state-of-the-art methods for supervised QA systems often employ end-to-end Deep Learning architectures that are computationally expensive to develop, and which need GPUs for effective deployment. While modules used within the QA pipeline in do involve supervised / deep learning techniques for particular sub-tasks (e.g., a Part of Speech (POS) tagger, trained on text annotated with POS tags, and sentence embeddings), we bypass the need for the kind of computational resources required by end-to-end supervised QA systems.
Handling complex questions. A query is regarded as complex if it involves multiple entities, and multiple relationships between them, e.g.,
Which Nolan films won an Oscar, but missed a Golden Globe?
In contrast, a query like Who won the Turing Award in 1970? would be regarded as โsimpleโ. Answers to complex questions will often need to be synthesised by combining evidence from multiple documents. is specifically designed to answer such complex queries. None of the labeled datasets that are commonly used for training and testing the recent supervised QA systems mentioned above contains an adequate number of complex questions, which comprise our primary focus.
Lightweight. To the best of our knowledge, QUEST (Lu et al., 2019), a recently-proposed system, represents the state-of-the-art for unsupervised, complex QA systems. For an empirical comparison, therefore, we chose QUEST as a baseline, along with DrQA (Chen et al., 2017). Our experimental results suggest that is computationally much less expensive than the chosen baselines; yet, it achieves significantly better performance on the datasets used.
Summary of contributions. In sum, this paper describes a lightweight, unsupervised QA system for complex questions. An empirical comparison with two recent, state-of-the-art baselines suggests that the proposed system achieves significant improvements in terms of both effectiveness and efficiency. While the techniques presented in this report are simple to implement, we make the source code available to aid reproducibility, comparisons with other similar systems, and further research111https://github.com/souravsaha/licqa.
2. Background and Related Work
Early research on large-scale QA generally considered the task of finding answers to factoid queries from a corpus of unstructured documents (Voorhees, 1999). The traditional strategy established by these early methods (Singhal et al., 1999; Moldovan et al., 1999; Srihari and Li, 2000) consisted of retrieving a set of paragraphs/ documents for a given question, and subjecting them to more detailed analysis (e.g., by extracting and scoring named entities (NEs) from them). In (Lin and Katz, 2006), the authors argued that the intent behind document retrieval for QA and traditional searches can be different. Further, they proposed a synthetic test collection for QA tasks.
More recently, the structured and interlinked information provided by Knowledge Graphs and Knowledge Bases has been applied to question answering (Ahn et al., 2004; Buscaldi and Rosso, 2006; Cai and Yates, 2013; Berant et al., 2013). However, the incompleteness of KGs has restricted their use in building general purpose QA systems. To overcome this limitation, researchers have successfully used unstructured text documents to supplement KGs for QA (Savenkov and Agichtein, 2016; Fader et al., 2013, 2014).
Most recent work on QA has focused on utilising deep learning models for the task (Dong et al., 2015; Tan et al., 2018). Some of these efforts have also resulted in the creation of benchmark QA models (Berant et al., 2013; Usbeck et al., 2017; Chen et al., 2017; Sawant et al., 2019). However, these models require significant training data, and are also computationally expensive. DrQA (Chen et al., 2017) is one of these state-of-the-art QA systems. It combines a TF-IDF based document retriever with a multilayered bi-LSTM network that encodes the paragraphs from the top retrieved documents based on word embeddings before training two classifiers to identify the starting and ending locations of the potential answers. Nonetheless, it is still distantly supervised on SQuAD, TREC Questions, Web Questions, and Wikimovies dataset.
Although they have proven to be effective for factoid question answering, most of the QA systems discussed above are designed for questions that have an answer contained within a single document. These techniques often fail to retrieve the correct answer for complex questions, where the answers involve multiple entities spread across several documents (Abujabal et al., 2017; Bast and Haussmann, 2015; Khot et al., 2017; Bhutani et al., 2020; Talmor and Berant, 2018). A method for quantitative evaluation of interactive QA systems was proposed in (Wacholder et al., 2007). The proposed paradigm, called RUTS, incorporates real users, tasks and systems, and unlike the framework presented in this article, RUTS is applicable for the evaluation of user-centered information-access system.
Recently, an unsupervised method QUEST has been proposed by (Lu et al., 2019) for answering complex questions. Unlike the other methods mentioned above, QUEST does not depend upon any annotated datasets for training. It finds answers by building and analysing a quasi knowledge graph, constructed from documents retrieved by a search engine in response to the question. However, the creation of the KG on the fly leads to significant execution latency in finding the answers to a given question.
3. : A lightweight complex question answering model
uses a standard pipeline (shown in Figure 1) to process questions. The question type is determined. Concurrently, a set of documents that are expected to contain the answer is retrieved using keywords from the question. Since answers to most questions are entities, they are extracted from these documents as potential answers. These entities are then scored, and finally returned in the form of a ranked list. In the following subsections, each stage of the pipeline is described in detail.
3.1. Question Type Classification
The question classifier (QC) module tries to classify a given question according to the predicted type of its answer. For example, consider the question: โWhich actor played in Troy and Seven?โ Its answer is a named entity of the PERSON type. Similarly, the answer to โWhere in New Zealand is the Tomb of the Unknown Warrior located?โ is a PLACE-type entity. QC is generally considered an essential early stage in traditional QA pipelines (Moldovan et al., 2002), and has been the focus of much research (Silva et al., 2010; Blunsom et al., 2006; Tayyar Madabushi and Lee, 2016; Mohasseb et al., 2018). For this stage of , we experimented with both traditional (โshallowโ) and deep learning based methods, discussed below.
Traditional machine learning based classifier. Support Vector Machines (SVMs) are reported to perform well for question classification (Silva et al., 2010). Following (Li et al., 2008) and (Li and Roth, 2002), we construct the feature vector of a question using:
-
(I)
all named entities (NEs) in , and
-
(II)
the lemma and POS tag of each word in .
If all the questions in a test collection together contain distinct NEs, distinct lemmata, distinct lemma-bigrams, distinct POS tags, and distinct POS tag bigrams, then the feature vector is a binary vector of dimensions (see Figure 1). A linear SVM was then trained using the 5500 annotated questions provided by (Li and Roth, 2002).
Neural classifiers. We also try a neural classifier for this stage. A question is encoded employing the Universal Sentence Encoder (USE) (Cer et al., 2018), a state-of-the-art sentence encoder that is reported to work well for question classification (Reimers and Gurevych, 2019). This encoder embeds into a 256-dimension vector space by passing the average embedding of word-level unigrams and bigrams present in to DAN, a deep feed-forward averaging network. Next, the embedding for is fed to a feed-forward neural network with one dense layer, and a softmax layer at the output level. The network is trained on the same set of 5500 questions as above, using the Adam optimizer, with ReLU as the activation function, and cross entropy as the loss function.
3.2. Answer Extraction and Filtering
Given a question , it is submitted as a query to a standard IR engine (or search service) to retrieve , a set of documents, from a given collection (or the open Web). Each document is preprocessed: accented characters are replaced by their non-accented equivalents using unicodedata from the Python standard library; and contractions like canโt are expanded to cannot.
Next, we extract the set of NEs present in using Flair (Akbik et al., 2019), a recent method based on contextualised embeddings (where a single word is mapped to different embeddings, corresponding to the different sentential contexts in which the word occurs). Each entity is tagged using a label from the OntoNotes 5 tagset (Weischedel et al., 2013).
| Answer Type in (Li and Roth, 2002) | OntoNotes 5 Entity Type |
|---|---|
| HUMAN | PERSON |
| LOCATION | GPE, LOC, ORG |
| ENTITY | NORP, FAC, PRODUCT, EVENT |
| LANGUAGE, LAW, WORK_OF_ART | |
| NUMERIC | DATE, TIME, PERCENT, MONEY |
| QUANTITY, ORDINAL, CARDINAL | |
| money | MONEY |
| date | DATE |
| group | ORG |
The final step within this stage involves selecting only entities whose label (or type) matches the expected answer type for the current question. Using Flair for NE extraction poses an operational problem, however. The tags assigned to candidate answers by Flair and the tags assigned by the QC module (trained using labels defined by (Li and Roth, 2002)) are different. To circumvent this issue, we use a handcrafted table (Table 1) to map the labels provided in (Li and Roth, 2002) to their equivalents in OntoNotes 5. Thus, for a question of type LOCATION, all entities labelled GPE, LOC or ORG are selected as candidate answers.
3.3. Answer Scoring
The next step involves scoring the selected entities (denoted by ), based on their presumed relevance to the given question. From the example questions cited above (i.e., โWhich actor played in Troy and Seven?โ), it is clear that the answer entity itself is unlikely to occur in , but contexts in which the answer occurs and is recognisable as such (i.e., โPittโs portrayal of Achilles in Troy (2004) Detective Mills in Seven โ etc.) will probably have a substantial overlap. Accordingly, the key idea in our scoring approach involves measuring the semantic similarity between and the sentences in which a candidate answer occurs.
To keep computational costs down (in keeping with our objective of designing a system that would be usable in near real-time), we first compute the document frequency (df) of each entity in as follows.
| (1) |
If contains more than 100 entities, these entities are ranked in decreasing order of df, and only the top 100222This number was fixed at 100 so as to avoid yet another tunable parameter in our system. are retained for further processing.
Next, we extract from all sentences in which any occurs. Let denote the sentences in that contain , and let . The match between any and is measured using , the cosine similarity between the embeddings of and . Two types of embeddings were tried.
InferSent. In this method proposed by (Conneau et al., 2017), a pre-trained GloVe embedding of words (300 dimension) are used to get vector representation of the words in the sentences. Further, a bidirectional LSTM (Bi-LSTM) network is trained on the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) dataset. Applying a max pooling over the hidden states of the network, it encodes each sentence to obtain a 4096-dimensional encoded version of the sentence.
Sentence-Bert. It fine-tunes a pretrained BERT (Devlin et al., 2019)/RoBERTa (Liu et al., 2019) based on the input sentences, and applies a mean pooling over the output vectors to create the sentence embedding. As discussed in (Reimers and Gurevych, 2019), it works significantly faster than BERT and gives better representation.
Finally, we need to compute a single for each candidate answer by aggregating the scores of all the sentences in . Three aggregation methods were tried.
-
โข
Simple averaging: In this case, , denoted by , is given by
(2) -
โข
Averaging over the best matching sentence from each document: For each document , we first pick the sentence containing that has the maximum similarity with (see Equation 3).
(3) We then compute the average of these scores over all .
(4) -
โข
Considering only the single best match: In this case, we consider the score of the best matching sentence in as .
(5)
3.4. Answer Ranking
The final ranking of the answers is based on an arithmetic combination of the similarity score obtained above and the (normalised) document frequency of the entities. This was motivated by observations from the query expansion / relevance feedback literature that document frequency is a useful factor for selecting / weighting expansion terms. To combine the two factors, we tried both weighted addition and simple multiplication.
-
โข
Weighted addition: The semantic similarity score and the normalized df were combined as follows.
(6) Here and are parameters to be tuned.
-
โข
Simple multiplication: In this variant, we simply multiplied the semantic similarity with the normalized df.
(7)
Finally, a list of the entities corresponding to the five highest combined scores is returned as the set of answers for . This list is only partially ordered, since can assign the same rank (one through five) to all entities that obtain the same combined score.
4. Experimental Setup
4.1. Baselines
To compare the effectiveness of with methods having diverse working principles, we adopt two state-of-the-art complex question answering models, specifically QUEST (Lu et al., 2019) and DrQA (Chen et al., 2017) as our baselines. Two additional graph-based question answering models, BFS (Suchanek et al., 2009) and ShortestPaths (Lu et al., 2019), are also considered for comparing the performance of .
is designed for answering complex questions (having answers that need to be constructed from multiple documents) in a real-time application (for which computational cost is an essential factor). Therefore, neither end-to-end QA systems that are heavily supervised, nor computationally expensive systems based on recent deep neural models (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019)) would be appropriate as baselines for our studies. Note that we do use distilled versions of some of these models in some stages of our pipeline, however, and compare their effect with that of other alternatives.
4.2. Datasets
Datasets that have been commonly used to evaluate QA systems include SQuAD (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019) and WikiMovies (Miller et al., 2016). As mentioned before, these datasets are not focused on complex questions, and are designed for supervised QA systems. In order to experiment with various types of complex questions, and to allow a fair comparison with our chosen baselines, we use the same dataset as QUEST. Specifically, we use two sets of questions โ WikiAnswers (CQ-W), and Google Trends (CQ-T) โ for evaluation. Each set contains 150 complex questions of varying difficulty.
Document collection. Recall from Figure 1 that one of the first steps in involves performing a keyword-based retrieval from a corpus with the question to obtain a set of documents containing the potential answers. In principle, we treat the open Web as our target document collection, and use a search service like the Google API to retrieve documents. However, the Web is dynamic. To ensure reproducibility, the dataset provided by (Lu et al., 2019) includes the 10 top-ranked documents that they actually retrieved in the course of their experiments for each query. We refer to this set (comprising the 10 top-ranked documents for each question) as Top10. Given Top10, we skip the retrieval step during our experiments, and start by analysing this set of documents.
To minimize the effect of the QA components apparently built into Googleโs search engine and to simulate variations in the search algorithm, five additional document collections were constructed using stratified sampling from the top documents returned by Google. These collections are named Strata-1 through Strata-5. These collections also contain 10 documents per query, with % documents selected from within the top 10 ranks, % from documents ranked 11 through 25, and % from ranks 26 to 50. Table 2 shows the values of , , and chosen for each these collections.
| Collection | |||
|---|---|---|---|
| Top10 | 100 | 0 | 0 |
| Strata-1 | 60 | 30 | 10 |
| Strata-2 | 50 | 40 | 10 |
| Strata-3 | 50 | 30 | 20 |
| Strata-4 | 40 | 40 | 20 |
| Strata-5 | 40 | 30 | 30 |
4.3. Evaluation metrics
Following standard practice (Lu et al., 2019; Christmann et al., 2019), we evaluate QA systems in terms of precision at rank 1 (), mean reciprocal rank (), and hit at 5 (). The commonly used definitions of these metrics are intended for use with a ranked list that does not have any ties. However, QA systems (including QUEST and ) sometimes do produce ties, i.e., multiple answers that are assigned the same rank. In fact, this would be the correct response to some questions like Which Nobel Laureate was in the Manhattan Project?) for which multiple correct answers are possible (Neils Bohr, Enrico Fermi, etc.). In addition to the above metrics, therefore, we use tie-aware versions of these metrics (McSherry and Najork, 2008; Saha et al., 2022) โ , and โ for a more realistic comparison of different systems.
| Stages of pipeline | Evaluation metrics | |||||
|---|---|---|---|---|---|---|
| Type classification | Sentence embedding | Scoring | Ranking | tMRR | tP@1 | tHit@5 |
| Traditional (SVM) | 0.412 | 0.282 | 0.633 | |||
| max-score | 0.406 | 0.275 | 0.626 | |||
| 0.123 | 0.063 | 0.242 | ||||
| avg-maxscore | 0.124 | 0.063 | 0.242 | |||
| 0.387 | 0.253 | 0.620 | ||||
| InferSent | avg-score | 0.386 | 0.253 | 0.620 | ||
| 0.411 | 0.280 | 0.633 | ||||
| max-score | 0.398 | 0.280 | 0.613 | |||
| 0.116 | 0.060 | 0.216 | ||||
| avg-maxscore | 0.118 | 0.060 | 0.220 | |||
| 0.408 | 0.260 | 0.640 | ||||
| Sent_BERT | avg-score | 0.393 | 0.260 | 0.613 | ||
| Neural (USE) | 0.407 | 0.282 | 0.626 | |||
| max-score | 0.401 | 0.275 | 0.620 | |||
| 0.117 | 0.056 | 0.208 | ||||
| avg-maxscore | 0.117 | 0.056 | 0.225 | |||
| 0.385 | 0.260 | 0.613 | ||||
| InferSent | avg-score | 0.385 | 0.260 | 0.613 | ||
| 0.399 | 0.273 | 0.620 | ||||
| max-score | 0.389 | 0.273 | 0.600 | |||
| 0.106 | 0.050 | 0.226 | ||||
| avg-maxscore | 0.104 | 0.050 | 0.206 | |||
| 0.389 | 0.246 | 0.620 | ||||
| Sent_BERT | avg-score | 0.387 | 0.260 | 0.600 | ||
4.4. configuration
Other than the parameters involved within the various commodity components used in our pipeline, the only parameters in are and (see Equation 6). These were varied over in steps of .
| System | Metric | Top10 | Strata-1 | Strata-2 | Strata-3 | Strata-4 | Strata-5 |
|---|---|---|---|---|---|---|---|
| ShortestPaths | 0.240 | 0.261 | 0.249 | 0.237 | 0.259 | 0.270 | |
| BFS | 0.249 | 0.256 | 0.266 | 0.212 | 0.219 | 0.254 | |
| DrQA | 0.226 | 0.237 | 0.257 | 0.256 | 0.215 | 0.248 | |
| QUEST | 0.355 | 0.380 | 0.340 | 0.302 | 0.356 | 0.318 | |
| 0.432โยงโ โก | 0.424โยงโ โก | 0.431โยงโ โก | 0.420โยงโ โก | 0.426โยงโ โก | 0.416โยงโ โก | ||
| % improvement | 21.6% | 11.5% | 26.7% | 39% | 19.6% | 30.8% | |
| ShortestPaths | 0.147 | 0.173 | 0.193 | 0.140 | 0.147 | 0.187 | |
| BFS | 0.160 | 0.167 | 0.193 | 0.113 | 0.100 | 0.147 | |
| DrQA | 0.184 | 0.199 | 0.221 | 0.215 | 0.172 | 0.200 | |
| QUEST | 0.268 | 0.315 | 0.262 | 0.216 | 0.258 | 0.216 | |
| 0.293โโ | 0.280ยง | 0.320โยง | 0.273โยง | 0.306โยงโ | 0.280ยง | ||
| % improvement | 9.32% | -11.1% | 22.13% | 26.3% | 18.6% | 29.6% | |
| ShortestPaths | 0.347 | 0.367 | 0.387 | 0.327 | 0.393 | 0.340 | |
| BFS | 0.360 | 0.353 | 0.347 | 0.327 | 0.327 | 0.360 | |
| DrQA | 0.313 | 0.315 | 0.322 | 0.322 | 0.303 | 0.340 | |
| QUEST | 0.376 | 0.396 | 0.356 | 0.344 | 0.401 | 0.358 | |
| 0.646โยงโ โก | 0.673โยงโ โก | 0.633โยงโ โก | 0.653โยงโ โก | 0.626โยงโ โก | 0.640โยงโ โก | ||
| % improvement | 71.8% | 69.9% | 77.8% | 89.8% | 56.1% | 78.7% |
| System | Metric | Top10 | Strata-1 | Strata-2 | Strata-3 | Strata-4 | Strata-5 |
| ShortestPaths | 0.266 | 0.224 | 0.248 | 0.219 | 0.232 | 0.222 | |
| BFS | 0.287 | 0.256 | 0.265 | 0.259 | 0.219 | 0.201 | |
| DrQA | 0.355 | 0.330 | 0.356 | 0.369 | 0.365 | 0.380 | |
| QUEST | 0.467 | 0.436 | 0.426 | 0.460 | 0.409 | 0.384 | |
| 0.443โยง | 0.407โยง | 0.472โยงโ โก | 0.476โยงโ โก | 0.422โยง | 0.468โยงโ โก | ||
| % improvement | -5.1% | -6.6% | 10.8% | 3.4% | 3.1% | 21.8% | |
| ShortestPaths | 0.190 | 0.140 | 0.160 | 0.160 | 0.150 | 0.130 | |
| BFS | 0.210 | 0.170 | 0.180 | 0.180 | 0.140 | 0.130 | |
| DrQA | 0.286 | 0.267 | 0.287 | 0.300 | 0.287 | 0.320 | |
| QUEST | 0.394 | 0.360 | 0.347 | 0.377 | 0.333 | 0.288 | |
| 0.306โ | 0.260โ | 0.346โยง | 0.340โยง | 0.273โยง | 0.320โยง | ||
| % improvement | -22.3% | -27.7% | -0.2% | -9.8% | -18.0% | 11.1% | |
| ShortestPaths | 0.350 | 0.320 | 0.340 | 0.310 | 0.330 | 0.290 | |
| BFS | 0.380 | 0.360 | 0.370 | 0.360 | 0.310 | 0.320 | |
| DrQA | 0.453 | 0.440 | 0.473 | 0.487 | 0.480 | 0.480 | |
| QUEST | 0.531 | 0.496 | 0.510 | 0.500 | 0.503 | 0.459 | |
| 0.646โยงโ โก | 0.633โยงโ โก | 0.680โยงโ โก | 0.673โยงโ โก | 0.640โยงโ โก | 0.686โยงโ โก | ||
| % improvement | 21.6% | 27.6% | 33.3% | 34.6% | 27.2% | 49.4% |
5. Result and Discussions
Recall that the pipeline contains four components for which multiple options were explored. We first present a comprehensive comparison across all the different combinations to determine what works best. This also allows us to carefully analyse the effect of the choice for any individual module, by fixing the choices for the other modules. In the next subsection, we present an overall comparison between and the baselines mentioned in the previous section. While this does involve selecting the most effective design for the pipeline, it does not amount to retrospectively training parameters on the test data. Finally, by way of a qualitative comparison with QUEST, as well as failure analysis, we provide an anecdotal discussion of the results for some specific questions.
5.1. Analysis of pipeline components
To verify the prominence of each component and the robustness of , we vary the different modules and check the performance. Table 3 shows the performance of different stages of the pipeline for CQ-W question-set. The performance variation with questions from CQ-T has been seen to be similar; hence we are reporting the performances on the CQ-W question-set for the sake of brevity. We choose the best module in each stage and report our performance of .
While comparing the overall performance of the question type classifiers, we note that the SVM based classifier works slightly better than the neural classifier. After extracting the set of sentences from the retrieved documents, the next stage of involves scoring of the entities in those sentences which consists of a semantic and a statistical comparison. The semantics-based similarity is computed between the question and the extracted sentences containing entities of same type as of the answer. Among the two sentence encoding techniques experimented with, InferSent has been seen to be working better than Sent_BERT in a majority of cases although the performances are mostly comparable. As part of the statistical comparison between the question and the extracted sentences with the NEs, max-score (Equation 5) has been observed to be superior than the other scoring functions (presented in Equation 2, 4, 5) irrespective of the other components. At the final phase, ranks the entities based on the computed similarity scores. The best performance is discerned when the normalized document frequency is multiplied with the score of the individual entities using (Equation 7). Overall, the optimal performance is reported when the SVM based classifier is used with InferSent as the sentence embedding technique while scoring and ranking of the entities are performed respectively using max-score and . Hence, we report the performance of based on this configuration and compare the result with the baselines.
5.2. End-to-end evaluation
5.2.1. Evaluation using , and
Table 4,5 compare the overall performance of and the baselines for both question sets, WikiAnswers (CQ-W) and Google Trends (CQ-T), in terms of , and . We observe that achieved the best performance across all metrics for CQ-W; it yields an improvement of 21.6% and 71.8% over the next best system (QUEST) in terms of and , respectively, for the Top10 dataset. Similar trends are observed for Strata 1โ5. The improvements over ShortestPaths, BFS and the strong baselines, QUEST and DrQA, are statistically significant (paired t-test with ) for almost all document sets.
For CQ-T, achieves the best across document sets, but QUEST does consistently better for , and also achieves better for the Top10 and Strata-1 sets. It turns out, however, that the difference between QUEST and is not statistically significant in these cases.
Across query and document sets, โs ability to consistently retrieve at least one correct answer within the top 5 ranks is particularly encouraging. In this regard, it is successful for at least 63% of the questions, significantly more than the next best system (usually QUEST).
Question: which country borders mali and ghana? Rank Answers 1 countries three countries 2 central bank of west african states africa ohada 3 attack 4 accra 5 burkina faso burkina faso burkina
| CQ-W | CQ-T | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| System | Metric | Top10 | Strata-1 | Strata-2 | Strata-3 | Strata-4 | Strata-5 | Top10 | Strata-1 | Strata-2 | Strata-3 | Strata-4 | Strata-5 |
| DrQA | 0.226 | 0.237 | 0.257 | 0.256 | 0.215 | 0.248 | 0.355 | 0.330 | 0.356 | 0.369 | 0.365 | 0.380 | |
| QUEST | 0.067 | 0.116 | 0.066 | 0.067 | 0.077 | 0.055 | 0.065 | 0.075 | 0.066 | 0.076 | 0.069 | 0.057 | |
| 0.412โ โก | 0.411โ โก | 0.408โ โก | 0.398โ โก | 0.418โ โก | 0.399โ โก | 0.433โก | 0.388โ โก | 0.443โ โก | 0.455โ โก | 0.399โ โก | 0.434โ โก | ||
| % improvement | 82.3% | 73.4% | 58.7% | 55.4% | 94.4% | 60.8% | 21.9% | 17.5% | 24.4% | 23.3% | 9.31% | 14.2% | |
| DrQA | 0.184 | 0.199 | 0.221 | 0.215 | 0.172 | 0.200 | 0.286 | 0.267 | 0.287 | 0.300 | 0.287 | 0.320 | |
| QUEST | 0.040 | 0.093 | 0.043 | 0.042 | 0.053 | 0.034 | 0.043 | 0.042 | 0.036 | 0.048 | 0.035 | 0.031 | |
| 0.282โ โก | 0.270โก | 0.292โก | 0.255โก | 0.303โ โก | 0.262โก | 0.296โก | 0.242โก | 0.312โก | 0.320โก | 0.250โก | 0.290โก | ||
| % improvement | 53.2% | 35.6% | 32.1% | 18.6% | 76.1% | 31.0% | 3.4% | -9.3% | 8.7% | 6.6% | -12.8% | -9.3% | |
| DrQA | 0.313 | 0.315 | 0.322 | 0.322 | 0.303 | 0.340 | 0.453 | 0.440 | 0.473 | 0.487 | 0.480 | 0.480 | |
| QUEST | 0.155 | 0.194 | 0.144 | 0.144 | 0.162 | 0.120 | 0.154 | 0.169 | 0.147 | 0.165 | 0.149 | 0.121 | |
| 0.640โ โก | 0.655โ โก | 0.614โ โก | 0.649โ โก | 0.620โ โก | 0.635โ โก | 0.646โ โก | 0.620โ โก | 0.657โ โก | 0.662โ โก | 0.616โ โก | 0.664โ โก | ||
| % improvement | 104.4% | 107.9% | 90.6% | 101.5% | 104.6% | 86.7% | 42.6% | 40.9% | 38.9% | 35.9% | 28.3% | 38.3% |
5.2.2. Evaluation using , and
Of the various systems whose performance we seek to compare, QUEST and both produce ranked lists of answers that may contain ties. The figures reported above are thus likely to over-estimate the accuracy of these two systems. Table 6 shows an example explaining this possibility. In the example, the answers returned by QUEST contains 20 entities at rank 2, all of which are incorrect, but because a correct answer is retrieved at rank 5 (before ties are broken), the values of traditional and values are 0.2 and 1, respectively.
In this section, therefore, we compare the performance of the best systems using tie-aware versions of the above evaluation measures, viz., , and . The values of these metrics for DrQA, QUEST and are shown in Table 7. Note that DrQA, ShortestPaths and BFS return single answers at each rank; the values of the conventional and tie-aware metrics would be the same for these methods. We have excluded the relatively weak baselines โ ShortestPaths and BFS โ from the table.
Once again, the results show that that consistently outperforms both DrQA and QUEST in terms of , as well as on both sets of questions. The table also confirms that traditional metrics grossly over-report the accuracy of QUEST. When evaluated using tie-aware measures, DrQAโs performance is observed to be significantly better than QUESTโs. In hindsight, this is unsurprising, since QUEST produces lists that often contain many different entities at a particular rank. Naturally, the improvement of over QUEST is statistically significant for all three metrics. More importantly, the and figures for are, almost without exception, also statistically significantly higher than the corresponding figures for DrQA.
Figure 2 graphically compares the performance of , DrQA and QUEST using both conventional and tie-aware metrics on CQ-W and CQ-T for the Top10 collection. Each plot contains five bars that correspond respectively to each of the classical and tie-aware metrics for QUEST (in khaki and olive), DrQA (teal) and (in skyblue and steelblue). As noted before, DrQA always retrieves a single entity at any rank, so both metrics have the same value. From the figure, it is clear that the skyblue bars are generally the tallest, except for P@1 on CQ-T, where QUEST performed the best. For the tie-aware metrics, consistently achieves the best performance (steelblue bar) in all cases.
| Question Set | Question | Answer | rank | QUEST rank | DrQA rank |
|---|---|---|---|---|---|
| CQ-W | What movie starred Bruce Willis and Haley Joel Osment? | The Sixth Sense | 1 | 5 | 3 |
| Who founded Apple and Pixar? | Steve Jobs | 1 | 2 | 5 | |
| CQ-T | Which 2018 studio album is performed by Playboi Carti and features Nicki Minaj? | Die Lit | 1 | 5 | 2 |
| Which American professional golfer is married to Taylor Dowd Simpson and played for Wake Forest University? | Simpson | 1 | 5 | 5 |
5.3. Qualitative comparison
Next, we take a closer look at the relative performance of QUEST and . Figures 3 and 4 show the per-query performance difference between the two systems on the Top10 dataset. In these figures, a bar above the x-axis indicates that performed better than QUEST for the corresponding question. Figures 3aโ3c and Figures 4a-4c) show that performs better for a majority of the questions on both sets. Note that, for a single question, the values of and are binary (0 or 1), and the difference in these values can be , or .
In contrast, the tie-aware measures have non-binary values. The per-query difference with respect to these metrics are shown in Figure 3d-3f and Figure 4d-4f, for CQ-W and CQ-T, respectively. Once again, it is clear from these figures that outperforms QUEST for most questions.
Some example questions for which was able to retrieve the correct answer at top rank, but QUEST and DrQA failed, are presented in Table 8.
5.4. Latency analysis
Finally, we compare with DrQA and QUEST in terms of the average time taken to produce the list of answers for a given set of questions. Each system is made to run in a loop for five iterations, and computes the answers for all questions in CQ-W and CQ-T in every iteration. The average of the time taken over these five iterations is reported in Table 9.
All our experiments were conducted on an Intel(R) Core(TM) i9-7900X CPU@3.30GHz and 24 GB Nvidia Titan RTX GPU machine with 132 GB RAM. Among the methods, QUEST takes the highest time in producing the results, spending a significant time in creating the quasi knowledge graph with the retrieved documents. The latency of DrQA is also close to QUEST. In contrast, the execution time of is significantly less than both the baselines. More specifically, we observe an speedup for compared to the baselines in the total running time taken to process the two question sets for all the document sets (Top10 and Strata1โ5).
| System | Query set | Top10 | Strata-1 | Strata-2 | Strata-3 | Strata-4 | Strata-5 |
|---|---|---|---|---|---|---|---|
| CQ-W | 13 | 15 | 17 | 14 | 16.5 | 14.3 | |
| QUEST | 144 | 148 | 151 | 156 | 149 | 162 | |
| DrQA | 120 | 131 | 119 | 124.6 | 128 | 129 | |
| speedup | 9.23 | 8.73 | 7 | 8.9 | 7.75 | 9.02 | |
| CQ-T | 14.5 | 16 | 18 | 15.8 | 18 | 17.6 | |
| QUEST | 155 | 159 | 162 | 178 | 189.5 | 195.4 | |
| DrQA | 130 | 135 | 138.6 | 132.8 | 142 | 137 | |
| speedup | 8.96 | 8.43 | 7.7 | 8.40 | 7.88 | 7.78 |
6. Conclusion and Future Work
In this paper, we present , a low-resource, unsupervised, complex question answering system that uses various forms of corpus evidence. To evaluate , experiments were conducted on diverse question sets taken from public forums like WikiAnswers and Google Trends. was compared with state-of-the-art systems DrQA and QUEST using , , , and their tie-aware variants. achieved significant improvements over all baselines in a majority of cases, while simultaneously reducing computational cost (as measured by the average time taken to answer a question).
Recently developed neural models for representing complex concepts via low dimensional vectors (Shalaby et al., 2019; Yamada et al., 2020) have been successfully applied to different text processing and information extraction tasks (Gruzdo et al., 2020). In future, we plan to explore such embedding models to further improve our answer extraction module.
References
- Automated template generation for question answering over knowledge graphs. In Proc. of 26th WWW, pp.ย 1191โ1200. Cited by: ยง2.
- Using wikipedia at the TREC QA track. In Proc. of Thirteenth TREC, NIST Special Publication, Vol. 500-261. Cited by: ยง2.
- Pooled contextualized embeddings for named entity recognition. In NAACL 2019, pp.ย 724 โ 728. Cited by: ยง3.2.
- More accurate question answering on freebase. In Proc. of 24th ACM CIKM, pp.ย 1431โ1440. Cited by: ยง2.
- Semantic parsing on Freebase from question-answer pairs. In Proc. of 2013 EMNLP, pp.ย 1533โ1544. Cited by: ยง2, ยง2.
- Answering complex questions by combining information from curated and extracted knowledge bases. In Proc. of First Workshop on Natural Language Interfaces, pp.ย 1โ10. Cited by: ยง2.
- Question classification with log-linear models. In Proceedings of 29th SIGIR, pp.ย 615โ616. Cited by: ยง3.1.
- A large annotated corpus for learning natural language inference. In Proc. 2015 EMNLP, pp.ย 632โ642. Cited by: ยง3.3.
- Mining knowledge from wikipedia for the question answering task. In Proc. of Fifth LREC, pp.ย 727โ730. Cited by: ยง2.
- Large-scale semantic parsing via schema matching and lexicon extension. In Proc. of 51st ACL, pp.ย 423โ433. Cited by: ยง2.
- Universal sentence encoder for English. In Proc. 2018 EMNLP: System Demonstrations, pp.ย 169โ174. Cited by: ยง3.1.
- Reading Wikipedia to answer open-domain questions. In Proceedings of 55th ACL 2017, pp.ย 1870โ1879. Cited by: ยง1, ยง2, ยง4.1.
- Look before you hop: conversational question answering over knowledge graphs using judicious context expansion. In Proc of 28th ACM CIKM, CIKM โ19, pp.ย 729โ738. Cited by: ยง4.3.
- Simple and effective multi-paragraph reading comprehension. In Proc. of 56th ACL, pp.ย 845โ855. Cited by: ยง1.
- Supervised learning of universal sentence representations from natural language inference data. In Proc. of 2017 EMNLP, pp.ย 670โ680. Cited by: ยง3.3.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, pp.ย 4171โ4186. Cited by: ยง3.3, ยง4.1.
- Question answering over freebase with multi-column convolutional neural networks. In Proc. of 53rd ACL, pp.ย 260โ269. Cited by: ยง2.
- Open question answering over curated and extracted knowledge bases. In The 20th ACM SIGKDD, pp.ย 1156โ1165. Cited by: ยง2.
- Paraphrase-driven learning for open question answering. In Proc. of 51st ACL, pp.ย 1608โ1618. Cited by: ยง2.
- Applฤฑcatฤฑon of paragraphs vectors model for semantฤฑc text analysฤฑs. In Proc. of 4th COLINS, CEUR Workshop Proceedings, Vol. 2604, pp.ย 283โ293. Cited by: ยง6.
- Answering complex questions using open information extraction. In Proc. of 55th ACL 2017, pp.ย 311โ316. Cited by: ยง2.
- Natural questions: a benchmark for question answering research. TACL. Cited by: ยง4.2.
- Unsupervised question answering by cloze translation. In Proc. of 57th ACL, pp.ย 4896โ4910. Cited by: ยง1.
- Classifying what-type questions by head noun tagging. In Proc. of 22nd Coling, pp.ย 481โ488. Cited by: ยง3.1.
- Learning question classifiers. In Proc. of 19th COLING, COLING โ02, pp.ย 1โ7. Cited by: ยง3.1, ยง3.1, ยง3.2, Table 1.
- Building a reusable test collection for question answering. J. Am. Soc. Inf. Sci. Technol. 57 (7), pp.ย 851โ861. External Links: ISSN 1532-2882 Cited by: ยง2.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: 1907.11692 Cited by: ยง3.3, ยง4.1.
- Answering complex questions by joining multi-document evidence with quasi knowledge graphs. In Proc. of 42nd SIGIR, SIGIRโ19, pp.ย 105โ114. External Links: ISBN 9781450361729 Cited by: ยง1, ยง2, ยง4.1, ยง4.2, ยง4.3.
- Computing information retrieval performance measures efficiently in the presence of tied scores. In Proc. of 30th ECIR, ECIRโ08, pp.ย 414โ421. Cited by: ยง4.3.
- Key-value memory networks for directly reading documents. In Proc. 2016 EMNLP, pp.ย 1400โ1409. Cited by: ยง4.2.
- Question categorization and classification using grammar based approach. Information Processing & Management 54 (6), pp.ย 1228โ1243. External Links: ISSN 0306-4573 Cited by: ยง3.1.
- LASSO: A tool for surfing the answer net. In Proc. of Eighth TREC 1999, Vol. 500-246. Cited by: ยง2.
- Performance issues and error analysis in an open-domain question answering system. In Proceedings of 40th ACL, pp.ย 33โ40. Cited by: ยง3.1.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of 2016 EMNLP, pp.ย 2383โ2392. Cited by: ยง1, ยง4.2.
- Sentence-bert: sentence embeddings using siamese bert-networks. In Proc. of 2019 EMNLP, pp.ย 3982โ3992. Cited by: ยง3.1, ยง3.3.
- On modifying evaluation measures to deal with ties in ranked lists. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, JCDL โ22, New York, NY, USA. External Links: ISBN 9781450393454, Link, Document Cited by: ยง4.3.
- When a knowledge base is not enough: question answering over knowledge bases with external text data. In Proc. of 39thSIGIR, pp.ย 235โ244. Cited by: ยง2.
- Neural architecture for question answering using a knowledge graph and web corpus. Inf. Retr. 22 (3โ4), pp.ย 324โ349. External Links: ISSN 1386-4564, Link Cited by: ยง2.
- Beyond word embeddings: learning entity and concept representations from large scale knowledge bases. Inf. Retr. J. 22 (6), pp.ย 525โ542. Cited by: ยง6.
- From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review 35 (2), pp.ย 137โ154. Cited by: ยง3.1, ยง3.1.
- AT&t at TREC-8. In Proc. of Eighth TREC 1999, Vol. 500-246. Cited by: ยง2.
- A question answering system supported by information extraction. In 6th ANLP, Seattle, Washington, USA, pp.ย 166โ172. Cited by: ยง2.
- STAR: steiner-tree approximation in relationship graphs. In 2013 IEEE 29th ICDE, Vol. , pp.ย 868โ879. External Links: ISSN 1084-4627 Cited by: ยง4.1.
- The web as a knowledge-base for answering complex questions. In Proc. of NAACL-HLT, pp.ย 641โ651. Cited by: ยง2.
- Context-aware answer sentence selection with hierarchical gated recurrent neural networks. IEEE ACM Trans. Audio Speech Lang. Process. 26 (3), pp.ย 540โ549. Cited by: ยง2.
- High accuracy rule-based question classification using question syntax and semantics. In Proc. of the 26th COLING 2016, pp.ย 1220โ1230. Cited by: ยง3.1.
- 7th open challenge on question answering over linked data (QALD-7). pp.ย 59โ69. Cited by: ยง2.
- The trec-8 question answering track report. In In Proc. of TREC-8, pp.ย 77โ82. Cited by: ยง2.
- A model for quantitative evaluation of an end-to-end question-answering system. J. Assoc. Inf. Sci. Technol. 58 (8), pp.ย 1082โ1099. External Links: Link, Document Cited by: ยง2.
- Ontonotes release 5.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Cited by: ยง3.2.
- Wikipedia2Vec: an efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. arXiv preprint 1812.06280v3. Cited by: ยง6.
- HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, pp.ย 2369โ2380. Cited by: ยง1.
- Complex factoid question answering with a free-text knowledge graph. In Proc. of Web Conference 2020, WWW โ20, pp.ย 1205โ1216. External Links: ISBN 9781450370233 Cited by: ยง1.