$\mathsf{LiCQA}$ : A Lightweight Complex Question Answering System

Sourav Saha Indian Statistical InstituteKolkataIndia sourav.saha_r@isical.ac.in , Dwaipayan Roy Indian Institute of Science Education and ResearchKolkataIndia dwaipayan.roy@iiserkol.ac.in and Mandar Mitra Indian Statistical InstituteKolkataIndia mandar@isical.ac.in

Abstract.

Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contemporary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present $\mathsf{LiCQA}$ , an unsupervised question answering model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of $\mathsf{LiCQA}$ with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that $\mathsf{LiCQA}$ significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.

1. Introduction

Open-domain question answering (QA), the task of providing direct answers to (factoid) questions, has remained an active area of research within the Information Retrieval (IR) and Natural Language Processing (NLP) communities. Most recent work in this area belongs to one of two paradigms: extracting answers from structured knowledge bases (KBs) or knowledge graphs (KGs), and extracting answers from unstructured text documents. While KBs are high-quality sources of organised information, they also have important limitations: real-life KBs that are up-to-date, and have both deep and broad coverage remain a holy grail. Much less technical machinery is required in order to contribute information to, and maintain, predominantly unstructured textual resources like the Web in general, and Wikipedia in particular. For the near future, therefore, traditional QA architectures based on text retrieval and analysis appear to be the paradigm of choice for flexible, open-domain QA. In this study, we present one such system, $\mathsf{LiCQA}$ . The three notable features of $\mathsf{LiCQA}$ that together distinguish it from most other recently proposed QA systems are: (i) its unsupervised nature, (ii) the ability to handle complex questions, and (iii) its efficiency. These are discussed in greater detail below.

Unsupervised nature. An unsupervised QA system attempts to answer a user-question Q given only the question, and a corpus C (Lewis et al., 2019). In contrast, many of the best-known, modern QA systems are supervised in nature (Zhao et al., 2020; Yang et al., 2018; Clark and Gardner, 2018; Rajpurkar et al., 2016). Given a large number of training samples, each consisting of a question, a textual passage containing the answer, and the answer itself, such systems generally ‘learn’ to extract the answer from a given passage that is known to contain the answer.

An unsupervised QA system like $\mathsf{LiCQA}$ has some practical advantages over supervised systems. First, it is concerned with the more general problem of finding answers to a question from a collection of documents, rather than finding the precise answer to a question from a piece of text that contains the answer. Second, it eliminates the need for large volumes of human-annotated training data. Finally, state-of-the-art methods for supervised QA systems often employ end-to-end Deep Learning architectures that are computationally expensive to develop, and which need GPUs for effective deployment. While modules used within the QA pipeline in $\mathsf{LiCQA}$ do involve supervised / deep learning techniques for particular sub-tasks (e.g., a Part of Speech (POS) tagger, trained on text annotated with POS tags, and sentence embeddings), we bypass the need for the kind of computational resources required by end-to-end supervised QA systems.

Handling complex questions. A query is regarded as complex if it involves multiple entities, and multiple relationships between them, e.g.,

Which Nolan films won an Oscar, but missed a Golden Globe?

In contrast, a query like Who won the Turing Award in 1970? would be regarded as ‘simple’. Answers to complex questions will often need to be synthesised by combining evidence from multiple documents. $\mathsf{LiCQA}$ is specifically designed to answer such complex queries. None of the labeled datasets that are commonly used for training and testing the recent supervised QA systems mentioned above contains an adequate number of complex questions, which comprise our primary focus.

Lightweight. To the best of our knowledge, QUEST (Lu et al., 2019), a recently-proposed system, represents the state-of-the-art for unsupervised, complex QA systems. For an empirical comparison, therefore, we chose QUEST as a baseline, along with DrQA (Chen et al., 2017). Our experimental results suggest that $\mathsf{LiCQA}$ is computationally much less expensive than the chosen baselines; yet, it achieves significantly better performance on the datasets used.

Summary of contributions. In sum, this paper describes a lightweight, unsupervised QA system for complex questions. An empirical comparison with two recent, state-of-the-art baselines suggests that the proposed system achieves significant improvements in terms of both effectiveness and efficiency. While the techniques presented in this report are simple to implement, we make the source code available to aid reproducibility, comparisons with other similar systems, and further research¹¹1https://github.com/souravsaha/licqa.

2. Background and Related Work

Early research on large-scale QA generally considered the task of finding answers to factoid queries from a corpus of unstructured documents (Voorhees, 1999). The traditional strategy established by these early methods (Singhal et al., 1999; Moldovan et al., 1999; Srihari and Li, 2000) consisted of retrieving a set of paragraphs/ documents for a given question, and subjecting them to more detailed analysis (e.g., by extracting and scoring named entities (NEs) from them). In (Lin and Katz, 2006), the authors argued that the intent behind document retrieval for QA and traditional searches can be different. Further, they proposed a synthetic test collection for QA tasks.

More recently, the structured and interlinked information provided by Knowledge Graphs and Knowledge Bases has been applied to question answering (Ahn et al., 2004; Buscaldi and Rosso, 2006; Cai and Yates, 2013; Berant et al., 2013). However, the incompleteness of KGs has restricted their use in building general purpose QA systems. To overcome this limitation, researchers have successfully used unstructured text documents to supplement KGs for QA (Savenkov and Agichtein, 2016; Fader et al., 2013, 2014).

Most recent work on QA has focused on utilising deep learning models for the task (Dong et al., 2015; Tan et al., 2018). Some of these efforts have also resulted in the creation of benchmark QA models (Berant et al., 2013; Usbeck et al., 2017; Chen et al., 2017; Sawant et al., 2019). However, these models require significant training data, and are also computationally expensive. DrQA (Chen et al., 2017) is one of these state-of-the-art QA systems. It combines a TF-IDF based document retriever with a multilayered bi-LSTM network that encodes the paragraphs from the top retrieved documents based on word embeddings before training two classifiers to identify the starting and ending locations of the potential answers. Nonetheless, it is still distantly supervised on SQuAD, TREC Questions, Web Questions, and Wikimovies dataset.

Although they have proven to be effective for factoid question answering, most of the QA systems discussed above are designed for questions that have an answer contained within a single document. These techniques often fail to retrieve the correct answer for complex questions, where the answers involve multiple entities spread across several documents (Abujabal et al., 2017; Bast and Haussmann, 2015; Khot et al., 2017; Bhutani et al., 2020; Talmor and Berant, 2018). A method for quantitative evaluation of interactive QA systems was proposed in (Wacholder et al., 2007). The proposed paradigm, called RUTS, incorporates real users, tasks and systems, and unlike the framework presented in this article, RUTS is applicable for the evaluation of user-centered information-access system.

Recently, an unsupervised method QUEST has been proposed by (Lu et al., 2019) for answering complex questions. Unlike the other methods mentioned above, QUEST does not depend upon any annotated datasets for training. It finds answers by building and analysing a quasi knowledge graph, constructed from documents retrieved by a search engine in response to the question. However, the creation of the KG on the fly leads to significant execution latency in finding the answers to a given question.

3. $\mathsf{LiCQA}$ : A lightweight complex question answering model

$\mathsf{LiCQA}$ uses a standard pipeline (shown in Figure 1) to process questions. The question type is determined. Concurrently, a set of documents that are expected to contain the answer is retrieved using keywords from the question. Since answers to most questions are entities, they are extracted from these documents as potential answers. These entities are then scored, and finally returned in the form of a ranked list. In the following subsections, each stage of the pipeline is described in detail.

Refer to caption — Figure 1. Architecture of the $\mathsf{LiCQA}$ system.

3.1. Question Type Classification

The question classifier (QC) module tries to classify a given question according to the predicted type of its answer. For example, consider the question: “Which actor played in Troy and Seven?” Its answer is a named entity of the PERSON type. Similarly, the answer to “Where in New Zealand is the Tomb of the Unknown Warrior located?” is a PLACE-type entity. QC is generally considered an essential early stage in traditional QA pipelines (Moldovan et al., 2002), and has been the focus of much research (Silva et al., 2010; Blunsom et al., 2006; Tayyar Madabushi and Lee, 2016; Mohasseb et al., 2018). For this stage of $\mathsf{LiCQA}$ , we experimented with both traditional (‘shallow’) and deep learning based methods, discussed below.

Traditional machine learning based classifier. Support Vector Machines (SVMs) are reported to perform well for question classification (Silva et al., 2010). Following (Li et al., 2008) and (Li and Roth, 2002), we construct the feature vector of a question $Q$ using:

(I)

all named entities (NEs) in $Q$ , and
(II)

the lemma and POS tag of each word in $Q$ .

If all the questions in a test collection together contain $m$ distinct NEs, $n$ distinct lemmata, $n^{\prime}$ distinct lemma-bigrams, $p$ distinct POS tags, and $p^{\prime}$ distinct POS tag bigrams, then the feature vector is a binary vector of $m+n+n^{\prime}+p+p^{\prime}$ dimensions (see Figure 1). A linear SVM was then trained using the 5500 annotated questions provided by (Li and Roth, 2002).

Neural classifiers. We also try a neural classifier for this stage. A question $Q$ is encoded employing the Universal Sentence Encoder (USE) (Cer et al., 2018), a state-of-the-art sentence encoder that is reported to work well for question classification (Reimers and Gurevych, 2019). This encoder embeds $Q$ into a 256-dimension vector space by passing the average embedding of word-level unigrams and bigrams present in $Q$ to DAN, a deep feed-forward averaging network. Next, the embedding for $Q$ is fed to a feed-forward neural network with one dense layer, and a softmax layer at the output level. The network is trained on the same set of 5500 questions as above, using the Adam optimizer, with ReLU as the activation function, and cross entropy as the loss function.

3.2. Answer Extraction and Filtering

Given a question $Q$ , it is submitted as a query to a standard IR engine (or search service) to retrieve $\mathcal{D}$ , a set of $k$ documents, from a given collection (or the open Web). Each document $D\in\mathcal{D}$ is preprocessed: accented characters are replaced by their non-accented equivalents using unicodedata from the Python standard library; and contractions like can’t are expanded to cannot.

Next, we extract the set of NEs present in $D$ using Flair (Akbik et al., 2019), a recent method based on contextualised embeddings (where a single word is mapped to different embeddings, corresponding to the different sentential contexts in which the word occurs). Each entity is tagged using a label from the OntoNotes 5 tagset (Weischedel et al., 2013).

Answer Type in (Li and Roth, 2002)	OntoNotes 5 Entity Type
HUMAN	PERSON
LOCATION	GPE, LOC, ORG
ENTITY	NORP, FAC, PRODUCT, EVENT
	LANGUAGE, LAW, WORK_OF_ART
NUMERIC	DATE, TIME, PERCENT, MONEY
	QUANTITY, ORDINAL, CARDINAL
money	MONEY
date	DATE
group	ORG

Table 1. Mapping question / answer types to OntoNotes tags. Answer types in uppercase correspond to top-level categories, while lowercases refer to second-level sub-categories.

The final step within this stage involves selecting only entities whose label (or type) matches the expected answer type for the current question. Using Flair for NE extraction poses an operational problem, however. The tags assigned to candidate answers by Flair and the tags assigned by the QC module (trained using labels defined by (Li and Roth, 2002)) are different. To circumvent this issue, we use a handcrafted table (Table 1) to map the labels provided in (Li and Roth, 2002) to their equivalents in OntoNotes 5. Thus, for a question of type LOCATION, all entities labelled GPE, LOC or ORG are selected as candidate answers.

3.3. Answer Scoring

The next step involves scoring the selected entities (denoted by $\mathcal{E}=\{e_{1},e_{2},\ldots\}$ ), based on their presumed relevance to the given question. From the example questions cited above (i.e., $Q=$ “Which actor played in Troy and Seven?”), it is clear that the answer entity itself is unlikely to occur in $Q$ , but contexts in which the answer occurs and is recognisable as such (i.e., “Pitt’s portrayal of Achilles in Troy (2004) $\ldots$ Detective Mills in Seven $\ldots$ ” etc.) will probably have a substantial overlap. Accordingly, the key idea in our scoring approach involves measuring the semantic similarity between $Q$ and the sentences in which a candidate answer occurs.

To keep computational costs down (in keeping with our objective of designing a system that would be usable in near real-time), we first compute the document frequency (df) of each entity $e$ in $\mathcal{D}$ as follows.

(1)

\mathit{df}(e)=\lvert\{D\in\mathcal{D}\penalty 10000\ \mid\penalty 10000\ e\mbox{ occurs in }D;\mbox{ its tag is of the desired type(s)}\}\lvert

If $\mathcal{E}$ contains more than 100 entities, these entities are ranked in decreasing order of df, and only the top 100²²2This number was fixed at 100 so as to avoid yet another tunable parameter in our system. are retained for further processing.

Next, we extract from $\mathcal{D}$ all sentences in which any $e\in\mathcal{E}$ occurs. Let $S_{e}=\{s^{(e)}_{1},s^{(e)}_{2},\ldots,\}$ denote the sentences in $\mathcal{D}$ that contain $e$ , and let $\mathcal{S}=\bigcup_{e\in\mathcal{E}}S_{e}$ . The match between any $s^{(e)}_{i}$ and $Q$ is measured using $\cos(\mathbf{Q},\mathbf{s^{(e)}_{i}})$ , the cosine similarity between the embeddings of $s^{(e)}_{i}$ and $Q$ . Two types of embeddings were tried.

InferSent. In this method proposed by (Conneau et al., 2017), a pre-trained GloVe embedding of words (300 dimension) are used to get vector representation of the words in the sentences. Further, a bidirectional LSTM (Bi-LSTM) network is trained on the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) dataset. Applying a max pooling over the hidden states of the network, it encodes each sentence to obtain a 4096-dimensional encoded version of the sentence.

Sentence-Bert. It fine-tunes a pretrained BERT (Devlin et al., 2019)/RoBERTa (Liu et al., 2019) based on the input sentences, and applies a mean pooling over the output vectors to create the sentence embedding. As discussed in (Reimers and Gurevych, 2019), it works significantly faster than BERT and gives better representation.

Finally, we need to compute a single $\mathit{score}(e)$ for each candidate answer $e$ by aggregating the scores of all the sentences in $S_{e}$ . Three aggregation methods were tried.

•

Simple averaging: In this case, $\mathit{score}(e)$ , denoted by $\mathit{avg\mbox{-}score}(e)$ , is given by

(2)

\mathit{score}(e)=\mathit{avg\mbox{-}score}(e)=\frac{1}{\lvert S_{e}\lvert}\sum_{s\in S_{e}}\cos(\mathbf{Q},\mathbf{s})

•

Averaging over the best matching sentence from each document: For each document $D\in\mathcal{D}$ , we first pick the sentence $s_{D}$ containing $e$ that has the maximum similarity with $Q$ (see Equation 3).

(3)

s_{D}=\arg\max_{s\in D\cap S_{e}}\cos(\mathbf{Q},\mathbf{s})

We then compute the average of these scores over all $D\in\mathcal{D}$ .

(4)

\mathit{score}(e)=\mathit{avg\textit{-}maxscore}(e)=\frac{1}{\lvert\mathcal{D}\lvert}\sum_{D\in\mathcal{D}}\cos(\mathbf{Q},\mathbf{s_{d}})

•

Considering only the single best match: In this case, we consider the score of the best matching sentence in $S_{e}$ as $\mathit{score}(e)$ .

(5)

\mathit{score}(e)=\mathit{max\mbox{-}score}(e)=\max_{s\in S_{e}}\cos(\mathbf{Q},\mathbf{s})

3.4. Answer Ranking

The final ranking of the answers is based on an arithmetic combination of the similarity score obtained above and the (normalised) document frequency of the entities. This was motivated by observations from the query expansion / relevance feedback literature that document frequency is a useful factor for selecting / weighting expansion terms. To combine the two factors, we tried both weighted addition and simple multiplication.

•

Weighted addition: The semantic similarity score and the normalized df were combined as follows.

(6)

\mathit{comb}\mbox{-}\mathit{score}_{+}=\alpha\times\mathit{score}(e)+\beta\times\frac{\mathit{df}(e)}{\lvert\mathcal{D}\lvert}

Here $\alpha$ and $\beta$ are parameters to be tuned.

•

Simple multiplication: In this variant, we simply multiplied the semantic similarity with the normalized df.

(7)

\mathit{comb}\mbox{-}\mathit{score}_{*}=\mathit{score}(e)\times\frac{\mathit{df}(e)}{\lvert\mathcal{D}\lvert}

Finally, a list of the entities corresponding to the five highest combined scores is returned as the set of answers for $Q$ . This list is only partially ordered, since $\mathsf{LiCQA}$ can assign the same rank (one through five) to all entities that obtain the same combined score.

4. Experimental Setup

4.1. Baselines

To compare the effectiveness of $\mathsf{LiCQA}$ with methods having diverse working principles, we adopt two state-of-the-art complex question answering models, specifically QUEST (Lu et al., 2019) and DrQA (Chen et al., 2017) as our baselines. Two additional graph-based question answering models, BFS (Suchanek et al., 2009) and ShortestPaths (Lu et al., 2019), are also considered for comparing the performance of $\mathsf{LiCQA}$ .

$\mathsf{LiCQA}$ is designed for answering complex questions (having answers that need to be constructed from multiple documents) in a real-time application (for which computational cost is an essential factor). Therefore, neither end-to-end QA systems that are heavily supervised, nor computationally expensive systems based on recent deep neural models (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019)) would be appropriate as baselines for our studies. Note that we do use distilled versions of some of these models in some stages of our pipeline, however, and compare their effect with that of other alternatives.

4.2. Datasets

Datasets that have been commonly used to evaluate QA systems include SQuAD (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019) and WikiMovies (Miller et al., 2016). As mentioned before, these datasets are not focused on complex questions, and are designed for supervised QA systems. In order to experiment with various types of complex questions, and to allow a fair comparison with our chosen baselines, we use the same dataset as QUEST. Specifically, we use two sets of questions — WikiAnswers (CQ-W), and Google Trends (CQ-T) — for evaluation. Each set contains 150 complex questions of varying difficulty.

Document collection. Recall from Figure 1 that one of the first steps in $\mathsf{LiCQA}$ involves performing a keyword-based retrieval from a corpus with the question to obtain a set of documents containing the potential answers. In principle, we treat the open Web as our target document collection, and use a search service like the Google API to retrieve documents. However, the Web is dynamic. To ensure reproducibility, the dataset provided by (Lu et al., 2019) includes the 10 top-ranked documents that they actually retrieved in the course of their experiments for each query. We refer to this set (comprising the 10 top-ranked documents for each question) as Top10. Given Top10, we skip the retrieval step during our experiments, and start by analysing this set of documents.

To minimize the effect of the QA components apparently built into Google’s search engine and to simulate variations in the search algorithm, five additional document collections were constructed using stratified sampling from the top $50$ documents returned by Google. These collections are named Strata-1 through Strata-5. These collections also contain 10 documents per query, with $x_{1}$ % documents selected from within the top 10 ranks, $x_{2}$ % from documents ranked 11 through 25, and $x_{3}$ % from ranks 26 to 50. Table 2 shows the values of $x_{1}$ , $x_{2}$ , and $x_{3}$ chosen for each these collections.

Collection	$x_{1}$	$x_{2}$	$x_{3}$
Top10	100	0	0
Strata-1	60	30	10
Strata-2	50	40	10
Strata-3	50	30	20
Strata-4	40	40	20
Strata-5	40	30	30

Table 2. Values of

x_{1}

x_{2}

and

x_{3}

for the document collections used in this article.

4.3. Evaluation metrics

Following standard practice (Lu et al., 2019; Christmann et al., 2019), we evaluate QA systems in terms of precision at rank 1 ( $\mathsf{P@1}$ ), mean reciprocal rank ( $\mathsf{MRR}$ ), and hit at 5 ( $\mathsf{Hit@5}$ ). The commonly used definitions of these metrics are intended for use with a ranked list that does not have any ties. However, QA systems (including QUEST and $\mathsf{LiCQA}$ ) sometimes do produce ties, i.e., multiple answers that are assigned the same rank. In fact, this would be the correct response to some questions like Which Nobel Laureate was in the Manhattan Project?) for which multiple correct answers are possible (Neils Bohr, Enrico Fermi, etc.). In addition to the above metrics, therefore, we use tie-aware versions of these metrics (McSherry and Najork, 2008; Saha et al., 2022) — $\mathsf{tP@1}$ , $\mathsf{tMRR}$ and $\mathsf{tHit@5}$ — for a more realistic comparison of different systems.

	Stages of pipeline			Evaluation metrics
Type classification	Sentence embedding	Scoring	Ranking	tMRR	tP@1	tHit@5
Traditional (SVM)			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.412	0.282	0.633
		max-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.406	0.275	0.626
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.123	0.063	0.242
		avg-maxscore	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.124	0.063	0.242
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.387	0.253	0.620
	InferSent	avg-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.386	0.253	0.620
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.411	0.280	0.633
		max-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.398	0.280	0.613
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.116	0.060	0.216
		avg-maxscore	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.118	0.060	0.220
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.408	0.260	0.640
	Sent_BERT	avg-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.393	0.260	0.613
Neural (USE)			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.407	0.282	0.626
		max-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.401	0.275	0.620
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.117	0.056	0.208
		avg-maxscore	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.117	0.056	0.225
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.385	0.260	0.613
	InferSent	avg-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.385	0.260	0.613
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.399	0.273	0.620
		max-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.389	0.273	0.600
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.106	0.050	0.226
		avg-maxscore	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.104	0.050	0.206
			$\mathit{comb}\mbox{-}\mathit{score}_{}$*	0.389	0.246	0.620
	Sent_BERT	avg-score	$\mathit{comb}\mbox{-}\mathit{score}_{+}$	0.387	0.260	0.600

Table 3. Performance of different stages of pipeline for CQ-W questions in terms of tie aware version of MRR, P@1 and Hit@5.

4.4. $\mathsf{LiCQA}$ configuration

Other than the parameters involved within the various commodity components used in our pipeline, the only parameters in $\mathsf{LiCQA}$ are $\alpha$ and $\beta$ (see Equation 6). These were varied over $[0.1,0.7]$ in steps of $0.1$ .

System	Metric	Top10	Strata-1	Strata-2	Strata-3	Strata-4	Strata-5
ShortestPaths		0.240	0.261	0.249	0.237	0.259	0.270
BFS		0.249	0.256	0.266	0.212	0.219	0.254
DrQA	$\mathsf{MRR}$	0.226	0.237	0.257	0.256	0.215	0.248
QUEST		0.355	0.380	0.340	0.302	0.356	0.318
$\mathsf{LiCQA}$		0.432^∗^§^†‡	0.424^∗^§^†‡	0.431^∗^§^†‡	0.420^∗^§^†‡	0.426^∗^§^†‡	0.416^∗^§^†‡
% improvement		21.6%	11.5%	26.7%	39%	19.6%	30.8%
ShortestPaths		0.147	0.173	0.193	0.140	0.147	0.187
BFS		0.160	0.167	0.193	0.113	0.100	0.147
DrQA	$\mathsf{P@1}$	0.184	0.199	0.221	0.215	0.172	0.200
QUEST		0.268	0.315	0.262	0.216	0.258	0.216
$\mathsf{LiCQA}$		0.293^∗^†	0.280^§	0.320^∗^§	0.273^∗^§	0.306^∗^§^†	0.280^§
% improvement		9.32%	-11.1%	22.13%	26.3%	18.6%	29.6%
ShortestPaths		0.347	0.367	0.387	0.327	0.393	0.340
BFS		0.360	0.353	0.347	0.327	0.327	0.360
DrQA	$\mathsf{Hit@5}$	0.313	0.315	0.322	0.322	0.303	0.340
QUEST		0.376	0.396	0.356	0.344	0.401	0.358
$\mathsf{LiCQA}$		0.646^∗^§^†‡	0.673^∗^§^†‡	0.633^∗^§^†‡	0.653^∗^§^†‡	0.626^∗^§^†‡	0.640^∗^§^†‡
% improvement		71.8%	69.9%	77.8%	89.8%	56.1%	78.7%

Table 4. Performance comparison of baseline methods with

\mathsf{LiCQA}

for CQ-W questions in terms of MRR, P@1 and Hit@5. A superscript ^∗, ^§,

\dagger

and

\ddagger

respectively indicate that

\mathsf{LiCQA}

is statistically significant (t-test with

p<0.05

) with the ShortestPaths, BFS, DrQA and QUEST. The percentage improvement of

\mathsf{LiCQA}

with the best performing baseline (QUEST) is presented in the last row of each group.

System	Metric	Top10	Strata-1	Strata-2	Strata-3	Strata-4	Strata-5
ShortestPaths		0.266	0.224	0.248	0.219	0.232	0.222
BFS		0.287	0.256	0.265	0.259	0.219	0.201
DrQA	$\mathsf{MRR}$	0.355	0.330	0.356	0.369	0.365	0.380
QUEST		0.467	0.436	0.426	0.460	0.409	0.384
$\mathsf{LiCQA}$		0.443^∗^§	0.407^∗^§	0.472^∗^§^†‡	0.476^∗^§^†‡	0.422^∗^§	0.468^∗^§^†‡
% improvement		-5.1%	-6.6%	10.8%	3.4%	3.1%	21.8%
ShortestPaths		0.190	0.140	0.160	0.160	0.150	0.130
BFS		0.210	0.170	0.180	0.180	0.140	0.130
DrQA	$\mathsf{P@1}$	0.286	0.267	0.287	0.300	0.287	0.320
QUEST		0.394	0.360	0.347	0.377	0.333	0.288
$\mathsf{LiCQA}$		0.306^∗	0.260^∗	0.346^∗^§	0.340^∗^§	0.273^∗^§	0.320^∗^§
% improvement		-22.3%	-27.7%	-0.2%	-9.8%	-18.0%	11.1%
ShortestPaths		0.350	0.320	0.340	0.310	0.330	0.290
BFS		0.380	0.360	0.370	0.360	0.310	0.320
DrQA	$\mathsf{Hit@5}$	0.453	0.440	0.473	0.487	0.480	0.480
QUEST		0.531	0.496	0.510	0.500	0.503	0.459
$\mathsf{LiCQA}$		0.646^∗^§^†‡	0.633^∗^§^†‡	0.680^∗^§^†‡	0.673^∗^§^†‡	0.640^∗^§^†‡	0.686^∗^§^†‡
% improvement		21.6%	27.6%	33.3%	34.6%	27.2%	49.4%

Table 5. Performance comparison of baseline methods with

\mathsf{LiCQA}

for CQ-T questions in terms of MRR, P@1 and Hit@5. A superscript ^∗, ^§,

\dagger

and

\ddagger

respectively indicate that

\mathsf{LiCQA}

is statistically significant (t-test with

p<0.05

) with the ShortestPaths, BFS, DrQA and QUEST. The percentage improvement of

\mathsf{LiCQA}

with the best performing baseline (QUEST) is presented in the last row of each group.

5. Result and Discussions

Recall that the $\mathsf{LiCQA}$ pipeline contains four components for which multiple options were explored. We first present a comprehensive comparison across all the different combinations to determine what works best. This also allows us to carefully analyse the effect of the choice for any individual module, by fixing the choices for the other modules. In the next subsection, we present an overall comparison between $\mathsf{LiCQA}$ and the baselines mentioned in the previous section. While this does involve selecting the most effective design for the pipeline, it does not amount to retrospectively training parameters on the test data. Finally, by way of a qualitative comparison with QUEST, as well as failure analysis, we provide an anecdotal discussion of the results for some specific questions.

5.1. Analysis of pipeline components

To verify the prominence of each component and the robustness of $\mathsf{LiCQA}$ , we vary the different modules and check the performance. Table 3 shows the performance of different stages of the pipeline for CQ-W question-set. The performance variation with questions from CQ-T has been seen to be similar; hence we are reporting the performances on the CQ-W question-set for the sake of brevity. We choose the best module in each stage and report our performance of $\mathsf{LiCQA}$ .

While comparing the overall performance of the question type classifiers, we note that the SVM based classifier works slightly better than the neural classifier. After extracting the set of sentences from the retrieved documents, the next stage of $\mathsf{LiCQA}$ involves scoring of the entities in those sentences which consists of a semantic and a statistical comparison. The semantics-based similarity is computed between the question and the extracted sentences containing entities of same type as of the answer. Among the two sentence encoding techniques experimented with, InferSent has been seen to be working better than Sent_BERT in a majority of cases although the performances are mostly comparable. As part of the statistical comparison between the question and the extracted sentences with the NEs, max-score (Equation 5) has been observed to be superior than the other scoring functions (presented in Equation 2, 4, 5) irrespective of the other components. At the final phase, $\mathsf{LiCQA}$ ranks the entities based on the computed similarity scores. The best performance is discerned when the normalized document frequency is multiplied with the score of the individual entities using $\mathit{comb}\mbox{-}\mathit{score}_{*}$ (Equation 7). Overall, the optimal performance is reported when the SVM based classifier is used with InferSent as the sentence embedding technique while scoring and ranking of the entities are performed respectively using max-score and $\mathit{comb}\mbox{-}\mathit{score}_{*}$ . Hence, we report the performance of $\mathsf{LiCQA}$ based on this configuration and compare the result with the baselines.

5.2. End-to-end evaluation

5.2.1. Evaluation using $\mathsf{MRR}$ , $\mathsf{P@1}$ and $\mathsf{Hit@5}$

Table 4,5 compare the overall performance of $\mathsf{LiCQA}$ and the baselines for both question sets, WikiAnswers (CQ-W) and Google Trends (CQ-T), in terms of $\mathsf{MRR}$ , $\mathsf{P@1}$ and $\mathsf{Hit@5}$ . We observe that $\mathsf{LiCQA}$ achieved the best performance across all metrics for CQ-W; it yields an improvement of 21.6% and 71.8% over the next best system (QUEST) in terms of $\mathsf{MRR}$ and $\mathsf{Hit@5}$ , respectively, for the Top10 dataset. Similar trends are observed for Strata 1–5. The improvements over ShortestPaths, BFS and the strong baselines, QUEST and DrQA, are statistically significant (paired t-test with $p<0.05$ ) for almost all document sets.

For CQ-T, $\mathsf{LiCQA}$ achieves the best $\mathsf{Hit@5}$ across document sets, but QUEST does consistently better for $\mathsf{P@1}$ , and also achieves better $\mathsf{MRR}$ for the Top10 and Strata-1 sets. It turns out, however, that the difference between QUEST and $\mathsf{LiCQA}$ is not statistically significant in these cases.

Across query and document sets, $\mathsf{LiCQA}$ ’s ability to consistently retrieve at least one correct answer within the top 5 ranks is particularly encouraging. In this regard, it is successful for at least 63% of the questions, significantly more than the next best system (usually QUEST).

Question: which country borders mali and ghana? Rank Answers 1 countries $\lvert$ three countries 2 central bank of west african states $\lvert$ $\ldots$ $\lvert$ africa ohada 3 attack 4 accra 5 burkina faso $\lvert$ burkina faso burkina

Table 6. Example ranked list returned by QUEST. Rank 2 contains 20 answers. Correct answers are in bold face.

System	Metric	Top10	Strata-1	Strata-2	Strata-3	Strata-4	Strata-5	Top10	Strata-1	Strata-2	Strata-3	Strata-4	Strata-5
				CQ-W						CQ-T
DrQA		0.226	0.237	0.257	0.256	0.215	0.248	0.355	0.330	0.356	0.369	0.365	0.380
QUEST	$\mathsf{tMRR}$	0.067	0.116	0.066	0.067	0.077	0.055	0.065	0.075	0.066	0.076	0.069	0.057
$\mathsf{LiCQA}$		0.412^†‡	0.411^†‡	0.408^†‡	0.398^†‡	0.418^†‡	0.399^†‡	0.433^‡	0.388^†‡	0.443^†‡	0.455^†‡	0.399^†‡	0.434^†‡
% improvement		82.3%	73.4%	58.7%	55.4%	94.4%	60.8%	21.9%	17.5%	24.4%	23.3%	9.31%	14.2%
DrQA		0.184	0.199	0.221	0.215	0.172	0.200	0.286	0.267	0.287	0.300	0.287	0.320
QUEST	$\mathsf{tP@1}$	0.040	0.093	0.043	0.042	0.053	0.034	0.043	0.042	0.036	0.048	0.035	0.031
$\mathsf{LiCQA}$		0.282^†‡	0.270^‡	0.292^‡	0.255^‡	0.303^†‡	0.262^‡	0.296^‡	0.242^‡	0.312^‡	0.320^‡	0.250^‡	0.290^‡
% improvement		53.2%	35.6%	32.1%	18.6%	76.1%	31.0%	3.4%	-9.3%	8.7%	6.6%	-12.8%	-9.3%
DrQA		0.313	0.315	0.322	0.322	0.303	0.340	0.453	0.440	0.473	0.487	0.480	0.480
QUEST	$\mathsf{tHit@5}$	0.155	0.194	0.144	0.144	0.162	0.120	0.154	0.169	0.147	0.165	0.149	0.121
$\mathsf{LiCQA}$		0.640^†‡	0.655^†‡	0.614^†‡	0.649^†‡	0.620^†‡	0.635^†‡	0.646^†‡	0.620^†‡	0.657^†‡	0.662^†‡	0.616^†‡	0.664^†‡
% improvement		104.4%	107.9%	90.6%	101.5%	104.6%	86.7%	42.6%	40.9%	38.9%	35.9%	28.3%	38.3%

Table 7. Performance comparison of baseline methods with

\mathsf{LiCQA}

for CQ-W and CQ-T questions in terms of

\mathsf{tMRR}

\mathsf{tP@1}

and

\mathsf{tHit@5}

. A

\dagger

and

\ddagger

indicate that

\mathsf{LiCQA}

is statistically significant (t-test with

p<0.05

) than DrQA and QUEST respectively. Percentage improvement of

\mathsf{LiCQA}

with best performing baseline (DrQA) is presented in last row of each group.

5.2.2. Evaluation using $\mathsf{tMRR}$ , $\mathsf{tP@1}$ and $\mathsf{tHit@5}$

Of the various systems whose performance we seek to compare, QUEST and $\mathsf{LiCQA}$ both produce ranked lists of answers that may contain ties. The figures reported above are thus likely to over-estimate the accuracy of these two systems. Table 6 shows an example explaining this possibility. In the example, the answers returned by QUEST contains 20 entities at rank 2, all of which are incorrect, but because a correct answer is retrieved at rank 5 (before ties are broken), the values of traditional $\mathsf{MRR}$ and $\mathsf{Hit@5}$ values are 0.2 and 1, respectively.

In this section, therefore, we compare the performance of the best systems using tie-aware versions of the above evaluation measures, viz., $\mathsf{tMRR}$ , $\mathsf{tP@1}$ and $\mathsf{tHit@5}$ . The values of these metrics for DrQA, QUEST and $\mathsf{LiCQA}$ are shown in Table 7. Note that DrQA, ShortestPaths and BFS return single answers at each rank; the values of the conventional and tie-aware metrics would be the same for these methods. We have excluded the relatively weak baselines — ShortestPaths and BFS — from the table.

Once again, the results show that that $\mathsf{LiCQA}$ consistently outperforms both DrQA and QUEST in terms of $\mathsf{tMRR}$ , $\mathsf{tP@1}$ as well as $\mathsf{tHit@5}$ on both sets of questions. The table also confirms that traditional metrics grossly over-report the accuracy of QUEST. When evaluated using tie-aware measures, DrQA’s performance is observed to be significantly better than QUEST’s. In hindsight, this is unsurprising, since QUEST produces lists that often contain many different entities at a particular rank. Naturally, the improvement of $\mathsf{LiCQA}$ over QUEST is statistically significant for all three metrics. More importantly, the $\mathsf{tMRR}$ and $\mathsf{tHit@5}$ figures for $\mathsf{LiCQA}$ are, almost without exception, also statistically significantly higher than the corresponding figures for DrQA.

Figure 2 graphically compares the performance of $\mathsf{LiCQA}$ , DrQA and QUEST using both conventional and tie-aware metrics on CQ-W and CQ-T for the Top10 collection. Each plot contains five bars that correspond respectively to each of the classical and tie-aware metrics for QUEST (in khaki and olive), DrQA (teal) and $\mathsf{LiCQA}$ (in skyblue and steelblue). As noted before, DrQA always retrieves a single entity at any rank, so both metrics have the same value. From the figure, it is clear that the skyblue bars are generally the tallest, except for P@1 on CQ-T, where QUEST performed the best. For the tie-aware metrics, $\mathsf{LiCQA}$ consistently achieves the best performance (steelblue bar) in all cases.

Question Set	Question	Answer	$\mathsf{LiCQA}$ rank	QUEST rank	DrQA rank
CQ-W	What movie starred Bruce Willis and Haley Joel Osment?	The Sixth Sense	1	$>$ 5	3
CQ-W	Who founded Apple and Pixar?	Steve Jobs	1	2	5
CQ-T	Which 2018 studio album is performed by Playboi Carti and features Nicki Minaj?	Die Lit	1	$>$ 5	2
	Which American professional golfer is married to Taylor Dowd Simpson and played for Wake Forest University?	Simpson	1	$>$ 5	$>$ 5

Table 8. Some example questions for which

\mathsf{LiCQA}

has been able to retrieve the correct answer at rank 1 where as QUEST and DrQA have failed to retrieve the answers in the top rank.

5.3. Qualitative comparison

Next, we take a closer look at the relative performance of QUEST and $\mathsf{LiCQA}$ . Figures 3 and 4 show the per-query performance difference between the two systems on the Top10 dataset. In these figures, a bar above the x-axis indicates that $\mathsf{LiCQA}$ performed better than QUEST for the corresponding question. Figures 3a–3c and Figures 4a-4c) show that $\mathsf{LiCQA}$ performs better for a majority of the questions on both sets. Note that, for a single question, the values of $\mathsf{P@1}$ and $\mathsf{Hit@5}$ are binary (0 or 1), and the difference in these values can be $1$ , $0$ or $-1$ .

In contrast, the tie-aware measures have non-binary values. The per-query difference with respect to these metrics are shown in Figure 3d-3f and Figure 4d-4f, for CQ-W and CQ-T, respectively. Once again, it is clear from these figures that $\mathsf{LiCQA}$ outperforms QUEST for most questions.

Some example questions for which $\mathsf{LiCQA}$ was able to retrieve the correct answer at top rank, but QUEST and DrQA failed, are presented in Table 8.

5.4. Latency analysis

Finally, we compare $\mathsf{LiCQA}$ with DrQA and QUEST in terms of the average time taken to produce the list of answers for a given set of questions. Each system is made to run in a loop for five iterations, and computes the answers for all questions in CQ-W and CQ-T in every iteration. The average of the time taken over these five iterations is reported in Table 9.

All our experiments were conducted on an Intel(R) Core(TM) i9-7900X CPU@3.30GHz and 24 GB Nvidia Titan RTX GPU machine with 132 GB RAM. Among the methods, QUEST takes the highest time in producing the results, spending a significant time in creating the quasi knowledge graph with the retrieved documents. The latency of DrQA is also close to QUEST. In contrast, the execution time of $\mathsf{LiCQA}$ is significantly less than both the baselines. More specifically, we observe an $8\times$ speedup for $\mathsf{LiCQA}$ compared to the baselines in the total running time taken to process the two question sets for all the document sets (Top10 and Strata1–5).

System	Query set	Top10	Strata-1	Strata-2	Strata-3	Strata-4	Strata-5
$\mathsf{LiCQA}$	CQ-W	13	15	17	14	16.5	14.3
QUEST		144	148	151	156	149	162
DrQA		120	131	119	124.6	128	129
speedup		9.23 $\times$	8.73 $\times$	7 $\times$	8.9 $\times$	7.75 $\times$	9.02 $\times$
$\mathsf{LiCQA}$	CQ-T	14.5	16	18	15.8	18	17.6
QUEST		155	159	162	178	189.5	195.4
DrQA		130	135	138.6	132.8	142	137
speedup		8.96 $\times$	8.43 $\times$	7.7 $\times$	8.40 $\times$	7.88 $\times$	7.78 $\times$

Table 9. Per query latency (in second) analysis of different QA models. Least latency highlighted in bold. Speedup row indicates the reduction in execution time than the baseline with the minimum latency (DrQA).

6. Conclusion and Future Work

In this paper, we present $\mathsf{LiCQA}$ , a low-resource, unsupervised, complex question answering system that uses various forms of corpus evidence. To evaluate $\mathsf{LiCQA}$ , experiments were conducted on diverse question sets taken from public forums like WikiAnswers and Google Trends. $\mathsf{LiCQA}$ was compared with state-of-the-art systems DrQA and QUEST using $\mathsf{MRR}$ , $\mathsf{P@1}$ , $\mathsf{Hit@5}$ , and their tie-aware variants. $\mathsf{LiCQA}$ achieved significant improvements over all baselines in a majority of cases, while simultaneously reducing computational cost (as measured by the average time taken to answer a question).

Recently developed neural models for representing complex concepts via low dimensional vectors (Shalaby et al., 2019; Yamada et al., 2020) have been successfully applied to different text processing and information extraction tasks (Gruzdo et al., 2020). In future, we plan to explore such embedding models to further improve our answer extraction module.

References

A. Abujabal, M. Yahya, M. Riedewald, and G. Weikum (2017) Automated template generation for question answering over knowledge graphs. In Proc. of 26th WWW, pp. 1191–1200. Cited by: §2.
D. Ahn, V. Jijkoun, G. Mishne, K. Müller, M. de Rijke, and S. Schlobach (2004) Using wikipedia at the TREC QA track. In Proc. of Thirteenth TREC, NIST Special Publication, Vol. 500-261. Cited by: §2.
A. Akbik, T. Bergmann, and R. Vollgraf (2019) Pooled contextualized embeddings for named entity recognition. In NAACL 2019, pp. 724 – 728. Cited by: §3.2.
H. Bast and E. Haussmann (2015) More accurate question answering on freebase. In Proc. of 24th ACM CIKM, pp. 1431–1440. Cited by: §2.
J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on Freebase from question-answer pairs. In Proc. of 2013 EMNLP, pp. 1533–1544. Cited by: §2, §2.
N. Bhutani, X. Zheng, K. Qian, Y. Li, and H. Jagadish (2020) Answering complex questions by combining information from curated and extracted knowledge bases. In Proc. of First Workshop on Natural Language Interfaces, pp. 1–10. Cited by: §2.
P. Blunsom, K. Kocik, and J. R. Curran (2006) Question classification with log-linear models. In Proceedings of 29th SIGIR, pp. 615–616. Cited by: §3.1.
S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proc. 2015 EMNLP, pp. 632–642. Cited by: §3.3.
D. Buscaldi and P. Rosso (2006) Mining knowledge from wikipedia for the question answering task. In Proc. of Fifth LREC, pp. 727–730. Cited by: §2.
Q. Cai and A. Yates (2013) Large-scale semantic parsing via schema matching and lexicon extension. In Proc. of 51st ACL, pp. 423–433. Cited by: §2.
D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil (2018) Universal sentence encoder for English. In Proc. 2018 EMNLP: System Demonstrations, pp. 169–174. Cited by: §3.1.
D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Proceedings of 55th ACL 2017, pp. 1870–1879. Cited by: §1, §2, §4.1.
P. Christmann, R. Saha Roy, A. Abujabal, J. Singh, and G. Weikum (2019) Look before you hop: conversational question answering over knowledge graphs using judicious context expansion. In Proc of 28th ACM CIKM, CIKM ’19, pp. 729–738. Cited by: §4.3.
C. Clark and M. Gardner (2018) Simple and effective multi-paragraph reading comprehension. In Proc. of 56th ACL, pp. 845–855. Cited by: §1.
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proc. of 2017 EMNLP, pp. 670–680. Cited by: §3.3.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, pp. 4171–4186. Cited by: §3.3, §4.1.
L. Dong, F. Wei, M. Zhou, and K. Xu (2015) Question answering over freebase with multi-column convolutional neural networks. In Proc. of 53rd ACL, pp. 260–269. Cited by: §2.
A. Fader, L. Zettlemoyer, and O. Etzioni (2014) Open question answering over curated and extracted knowledge bases. In The 20th ACM SIGKDD, pp. 1156–1165. Cited by: §2.
A. Fader, L. S. Zettlemoyer, and O. Etzioni (2013) Paraphrase-driven learning for open question answering. In Proc. of 51st ACL, pp. 1608–1618. Cited by: §2.
I. Gruzdo, I. Kyrychenko, G. Tereshchenko, and O. Cherednichenko (2020) Applıcatıon of paragraphs vectors model for semantıc text analysıs. In Proc. of 4th COLINS, CEUR Workshop Proceedings, Vol. 2604, pp. 283–293. Cited by: §6.
T. Khot, A. Sabharwal, and P. Clark (2017) Answering complex questions using open information extraction. In Proc. of 55th ACL 2017, pp. 311–316. Cited by: §2.
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. TACL. Cited by: §4.2.
P. Lewis, L. Denoyer, and S. Riedel (2019) Unsupervised question answering by cloze translation. In Proc. of 57th ACL, pp. 4896–4910. Cited by: §1.
F. Li, X. Zhang, J. Yuan, and X. Zhu (2008) Classifying what-type questions by head noun tagging. In Proc. of 22nd Coling, pp. 481–488. Cited by: §3.1.
X. Li and D. Roth (2002) Learning question classifiers. In Proc. of 19th COLING, COLING ’02, pp. 1–7. Cited by: §3.1, §3.1, §3.2, Table 1.
J. Lin and B. Katz (2006) Building a reusable test collection for question answering. J. Am. Soc. Inf. Sci. Technol. 57 (7), pp. 851–861. External Links: ISSN 1532-2882 Cited by: §2.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: 1907.11692 Cited by: §3.3, §4.1.
X. Lu, S. Pramanik, R. Saha Roy, A. Abujabal, Y. Wang, and G. Weikum (2019) Answering complex questions by joining multi-document evidence with quasi knowledge graphs. In Proc. of 42nd SIGIR, SIGIR’19, pp. 105–114. External Links: ISBN 9781450361729 Cited by: §1, §2, §4.1, §4.2, §4.3.
F. McSherry and M. Najork (2008) Computing information retrieval performance measures efficiently in the presence of tied scores. In Proc. of 30th ECIR, ECIR’08, pp. 414–421. Cited by: §4.3.
A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In Proc. 2016 EMNLP, pp. 1400–1409. Cited by: §4.2.
A. Mohasseb, M. Bader-El-Den, and M. Cocea (2018) Question categorization and classification using grammar based approach. Information Processing & Management 54 (6), pp. 1228–1243. External Links: ISSN 0306-4573 Cited by: §3.1.
D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Goodrum, R. Girju, and V. Rus (1999) LASSO: A tool for surfing the answer net. In Proc. of Eighth TREC 1999, Vol. 500-246. Cited by: §2.
D. Moldovan, M. Pasca, S. Harabagiu, and M. Surdeanu (2002) Performance issues and error analysis in an open-domain question answering system. In Proceedings of 40th ACL, pp. 33–40. Cited by: §3.1.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of 2016 EMNLP, pp. 2383–2392. Cited by: §1, §4.2.
N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proc. of 2019 EMNLP, pp. 3982–3992. Cited by: §3.1, §3.3.
S. Saha, D. Roy, and M. Mitra (2022) On modifying evaluation measures to deal with ties in ranked lists. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, JCDL ’22, New York, NY, USA. External Links: ISBN 9781450393454, Link, Document Cited by: §4.3.
D. Savenkov and E. Agichtein (2016) When a knowledge base is not enough: question answering over knowledge bases with external text data. In Proc. of 39thSIGIR, pp. 235–244. Cited by: §2.
U. Sawant, S. Garg, S. Chakrabarti, and G. Ramakrishnan (2019) Neural architecture for question answering using a knowledge graph and web corpus. Inf. Retr. 22 (3–4), pp. 324–349. External Links: ISSN 1386-4564, Link Cited by: §2.
W. Shalaby, W. Zadrozny, and H. Jin (2019) Beyond word embeddings: learning entity and concept representations from large scale knowledge bases. Inf. Retr. J. 22 (6), pp. 525–542. Cited by: §6.
J. Silva, L. Coheur, A. C. Mendes, and A. Wichert (2010) From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review 35 (2), pp. 137–154. Cited by: §3.1, §3.1.
A. Singhal, S. P. Abney, M. Bacchiani, M. Collins, D. Hindle, and F. C. N. Pereira (1999) AT&t at TREC-8. In Proc. of Eighth TREC 1999, Vol. 500-246. Cited by: §2.
R. K. Srihari and W. Li (2000) A question answering system supported by information extraction. In 6th ANLP, Seattle, Washington, USA, pp. 166–172. Cited by: §2.
F. M. Suchanek, G. Weikum, M. Ramanath, G. Kasneci, and M. Sozio (2009) STAR: steiner-tree approximation in relationship graphs. In 2013 IEEE 29th ICDE, Vol. , pp. 868–879. External Links: ISSN 1084-4627 Cited by: §4.1.
A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In Proc. of NAACL-HLT, pp. 641–651. Cited by: §2.
C. Tan, F. Wei, Q. Zhou, N. Yang, B. Du, W. Lv, and M. Zhou (2018) Context-aware answer sentence selection with hierarchical gated recurrent neural networks. IEEE ACM Trans. Audio Speech Lang. Process. 26 (3), pp. 540–549. Cited by: §2.
H. Tayyar Madabushi and M. Lee (2016) High accuracy rule-based question classification using question syntax and semantics. In Proc. of the 26th COLING 2016, pp. 1220–1230. Cited by: §3.1.
R. Usbeck, A. N. Ngomo, B. Haarmann, A. Krithara, M. Röder, and G. Napolitano (2017) 7th open challenge on question answering over linked data (QALD-7). pp. 59–69. Cited by: §2.
E. M. Voorhees (1999) The trec-8 question answering track report. In In Proc. of TREC-8, pp. 77–82. Cited by: §2.
N. Wacholder, D. Kelly, P. B. Kantor, R. Rittman, Y. Sun, B. Bai, S. G. Small, B. Yamrom, and T. Strzalkowski (2007) A model for quantitative evaluation of an end-to-end question-answering system. J. Assoc. Inf. Sci. Technol. 58 (8), pp. 1082–1099. External Links: Link, Document Cited by: §2.
R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, M. El-Bachouti, R. Belvin, and A. Houston (2013) Ontonotes release 5.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Cited by: §3.2.
I. Yamada, A. Asai, J. Sakuma, H. Shindo, H. Takeda, Y. Takefuji, and Y. Matsumoto (2020) Wikipedia2Vec: an efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. arXiv preprint 1812.06280v3. Cited by: §6.
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, pp. 2369–2380. Cited by: §1.
C. Zhao, C. Xiong, X. Qian, and J. Boyd-Graber (2020) Complex factoid question answering with a free-text knowledge graph. In Proc. of Web Conference 2020, WWW ’20, pp. 1205–1216. External Links: ISBN 9781450370233 Cited by: §1.

𝖫𝗂𝖢𝖰𝖠\mathsf{LiCQA} : A Lightweight Complex Question Answering System

Abstract.

1. Introduction

2. Background and Related Work

3. 𝖫𝗂𝖢𝖰𝖠\mathsf{LiCQA}: A lightweight complex question answering model

3.1. Question Type Classification

3.2. Answer Extraction and Filtering

3.3. Answer Scoring

3.4. Answer Ranking

4. Experimental Setup

4.1. Baselines

4.2. Datasets

4.3. Evaluation metrics

4.4. 𝖫𝗂𝖢𝖰𝖠\mathsf{LiCQA} configuration

5. Result and Discussions

5.1. Analysis of pipeline components

5.2. End-to-end evaluation

5.2.1. Evaluation using 𝖬𝖱𝖱\mathsf{MRR}, 𝖯​@​𝟣\mathsf{P@1} and 𝖧𝗂𝗍​@​𝟧\mathsf{Hit@5}

5.2.2. Evaluation using 𝗍𝖬𝖱𝖱\mathsf{tMRR}, 𝗍𝖯​@​𝟣\mathsf{tP@1} and 𝗍𝖧𝗂𝗍​@​𝟧\mathsf{tHit@5}

5.3. Qualitative comparison

5.4. Latency analysis

6. Conclusion and Future Work

References

$\mathsf{LiCQA}$ : A Lightweight Complex Question Answering System

3. $\mathsf{LiCQA}$ : A lightweight complex question answering model

4.4. $\mathsf{LiCQA}$ configuration

5.2.1. Evaluation using $\mathsf{MRR}$ , $\mathsf{P@1}$ and $\mathsf{Hit@5}$

5.2.2. Evaluation using $\mathsf{tMRR}$ , $\mathsf{tP@1}$ and $\mathsf{tHit@5}$