QA With Deep Learning

Uploaded by

louislicam

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

21 views

QA With Deep Learning

Uploaded by

louislicam

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 10

Question Answering System with Deep Learning Jake Spracher Robert M Schoenhals scPD Stanford Law School jeprachedstantord.edu Stanford University [email protected] Department of Management Science and Engineering Stanford University. ‘[email protected] Abstract ‘The Stanford Question Answering Dataset (SQuAD) challenge is a machine reading comprehension task that has gained popularity in recent years. In this paper, we implement various existing deep learning methods with in- cremental improvements and conduct a comparative study of their performance on SQuAD dataset. Our best model achieves 76.1 FI and 66.1 EM scores on the test set. ‘This project is completed for Assignment 4 of C8224n, 1 Introduction The Stanford Question Answering Dataset (SQuAD) challenge, a machine comprehension task, has gained popularity in recent years from both theoretical and practical perspectives. The Stanford NLP group published the SQuAD(2] dataset consisting of more than 100,000 question-answer tuples taken from the set of Wikipedia articles, in which the answer to ‘each question is the consecutive segment of text in the corresponding reading passage. The primary task is to build models that take a paragraph and a question about it as an input, ‘and identify the answer to the question from the given paragraph. ‘There has been a lot ‘of research on building a state-of-the-art deep learning system on SQuAD that has been reported to accomplish outstanding performance (3][6][5}. Hence, the objective of this paper is to start with the provided starter code for C5224n: Natural Language Processing with Deep Learning (2017-2018 Winter Quarter), make successive improvements by implementing, ‘existing models, and compare their performance on SQuAD. The rest of the paper is organized as follows. Section 2 illustrates the components and the architecture of our system, Section 3 describes the experime i demonstrates error analysis, Section 5 concludes the paper and discusses future work. 2 Model 1 ndlependent implementatic and then illustrate the specific com model is modular in that different components can be swapped with others due to Thus, we first present al the modules considered in this paper, ations of the modules that we run in experiments.2.1 Model Components 24.1 Encoding Layer A d-dimensional word embeddings of a question x1 ,-+- ,xy € Rand context y1,-++ Yar € R¢ aro fed into a bidirectional LSTM with weights shared between the question and context. ‘The encoding layer encodes the embeddings into the representation matrix of the context jddden states Hin RY*2" and the question hidden states U in “2 where his the si of hidden states, 2.1.2 Bidirectional Attention Layer Bidirectional Attention Layer [3] is one of the attention layers that we use representations for context hidden states hy € R® and the quest uy, sty € Ra matrix S in RY" is computed according to 8; = w" [hys wjshious) € R, where w € Fé is a weight vector learned through training. We first compute Contert-To-Question (C2Q) attention. C2Q attention distribution is obtained by a = softmax(S,.) € R™."i € {1,---,.N}. ‘The question hidden states uj are then weighted according to a' to get C2Q attention output a = 37", ajuy € R?*, Next, we compute Question-To-Context (Q2C) attention. Q2C attention distribution is obtained by 3 = softmax(m) € R® for m, = max, $4). i € {1,---,.N}. The context hidden states hy are then weighted according to 8 to get Q2C attention output e! = > Shy € R2". Then, wwe get the bidirectional attention encoding by = (his ai;hioa,; hice’) € ROY i € {1,--* ,N} 2.1.8 Coattention Layer Another type of attention layers we implemented is Coattention layer (6). Given the question hidden states u1,-- ,unr € R', we first compute projected question hidden states w tanh(Wu, +b) € REYj € {1,---,M}. Also, sentinels hg and up are appended to the context and question hidden states, which gives us {In,-+~ .ligr,tg}and {ui,++* tae, 9} We then compute a affinity matrix L € RO+Y*+0 where Ly y= hu € R. Using the affinity matrix L, we apply the standard attention mechanism in both directions. Conteat-To-Question attention output is obtained by a; = SAP aja, € RE for af softmax(L,) € RS" i € (1,--+ ,N}. Question-To-Contert attention output is computed SLM Bh, € RE for 84 = softmax(L,,;)) © RY" 7 € {1,--- My Next, we compute second-level attention output s; = Mt" alb; © RY. Finally, [aissi] € RP € (1,--- , N} is fed into a bidirectional LSTM, and the resulting hidden states are the coattention encoding. in a similar way: by 24.4 Modeling Layer Following the example of the BIDAF{3), we implement. a modoling layer comprised of two layers of bidirectional LSTM’s, which outputs M R&** 2.2 Self-Attention Layer A self-attention layer [4] is used as an alternative to the modeling layer. Given the context hidden states H€ R**?*, we apply the attention mechanism to obtain attention distribution A = softmax(HH" /V2h) € RN*%, where softmax is taken with respect to the rows of HH /V2h. Then, self-attention ouiput is computed by a= AH € RN*2" 2.2.1 Output Layers The basic output layer we consider has the identical structure of the BiDAF{3}. ‘This module is used in conjunction with the modeling layer. Let G € R#e** denote the output by ‘an attention layer. Then, the probability distribution of the start index is computed byprt = softmax(w!|G;M)) ¢ RX, where w, € RYH is a trainable woight. ‘The passed to a bidirectional LSTM ‘that outputs My ¢ R“*". Finally, the probability distribution of the end index is obtained by p*4 = softmax(w? [G; My] € RY. Another type of output layer we implemented is Answer-Pointer Layer[6]. Given the blended representation G, the probability distribution of the start index is given by prtt = softmax(w'P, +6 ey) © RY. where Fy = tanh(VG +b @ en) € RX, and w € R¥%,c € R,V © RM*6,b © RM are parameters to be trained. ‘The operator © en produces a matrix by repeating the element on the left hand side for N times. Then, ‘we compute the hidden vector hy by using attention mechanism Gp, € R#° and passing it to a standard LSTM. Finally, the probability distribution of the end index is obtained by pi = softmax(w?F,+exew) © R, where Fy = tanh(VG-+(Wahy +b)Sen) € Re", and W;, € R&*#e ig another trainable weight The start and end indices (F*, ) are selected such that the joint probability pip" is maximized subject to i

QA With Deep Learning

Uploaded by

QA With Deep Learning

Uploaded by

You might also like