Attention Based CNN
Attention Based CNN
Abstract
Understanding open-domain text is one of the pri-
mary challenges in natural language processing
(NLP). Machine comprehension benchmarks eval-
uate the system’s ability to understand text based on
the text content only. In this work, we investigate
machine comprehension on MCTest, a question an-
swering (QA) benchmark. Prior work is mainly
based on feature engineering approaches. We come
up with a neural network framework, named hier-
archical attention-based convolutional neural net-
work (HABCNN), to address this task without any
manually designed features. Specifically, we ex-
plore HABCNN for this task by two routes, one Figure 1: One example with 2 out of 4 questions in the
is through traditional joint modeling of document, MCTest. “*” marks correct answer.
question and answer, one is through textual entail-
only one is correct. Questions in MCTest have two categories:
ment. HABCNN employs an attention mechanism
“one” and “multiple”. The label means one or multiple sen-
to detect key phrases, key sentences and key snip-
tences from the document are required to answer this ques-
pets that are relevant to answering the question. Ex-
tion. To correctly answer the first question in the example,
periments show that HABCNN outperforms prior
the two blue sentences are required; for the second question
deep learning approaches by a big margin.
instead, only the red sentence can help. The following obser-
vations hold for the whole MCTest. (i) Most of the sentences
1 Introduction in the document are irrelavent for a given question. It hints
Endowing machines with the ability to understand natu- that we need to pay attention to just some key regions. (ii) The
ral language is a long-standing goal in NLP and holds the answer candidates can be flexible text in length and abstrac-
promise of revolutionizing the way in which people interact tion level, and probably do not appear in the document. For
with machines and retrieve information. Richardson et al. example, candidate B for the second question is “outside”,
[2013] proposed the task of machine comprehension, along which is one word and does not exist in the document, while
with MCTest, a question answering dataset for evaluation. the answer candidates for the first question are longer texts
The ability of the machine to understand text is evaluated with some auxiliary words like “Because” in the text. This
by posing a series of questions, where the answer to each requires our system to handle flexible texts via extraction as
question can be found only in the associated text. Solutions well as abstraction. (iii) Some questions require multiple sen-
typically focus on some semantic interpretation of the text, tences to infer the answer, and those vital sentences mostly
possibly with some form of probabilistic or logic inference, appear close to each other (we call them snippet). Hence,
to answer the question. Despite intensive recent work [We- our system should be able to make a choice or compromise
ston et al., 2014; Weston et al., 2015; Hermann et al., 2015; between potential single-sentence clue and snippet clue.
Sachan et al., 2015], the problem is far from solved. Prior work on this task is mostly based on feature engi-
Machine comprehension is an open-domain question- neering. This work, instead, takes the lead in presenting a
answering problem which contains factoid questions, but the deep neural network based approach without any linguistic
answers can be derived by extraction or induction of key features involved.
clues. Figure 1 shows one example in MCTest. Each ex- Concretely, we propose HABCNN, a hierarchical
ample consists of one document, four associated questions; attention-based convolutional neural network, to address this
each question is followed by four answer candidates in which task in two roadmaps. In the first one, we project the docu-
ment in two different ways, one based on question-attention,
one based on answer-attention and then compare the two
projected document representations to determine whether
the answer matches the question. In the second one, every
question-answer pair is reformatted into a statement, then the
whole task is treated through textual entailment.
In both roadmaps, convolutional neural network (CNN) is
explored to model all types of text. As human beings usu-
ally do for such a QA task, our model is expected to be able
to detect the key snippets, key sentences, and key words or
Figure 2: Illustrations of HABCNN-QAP (top), HABCHH-
phrases in the document. In order to detect those informative
QP (middle) and HABCNN-TE (bottom). Q, A, S: question,
parts required by questions, we explore an attention mecha-
answer, statement; D: document
nism to model the document so that its representation con-
tains required information intensively. In practice, instead [Narasimhan and Barzilay, 2015; Sachan et al., 2015; Wang
of imitating human beings in QA task top-down, our system and McAllester, 2015; Smith et al., 2015]. In these works,
models the document bottom-up, through accumulating the a common route is first to define a regularized loss func-
most relevant information from word level to snippet level. tion based on assumed feature vectors, then the effort focuses
Our approach is novel in three aspects. (i) A document on designing effective features based on various rules. Even
is modeled by a hierarchical CNN for different granularity, though these researches are groundbreaking for this task, their
from word to sentence level, then from sentence to snippet flexibility and their capacity for generalization are limited.
level. The reason of choosing a CNN rather than other se- Deep learning based approaches appeal to increasing inter-
quence models like recurrent neural network [Mikolov et al., est in analogous tasks. Weston et al., [2014] introduce mem-
2010], long short-term memory unit (LSTM [Hochreiter and ory networks for factoid QA. Memory network framework
Schmidhuber, 1997]), gated recurrent unit (GRU [Cho et al., is extended in [Weston et al., 2015; Kumar et al., 2015] for
2014]) etc, is that we argue CNNs are more suitable to detect Facebook bAbI dataset. Peng et al. [2015]’s Neural Reasoner
the key sentences within documents and key phrases within infers over multiple supporting facts to generate an entity an-
sentences. Considering again the second question in Figure swer for a given question and it is also tested on bAbI. All of
1, the original sentence “They sat by the fire and talked about these works deal with some short texts with simple-grammar,
he insects” has more information than required, i.e, we do not aiming to generate an answer which is restricted to be one
need to know “they talked about the insects”. Sequence mod- word denoting a location, a person etc.
eling neural networks usually model the sentence meaning by Some works also tried over other kinds of QA tasks. For
accumulating the whole sequence. CNNs, with convolution- example, Iyyer et al., [2014] present QANTA, a recursive
pooling steps, are supposed to detect some prominent features neural network, to infer an entity based on its description text.
no matter where the features come from. (ii) In the example This task is basically a matching between description and en-
in Figure 1, apparently not all sentences are required given a tity, no explicit question exist. Another difference with us
question, and usually different snippets are required by dif- lies in that all the sentences in the entity description actually
ferent questions. Hence, the same document should have dif- contain partial information about the entity, hence a descrip-
ferent representations based on what the question is. To this tion is supposed to have only one representation. However
end, attentions are incorporated into the hierarchical CNN in our task, the modeling of document should be dynami-
to guide the learning of dynamic document representations cally changed according to the question analysis. Hermann
which closely match the information requirements by ques- et al., [2015] incorporate attention mechanism into LSTM for
tions. (iii) Document representations at sentence and snippet a QA task over news text. Still, their work does not handle
levels both are informative for the question, a highway net- some complex question types like “Why...”, they merely aim
work is developed to combine them, enabling our system to to find the entity from the document to fill the slot in the query
make a flexible tradeoff. so that the completed query is true based on the document.
Overall, we make three contributions. (i) We present a hi- Nevertheless, it inspires us to treat our task as a textual en-
erarchical attention-based CNN system “HABCNN”. It is, to tailment problem by first reformatting question-answer pairs
our knowledge, the first deep learning based system for this into statements.
MCTest task. (ii) Prior document modeling systems based on Some other deep learning systems are developed for an-
deep neural networks mostly generate generic representation, swer selection task [Yu et al., 2014; Yang et al., 2015;
this work is the first to incorporate attention so that docu- Severyn and Moschitti, 2015; Shen et al., 2015; Wang et
ment representation is biased towards the question require- al., 2010]. Differently, this kind of question answering task
ment. (iii) Our HABCNN systems outperform other deep does not involve document comprehension. They only try to
learning competitors by big margins. match the question and answer candidate without any back-
ground information. Instead, we treat machine comprehen-
2 Related Work sion in this work as a question-answer matching problem un-
der background guidance.
Existing systems for MCTest task are mostly based on man- Overall, for open-domain MCTest machine comprehension
ually engineered features. Representative work includes task, this work is the first to resort to deep neural networks.
Figure 3: HABCNN. Feature maps for phrase representations pi and the max pooling steps that create sentence representations
out of phrase representations are omitted for simplification. Each snippet covers three sentences in snippet-CNN. Symbols ◦
mean cosine similarity calculation.
Figure 4: Attention visualization for statement “Grandpa answered the door because Jimmy knocked” in the example Figure 1.
Too long sentences are truncated with “. . .”. Left is attention weights for each single sentence after the sentence-CNN, right is
attention weights for each snippet (two consecutive sentences as filter width w = 2 in Table 1) after the snippet-CNN.
doubt such kind of attention scheme when used in sentence way network to combine both representations as an overall
sequences of large size. In training, the attention weights after D representation. This visualization hints that our architec-
softmax normalization have actually small difference across ture provides a good way for a question to compromise key
sentences, this means the system can not distinguish key sen- information from different granularity.
tences from noise sentences effectively. Our cosine similarity We also do some preliminary error analysis. One big obsta-
based attention-pooling, though pretty simple, is able to fil- cle for our systems is the “how many” questions. For exam-
ter noise sentences more effectively, as we only pick top-k ple, for question “how many rooms did I say I checked?” and
pivotal sentences to form D representation finally. This trick the answer candidates are four digits “5,4,3,2” which never
makes the system simple while effective. appear in the D, but require the counting of some locations.
However, these digital answers can not be modeled well by
4.6 Case Study and Error Analysis distributed representations so far. In addition, digital answers
In Figure 4, we visualize the attention distribution at sentence also appear for “what” questions, like “what time did...”. An-
level as well as snippet level for the statement “ Grandpa an- other big limitation lies in “why” questions. This question
swered the door because Jimmy knocked” whose correspond- type requires complex inference and long-distance dependen-
ing question requires multiple sentences to answer. From its cies. We observed that all deep lerning systems, including the
left part, we can see that “Grandpa answered the door with a two baselines, suffered somewhat from it.
smile and welcomed Jimmy inside” has the highest attention
weight. This meets the intuition that this sentence has seman- 5 Conclusion
tic overlap with the statement. And yet this sentence does not
contain the answer. Look further the right part, in which the This work takes the lead in presenting a CNN based neu-
CNN layer over sentence-level representations is supposed to ral network system for open-domain machine comprehen-
extract high-level features of snippets. In this level, the high- sion task. Our systems tried to solve this task in a docu-
est attention weight is cast to the best snippet “Finally, Jimmy ment projection way as well as a textual entailment way. The
arrived...knocked. Grandpa answered the door...”. And the latter one demonstrates slightly better performance. Over-
neighboring snippets also get relatively higher attentions than all, our architecture, modeling dynamic document representa-
other regions. Recall that our system chooses the one sen- tion by attention scheme from sentence level to snippet level,
tence with top attention at left part and choose top-3 snip- shows promising results in this task. In the future, more fine-
pets at right part (referring to k value in Table 1) to form grained representation learning approaches are expected to
D representations at different granularity, then uses a high- model complex answer types and question types.
References of text. In Proceedings of EMNLP, volume 1, pages
[Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, 193–203, 2013.
Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, [Sachan et al., 2015] Mrinmaya Sachan, Avinava Dubey,
Holger Schwenk, and Yoshua Bengio. Learning phrase Eric P Xing, and Matthew Richardson. Learning answer-
representations using rnn encoder-decoder for statistical entailing structures for machine comprehension. In Pro-
machine translation. In Proceedings of EMNLP, pages ceedings of ACL, pages 239–249, 2015.
1724–1734, 2014. [Severyn and Moschitti, 2015] Aliaksei Severyn and
[Collobert and Weston, 2008] Ronan Collobert and Jason Alessandro Moschitti. Learning to rank short text pairs
Weston. A unified architecture for natural language pro- with convolutional deep neural networks. In Proceedings
cessing: Deep neural networks with multitask learning. In of SIGIR, pages 373–382, 2015.
Proceedings of ICML, pages 160–167, 2008. [Shen et al., 2015] Yikang Shen, Wenge Rong, Zhiwei Sun,
[Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram Yuanxin Ouyang, and Zhang Xiong. Question/answer
Singer. Adaptive subgradient methods for online learn- matching for cqa system via combining lexical and se-
ing and stochastic optimization. The Journal of Machine quential information. In Proceedings of AAAI, pages 275–
Learning Research, 12:2121–2159, 2011. 281, 2015.
[Hermann et al., 2015] Karl Moritz Hermann, Tomas Ko- [Smith et al., 2015] Ellery Smith, Nicola Greco, Matko
cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Bosnjak, and Andreas Vlachos. A strong lexical matching
Mustafa Suleyman, and Phil Blunsom. Teaching machines method for the machine comprehension test. In Proceed-
to read and comprehend. In Proceedings of NIPS, pages ings of EMNLP, pages 1693–1698, 2015.
1684–1692, 2015. [Srivastava et al., 2015] Rupesh K Srivastava, Klaus Greff,
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and and Jürgen Schmidhuber. Training very deep networks.
Jürgen Schmidhuber. Long short-term memory. Neural In NIPS, pages 2368–2376, 2015.
computation, 9(8):1735–1780, 1997. [Wang and McAllester, 2015] Hai Wang and Mohit Bansal
[Iyyer et al., 2014] Mohit Iyyer, Jordan Boyd-Graber, Kevin Gimpel David McAllester. Machine comprehen-
Leonardo Claudino, Richard Socher, and Hal Daumé III. sion with syntax, frames, and semantics. In Proceedings
A neural network for factoid question answering over of ACL, Volume 2: Short Papers, pages 700–706, 2015.
paragraphs. In Proceedings of EMNLP, pages 633–644, [Wang et al., 2010] Baoxun Wang, Xiaolong Wang,
2014. Chengjie Sun, Bingquan Liu, and Lin Sun. Modeling
[Järvelin and Kekäläinen, 2002] Kalervo Järvelin and Jaana semantic relevance for question-answer pairs in web social
Kekäläinen. Cumulated gain-based evaluation of ir communities. In Proceedings of ACL, pages 1230–1238,
techniques. ACM Transactions on Information Systems 2010.
(TOIS), 20(4):422–446, 2002. [Weston et al., 2014] Jason Weston, Sumit Chopra, and An-
[Kumar et al., 2015] Ankit Kumar, Ozan Irsoy, Jonathan Su, toine Bordes. Memory networks. In Proceedings of ICLR,
James Bradbury, Robert English, Brian Pierce, Peter On- 2014.
druska, Ishaan Gulrajani, and Richard Socher. Ask me [Weston et al., 2015] Jason Weston, Antoine Bordes, Sumit
anything: Dynamic memory networks for natural language Chopra, and Tomas Mikolov. Towards ai-complete ques-
processing. arXiv preprint arXiv:1506.07285, 2015. tion answering: a set of prerequisite toy tasks. arXiv
[Mikolov et al., 2010] Tomas Mikolov, Martin Karafiát, preprint arXiv:1502.05698, 2015.
Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Re- [Yang et al., 2015] Yi Yang, Wen-tau Yih, and Christopher
current neural network based language model. In Proceed- Meek. Wikiqa: A challenge dataset for open-domain ques-
ings of INTERSPEECH, pages 1045–1048, 2010. tion answering. In Proceedings of EMNLP, pages 2013–
[Narasimhan and Barzilay, 2015] Karthik Narasimhan and 2018, 2015.
Regina Barzilay. Machine comprehension with discourse [Yu et al., 2014] Lei Yu, Karl Moritz Hermann, Phil Blun-
relations. In Proceedings of ACL, pages 1253–1262, 2015. som, and Stephen Pulman. Deep learning for answer sen-
[Peng et al., 2015] Baolin Peng, Zhengdong Lu, Hang Li, tence selection. In ICLR workshop, 2014.
and Kam-Fai Wong. Towards neural network-based rea-
soning. CoRR, abs/1508.05508, 2015.
[Pennington et al., 2014] Jeffrey Pennington, Richard
Socher, and Christopher D Manning. Glove: Global
vectors for word representation. Proceedings of EMNLP,
12:1532–1543, 2014.
[Richardson et al., 2013] Matthew Richardson, Christo-
pher JC Burges, and Erin Renshaw. Mctest: A challenge
dataset for the open-domain machine comprehension