0% found this document useful (0 votes)
5 views

Icml2016 Memnn Tutorial

Uploaded by

2012dtd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Icml2016 Memnn Tutorial

Uploaded by

2012dtd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Memory Networks for

Language Understanding

Jason Weston

Facebook AI Research
Intelligent Conversational
Agents
End-to-End Dialog Agents
While it is possible to build useful dialog agents as a set
of separate black boxes with joining logic (Google Now,
Cortana, Siri, .. ?) we believe a true dialog agent should:
 Be able to combine all its knowledge to fulfill
complex tasks.
 Handle long open-ended conversations involving
effectively tracking many latent variables.
 Be able to learn (new tasks) via conversation.
Our bet: Machine Learning End-to-End systems is
the way forward in the long-run.
Memory Networks
 Class of models that combine large memory with learning
component that can read and write to it.
 Incorporates reasoning with attention over memory (RAM).
 Most ML has limited memory which is more-or-less all that’s
needed for “low level” tasks e.g. object detection.

Our motivation: long-term memory is required to


read a story and then e.g. answer questions about
it.

Similarly, it’s also required for dialog: to remember


previous dialog (short- and long-term), and respond.

1. We first test this on the toy (bAbI) tasks.


2. Any interesting model has to be good on real data as
Memory Networks
Evaluating End-To-End
Learners
 Long Term goal: A learner can be trained (from
scratch?) to understand and use language.
 Our main interest: uncover the learning algorithms
able to do so.
 Inspired by “A Roadmap towards Machine
Intelligence” (Mikolov, Joulin, Baroni 2015) we
advocate a set of tasks to train & evaluate on:
 Classic Language Modeling (Penn TreeBank, Text8)
 Story understanding (Children’s Book Test, News articles)
 Open Question Answering (WebQuestions, WikiQA)
 Goal-Oriented Dialog and Chit-Chat (Movie Dialog,
Ubuntu)
What is a Memory
Network? Original paper
description of class of models
MemNNs have four component networks (which may
or may not have shared parameters):
 I: (input feature map) convert incoming data to the
internal feature representation.
 G: (generalization) update memories given new
input.
 O: produce new output (in feature
representation space) given the memories.
 R: (response) convert output O into
a response seen by the outside world.
Some Memory Network-
related Publications
 J. Weston, S. Chopra, A. Bordes. Memory Networks. ICLR 2015 (and
arXiv:1410.3916).
 S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. End-To-End Memory Networks.
NIPS 2015 (and arXiv:1503.08895).
 J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, T.
Mikolov. Towards AI-Complete Question Answering: A Set of Prerequisite Toy
Tasks. arXiv:1502.05698.
 A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question
Answering with Memory Networks. arXiv:1506.02075.
 J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Chopra, A. Miller, A. Szlam, J. Weston.
Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems.
arXiv:1511.06931.
 F. Hill, A. Bordes, S. Chopra, J. Weston. The Goldilocks Principle: Reading
Children's Books with Explicit Memory Representations. arXiv:1511.02301.
 J. Weston. Dialog-based Language Learning. arXiv:1604.06045.
 A. Bordes, Jason Weston. Learning End-to-End Goal-Oriented Dialog.
arXiv:1605.07683.
Memory Network Models
implemented models..

[Figure by Saina
Sukhbaatar]
Variants of the class…
Some options and extensions:
 Representation of inputs and memories could use all
kinds of encodings: bag of words, RNN style reading at
word or character level, etc.
 Different possibilities for output module: e.g. multi-
class classifier or uses an RNN to output sentences.
 If the memory is huge (e.g. Wikipedia) we need to
organize the memories. Solution: hash the memories to
store in buckets (topics). Then, memory addressing and
reading doesn’t operate on all memories.
 If the memory is full, there could be a way of removing
one it thinks is most useless; i.e. it ``forgets’’ somehow.
That would require a scoring function of the utility of each
memory..
Task (1) Factoid QA with
Single Supporting Fact
(“where is actor”)

(Very Simple) Toy reading comprehension task:

John was in the bedroom.


Bob was in the office.
John went to kitchen. SUPPORTING FACT
Bob travelled back home.
Where is John? A:kitchen
(2) Factoid QA with Two
Supporting Facts (“where is
actor+object”)
A harder (toy) task is to answer questions where two supporting
statements have to be chained to answer the question:

John is in the playground.


Bob is in the office.
John picked up the football.
Bob went to the kitchen.
Where is the football? A:playground
Where was Bob before the kitchen? A:office
(2) Factoid QA with Two
Supporting Facts (“where is
actor+object”)
A harder (toy) task is to answer questions where two supporting
statements have to be chained to answer the question:

John is in the playground. SUPPORTING FACT


Bob is in the office.
John picked up the football. SUPPORTING FACT
Bob went to the kitchen.
Where is the football? A:playground
Where was Bob before the kitchen? A:office

To answer the first question Where is the football? both John


picked up the football and John is in the playground are
supporting facts.

.
Memory Network Models
implemented models

[Figure by Saina
Sukhbaatar]
The First MemNN
Implemention
 I (input): converts to bag-of-word-embeddings x.
 G (generalization): stores x in next available slot mN.
 O (output): Loops over all memories k=1 or 2
times:
 1st loop max: finds best match mi with x.
 2nd loop max: finds best match mJ with (x, mi).
 The output o is represented with (x, mi, mJ).

 R (response): ranks all words in the dictionary


given o and returns best single word. (OR: use a
full RNN here)
Matching function
 For a given Q, we want a good match to the relevant
memory slot(s) containing the answer, e.g.:

Match(Where is the football ?, John picked up the


football)
 We use a qTUTUd embedding model with word
embedding features:
 LHS features: Q:Where Q:is Q:the Q:football Q:?
 RHS features: D:John D:picked D:up D:the D:football
QDMatch:the QDMatch:football
(QDMatch:football is a feature to say there’s a Q&A word match, which
can help.)

The parameters U are trained with a margin ranking loss:


supporting facts should score higher than non-supporting
facts.
Matching function: 2 hop nd

 On the 2nd hop we match question & 1st hop to


new fact:

Match( [Where is the football ?, John picked up the


football], J John is in the playground)

 We use the same qTUTUd embedding model:


 LHS features: Q:Where Q:is Q:the Q:football Q:?
Q2: John Q2:picked Q2:up Q2:the Q2:football
 RHS features: D:John D:is D:in D:the D:playground
QDMatch:the QDMatch:is .. Q2DMatch:John
Objective function
Minimize:

Where: SO is the matching function for the Output component.


SR is the matching function for the Response component.
x is the input question.
mO1 is the first true supporting memory (fact).
mO2 is the first second supporting memory (fact).
r is the response
True facts and responses mO1, mO2 and r should have
Comparing triples
 We also need time information for the bAbI tasks. We tried
adding absolute time as a feature: it works, but the
following idea can be better:
 Seems to work better if we compare triples:
 Match(Q,D,D’) returns < 0 if D is better than D’
returns > 0 if D’ is better than D

We can loop through memories, keep best mi at each step.

Now the features include relative time features:

L.H.S: same as before

R.H.S: features(D) DbeforeQ:0-or-1

-features(D’) D’beforeQ:0-or-1 DbeforeD’:0-or-1


Comparing triples: Objective and
Inference

Similar to before, except now for both mo1 and mo2 we need to
have two terms considering them as the second or third
argument to the SOt as they may appear on either side during
inference:
bAbI Experiment 1
• 10k sentences. (Actor: only ask questions about actors.)
• Difficulty: how many sentences in the past when entity
mentioned.
• Fully supervised (supporting sentences are labeled).
• Compare RNN (no supervision)
and MemNN hops k = 1 or 2, & with/without time
features.

Difficulty 5 -- Max mem. sz. required: 65 Average mem. sz.


required: 9
bAbI Experiment 1
 Example test story + predictions:
Antoine went to the kitchen. Antoine got the milk.
Antoine travelled to the office. Antoine dropped the
milk. Sumit picked up the football. Antoine went to
the bathroom. Sumit moved to the kitchen.
 where is the milk now? A: office
 where is the football? A: kitchen
 where is Antoine ? A: bathroom
 where is Sumit ? A: kitchen
 where was Antoine before the bathroom? A:
Unsegmented setup; R module is an
RNN
Joe went to the garden then Fred picked up the milk; Joe moved
to the bathroom and Fred dropped the milk, and then Dan
moved to the living room.
 Where is Dan? A: living room I believe
 Where is Joe? A: the bathroom
Fred moved to the bedroom and Joe went to the kitchen then
Joe took the milk there and Dan journeyed to the bedroom; Joe
discarded the milk.
 Where is the milk now? A: the milk is in the kitchen
 Where is Dan now? A: I think he is in the bedroom
Joe took the milk there, after that Mike travelled to the office,
then Joe went to the living room, next Dan went back to the
kitchen and Joe travelled to the office.
 Where is Joe now? A: I think Joe is in the office
Larger QA: Reverb Dataset in (Fader et al.,
13)
 14M statements, stored as (subject, relation,
object) triples. Triples are REVERB extractions
mined from ClueWeb09.

 Statements cover diverse topics:


 (milne, authored, winnie-the-pooh)
 (sheep, be-afraid-of, wolf), etc...
 Weakly labeled QA pairs and 35M paraphrased
questions from WikiAnswers:
 ``Who wrote the Winnie the Pooh books?
 ``Who is poohs creator?’’
Results: QA on Reverb data
from (Fader et al.)
• 14M statements stored in the memNN memory.
• k=1 loops MemNN, 128-dim embedding.
• R response simply outputs top scoring
statement.
• Time features are not necessary, hence not
used.
• We also tried adding bag of words (BoW)
features.
Fast QA on Reverb data
Scoring all 14M candidates in the memory is slow.

We consider speedups using hashing in S and O as


mentioned earlier:
 Hashing via words (essentially: inverted index)
 Hashing via k-means in embedding space
(k=1000)
A MemNN multitasked on bAbI
data and Reverb QA data
The “story” told to the model after training:

Antoine went to the kitchen. Antoine picked up the milk.


Antoine travelled to the office.

MemNN’s answers to some questions:


 Where is the milk? A: office

 Where was Antoine before the office? A: kitchen

 Where does milk come from? A: milk come from cow

 What is a cow a type of? A: cow be female of cattle

 Where are cattle found? A: cattle farm become widespread in brazil

 What does milk taste like? A: milk taste like milk

 What does milk go well with? A: milk go with coffee


Related Memory Models
(published before or ~same time as original
paper)
 RNNSearch (Bahdanau et al.) for Machine Translation
 Can be seen as a Memory Network where memory goes back
only one sentence (writes embedding for each word).
 At prediction time, reads memory and performs a soft max to
find best alignment (most useful words). 1 hop only.

 Generating Sequences With RNNs (Graves, ‘13)


 Also does alignment with previous sentence to generate
handwriting (so RNN knows what letter it’s currently on).

 Neural Turing Machines (Graves et al., 14)


[on arxiv just 5 days after MemNNs!]
 Has read and write operations over memory to perform tasks
(e.g. copy, sort, associative recall).
 128 memory slots in experiments; content addressing
computes a score for each slot  slow for large memory?

 Earlier work by (Das ‘92), (Schmidhuber et al., 93),


Reasoning, Attention, Memory (RAM)
(e.g. addition, multiplication,
sorting)
Methods include adding stacks and addressable memory to RNNs
 “Neural Net Architectures for Temporal Sequence Processing.” M
Mozer.

 “Neural Turing Machines” A. Graves, G. Wayne, I. Danihelka.


 “Inferring Algorithmic Patterns with Stack Augmented Recurren
Nets.” A. Joulin, T. Mikolov.

 “Learning to Transduce with Unbounded Memory” E. Grefenstette


al.

 “Neural Programmer-Interpreters” S. Reed, N. de Freitas.


 “Reinforcement Learning Turing Machine.” W. Zaremba and I.
Sutskever.

 “Learning Simple Algorithms from Examples” W. Zaremba, T. Mikol


A. Joulin, R. Fergus.
RAM
Classic Language Modeling:
 “Long short-term memory” Sepp Hochreiter, Jürgen Schmidhuber.
Machine translation:
 “Sequence to Sequence Learning with Neural Networks” I.
Sutskever, O. Vinyals, Q. Le.
 “Neural Machine Translation by Jointly Learning to Align and
Translate” D. Bahdanau, K. Cho, Y. Bengio.

Parsing:
 “Grammar as a Foreign Language” O. Vinyals, L. Kaiser, T. Koo, S.
Petrov, I. Sutskever, G. Hinton.

Entailment:
 “Reasoning about Entailment with Neural Attention” T. Rocktäschel,
E. Grefenstette, K. Hermann, T. Kočiský, P. Blunsom.

Summarization:
 “A Neural Attention Model for Abstractive Sentence
Reasoning with synthetic
language
 “A Roadmap towards Machine Intelligence” T. Mikolov, A. Joulin, M.
Baroni.

 “Towards AI-Complete Question Answering: A Set of Prerequisite


Toy Tasks” J. Weston, A. Bordes, S. Chopra, A.. Rush, B. van
Merriënboer, A. Joulin, T. Mikolov.

Several new models that attempt to solve bAbI tasks:

 “Dynamic Memory Networks for Natural Language Processing” A.


Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, R. Socher

 “Towards Neural Network-based Reasoning” B. Peng, Z. Lu, H. Li, K.


Wong.

 “End-To-End Memory Networks” S. Sukhbaatar, A. Szlam, J. Weston, R.


Fergus.
RAM
Understanding news articles:
 “Teaching Machines to Read and Comprehend” K. Hermann, T. Kočiský, E.
Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blunsom.

Understanding children’s books:


 “The Goldilocks Principle: Reading Children's Books with Explicit Memory
Representations” F. Hill, A. Bordes, S. Chopra, J. Weston.

Conducting Dialog:
 “Hierarchical Neural Network Generative Models for Movie Dialogues” I.
Serban, A. Sordoni, Y. Bengio, A. Courville, J. Pineau.
 “A Neural Network Approach to Context-Sensitive Generation of
Conversational Responses” Sordoni et al.
 “Neural Responding Machine for Short-Text Conversation” L. Shang, Z. Lu,
H.Li.
 “Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems”
J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Chopra, A. Miller, A. Szlam, J. Weston.

General Question Answering:


 “Large-scale Simple Question Answering with Memory Networks” A. Bordes,
What was next for
MemNNs?
 Make the language much harder: coreference,
conjunctions, negations, etc. etc – will it work?
 MemNNs that reason with more than 2
supporting memories.
 End-to-end? (doesn’t need supporting facts)
 More useful applications on real datasets.
 Dialog: Ask questions? Say statements?
 Do MemNN ideas extend to other ML tasks and
model variants, .e.g. visual QA, perform
actions…? [A: yes!].
bAbI tasks: what reasoning
tasks would we like models to
work on?
 We define 20 tasks (generated by the simulation)
that we can test new models on. (See: http://fb.ai/babi)

 The idea is they are a bit like software tests:


each task checks if an ML system has a certain
skill.
 We would like each “skill” we check to be a natural
task for humans w.r.t. text understanding &
reasoning, humans should be able to get 100%.
J. Weston, A. Bordes, S. Chopra, T. Mikolov. Towards
AI-Complete Question Answering: A Set of
Prerequisite Toy Tasks. arXiv:1502.05698.
Task (1) Factoid QA with
Single Supporting Fact
(“where is actor”)
Our first task consists of questions where a single supporting fact,
previously given, provides the answer.

We test simplest case of this, by asking for the location of a person.

A small sample of the task is thus:

John is in the playground. SUPPORTING FACT


Bob is in the office.
Where is John? A:playground

We could use supporting facts for supervision at training


time, but are not known at test time (we call this “strong
supervision”). However weak supervision is much better!!
(2) Factoid QA with Two
Supporting Facts (“where is
actor+object”)
A harder task is to answer questions where two supporting
statements have to be chained to answer the question:

John is in the playground. SUPPORTING FACT


Bob is in the office.
John picked up the football. SUPPORTING FACT
Bob went to the kitchen.
Where is the football? A:playground

To answer the question Where is the football? both John picked up


the football and John is in the playground are supporting facts.

.
(3) Factoid QA with Three
Supporting Facts
Similarly, one can make a task with three supporting facts:

John picked up the apple.


John went to the office.
John went to the kitchen.
John dropped the apple.
Where was the apple before the kitchen? A:office

The first three statements are all required to answer this.


(4) Two Argument Relations:
Subject vs. Object
To answer questions the ability to differentiate and recognize
subjects and objects is crucial.

We consider the extreme case: sentences feature re-ordered words:

The office is north of the bedroom.


The bedroom is north of the bathroom.
What is north of the bedroom? A:office
What is the bedroom north of? A:bathroom

Note that the two questions above have exactly the same words,
but in a different order, and different answers.

So a bag-of-words will not work.


(6) Yes/No Questions
 This task tests, in the simplest case possible (with a single
supporting fact) the ability of a model to answer true/false type
questions:
John is in the playground.
Daniel picks up the milk.
Is John in the classroom? A:no
Does Daniel have the milk? A:yes
(7) Counting
Tests ability to count sets:

Daniel picked up the football.


Daniel dropped the football.
Daniel got the milk.
Daniel took the apple.
How many objects is Daniel holding? A:two

(8) Lists/Sets
 Tests ability to produce lists/sets:

Daniel picks up the football.


Daniel drops the newspaper.
Daniel picks up the milk.
What is Daniel holding? A:milk,football
(11) Basic Coreference (nearest
referent)

Daniel was in the kitchen.


Then he went to the studio.
Sandra was in the office.
Where is Daniel? A:studio

(13) Compound Coreference


Daniel and Sandra journeyed to the office.
Then they went to the garden.
Sandra and John travelled to the kitchen.
After that they moved to the hallway.
Where is Daniel? A:garden
(14) Time manipulation
 While our tasks so far have included time implicitly in the order of
the statements, this task tests understanding the use of time
expressions within the statements:

In the afternoon Julie went to the park.


Yesterday Julie was at school.
Julie went to the cinema this evening.
Where did Julie go after the park? A:cinema

Much harder difficulty: adapt a real time expression labeling


dataset into a question answer format, e.g. Uzzaman et al., ‘12.
(15) Basic Deduction
 This task tests basic deduction via inheritance of properties:

Sheep are afraid of wolves.


Cats are afraid of dogs.
Mice are afraid of cats.
Gertrude is a sheep.
What is Gertrude afraid of? A:wolves

Deduction should prove difficult for MemNNs because it


effectively involves search, although our setup might be simple
enough for it.
(17) Positional Reasoning
 This task tests spatial reasoning, one of many components of the
classical SHRDLU system:

The triangle is to the right of the blue square.


The red square is on top of the blue square.
The red sphere is to the right of the blue square.
Is the red sphere to the right of the blue square? A:yes
Is the red square to the left of the triangle? A:yes
(18) Reasoning about size
 This tasks requires reasoning about relative size of objects and is
inspired by the commonsense reasoning examples in the
Winograd schema challenge:

The football fits in the suitcase.


The suitcase fits in the cupboard.
The box of chocolates is smaller than the football.
Will the box of chocolates fit in the suitcase? A:yes

Tasks 3 (three supporting facts) and 6 (Yes/No) are prerequisites.


(19) Path Finding
 In this task the goal is to find the path between locations:

The kitchen is north of the hallway.


The den is east of the hallway.
How do you go from den to kitchen? A:west,north

This is going to prove difficult for MemNNs because it effectively


involves search.
What models could we
try?
 Classic NLP cascade e.g. SVM-struct with bunch
of features for subtasks: (Not End-to-End)
 N-gram models with SVM-type classifier?
 (LSTM) Recurrent Neural Nets?
 Memory Network variants … ?
 <Insert your new model here>
End-to-end Memory
Network (MemN2N)
 New end-to-end (MemN2N) model (Sukhbaatar ‘15):
 Reads from memory with soft attention
 Performs multiple lookups (hops) on memory
 End-to-end training with backpropagation
 Only need supervision on the final output

 It is based on “Memory Networks” by


[Weston, Chopra & Bordes ICLR 2015] but that had:
 Hard attention
 requires explicit supervision of attention during training
 Only feasible for simple tasks
MemN2N architecture
supervision
Output

read

address
in
Memory g Controlle
Module read r module
address
in
g

Memory vectors Internal state


(unordered) Input vector
Memory Module
Weighted Sum

Attention weights
/ Soft address

To controller
Softma (added to
x controller state)

Dot Product

Addressing signal
(controller
state vector)
Memory vectors
Question & Answering
Answer kitchen
Memory Module

Weighted
Sum

Controller
Dot product +
softmax

1: Sam moved 2: Sam went 3: Sam drops


Where is Sam?
to garden to kitchen apple there
Question
Input story
Memory Vectors
E.g.) constructing memory vectors with Bag-of-
Words (BoW)
1. Embed each word
2. Sum embedding vectors

Embedding Vectors Memory Vector

E.g.) temporal structure: special words for time and


include them in BoW
Time
embedding
Positional Encoding of
Words
Representation of inputs and memories
could use all kinds of encodings: bag of
words, RNN style reading at word or
character level, etc.

We also built a positional encoding


variant:
Words are represented by vectors as before.
But instead of a bag, position is modeled by
a multiplicative term on each word vector
with weights depending on the position in
Training on 1k Weakly supervised Supervised Supp.
stories Facts
TASK N-grams LSTMs MemN2N Memory StructSVM
Network +coref+srl
s
T1. Single supporting fact 36 50 PASS PASS PASS
T2. Two supporting facts 2 20 87 PASS 74
T3. Three supporting facts 7 20 60 PASS 17
T4. Two arguments relations 50 61 PASS PASS PASS
T5. Three arguments relations 20 70 87 PASS 83
T6. Yes/no questions 49 48 92 PASS PASS
T7. Counting 52 49 83 85 69
T8. Sets 40 45 90 91 70
T9. Simple negation 62 64 87 PASS PASS
T10. Indefinite knowledge 45 44 85 PASS PASS
T11. Basic coreference 29 72 PASS PASS PASS
T12. Conjunction 9 74 PASS PASS PASS
T13. Compound coreference 26 PASS PASS PASS PASS
T14. Time reasoning 19 27 PASS PASS PASS
T15. Basic deduction 20 21 PASS PASS PASS
T16. Basic induction 43 23 PASS PASS 24
T17. Positional reasoning 46 51 49 65 61
T18. Size reasoning 52 52 89 PASS 62
T19. Path finding 0 8 7 36 49
T20. Agent’s motivation 76 91 PASS PASS PASS
Attention during memory
lookups
Samples from toy QA tasks

Test Acc Failed


tasks
MemNN 93.3% 4

LSTM 49% 20

20 bAbI Tasks MemN2N 74.82% 17


1 hop
2 hops 84.4% 11

3 hops 87.6.% 11
So we still fail on some
tasks….
.. and we could also make more tasks that we
fail on!

Our hope is that a feedback loop of:

1. Developing tasks that break models, and


2. Developing models that can solve tasks
… leads in a fruitful research direction….
How about on real data?
 Toy AI tasks are important for developing innovative
methods.
 But they do not give all the answers.

 How do these models work on real data?


 Classic Language Modeling (Penn TreeBank, Text8)
 Story understanding (Children’s Book Test, News articles)
 Open Question Answering (WebQuestions, WikiQA)
 Goal-Oriented Dialog and Chit-Chat (Movie Dialog,
Ubuntu)
Language Modeling
The goal is to predict the next word in a text sequence given the
previous words. Results on the Penn Treebank and Text8
(Wikipedia-based) corpora.
Penn Tree Text8

RNN 129 184


Test perplexity
LSTM 115 154

MemN2N 121 187


2 hops
5 hops 118 154

7 hops 111 147

Hops vs. Attention:


Average over (PTB) Average over
(Text8)
Language Modeling
The goal is to predict the next word in a text sequence given the
previous words. Results on the Penn Treebank and Text8
(Wikipedia-based) corpora.
Penn Tree Text8

RNN 129 184


Test perplexity
LSTM 115 154

MemN2N 121 187


2 hops
5 hops 118 154

7 hops 111 147

MemNNs are in the same ballpark as LSTMs.

Hypothesis: many words (e.g. syntax words) don’t actually need


really long term context, and so memNNs don’t help there.

Maybe MemNNs could eventually help more on things like


nouns/entities?
Self-Supervision Memory
Network
Two tricks together that make things work a bit better:

1) Bypass module

Instead of the last output module being a linear layer


from the output of the memory, assume the answer is one
of the memories. Sum the scores of identical memories.

2) Self-Supervision

We know what the right answer is on the training data, so


just directly train that memories containing the answer
word to be supporting facts (have high probability).
Results on Children’s Book
Test
Question Answering on New’s
We Articles
evaluate our models on the data from:
“Teaching Machines to Read and Comprehend”
Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse
Espeholt, Will Kay, Mustafa Suleyman, Phil Blunsom
Results on CNN QA dataset
Latest Fresh Results
 Our best results: QACNN: 69.4 CBT-NE: 66.6 CBT-V: 63.0
 Text Understanding with the Attention Sum Reader Network.
Kadlec et al. (4 Mar ‘16) QACNN: 75.4 CBT-NE: 71.0 CBT-CN:
68.9 Uses RNN style encoding of words + bypass
module + 1 hop
 Iterative Alternating Neural Attention for Machine Reading. Sordoni
et al. (7 Jun ’16) QACNN: 76.1 CBT-NE: 72.0 CBT-CN: 71.0
 Natural Language Comprehension with the EpiReader. Trischler et
al. (7 Jun ’16) QACNN: 74.0 CBT-NE: 71.8 CBT-CN: 70.6
 Gated-Attention Readers for Text Comprehension. Dhingra et al. (5
Jun ’16) QACNN: 77.4 CBT-NE: 71.9 CBT-CN: 69.
Uses RNN style encoding of words + bypass module +
multiplicative combination of query + multiple hops
WebQuestions &
SimpleQuestions
 Decent results on WebQuestions, a popular QA
task:

A. Bordes, N. Usunier, S. Chopra J.Weston. Large-


scale Simple Question Answering with Memory
Networks. arXiv:1506.02075.

• However now beaten by many results, especially (Yih et al. ACL ‘15)
that achieves 52.5! Several hand engineered features are used in
that case. Note WebQuestions is very small (4k train+valid).
Recent Work: New Models for QA on
documents Miller et al. Key-Value Memory
Networks for Directly Reading Documents.
arXiv:1606.03126.
Recent Work: New Models for QA on
documents Miller et al. Key-Value Memory
Networks for Directly Reading Documents.
arXiv:1606.03126.

WikiQA Results
dialog data? With multiple
exchanges?
 Everything we showed so far was question answering
potentially with long-term context.
 We have also built a Movie Dialog Dataset
Closed, but large, domain about movies (75k entities, 3.5M
ex).
 Ask facts about movies?
 Ask for opinions (recommendations) about movies?
 Dialog combining facts and opinions?
 General chit-chat about movies (statements not
questions)?

And… combination of all above in one end-to-end


model.
Recent Work: Combines QA with Dialog Tasks
Dodge et al. “Evaluating Prerequisite Qualities
for Learning End-to-End Dialog Systems.” ICLR ‘16
(Dialog 1) QA: facts about
movies
Sample input contexts and target replies (in red) from Dialog Task 1:

What movies are about open source? Revolution OS


Ruggero Raimondi appears in which movies? Carmen
What movies did Darren McGavin star in? Billy Madison, The
Night Stalker, Mrs. Pollifax-Spy, The Challenge
Can you name a film directed by Stuart Ortiz? Grave
Encounters
Who directed the film White Elephant? Pablo Trapero
What is the genre of the film Dial M for Murder? Thriller, Crime
What language is Whity in? German
(Dialog 2) Recs: movie
recommendations
Sample input contexts and target replies (in red) from Dialog Task 2:

Schindler's List, The Fugitive, Apocalypse Now, Pulp Fiction, and


The Godfather are films I really liked. Can you suggest a film?
The Hunt for Red October

Some movies I like are Heat, Kids, Fight Club, Shaun of the
Dead, The Avengers, Skyfall, and Jurassic Park. Can you
suggest something else I might like? Ocean's Eleven
(Dialog 3) QA+Recs: combination
dialog
Sample input contexts and target replies (in red) from Dialog Task 3:

I loved Billy Madison, Blades of Glory, Bio-Dome, Clue, and


Happy Gilmore. I'm looking for a Music movie. School of Rock
What else is that about? Music, Musical, Jack Black, school,
teacher, Richard Linklater, rock, guitar
I like rock and roll movies more. Do you know anything else?
Little Richard
(Dialog 4) Reddit: real dialog
Sample input contexts and target replies (in red) from Dialog Task 4:

I think the Terminator movies really suck, I mean the first one
was kinda ok, but after that they got really cheesy. Even the
second one which people somehow think is great. And after
that... forgeddabotit.
C’mon the second one was still pretty cool.. Arny was still so
badass, as was Sararah Connor’s character.. and the way they
blended real action and effects was perhaps the last of its
kind...
example
Results
Ubuntu Data
Dialog dataset: Ubuntu IRC channel logs, users ask
questions about issues they are having with
Ubuntu and get answers by other users. (Lowe et
al., ‘15)

Best results currently reported:


Sentence Pair Scoring: Towards Unified Framework for Text
Comprehension
Petr Baudiš, Jan Pichl, Tomáš Vyskočil, Jan Šedivý
Next Steps
Artificial tasks to help design new methods:
 New methods that succeed on all bAbI tasks?
 Make more bAbI tasks to check other skills.
Real tasks to make sure those methods are actually useful:
 Sophisticated reasoning on bAbI tasks doesn’t always happen as
clearly on real data.. Why? Fix!
 Models that work jointly on all tasks so far built.
Dream: can learn from very weak supervision:

We would like to learn in an environment just by communicating with


other agents / humans, as well as seeing other agents communicating
+ acting in the environment.

E.g. a baby talking to its parents, and seeing them talk to each other.
Learning From Human
Responses
Mary went to the hallway.

John moved to the bathroom.

Mary travelled to the kitchen.

Where is Mary? A:playground


If you can predict this,
No, that's incorrect. you are most of the way
to knowing how to answer
Where is John? A:bathroom correctly.

Yes, that's right!


Human Responses Give
Lots of Info
Mary went to the hallway.

John moved to the bathroom.

Mary travelled to the kitchen.

Where is Mary? A:playground


Much more signal
No, the answer is kitchen. than just “No” or
zero reward.
Where is John? A:bathroom

Yes, that's right!


Forward Prediction
Mary went to the hallway.

John moved to the bathroom.

Mary travelled to the kitchen.

Where is Mary? A:playground

No, she’s in the kitchen.

If you can predict this,


you are most of the
way to knowing how to
answer correctly.

See our new paper! “Dialog-Based Language Learning”


arXiv:1604.06045.
FAIR: paper / data / code
 Papers:
 bAbI tasks: arxiv.org/abs/1502.05698
 Memory Networks: http://arxiv.org/abs/1410.3916
 End-to-end Memory Networks: http://arxiv.org/abs/1503.08895
 Large-scale QA with MemNNs: http://arxiv.org/abs/1506.02075
 Reading Children’s Books: http://arxiv.org/abs/1511.02301
 Evaluating End-To-End Dialog: http://arxiv.org/abs/1511.06931
 Dialog-based Language Learning: http://arxiv.org/abs/1604.06045

 Data:
 bAbI tasks: fb.ai/babi
 SimpleQuestions dataset (100k questions): fb.ai/babi
 Children’s Book Test dataset: fb.ai/babi
 Movie Dialog Dataest: fb.ai/babi

 Code:
 Memory Networks: https://github.com/facebook/MemNN
 Simulation tasks generator: https://github.com/facebook/bAbI-tasks
RAM Issues
 How to decide what to write and what not to write in the
memory?
 How to represent knowledge to be stored in memories?
 Types of memory (arrays, stacks, or stored within weights of
model), when they should be used, and how can they be learnt?
 How to do fast retrieval of relevant knowledge from memories
when the scale is huge?
 How to build hierarchical memories, e.g. multiscale attention?
 How to build hierarchical reasoning, e.g. composition of
functions?
 How to incorporate forgetting/compression of information?
 How to evaluate reasoning models? Are artificial tasks a good
way? Where do they break down and real tasks are needed?
 Can we draw inspiration from how animal or human memories
Thanks!

You might also like