最近のDeep Learning (NLP) 界隈におけるAttention事情

Attention
Yuta Kikuchi
@kiyukuta
最近のDeep Learning界隈における
事情
neural network with attention: survey
2016/01/18

今日はこんな形，よく見ることになります
=
= i
i
t t
なんか計算i
i

=
=
i
i
i
なんか計算
要素iの重み
(よくsoftmaxされている)
要素i
（専ら隠れ層ベクトル）
最終的に時刻tで
使われる情報
itt

=
=
i
i
i
t
時刻tの要素iの重み計算
（時刻t-1の情報とか，とか使って計算）ti
要素iの重み要素i
（専ら隠れ層ベクトル）
最終的に時刻tで
使われる情報
it
使う情報は今回の隠れ層だったり前回の隠れ層だったり
計算も隠れ層同士の内積だったりMLPを用意したり

Attentionという同じ技術の似たような適用方法を複数の著者が
「各々の思い思いの図」で書いてます
この中で（attention的な意味で）仲間はずれは？？？？

この中で（attention的な意味で）仲間はずれは？？？？
（細かいところはちがったり）
Attentionという同じ技術の似たような適用方法を複数の著者が
「各々の思い思いの図」で書いてます

- Background1: Neural Network
- Background2: Recurrent Neural Network
- Background3: Encoder-Decoder approach
(aka. sequence to sequence approach)
- Attention mechanism and its variants
- Global attention
- Local attention
- Pointer networks
- Attention for image (image caption generation)
- Attention techniques
- NN with Memory
Agenda

[Cho+2014]
入力側(Encoder)と出力側(Decoder)それぞれにRNNを用意
Encoderによって入力系列を中間ノードに変換し，その情報を元に
Decoderが系列を出力
[Sutskever+2014]

機械翻訳やキャプション生成(encoder=CNN)で話題になり
様々なタスクでの適用例が報告されている[YYYYYYY+201X]
[Sutskever+2014]
[Cho+2014]

そしてAttentionの出現
[Cho+2014]
[Sutskever+2014]
[Bahdanau+2014]
Although most of the previousworks (see, e.g., Cho et al.,
2014a; Sutskever et al., 2014; Kalchbrenner and
Blunsom, 2013) used to encode a variable-length input
sentence into a fixed-length vector, it is not necessary,
and even it may be beneficial to have a variable-length
vector, as we will show later

Attention mechanism
!
(global attention
for convenience)

Although most of the previousworks (see, e.g., Cho et al.,
2014a; Sutskever et al., 2014; Kalchbrenner and Blunsom,
2013) used to encode a variable-length input sentence
into a fixed-length vector, it is not necessary, and
even it may be beneficial to have a variable-length vector,
as we will show later
[Bahdanau+2014]
入力系列をシンプルなRNN Encoderによって一つの固定長ベクトル
に詰め込むのではなく，Encode中の各隠れ層をとっておいて，
decodeの各ステップでそれらを使う（どう使うかはもうすぐ説明）
!
最終的に一つのベクトルとしてdecoderに渡すのは同じだが，
渡すベクトルが文脈（e.g. これまでの自身の出力）に応じて動的に変わる

simple enc-dec enc-dec + attention
Figures from [Luong+2015] for comparison
Enc-decがEncoderが入力系列を一つのベクトルに圧縮
Decoderが出力時の初期値として使う

enc-dec attention
Attentionでは出力ステップtで，その時の隠れ層htを使い
入力側の各隠れ層を荷重atで加重平均した文脈ベクトルct
を使って出力ytを予測
Enc-decがEncoderが入力系列を一つのベクトルに圧縮
Decoderが出力時の初期値として使う

ct = s at(s)¯hs

ct = s at(s)¯hs
ここは単語出力用

ct = s at(s)¯hs
冒頭で見たやつだ！！

まとめると
の言葉を借りつつ
Describing Multimedia Content using
Attention-based Encoder–Decoder Networks [Cho+2015]
Attentionは入力と出力のアライメントを学習するもの
!
!
シンプルなEnc-decと比較してAttention導入の利点は
- 扱う入力側の情報を動的に変化させることができる
- 解釈が容易になる（アライメントとして可視化できる）
ソフトが多いけどハード(今日扱わない)もあるよ！

[Bahdanau+2014]以前のAttentionっぽいものは？
画像処理分野で単純に attention とかでたどると
本物のヒトの注視点データを用いた研究とか出てきた
（もっと以前にあるかは未調査）
Learning where to Attend with Deep Architectures for Image Tracking
[Denil+2011]
テキストだと[Graves2013]に出てくる
soft window が関係深そう
ちょっとタスクが面白いので紹介

Generating Sequences With
Recurrent Neural Networks [Graves2013]
4 Handwritten Prediction （前置き）
手書き文字のストロークから次のペンの動きを予想
えっこんなの生成できるのk....て人手かよ汚すぎだなおい！

x座標 y座標ペンが接地中or not
ペンが離れるか否か ↓の混合ガウス分布のパラメータ
Pr(次のペン位置 | NNが予測した混合ガウス分布)
これが大きくなるようなパラメータを出力するよう学習
NNへの入力:
NNの出力:

5 Handwritten Synthesis
文字列からその手書き文字画像を生成
入出力は 4と同じだが文脈情報としてクエリ文字列を使う
特徴
- 入力と出力の長さがかなり異なる
- 文字と手描きストローク間のアライメント情報はない
本題
[Graves2013]

ステップtにおける入力文字uの重み
ガウス関数のパラメータは
その時の隠れ層hで決定
見たやつ！！

(sequence to sequence approach)
- Gobal attention
- Local attention
- Pointer networks
- NN with Memory
Agenda
ここまで

- Global attention
- Local attention
- Pointer networks
- NN with Memory
翻訳以外のタスクでは？
Agenda
ここまで

Global attentionを使ったその他の論文たち
(数スライドずつ)

A Neural Attention Model
for Abstractive Sentence Summarization [Rush+2015]
過去の自分の出力と入力文から
各入力単語の重みを計算し
加重平均ベクトルを予測に使う
- 文の要約 (not 文書の要約)
- [Bahdanau+2014]を簡素化したネットワーク構造
http://www.slideshare.net/yutakikuchi927/a-neural-attention-model-for-sentence-summarization
詳しくはこちら！
bag-of-word embeddings
の加重平均

良い言い換え(I:入力，A:システムの出力)
失敗な言い換え(I:入力，A:システムの出力)

Character-based Neural Machine Translation [Ling+2015]
[Bahdanau+2014]
attention
[Bahdanau+2014]を文字レベルに拡張

Character-based Neural Machine Translation [Ling+2015]
[Bahdanau+2014]
[Bahdanau+2014]を文字レベルに拡張
文字列をベクトルに
文字列の生成
BLEUはwordとくらべて微増だが，過去
のcharベース手法がwordを超えられなかっ
たことを考えるとすごいとの主張
!
Attentionという意味での拡張はないが
文字ベース翻訳知らない勢としては面白
い事例もあったので興味あれば原稿を

Grammar as a Foreign Language [Vinyals+2014]
入力文が文，出力が句構造の構文木
「Go.」という文での例．入力を逆順で入れる
attention

tri-training（BerkeleyとZParの２パーザの出力が一致した文
とその木を訓練データとする）によりデータ数増やすと精度が
結構あがる（約11million文）

最近のDeep Learning (NLP) 界隈におけるAttention事情

Neural Responding Machine for Short-Text Conversation
[Shang+2015]
対話生成（入力された単語列への応対としての単語列の出力）
普通のenc-dec
以下，Encoderだけの図（ctがデコーダに渡される）
attention
Joint

Teaching Machines to Read and Comprehend Karl
[Hermann+15]
Contextと穴あき（X）Queryが与えられた時，
Xに何が入るかをContextから選択

Queryベクトル (u)は両方向RNNの最後の隠れ状態を連結したもの
基本は両方向RNN
Contextのベクトル(r)はQueryの情報(u)を使ったattention
Queryの各文字でRNNする
ことでrを作る
各ステップでattention

Attention(前スライドのふたつ)強い

(b)の方の時系列Heat mapもあるので気になる人はsee the paper
正解間違い
爆破が起きた地理情報（の粒度）に曖昧性があり，
正解は選べなかったけど文脈的には正しいattention
二前前の(a)Attentive Readerのヒートマップ

- Global attention
- Local attention
- Pointer networks
- NN with Memory
- (end-to-end) memory networks
- neural turing machine
- language modeling
翻訳以外のタスクでは？
- 文要約 [Rush+2015]
- パージング [Vinyals+2014]
- 文字ベース翻訳[Ling+2015]
- 対話生成 [Shang+2015]
- 穴埋め [Hermann+15]
など…
Agenda
ここまで

- Global attention
- Local attention
- Pointer networks
- NN with Memory
Agenda

Effective Approaches to Attention-based Neural
Machine Translation [Luong+15]
global attention
入力全体を加重平均
local attention
Neural Machine Translation
位置情報の情報も利用したattention

local attention
文長 0.0 1.0
時刻tで注目する位置:
global attentionの時の重みも使う
ptを使ってattention :

Pointer Networks [Vinyals+2015]
NNに組合せ最適化問題を解かせる
凸包
（Attentionの異なる使いかた）
翻訳のように「固定された語彙リストからの選択」ではなく
「入力系列のいずれかそのものを選択する」というNN
!
考慮するPiの数は問題によって違う
Decoderの出力層(softmax)の次元を固定できない
P1
P4
P2
P3

点の数ごとに別の
ネットワーク訓練？

これまでのattentionで荷重ベクトル
の荷重として使ってきたuの分布
P1
P4
P2
P3

attentionの分布をそのまま答えに！
これまでのattentionで荷重ベクトル
の荷重として使ってきたuの分布
P1
P4
P2
P3

緑が正解，赤が予測
LSTM→ふつうのEnc-dec+attention

画像から説明文[Xu+2015]
2014年の終わり話題になった[Vinyals+2015b]
は最終(fully connected)層をdecoderへ渡す
CNNの最後のConv層を使い
画像のどこに注目するかを決める
Show, Attend and Tell:
Neural Image Caption Generation with Visual Attention

[Xu+2015]からの図
２匹のキリンが鳥に見えた

パーカーの丸いマークを
時計と勘違い

Generating Images from Captions with Attention
[Mansimov+2016]
DRAW[Gregor+2015]の拡張両方向RNN
そんなことが可能なのか…？

キャプションの単語(色やテーブルにのってるもの)を変えた時の画像の変化
実際に生成した画像

Attentionの良さを高めるための工夫たち
Doubly Stochastic Attention from [Xu+2015]
ふつうのAttention: 各ステップで入力で荷重の総和取ると1になる 
(softmax的な意味で)
!
入力系列の各要素(i)について時間(t)での総和も1に近づける
これによって入力全体をまんべんなくAttentionさせる
BLEUの向上にも，定性的な向上にも役立ったとのこと

Gated Attention from [Xu+2015]
Attentionの結果できた加重平均ベクトルを
どのくらい使うかをgateで制御する
これを入れると，画像中のオブジェクトへの
Attentionがより強調されたっぽい

Weak Supervision from [Ling+2015]
文字ベース機械翻訳のやつ
!
Attentionに明示的な教師はあまり使われていないが
この論文はIBM model4のアライメント情報を
弱教師として利用

最近，LSTMのメモリーブロック的な意味ではない
外部Memoryを使ったNNたち(翻訳の[Meng+2016]など他にもちらほら)
End-To-End Memory Networks[Sukhbaatar+2015]
Neural Turing Machines [Graves+2014]
Recurrent Memory Network for Language
Modeling[Tran+2016]
アルゴリズム
QA
言語モデル

入力のattention荷重を決める
ためのLookupTableと，実際
に加重平均されるものを別に
クエリ
答え
文は単語ベクトルの和
文の集合
実際はわりと普通にAttention
タスクはQA

クエリ
答え
文は単語ベクトルの和
文の集合
実際はわりと普通にAttention
入力のattention荷重を決める
ためのLookupTableと，実際
に加重平均されるものを別に
タスクはQA

Recurrent Memory Network for Language
Modeling[Tran+2016]
これも同様
過去の単語集合をAttentionする
RNNLMを成す層の一部に

Neural Turing Machines [Graves+2014]
外部メモリー使います感が高い
Memoryのどの位置を
（重点的に）読むか
Attention( )
Memoryのどの位置をどう書き換えるか
元の記憶を弱めるgate
今回の記憶を追記するgate

おわり
Attentionはいろいろなところですでに使われまくり始めてる
!
Attentionの取り方(local attention, memory network)
Attentionをする対象（テキスト，画像，扱わなかったが音声や動画も）
Attentionの使いかた（pointer network, memory network）
よりよいAttentionのための小技
（doubly stocastic attention, gated attention , weak supervision）

[Graves2013] Alex Graves, Generating Sequences With Recurrent Neural Networks .
[Cho+2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
Yoshua Bengio, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation .
[Sutskever+2014] Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks .
[Luong+2015] Minh-Thang Luong, Hieu Pham, Christopher D. Manning, Eﬀective Approaches to Attention-based Neural
Machine Translation .
[Denil+2011] Misha Denil, Loris Bazzani, Hugo Larochelle, Nando de Freitas, Learning where to Attend with Deep
Architectures for Image Tracking
[Cho+2015] Kyunghyun Cho, Aaron Courville, Yoshua Bengio, Describing Multimedia Content using Attention-based Encoder‒
Decoder Networks .
[Rush+2015] Alexander M. Rush, Sumit Chopra, Jason Weston, A Neural Attention Model for Abstractive Sentence
Summarization
[Ling+2015] Wang Ling, Isabel Trancoso, Chris Dyer, Alan W Black, Character-based Neural Machine Translation .
[Vinyals+2014] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoﬀrey Hinton, Grammar as a Foreign
Language
[Shang+2015] Lifeng Shang, Zhengdong Lu, Hang Li , Neural Responding Machine for Short-Text Conversation
[Hermann+15] Karl Moritz Hermann, Tomáš Ko iský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Phil
Blunsom, Teaching Machines to Read and Comprehend
[Vinyals+2015] Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, Pointer Networks
[Xu+2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua
Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention .
[Vinyals+2015b] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and Tell: A Neural Image Caption
Generator
[Mansimov+2016] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov, Generating Images from Captions
with Attention
[Meng+2016] Fandong Meng, Zhengdong Lu, Zhaopeng Tu, Hang Li, Qun Liu , A Deep Memory-based Architecture for
Sequence-to-Sequence Learning .
[Sukhbaatar+2015] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, End-To-End Memory Networks ..
[Graves+2014] Alex Graves, Greg Wayne, Ivo Danihelka , Neural Turing Machines .
[Tran+2016] Ke Tran, Arianna Bisazza, Christof Monz, Recurrent Memory Network for Language Modeling .
Reference

最近のDeep Learning (NLP) 界隈におけるAttention事情

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 最近のDeep Learning (NLP) 界隈におけるAttention事情 (20)

最近のDeep Learning (NLP) 界隈におけるAttention事情