SlideShare a Scribd company logo
Algorithmic	Intelligence	Laboratory
Algorithmic	Intelligence	Laboratory
EE807:	Recent	Advances	in	Deep	Learning
Lecture	19
Slide	made	by	
Sangwoo	Mo
KAIST	EE
Advanced	Models	for	Language
1
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
2
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
3
Algorithmic	Intelligence	Laboratory
Why	Deep	Learning	for	Natural	Language	Processing	(NLP)?
• Deep	learning is	now	commonly	used in	natural	language	processing	(NLP)
*Source:	Young	et	al.	“Recent	Trends	in	Deep	Learning	Based	Natural	Language	Processing”,	arXiv	2017 4
Algorithmic	Intelligence	Laboratory
Recap:	RNN	&	CNN	for	Sequence	Modeling
• Language is	sequential: It	is	natural	to	use	RNN	architectures
• RNN (or	LSTM	variants)	is	a	natural	choice	for	sequence	modelling
• Language is	translation-invariant: It	is	natural	to	use	CNN	architectures
• One	can	use	CNN [Gehring	et	al.,	2017]	for	parallelization
*Source:	https://towardsdatascience.com/introduction-to-recurrent-neural-network-27202c3945f3
Gehring	et	al.	“Convolutional	Sequence	to	Sequence	Learning”,	ICML	2017 5
Algorithmic	Intelligence	Laboratory
Limitations	of	prior	works
• However,	prior	works have	several	limitations…
• Network	architecture
• Long-term	dependencies:	Network	forgets previous	information	as	it	summarizes	
inputs	into	a	single feature	vector
• Limitations	of	softmax:	Computation linearly	increases	to	the	vocabulary	size,
and	expressivity is	bounded	by	the	feature	dimension
• Training	methods
• Exposure	bias:	Model	only	sees	true tokens	at	training,	but	it	sees	generated
tokens	at	inference	(and	noise	accumulates	sequentially)
• Loss/evaluation	mismatch:	Model	uses	MLE objective	at	training,	but	use	other	
evaluation	metrics (e.g.,	BLEU	score	[Papineni	et	al.,	2002])	at	inference
• Unsupervised	setting:	How	to	train	models	if	there	are	no	paired data?
6
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
7
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Motivation:
• Previous	models	summarize inputs	into	a	single feature	vector
• Hence,	the	model	forgets old	inputs,	especially	for	long sequences
• Idea:
• Use	input	features,	but	attend	on	the	most	importance features
• Example)	Translate	“Ich	mochte	ein	bier” ⇔ “I’d	like	a	beer”
• Here,	when	the	model	generates	“beer”,	it	should	attend	on	“bier”
8*Source:	https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/10/06/attention/
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Method:
• Task: Translate	source	sequence	[𝑥$, … , 𝑥'] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	the	decoder	hidden	state	𝑠, is	a	function	of	previous	state	𝑠,-$,	current	input	
𝑦.,-$,	and	context	vector	𝑐,,	i.e.,	𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
9*Source:	https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
𝑐,
𝑠,𝑠,-$
𝑦.,-$
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Method:
• Task: Translate	source	sequence	[𝑥$, … , 𝑥'] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	the	decoder	hidden	state	𝑠, is	a	function	of	previous	state	𝑠,-$,	current	input	
𝑦.,-$,	and	context	vector	𝑐,,	i.e.,	𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
• The	context	vector	𝑐, is	linear	combination of	input	hidden	features [ℎ$, … , ℎ']
• Here,	the	weight	𝛼,,4 is	alignment	score of	two	words	𝑦, and	𝑥4
where	score is	also	jointly	trained,	e.g.,	
10
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Results:	Attention	shows	good	correlation between	source	and	target
11
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Results:	Attention	improves machine	translation	performance
• RNNenc:	no	attention	/	RNNsearch:	with	attention	/	#:	max	length	of	train	data
12
No	UNK:	omit	unknown	words
*:	longer	train	until	converge
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Motivation:	Can	apply	attention for	image	captioning?
• Task: Translate	source	image	[𝑥] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	attend	on	specific	location on	the	image,	not	the	words
• Idea:	Apply	attention	to	convolutional	features [ℎ$, … , ℎ:] (with	𝐾 channels)
• Apply	deterministic	soft attention	(as	previous	one) and	stochastic	hard attention
(pick	one	ℎ4 by	sampling	multinomial	distribution	with	parameter	𝛼)
• Hard	attention picks	more	specific area	and	shows	better results,	but	training	is	
less	stable due	to	the	stochasticity and	differentiability
13
Up:	hard	attention	/	Down:	soft	attention
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Results:	Attention	picks	visually	plausible	locations
14
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Results:	Attention	improves the	image	captioning	performance
15
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Motivation:
• Prior	works	use	RNN/CNN	to	solve	sequence-to-sequence problems
• Attention already	handles	arbitrary	length of	sequences,	easy	to	parallelize,	and	
not	suffer	from	forgetting problems…	Why	should	one	use	RNN/CNN modules?
• Idea:
• Design	architecture	only	using attention modules
• To	extract	features,	the	authors	use	self-attention,	that	features	attend	on	itself
• Self-attention	has	many	advantages	over	RNN/CNN	blocks
16
𝑛: sequence	length,	𝑑:	feature	dimension,	𝑘:	(conv)	kernel	size,	𝑟:	window	size	to	consider
Maximum	path	length: maximum	traversal	between	any	two	input/outputs	(lower	is	better)
*Cf.	Now	self-attention	is	widely	used	in	other	architectures,	e.g.,	CNN	[Wang	et	al.,	2018]	or	GAN	[Zhang	et	al.,	2018]
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Multi-head	attention:	The	building	block	of	the	Transformer
• In	previous	slide,	we	introduced	additive attention	[Bahdanau	et	al.,	2015]
• Here,	the	context	vector	is	a	linear	combination of
• weight	𝛼,,4,	a	function	of	inputs	[𝑥4] and	output	𝑦,
• and	input	hidden	states	[ℎ4]
• In	general,	attention	is	a	function	of	key 𝐾,	value 𝑉,	and	query 𝑄
• key [𝑥4]	and	query 𝑦,	defines	weights	𝛼,,4,	which	are	applied	to	value [ℎ4]
• For	sequence	length	𝑇 and	feature	dimension	𝑑,	(𝐾, 𝑉, 𝑄) are	𝑇×𝑑,	𝑇×𝑑,	and	1×𝑑 matrices
• Transformer	use	scaled	dot-product attention
• In	addition,	transformer	use	multi-head	attention,
ensemble of	attentions
17
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Transformer:
• The	final	transformer model	is	built	upon	the	(multi-head)	attention	blocks
• First,	extract	features	with	self-attention	(see	lower	part	of	the	block)
• Then	decode	feature	with	usual	attention (see	middle	part	of	the	block)
• Since	the	model	don’t	have	a	sequential	structure,
the	authors	give	position	embedding	(some	handcrafted
feature	that	represents	the	location	in	sequence)
18
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Results:	Transformer	architecture	shows	good	performance for	languages
19
Algorithmic	Intelligence	Laboratory
BERT	[Delvin	et	al.,	2018]
• Motivation:
• Many	success	of	CNN	comes	from	ImageNet-pretrained networks
• Can	train	a	universal	encoder for	natural	languages?
• Method:
• BERT	(bidirectional	encoder	representations	from	transformers):	Design	a	neural	
network	based	on	bidirectional	transformer,	and	use	it	as	a	pretraining	model
• Pretrain	with	two	tasks (masked	language	model,	next	sentence	prediction)	
• Use	fixed	BERT	encoder,	and	fine-tune	simple	1-layer	decoder for	each	task
20
Sentence	classification Question	answering
Algorithmic	Intelligence	Laboratory
BERT	[Delvin	et	al.,	2018]
• Results:
• Even	without task-specific	complex	architectures,	BERT	achieves	SOTA	for	11	NLP	
tasks,	including	classification,	question	answering,	tagging,	etc.
21
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
22
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Motivation:	
• Computation	of	softmax is	expensive,	especially	for	large	vocabularies
• Hierarchical	softmax	[Mnih	&	Hinton,	2009]:
• Cluster	𝑘 words	into	balanced 𝑘 groups,	which	reduces	the	complexity	to	𝑂( 𝑘)
• For	hidden	state	ℎ,	word	𝑤,	and	cluster	𝐶 𝑤 ,
• One	can	repeat	clustering	for	subtrees	(i.e.,	build	a	balanced	𝑛-ary tree),	which	
reduces	the	complexity	to	𝑂(log 𝑘)
23*Source:	http://opendatastructures.org/versions/edition-0.1d/ods-java/node40.html
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Cluster	𝑘 words	into	balanced 𝑘 groups,	which	reduces	the	complexity	to	𝑂( 𝑘)
• One	can	repeat	clustering	for	subtrees,	which	reduces	the	complexity	to	𝑂(log 𝑘)
• However,	putting	all	words	to	the	leaves drop	the	performance	(around	5-10%)
• Instead,	one	can	put	frequent	words	in	front (similar	to	Huffman	coding)
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	token	“NEXT-𝒊”	in	the	first	layer,	and
put	𝑘4 words	(𝑝4 of	frequencies)	in	the	next	layers
24
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	token	“NEXT-𝒊”	in	the	first	layer,	and
put	𝑘4 words	(𝑝4 of	frequencies)	in	the	next	layers
• Let	𝑔(𝑘, 𝐵) be	the	computation	time	for	𝑘 words	and	batch	size	𝐵
• Then	the	computation	time of	adaptive	softmax	(with	𝐽 clusters)	is
• For	𝑘, 𝐵 larger	than	some	threshold,	one	can	simply	assume	𝑔 𝑘, 𝐵 = 𝑘𝐵 (see	paper	for	details)
• By	solving	the	optimization	problem	(for	𝑘4 and	𝐽),	the	model	is	3-5x	faster than	
the	original	softmax	(in	practice,	𝐽 = 5 works	well)
25
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Results:	Adaptive	softmax	shows	comparable	results to	the	original	softmax	
(while	much	faster)
26
ppl:	perplexity	(lower	is	better)
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Motivation:
• Rank of	softmax	layer	is	bounded by	the	feature	dimension	𝑑
• Recall: By	definition	of	softmax
we	have																																																																(which	is	called	logit)
• Let	𝑁 be	number	of	possible	contexts,	and	𝑀 be	vocabulary	size,	then
which	implies	that	softmax	can	represent	at	most	rank	𝒅 (real	𝐀 can	be	larger)
27*Source:	https://www.facebook.com/iclr.cc/videos/2127071060655282/
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Motivation:
• Rank of	softmax	layer	is	bounded by	the	feature	dimension	𝑑
• Naïvely	increasing	dimension	𝑑 to	vocab	size	𝑀 is	inefficient
• Idea:
• Use	mixture	of	softmaxes (MoS)
• It	is	easily	implemented	by	defining	𝜋,] and	𝐡,] as	a	function	of	original	𝐡
• Note	that	now
is	a	nonlinear	(log-sum-exp)	function	of	𝐡 and	𝐰,	hence	can	represent	high	rank
28
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Results:	MoS	learns	full	rank (=	vocab	size)	while	softmax	is	bounded	by	𝑑
• Measured	empirical	rank,	collect	every	empirical	contexts	&	outputs
29
MoC:	mixture	of	contexts
(mixture	before softmax)
𝑑 = 400, 280, 280 for
Softmax,	MoC,	MoS,	respectively
Note	that	9981	is	full	rank
as	vocab	size	=	9981
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Results:	Simply	changing	Softmax	to	MoS	improves the	performance
• By	applying	MoS	to	SOTA	models,	the	authors	achieved	new	SOTA	records
30
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
31
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
32
*Source:	https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
• However,	the	model	use	predicted token	at	inference	(a.k.a.	exposure	bias)
33
*Source:	https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
• However,	the	model	use	predicted token	at	inference	(a.k.a.	exposure	bias)
• Training	with	predicted	token	is	not	trivial,	since	(a)	training	is	unstable,	and	(b)	as	
previous	token	is	changed,	target	also	should	be	changed
• Idea:	Apply	curriculum	learning
• At	beginning,	use	real tokens,	and	slowly	move	to	predicted tokens
34
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Results:	Scheduled	sampling	improves	baseline for	many	tasks	
35
Image	captioning
Constituency	parsing
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Motivation:
• Scheduled	sampling	(SS)	is	known	to	optimize	wrong	objective [Huszár	et	al.,	2015]
• Idea:
• Make	features	of	predicted tokens	be	similar	to	the	features	of	true tokens
• To	this	end,	train	a	discriminator classifies	features	of	true/predicted	tokens
• Teacher	forcing: use	real	tokens	/	Free	running: use	predicted	tokens
36
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Results:
• Professor	forcing	improves	the	generalization performance,	especially	for	the
long	sequences (test	samples	are	much	longer	than	training	samples)
37
NLL	for	MNIST
generation
Human	evaluation
for	handwriting
generation
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
38
Algorithmic	Intelligence	Laboratory
MIXER	[Ranzato	et	al.,	2016]
• Motivation:
• Prior	works	use	word-level objectives	(e.g.,	cross-entropy)	for	training,	but	use	
sequence-level objectives	(e.g.,	BLEU	[Papineni	et	al.,	2002])	for	evaluation
• Idea:	Directly	optimize model	with	sequence-level objective	(e.g.,	BLEU)
• Q.	How	to	backprop	(usually	not	differentiable)	sequence-level	objective?
• Sequence	generation	is	a	kind	of	RL	problem
• state:	hidden	state,	action:	output,	policy:	generation	algorithm
• Sequence-level	objective	is	the	reward of	current	algorithm
• Hence,	one	can	use	policy	gradient (e.g.,	REINFORCE)	algorithm
• However,	the	gradient	estimator	of	REINFORCE	has	high	variance
• To	reduce	variance,	MIXER	(mixed	incremental	cross-entropy	reinforce)	use
MLE	for	first	𝑇′ steps	and	REINFOCE	for	next	𝑇 − 𝑇′ steps	(𝑇′ goes	to	zero)
• Cf.	One	can	also	use	other	variance	reduction	techniques,	e.g.,	actor-critic	[Bahdanau	et	al.,	2017]
39
Algorithmic	Intelligence	Laboratory
MIXER	[Ranzato	et	al.,	2016]
• Results:
• MIXER shows	better	performance than	other	baselines
• XENT	(=	cross	entropy):	another	name	of	maximum	likelihood	estimation	(MLE)
• DAD	(=	data	as	demonstrator):	another	name	of	scheduled	sampling
• E2D	(=	end-to-end	backprop):	use	top-K	vector	as	input	(approx.	beam	search)
40
Algorithmic	Intelligence	Laboratory
SeqGAN	[Yu	et	al.,	2017]
• Motivation:
• RL-based	method	still	relies	on	handcrafted	objective (e.g.,	BLEU)
• Instead,	one	can	use	GAN	loss to	generate	realistic	sequences
• However,	it	is	not	trivial	to	apply	GAN	for	natural	languages,	since	data	is	discrete
(hence	not	differentiable)	and	sequence (hence	need	new	architecture)
• Idea:	Backprop	discriminator’s	output	with	policy	gradient
• Similar	to	actor-critic;	only	difference	is	now	the	reward	is	discriminator’s	output
• Use	LSTM-generator	and	CNN	(or	Bi-LSTM)-discriminator	architectures
41
Algorithmic	Intelligence	Laboratory
SeqGAN	[Yu	et	al.,	2017]
• Results:
• SeqGAN shows	better	performance	than	prior	methods
42
Synthetic	generation
(follow	the	oracle)
Chinese	poem	generation Obama	speech	generation
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
43
Algorithmic	Intelligence	Laboratory
UNMT	[Artetxe	et	al.,	2018]
• Motivation:
• Can	train	neural	machine	translation models	in	unsupervised way?
• Idea:	Apply	the	idea	of	domain	transfer	in	Lecture	12
• Combine	two	losses:	reconstruction loss	and	cycle-consistency loss
• Recall: Cycle-consistency	loss	forces	twice cross-domain	generated	(e.g.,	L1→L2→L1)	data	to	
become	the	original	data
44*Source:	Lample	et	al.	“Unsupervised	Machine	Translation	Using	Monolingual	Corpora	Only”,	ICLR	2018.
Model	architecture	(L1/L2:	language	1,	2)
reconstruction
cross-domain	generation
Algorithmic	Intelligence	Laboratory
UNMT	[Artetxe	et	al.,	2018]
• Results:	UNMT	produces	good translation	results
45
BPE	(byte	pair	encoding),
a	preprocessing	method
Algorithmic	Intelligence	Laboratory
Conclusion
• Deep	learning	is	widely	used	for	natural	language	processing	(NLP)
• RNN	and	CNN	were	popular	in	2014-2017
• Recently,	self-attention	based	methods	are	widely	used
• Many	new	ideas	are	proposed	to	solve	language	problems
• New	architectures	(e.g.,	self-attention,	softmax)
• New	training	methods	(e.g.,	loss,	algorithm,	unsupervised)
• Research	for	natural	languages	are	now	just	began
• Deep	learning	(especially	GAN)	is	not	widely	used	in	NLP	as	computer	vision
• Transformer	and	BERT	are	just	published	in	2017-2018
• There	are	still	many	research	opportunities	in	NLP
46
Algorithmic	Intelligence	Laboratory
Introduction
• [Papineni	et	al.,	2002]	BLEU:	a	method	for	automatic	evaluation	of	machine	translation.	ACL	2002.
link	:	https://dl.acm.org/citation.cfm?id=1073135
• [Cho	et	al.,	2014]	Learning	Phrase	Representations	using	RNN	Encoder-Decoder	for	Statistical...	EMNLP	2014.
link	:	https://arxiv.org/abs/1406.1078
• [Sutskever	et	al.,	2014]	Sequence	to	Sequence	Learning	with	Neural	Networks.	NIPS	2014.
link	:	https://arxiv.org/abs/1409.3215
• [Gehring	et	al.,	2017]	Convolutional	Sequence	to	Sequence	Learning.	ICML	2017.
link	:	https://arxiv.org/abs/1705.03122
• [Young	et	al.,	2017]	Recent	Trends	in	Deep	Learning	Based	Natural	Language	Processing.	arXiv	2017.
link	:	https://arxiv.org/abs/1708.02709
Extension	to	unsupervised	setting
• [Artetxe	et	al.,	2018]	Unsupervised	Neural	Machine	Translation.	ICLR	2018.
link	:	https://arxiv.org/abs/1710.11041
• [Lample	et	al.,	2018]	Unsupervised	Machine	Translation	Using	Monolingual	Corpora	Only.	ICLR	2018.
link	:	https://arxiv.org/abs/1711.00043
References
47
Algorithmic	Intelligence	Laboratory
Learning	long-term	dependencies
• [Bahdanau	et	al.,	2015]	Neural	Machine	Translation	by	Jointly	Learning	to	Align	and	Translate.	ICLR	2015.
link	:	https://arxiv.org/abs/1409.0473
• [Weston	et	al.,	2015]	Memory	Networks.	ICLR	2015.
link	:	https://arxiv.org/abs/1410.3916
• [Xu	et	al.,	2015]	Show,	Attend	and	Tell:	Neural	Image	Caption	Generation	with	Visual	Attention.	ICML	2015.
link	:	https://arxiv.org/abs/1502.03044
• [Sukhbaatar	et	al.,	2015]	End-To-End	Memory	Networks.	NIPS	2015.
link	:	https://arxiv.org/abs/1503.08895
• [Kumar	et	al.,	2016]	Ask	Me	Anything:	Dynamic	Memory	Networks	for	Natural	Language	Processing.	ICML	2016.
link	:	https://arxiv.org/abs/1506.07285
• [Vaswani	et	al.,	2017]	Attention	Is	All	You	Need.	NIPS	2017.
link	:	https://arxiv.org/abs/1706.03762
• [Wang	et	al.,	2018]	Non-local	Neural	Networks.	CVPR	2018.
link	:	https://arxiv.org/abs/1711.07971
• [Zhang	et	al.,	2018]	Self-Attention	Generative	Adversarial	Networks.	arXiv	2018.
link	:	https://arxiv.org/abs/1805.08318
• [Peters	et	al.,	2018]	Deep	contextualized	word	representations.	NAACL	2018.
link	:	https://arxiv.org/abs/1802.05365
• [Delvin	et	al.,	2018]	BERT:	Pre-training	of	Deep	Bidirectional	Transformers	for	Language	Understanding.	arXiv	2018.
link	:	https://arxiv.org/abs/1810.04805
References
48
Algorithmic	Intelligence	Laboratory
Improve	softmax	layers
• [Mnih	&	Hinton,	2009]	A	Scalable	Hierarchical	Distributed	Language	Model.	NIPS	2009.
link	:	https://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model
• [Grave	et	al.,	2017]	Efficient	softmax	approximation	for	GPUs.	ICML	2017.
link	:	https://arxiv.org/abs/1609.04309
• [Yang	et	al.,	2018]	Breaking	the	Softmax	Bottleneck:	A	High-Rank	RNN	Language	Model.	ICLR	2018.
link	:	https://arxiv.org/abs/1711.03953
Reduce	exposure	bias
• [Williams	et	al.,	1989]	A	Learning	Algorithm	for	Continually	Running	Fully	Recurrent...	Neural	Computation	1989.
link	:	https://ieeexplore.ieee.org/document/6795228
• [Bengio	et	al.,	2015]	Scheduled	Sampling	for	Sequence	Prediction	with	Recurrent	Neural	Networks.	NIPS	2015.
link	:	https://arxiv.org/abs/1506.03099
• [Huszár	et	al.,	2015]	How	(not)	to	Train	your	Generative	Model:	Scheduled	Sampling,	Likelihood...	arXiv	2015.
link	:	https://arxiv.org/abs/1511.05101
• [Lamb	et	al.,	2016]	Professor	Forcing:	A	New	Algorithm	for	Training	Recurrent	Networks.	NIPS	2016.
link	:	https://arxiv.org/abs/1610.09038
References
49
Algorithmic	Intelligence	Laboratory
Reduce	loss/evaluation	mismatch
• [Ranzato	et	al.,	2016]	Sequence	Level	Training	with	Recurrent	Neural	Networks.	ICLR	2016.
link	:	https://arxiv.org/abs/1511.06732
• [Bahdanau	et	al.,	2017]	An	Actor-Critic	Algorithm	for	Sequence	Prediction.	ICLR	2017.
link	:	https://arxiv.org/abs/1607.07086
• [Yu	et	al.,	2017]	SeqGAN:	Sequence	Generative	Adversarial	Nets	with	Policy	Gradient.	AAAI	2017.
link	:	https://arxiv.org/abs/1609.05473
• [Rajeswar	et	al.,	2017]	Adversarial	Generation	of	Natural	Language.	arXiv	2017.
link	:	https://arxiv.org/abs/1705.10929
• [Maddison	et	al.,	2017]	The	Concrete	Distribution:	A	Continuous	Relaxation	of	Discrete	Random...	ICLR	2017.
link	:	https://arxiv.org/abs/1611.00712
• [Jang	et	al.,	2017]	Categorical	Reparameterization	with	Gumbel-Softmax.	ICLR	2017.
link	:	https://arxiv.org/abs/1611.01144
• [Kusner	et	al.,	2016]	GANS	for	Sequences	of	Discrete	Elements	with	the	Gumbel-softmax...	NIPS	Workshop	2016.
link	:	https://arxiv.org/abs/1611.04051
• [Tucker	et	al.,	2017]	REBAR:	Low-variance,	unbiased	gradient	estimates	for	discrete	latent	variable...	NIPS	2017.
link	:	https://arxiv.org/abs/1703.07370
• [Hjelm	et	al.,	2018]	Boundary-Seeking	Generative	Adversarial	Networks.	ICLR	2018.
link	:	https://arxiv.org/abs/1702.08431
• [Zhao	et	al.,	2018]	Adversarially	Regularized	Autoencoders.	ICML	2018.
link	:	https://arxiv.org/abs/1706.04223
References
50
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Method:
• (Scaled	dot-product)	attention is	given	by
• Use	multi-head	attention (i.e.,	ensemble	of	attentions)
• The	final	transformer model	is	built	upon	the	attention	blocks
• First,	extract	features	with	self-attention
• Then	decode	feature	with	usual	attention
• Since	the	model	don’t	have	a	sequential	structure,
the	authors	give	position	embedding	(some	handcrafted
feature	that	represents	the	location	in	sequence)
51
*Notation:	(𝐊, 𝐕) is	(key,	value)	pair,	and	𝐐 is	query
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	a	token	“NEXT”	in	the	first	layer,	and
put	𝑘, = 𝑘 − 𝑘Q words	(𝑝, = 1 − 𝑝Q of	frequencies)	in	the	next	layer
• Let	𝑔(𝑘, 𝐵) be	a	computation	time	for	𝑘 vocabularies	and	batch	size	𝐵
• Then	the	computation	time of	the	proposed	method	is
• Here,	𝑔(𝑘, 𝐵) is	a	threshold	function (due	to	the	initial	setup	of	GPU)
52
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• The	computation	time of	the	proposed	method	is
• Hence,	give	a	constraint that	𝑘𝐵 ≥ 𝑘l 𝐵l (for	efficient	usage	for	GPU)
• Also,	extend	the	model	to	multi-cluster setting	(with	𝐽 clusters):
• By	solving	the	optimization	problem	(for	𝑘4 and	𝐽),	the	model	is	3-5x	faster than	
the	original	softmax	(in	practice,	𝐽 = 5 shows	good	computation/performance		trade-off)
53
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Motivation:
• Scheduled	sampling	(SS)	is	known	to	optimize	wrong	objective [Huszár	et	al.,	2015]
• Let	𝑃 and	𝑄 be	data	and	model	distribution,	respectively
• Assume	length	2	sequence	𝑥$ 𝑥n,	and	let	𝜖 be	the	ratio	of	real	sample
• Then	the	objective of	scheduled	sampling	is
• If	𝜖 = 1,	it	is	usual	MLE	objective,	but	as	𝜖 → 0,	it	pushes	the	conditional	
distribution	𝑄pq|ps
to	the	marginal	distribution	𝑃pq
instead	of	𝑃pq|ps
• Hence,	the	factorized	𝑄∗
= 𝑃ps
𝑃pq
can	minimize	the	objective
54
Algorithmic	Intelligence	Laboratory
More	Methods	for	Discrete	GAN
• Gumbel-Softmax	(a.k.a.	concrete	distribution):
• Gradient	estimator	of	REINFORCE	has	high	variance
• One	can	apply	reparameterization	trick…	but	how	for	discrete variables?
• One	can	use	Gumbel-softmax	trick [Jang	et	al.,	2017];	[Maddison	et	al.,	2017]	to	
achieve	a	biased	but	low	variance gradient	estimator
• One	can	also	get	unbiased estimator	using	Gumbel-softmax	estimator	as	a	control	
variate	for	REINFORCE,	called	REBAR [Tucker	et	al.,	2017]
• Discrete	GAN	is	still	an	active	research	area
• BSGAN	[Hjelm	et	al.,	2018],	ARAE	[Zhao	et	al.,	2018],	etc.
• However,	GAN	is	not	popular for	sequences	(natural	languages)	as	images	yet
55
Ad

More Related Content

What's hot (20)

Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
Swarnava Sen
 
Knapsack problem using fixed tuple
Knapsack problem using fixed tupleKnapsack problem using fixed tuple
Knapsack problem using fixed tuple
Mohanlal Sukhadia University (MLSU)
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
Bushra Jbawi
 
Deep Neural Networks (DNN)
Deep Neural Networks (DNN)Deep Neural Networks (DNN)
Deep Neural Networks (DNN)
Sir Syed University of Engineering & Technology
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
Yan Xu
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
Machine learning
Machine learning Machine learning
Machine learning
Saurabh Agrawal
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
9xdot
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
Data Science Thailand
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
BERT
BERTBERT
BERT
Khang Pham
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
jins0618
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
Sebastian Ruder
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
Swarnava Sen
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
Bushra Jbawi
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
Yan Xu
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
9xdot
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
Data Science Thailand
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
jins0618
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
Sebastian Ruder
 

Similar to Deep Learning for Natural Language Processing (20)

Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Vincenzo Lomonaco
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
BoahKim2
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
NAVER Engineering
 
Deep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial IntelligenceDeep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial Intelligence
Rukshan Batuwita
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
Roelof Pieters
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
Balázs Hidasi
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
microsoftventures
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
BE-DIP-Lab-Manual.pdf
BE-DIP-Lab-Manual.pdfBE-DIP-Lab-Manual.pdf
BE-DIP-Lab-Manual.pdf
NetraBahadurKatuwal
 
Presentation of master thesis
Presentation of master thesisPresentation of master thesis
Presentation of master thesis
Seoung-Ho Choi
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Rosella Gennari
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
Roelof Pieters
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
DEF CON 24 - Clarence Chio - machine duping 101
DEF CON 24 - Clarence Chio - machine duping 101DEF CON 24 - Clarence Chio - machine duping 101
DEF CON 24 - Clarence Chio - machine duping 101
Felipe Prado
 
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Ohsawa Goodfellow
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Vincenzo Lomonaco
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
BoahKim2
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
NAVER Engineering
 
Deep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial IntelligenceDeep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial Intelligence
Rukshan Batuwita
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
Roelof Pieters
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
Balázs Hidasi
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
microsoftventures
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
Presentation of master thesis
Presentation of master thesisPresentation of master thesis
Presentation of master thesis
Seoung-Ho Choi
 
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Extreme Apprenticeship Meets Playful Design at Operating System Labs: A Case ...
Rosella Gennari
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
Roelof Pieters
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
DEF CON 24 - Clarence Chio - machine duping 101
DEF CON 24 - Clarence Chio - machine duping 101DEF CON 24 - Clarence Chio - machine duping 101
DEF CON 24 - Clarence Chio - machine duping 101
Felipe Prado
 
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Ohsawa Goodfellow
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Ad

More from Sangwoo Mo (20)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
Sangwoo Mo
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
Sangwoo Mo
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
Sangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
Sangwoo Mo
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
Sangwoo Mo
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
Sangwoo Mo
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
Sangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
Sangwoo Mo
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
Ad

Recently uploaded (20)

DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 

Deep Learning for Natural Language Processing