Deep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 2017 (Deep Learning Practical)
Labs:
(Computer Vision) Thomas Brox,
(Robotics) Wolfram Burgard,
(Machine Learning) Frank Hutter,
(Neurorobotics) Joschka Boedecker
University of Freiburg
I Phase 1
I Today: Introduction to Deep Learning (lecture).
I Assignment 1) 16.10 - 06.11 Implementation of a feed-forward Neural
Network from scratch (each person individually, 1-2 page report).
I 23.10: meeting solving open questions
I 30.10: meeting solving open questions
I 06.11: Intro to CNNs and Tensorflow (mandatory lecture), hand in
Assignment 1
I Assignment 2) 06.11 - 20.11 Train CNN in tensorflow (each person
individually, 1-2 page report)
I 13.11 meeting solving open questions
I Phase 1
I Today: Introduction to Deep Learning (lecture).
I Assignment 1) 16.10 - 06.11 Implementation of a feed-forward Neural
Network from scratch (each person individually, 1-2 page report).
I 23.10: meeting solving open questions
I 30.10: meeting solving open questions
I 06.11: Intro to CNNs and Tensorflow (mandatory lecture), hand in
Assignment 1
I Assignment 2) 06.11 - 20.11 Train CNN in tensorflow (each person
individually, 1-2 page report)
I 13.11 meeting solving open questions
I Phase 2 (split into three tracks)
I 20.11 First meeting in tracks, hand in Assignment 2
I Assignment 3) 20.11 - 18.12 Group work track specific task, 1-2 page
report
I Assignment 4) 18.12 - 22.01 Group work advanced topic, 1-2 page report
I Phase 1
I Today: Introduction to Deep Learning (lecture).
I Assignment 1) 16.10 - 06.11 Implementation of a feed-forward Neural
Network from scratch (each person individually, 1-2 page report).
I 23.10: meeting solving open questions
I 30.10: meeting solving open questions
I 06.11: Intro to CNNs and Tensorflow (mandatory lecture), hand in
Assignment 1
I Assignment 2) 06.11 - 20.11 Train CNN in tensorflow (each person
individually, 1-2 page report)
I 13.11 meeting solving open questions
I Phase 2 (split into three tracks)
I 20.11 First meeting in tracks, hand in Assignment 2
I Assignment 3) 20.11 - 18.12 Group work track specific task, 1-2 page
report
I Assignment 4) 18.12 - 22.01 Group work advanced topic, 1-2 page report
I Phase 3:
I 15.01: meeting final project topics and solving open questions
I Assignment 5) 22.01 - 12.02 Solve a challenging problem of your choice
using the tools from Phase 2 (group work + 5 min presentation)
I Track 1 Neurorobotics/Robotics
I Robot navigation
I Deep reinforcement learning
I Track 1 Neurorobotics/Robotics
I Robot navigation
I Deep reinforcement learning
I Track 2 Machine Learning
I Architecture search and hyperparameter optimization
I Visual recognition
I Track 1 Neurorobotics/Robotics
I Robot navigation
I Deep reinforcement learning
I Track 2 Machine Learning
I Architecture search and hyperparameter optimization
I Visual recognition
I Track 3 Computer Vision
I Image segmentation
I Autoencoders
I Generative adversarial networks
I Groups: You will work on your own for the first exercise
I Assignment: You will have three weeks to build a MLP and apply it to
the MNIST dataset (more on this at the end)
I Lecture: Short recap on how MLPs (feed-forward neural networks) work
and how to train them
Data
?
inference
Model M
1 Learn Model M from the data
2 Let the model M infer unknown quantities from data
Data
?
inference
Model M
What is the difference between deep learning and a standard machine learning
pipeline ?
x 1
x 2
x
(3)
3
features
(1) x 1 color height (4 cm) (2)
x 2
color height (3.5 cm)
x 3
color height (10 cm)
Data
x 1
x 2
features
x x 1 color height (4 cm) (3)
3
x 2
color height (3.5 cm)
x
(1a) 3
color height (10 cm)
(2)
Data Pixels
(1b)
Andreas Eitel Uni FR DL 2017 (11)
Supervised Deep Learning Pipeline
x 1
Classes
x 2
(2)
x 3
Pixels
Data
(1)
I Let’s formalize!
I We are given:
I Dataset D = {(x1 , y1 ), . . . , (xN , yN )}
I A neural network with parameters θ which implements a function fθ (x)
I We want to learn:
I The parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi
x ... x ... x
1 j N
smax smax
f(x) a a
1
I unit i activation:
XN
(2)
h (x) a
... a
... a
ai = wi,j xj + bi
j=i
1
unit i output:
... ...
I (1)
(1) h (x) a a ai
hi (x) = t(ai ) where t(·) is an
activation or transfer function b i
w i,j
1
x ... x ... x
1 j N
smax smax
f(x) a a
alternatively (and much faster) use 1
vector notation: (2)
h (x) a
... a
... a
I layer activation:
a(1) = W(1) x + b(1)
1
I layer output:
h(1) (x) = t(a(1) )
(1)
h (x) a
... a
... ai
where t(·) is applied element wise (1) b i
W w i,j
1
x ... x ... x
1 j N
smax smax
f(x) a a
Second layer 1
I layer 2 activation:
(2)
h (x) a
... a
... a
a(2) = W(2) h(1) (x) + b(2) (2)
I layer 2 output: W 1
h(1) (x) = t(a(2) )
(1)
h (x) a
... a
... ai
where t(·) is applied element wise
(1)
W 1
x ... x ... x
1 j N
smax smax
Output layer f(x) a a
output layer activation: (3)
W
I
1
a(3) = W(3) h(2) (x) + b(3)
I network output:
(2)
h (x) a
... a
... a
f (x) = o(a(3) ) (2)
where o(·) is the output W 1
nonlinearity (1)
h (x) a
... a
... ai
I for classification use softmax:
ezi (1)
oi (z) = P|z|
j=1 e
zj W 1
x ... x ... x
1 j N
Gradient descent:
θ0 ← init randomly
do
t+1 ∂L(fθ , D)
I θ = θt − γ
∂θ
while
(L(fθt+1 , V ) − L(fθt , V ))2 >
Gradient descent:
θ0 ← init randomly
do
t+1 ∂L(fθ , D)
I θ = θt − γ
∂θ
while
(L(fθt+1 , V ) − L(fθt , V ))2 >
I Where V is a validation dataset (why not use D ?)
N
1 X
I Remember in our case: L(fθ , D) = l(fθ (xi ), yi )
N i=1
I We will get to computing the derivatives shortly
gradient descent
gradient descent 2
→ Same data, assuming that gradient evaluation on all data takes 4 times as
much time as evaluating a single datapoint
1
(gradient descent (γ = 2), stochastic gradient descent (γ t = 0.01 ))
t
sgd
Andreas Eitel Uni FR DL 2017 (27)
Neural Network backward pass
l(f(x), y)
x ... x ... x
1 j N
l(f(x), y)
→ recall
a(3) = W(3) h(2) (x) + b(3)
l(f(x), y)
I gradient wrt. layer 3 weights:
∂l(f (x), y) ∂l(f (x), y) ∂a(3)
= smax smax
∂W (3) ∂a(3) W(3) f(x) a a
(3)
assuming l is NLL and softmax outputs,
W
I
1
gradient wrt. layer 3 activation is:
∂l(f (x), y)
(2)
h (x) a
... a
... a
= −(y − f (x)) (2)
∂a(3) W 1
I gradient of a wrt. W(3) : (1)
h (x) a
... a
... ai
∂a(3)
= h(2) (x)T (1)
∂W(3) W 1
I combined:
∂l(f (x), y) x ... x ... x
1 j N
∂W(3)
= (h(2) (x))T (1)
h (x) a
... a
... ai
I gradient wrt. previous layer: (1)
W 1
∂l(f (x), y)
=
∂l(f (x), y) ∂a(3) x ... x ... x
1 j N
l(f(x), y)
∂a(2)
= σ(ai )(1 − ai ) W 1
→ recall
a(3) = W(3) h(2) (x) + b(3)
Gradient Checking:
θ = (W(1) , b(1) , . . . ) init randomly
x ← init randomly ; y ← init randomly
∂l(fθ (x), y)
ganalytic ← compute gradient via backprop
∂θ
for i in #θ
I θ̂ = θ
I θ̂i = θ̂i +
l(fθ̂ (x), y) − l(fθ (x), y)
I gnumeric =
I assert(kgnumeric − ganalytic k < )
I can also be used to test partial implementations
(i.e. layers, activation functions)
→ simply remove loss computation and backprop ones