Deep Learning From Scratch
Deep Learning From Scratch
www.data4sci.com/newsletter
https://github.com/DataForScience/DeepLearning
References https://github.com/DataForScience/DeepLearning
@bgoncalves www.data4sci.com
Requirements
https://github.com/DataForScience/DeepLearning
@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Machine Learning
@bgoncalves www.data4sci.com
What about Neurons?
Biological Neuron
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that fire together wire together” (Hebb)
• Different areas perform different functions using same structure (Modularity)
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Perceptron
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x
x3
j
N
w
xN
@bgoncalves www.data4sci.com
Perceptron - Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w
xN
@bgoncalves www.data4sci.com
Perceptron - Forward Propagation
return a
@bgoncalves www.data4sci.com
Perceptron - Training
• Training Procedure:
• If correct, do nothing
xN
@bgoncalves www.data4sci.com
1
Linear boundaries 0
AND OR NOR
@bgoncalves www.data4sci.com
1
Linear boundaries 0
• Perceptrons rely on hyperplanes to separate the data points. Unfortunately, this is not always
possible:
AND OR NOR
XOR
Impossible
@bgoncalves www.data4sci.com
Linear Boundaries
@bgoncalves www.data4sci.com
Code - Perceptron / Forward
Propagation
https://github.com/DataForScience/DeepLearning
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w
xN
@bgoncalves www.data4sci.com
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w
xN
@bgoncalves www.data4sci.com
Linear Regression
⃗
x ) x1
13 f( w1
y ≈ x 0+
Each point is w0 Add x0 ≡ 1
xi ⃗ = (x0, x1, ⋯, xn)
T
represented = to account
y
by a vector for intercept
9.75
6.5
y
3.25
0
0 5 10 15 20
x1
@bgoncalves www.data4sci.com
Optimization Problem https://github.com/DataForScience/DeepLearning
• The constraints
@bgoncalves www.data4sci.com
Optimization Problem https://github.com/DataForScience/DeepLearning
• The constraints
Problem Representation
• The function to optimize
Prediction Error
• The optimization algorithm.
Gradient Descent
@bgoncalves www.data4sci.com
Linear Regression
• We are assuming that our functional dependence is of the form:
f ( x )⃗ = w0 + w1x1 + ⋯ + wn xn ≡ X w ⃗
hw (X) = X w ⃗ ≡ ŷ
Feature M
Feature 1
Feature 2
Feature 3
…
value
and it imposes a Constraint on the solutions that can be found.
Sample 1
Sample 2
• We quantify our far our hypothesis is from the correct value using Sample 3
Sample 4
an Error Function: Sample 5
1
[ ]
2 Sample 6
Jw (X, y )⃗ = (
(i)
)
(i) X y
2m ∑
h w x − y .
i
or, vectorially:
1
Jw (X, y )⃗ = [X w ⃗ − y ]⃗
2
2m Sample N
@bgoncalves www.data4sci.com
Linear Regression
@bgoncalves www.data4sci.com
Geometric Interpretation
13
1
Jw (X, y )⃗ = [ ⃗ ⃗
]
2
X w − y
2m
9.75
6.5
Quadratic error
means that an error
3.25 twice as large is
penalized four times
as much.
0
0 5 10 15 20
@bgoncalves www.data4sci.com
Gradient Descent
• Goal: Find the minimum of Jw (X, y )⃗ by varying the components of w ⃗
δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y )⃗
δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y ⃗)
• Algorithm:
2D 3D nD
13
9.75
6.5
3.25
0
0 5 10 15 20
y = w0 + w1x1 y = w0 + w1x1 + w2 x2 y = X w⃗
Add x0 ≡ 1
Finds the hyperplane that
to account
splits the points in two for intercept
such that the errors on
each side balance out
@bgoncalves www.data4sci.com
Code - Linear Regression
https://github.com/DataForScience/DeepLearning
Learning Procedure
Constraint
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Learning Procedure
Constraint
Which we
can redefine
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Learning Procedure
Constraint
T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗
And
rewrite
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Logistic Regression (Classification)
• Not actually regression, but rather Classification
z encapsulates all
the parameters and
input values
@bgoncalves
maximize the value
of z for members of
Geometric Interpretation the class
1
ϕ (z) ≥
2
@bgoncalves www.data4sci.com
Logistic Regression
• Error Function - Cross Entropy
1 T
Jw (X, y )⃗ = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
T
m
measures the “distance” between two probability distributions
1
hw (X) =
1 + e −X w ⃗
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).
@bgoncalves www.data4sci.com
Iris dataset
@bgoncalves www.data4sci.com
Iris dataset
@bgoncalves www.data4sci.com
Code - Logistic Regression
https://github.com/DataForScience/DeepLearning
Logistic Regression
@bgoncalves www.data4sci.com
Logistic Regression
@bgoncalves www.data4sci.com
Learning Procedure
Constraint
T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Comparison
• Linear Regression • Logistic Regression
z = X w⃗ z = X w⃗
Map features to a
continuous variable
1 1 T
⃗
Jw (X, y ) = ⃗
[hw (X) − y ] Jw (X, y ) = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
⃗
2 T
2m m
δ 1 T δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗ Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m δwj m
@bgoncalves www.data4sci.com
Learning Procedure
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w
xN
Activation
Inputs Weights function
@bgoncalves www.data4sci.com
Generalized Perceptron
Bias
1
By changing the
w activation function,
x1 0j we change the
w underlying algorithm
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w
xN
Activation
Inputs Weights function
@bgoncalves www.data4sci.com
Activation Function
• Non-Linear function
• Differentiable
• non-decreasing
@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
ϕ (z) = z
• Compute new sets of features
• The simplest
@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function
• non-decreasing
ϕ (z) = z
• Compute new sets of features
• The simplest
@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing
1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data
@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function
• non-decreasing
1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data
@bgoncalves www.data4sci.com
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
@bgoncalves www.data4sci.com
Code - Forward Propagation
https://github.com/DataForScience/DeepLearning
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
ϕ (z) = z, z > 0
• Compute new sets of features
@bgoncalves www.data4sci.com
Activation Function - ReLu
• Non-Linear function
• non-decreasing
ϕ (z) = z, z > 0
• Compute new sets of features
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
• Constant
• Products of hinges
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)
@bgoncalves www.data4sci.com
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
1
1
w
w a1 0k
x1 0j w
1k
w
1j a2 w2k
x2 w2j ak
w 3k wT a
w 3j wT x aj
x3
k
j wN
wN
aN
xN
• But how can we propagate back the errors and update the weights?
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
f ̂ (x) =
∑
ci Bi (x)
i
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)
1
1
x ReLu
1
x ReLu Linear
x ReLu
@bgoncalves www.data4sci.com
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:
@bgoncalves www.data4sci.com
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.
• Lasso helps with feature selection by driving less important weights to zero
@bgoncalves www.data4sci.com
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• The error at the output is a weighted average difference between predicted output and the
observed one.
@bgoncalves www.data4sci.com
BackProp
• Let δ (l) be the error at each of the total L layers:
• Then:
δ (L) = hw (X) − y
• And finally:
Δ(l)
ij
= Δ (l)
ij
+ a (l) (l+1)
j i
δ
∂ 1 (l)
⃗
(l) w ( )
(l)
J X, y = Δ + λw
∂wij m ij ij
@bgoncalves www.data4sci.com
A practical example - MNIST
8
Feature M
Feature 1
Feature 2
Feature 3
…
Label
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST
Feature M
Feature 1
Feature 2
Feature 3
…
Label
Sample 1
Sample 2
Sample 3
s: Sample 4
er
arg max
Sample 5
Sample 6
er X ⇥1 ⇥2 .
er X y
Sample N
yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST
def forward(Theta, X, active):
N = X.shape[0]
5000 examples
arg max
# Multiply by the weights
X ⇥1 ⇥2 z = np.dot(X_, Theta.T)
return a
Vectors 784 50 10 1
Matrices 50 ⇥ 785 10 ⇥ 51 Forward Propagation
Matrices 50 ⇥ 784 10 ⇥ 50 Backward Propagation
return np.argmax(h2, 1)
@bgoncalves www.data4sci.com
Code - Simple Network
https://github.com/DataForScience/DeepLearning
Practical Considerations
• So far we have looked at very idealized cases. Reality is never this simple!
• Data normalization
• Overfitting
• Hyperparameters
• etc…
@bgoncalves www.data4sci.com
Data Normalization
• The range of raw data values can vary widely.
• Using feature with very different ranges in the same analysis can cause numerical problems.
Many algorithms are linear or use euclidean distances that are heavily influenced by the
numerical values used (cm vs km, for example)
• To avoid difficulties, it’s common to rescale the range of all features in such a way that each
feature follows within the same range.
• Several possibilities:
x xmin
• Rescaling - x̂ =
xmax xmin
x µx
• Standardization - x̂ =
x
x
• Normalization - x̂ =
||x||
• In the rest of the discussion we will assume that the data has been normalized in some
@bgoncalves www.data4sci.com
Data Normalization
• The range of raw data values can vary widely.
• Using feature with very different ranges in the same analysis can cause numerical problems.
Many algorithms are linear or use euclidean distances that are heavily influenced by the
numerical values used (cm vs km, for example)
• To avoid difficulties, it’s common to rescale the range of all features in such a way that each
feature follows within the same range.
• Several possibilities:
x xmin
• Rescaling - x̂ =
xmax xmin
x µx
• Standardization - x̂ =
x
x
• Normalization - x̂ =
||x||
• In the rest of the discussion we will assume that the data has been normalized in some
@bgoncalves www.data4sci.com
Supervised Learning - Overfitting
Feature M
Feature 1
Feature 2
Feature 3
…
value
Sample 1
• “Learning the noise” Sample 2
Sample 3
Training
• “Memorization” instead of “generalization” Sample 4
Sample 5
Sample 6
• How can we prevent it? .
Testing
• Train model using only the Training dataset and evaluate results in the previously unseen
Testing dataset. Sample N
• Single split
• k-fold cross validation: split dataset in k parts, train in k-1 and evaluate in 1, repeat k
times and average results.
@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
High Bias Low Bias
Low Variance High Variance
Training
Error
Testing
Variance
Bias
Model Complexity
@bgoncalves www.data4sci.com
Learning Rate
δ
wij = wij − α Jw (X, y )⃗
δwij
δ
↵ defines size of step in direction of Jw (X, y )⃗
δwij
Epoch
@bgoncalves www.data4sci.com
Tips
• online learning - update weights after each case
- might be useful to update model as new data is obtained
- subject to fluctuations
• momentum - let gradient change the velocity of weight change instead of the value directly
• rmsprop - divide learning rate for each weight by a running average of “recent” gradients
• learning rate - vary over the course of the training procedure and use different learning rates
for each weight
@bgoncalves www.data4sci.com
Generalization
• Neural Networks are extremely modular in their design with
• Fortunately, we can write code that is also modular and can class Activation(object):
def f(z):
easily handle arbitrary numbers of layers pass
def df(z):
pass
• Let’s describe the structure of our network as a list of weight
matrices and activation functions class Linear(Activation):
def f(z):
return z
• We also need to keep track of the gradients of the activation def df(z):
return np.ones(z.shape)
functions so let us define a simple class:
class Sigmoid(Activation):
def f(z):
return 1./(1+np.exp(-z))
def df(z):
h = Sigmoid.f(z)
return h*(1-h)
@bgoncalves www.data4sci.com
Generalization
• Now we can describe our simple MNIST model with:
Thetas = []
Thetas.append(init_weights(input_layer_size, hidden_layer_size))
Thetas.append(init_weights(hidden_layer_size, num_labels))
model = []
model.append(Thetas[0])
model.append(Sigmoid)
model.append(Thetas[1])
model.append(Sigmoid)
• Where Sigmoid is an object that contains both the sigmoid function and its gradient as was
defined in the previous slide.
@bgoncalves www.data4sci.com
Generalization - Forward propagation
return a
h = forward(theta, h, activation)
return np.argmax(h, 1)
@bgoncalves www.data4sci.com
def backprop(model, X, y):
M = X.shape[0]
Thetas = model[0::2]
activations = model[1::2]
layers = len(Thetas)
K = Thetas[-1].shape[0]
J = 0
Deltas = []
for i in range(layers):
Deltas.append(np.zeros(Thetas[i].shape))
deltas = [0, 0, 0, 0]
for i in range(M):
As = []
Zs = [0]
Hs = [X[i]]
# Forward propagation, saving intermediate results
As.append(np.concatenate(([1], Hs[0]))) # Input layer
y0 = one_hot(K, y[i])
# Cross entropy
J -= np.dot(y0.T, np.log(Hs[2]))+np.dot((1-y0).T, np.log(1-Hs[2]))
J /= M
grads = []
grads.append(Deltas[0]/M)
grads.append(Deltas[1]/M)
@bgoncalves www.data4sci.com
word2vec Mikolov 2013
1
wj
wj
⇥2 ⇥1 word embeddings ⇥2
⇥2 context embeddings
wj
wj
⇥1 ⇥1
wj one hot vector
⇥2 ⇥2
activation function
wj+1
wj+1
www.data4sci.com
“You shall know a word by the company it keeps”
(J. R. Firth)
Analogies
• The embedding of each word is a function of the context it appears in:
(red) = f (context (red))
• words that appear in similar contexts will have similar embeddings:
France
Italy Portugal Country context
Geometrical relations Paris
USA
@bgoncalves www.data4sci.com
Feed Forward Networks
ht Output
xt Input
ht = f (xt)
@bgoncalves www.data4sci.com
Feed Forward Networks
ht Output
Information
Flow
xt Input
ht = f (xt)
@bgoncalves www.data4sci.com
Information
Recurrent Neural Network (RNN) Flow
ht Output
ht Output
Previous ht−1
Output
xt Input
ht = f (xt, ht−1)
@bgoncalves www.data4sci.com
Recurrent Neural Network (RNN)
• Each output depends (implicitly) on all previous outputs.
ht−1 ht ht+1
ht−2 ht−1 ht ht+1
xt−1 xt xt+1
@bgoncalves www.data4sci.com
Long-Short Term Memory (LSTM)
• What if we want to keep explicit information about previous states (memory)?
ht−1 ht ht+1
ct−2 ct−1 ct ct+1
ht−2 ht−1 ht ht+1
xt−1 xt xt+1
@bgoncalves www.data4sci.com
Convolutional Neural Networks
@bgoncalves www.data4sci.com
@bgoncalves
Curve Fitting?
@bgoncalves www.data4sci.com
Interpretability?
@bgoncalves www.data4sci.com
Interpretability?
@bgoncalves www.data4sci.com
“Deep” learning
@bgoncalves www.data4sci.com
Events
www.data4sci.com/newsletter
@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com