0% found this document useful (0 votes)

72 views

Deep Learning From Scratch

Uploaded by

Amit Meher

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

Deep Learning From Scratch

Uploaded by

Amit Meher

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Deep Learning From Scratch

www.data4sci.com/newsletter
https://github.com/DataForScience/DeepLearning
References https://github.com/DataForScience/DeepLearning

Neural Networks for Machine Learning Machine Learning

Geoff Hinton Andrew Ng

@bgoncalves www.data4sci.com
Requirements

https://github.com/DataForScience/DeepLearning

@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Machine Learning

@bgoncalves www.data4sci.com
What about Neurons?
Biological Neuron

@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)

@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that fire together wire together” (Hebb)
• Different areas perform different functions using same structure (Modularity)

@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)

Inputs f(Inputs) Output

@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Perceptron
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x
x3
j
N
w

Inputs Weights Output

@bgoncalves www.data4sci.com
Perceptron - Forward Propagation
• The output of a perceptron is determined by a sequence of steps:

• obtain the inputs

• multiply the inputs by the respective weights

1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

xN
@bgoncalves www.data4sci.com
Perceptron - Forward Propagation

def forward(Theta, X, active):

N = X.shape[0]

# Add the bias column

X_ = np.concatenate((np.ones((N, 1)), X), 1)

# Multiply by the weights

z = np.dot(X_, Theta.T)

return a

@bgoncalves www.data4sci.com
Perceptron - Training
• Training Procedure:

• If correct, do nothing

• If output incorrectly outputs 0, add input to weight

vector
1
• if output incorrectly outputs 1, subtract input to
weight vector w
x1 0j

• Guaranteed to converge, if a correct set of weights

w
1j
exists x2 w2j
zj
• Given enough features, perceptrons can learn almost T
w 3j w x
anything x3
j
• Specific Features used limit what is possible to learn w N

xN
@bgoncalves www.data4sci.com
1
Linear boundaries 0

AND OR NOR

@bgoncalves www.data4sci.com
1
Linear boundaries 0
• Perceptrons rely on hyperplanes to separate the data points. Unfortunately, this is not always
possible:
AND OR NOR

Possible Possible Possible

XOR

Impossible
@bgoncalves www.data4sci.com
Linear Boundaries

@bgoncalves www.data4sci.com
Code - Perceptron / Forward
Propagation
https://github.com/DataForScience/DeepLearning
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

@bgoncalves www.data4sci.com
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

• This is just a graphical representation of:

z = wT x
which is just linear regression!

@bgoncalves www.data4sci.com
Linear Regression
⃗
x ) x1
13 f( w1
y ≈ x 0+
Each point is w0 Add x0 ≡ 1
xi ⃗ = (x0, x1, ⋯, xn)
T
represented = to account
y
by a vector for intercept
9.75

6.5
y

3.25

0
0 5 10 15 20
x1
@bgoncalves www.data4sci.com
Optimization Problem https://github.com/DataForScience/DeepLearning

• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints

• The function to optimize

• The optimization algorithm.

@bgoncalves www.data4sci.com
Optimization Problem https://github.com/DataForScience/DeepLearning

• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints
Problem Representation
• The function to optimize
Prediction Error
• The optimization algorithm.
Gradient Descent

@bgoncalves www.data4sci.com
Linear Regression
• We are assuming that our functional dependence is of the form:

f ( x )⃗ = w0 + w1x1 + ⋯ + wn xn ≡ X w ⃗

• In other words, at each step, our hypothesis is:

hw (X) = X w ⃗ ≡ ŷ

Feature M
Feature 1
Feature 2
Feature 3
…

value
and it imposes a Constraint on the solutions that can be found.
Sample 1
Sample 2
• We quantify our far our hypothesis is from the correct value using Sample 3
Sample 4
an Error Function: Sample 5

1
[ ]
2 Sample 6
Jw (X, y )⃗ = (
(i)
)
(i) X y
2m ∑
h w x − y .

i
or, vectorially:
1
Jw (X, y )⃗ = [X w ⃗ − y ]⃗
2
2m Sample N

@bgoncalves www.data4sci.com
Linear Regression

@bgoncalves www.data4sci.com
Geometric Interpretation
13
1
Jw (X, y )⃗ = [ ⃗ ⃗
]
2
X w − y
2m

9.75

6.5

Quadratic error
means that an error
3.25 twice as large is
penalized four times
as much.

0
0 5 10 15 20

@bgoncalves www.data4sci.com
Gradient Descent
• Goal: Find the minimum of Jw (X, y )⃗ by varying the components of w ⃗

• Intuition: Follow the slope of the error function until convergence

δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y )⃗

δ
− Jw (X, y )⃗
δ w⃗

Jw (X, y ⃗)
• Algorithm:

⃗ (initial values of the parameters)

• Guess w (0)
step size
• Update until “convergence”:
δ
Jw (X, y )⃗
δ 1 T
wj = wj − α Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj δwj m
@bgoncalves www.data4sci.com
`
Geometric Interpretation

2D 3D nD
13

9.75

6.5

3.25

0
0 5 10 15 20

y = w0 + w1x1 y = w0 + w1x1 + w2 x2 y = X w⃗

Add x0 ≡ 1
Finds the hyperplane that
to account
splits the points in two for intercept
such that the errors on
each side balance out

@bgoncalves www.data4sci.com
Code - Linear Regression
https://github.com/DataForScience/DeepLearning
Learning Procedure

Constraint

input predicted observed

hypothesis
output output
XT hw (X) ŷ ⃗ Jw (X, y )⃗

Learning Error
Algorithm Function

@bgoncalves www.data4sci.com
Learning Procedure

Constraint

input predicted observed

hypothesis
output output
XT hw (X) = ϕ (X T w) ŷ ⃗ Jw (X, y )⃗

Which we
can redefine

Learning Error
Algorithm Function

@bgoncalves www.data4sci.com
Learning Procedure

Constraint

T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗
And
rewrite

Learning Error
Algorithm Function

@bgoncalves www.data4sci.com
Logistic Regression (Classification)
• Not actually regression, but rather Classification

• Predict the probability of instance belonging to the given class:

hw (X) ∈ [0,1]
1 - part of the class
• Use sigmoid/logistic function to map weighted inputs to[0,1]
0 - otherwise
hw (X) = ϕ (X w )⃗

z encapsulates all
the parameters and
input values

@bgoncalves
maximize the value
of z for members of
Geometric Interpretation the class

1
ϕ (z) ≥
2

@bgoncalves www.data4sci.com
Logistic Regression
• Error Function - Cross Entropy
1 T
Jw (X, y )⃗ = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
T
m
measures the “distance” between two probability distributions
1
hw (X) =
1 + e −X w ⃗
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).

• Gradient - same as Logistic Regression

δ
wj = wj − α Jw (X, y )⃗
δwj
δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m

@bgoncalves www.data4sci.com
Iris dataset

@bgoncalves www.data4sci.com
Code - Logistic Regression
https://github.com/DataForScience/DeepLearning
Logistic Regression

@bgoncalves www.data4sci.com
Logistic Regression

@bgoncalves www.data4sci.com
Learning Procedure

Constraint

T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗

Learning Error
Algorithm Function

@bgoncalves www.data4sci.com
Comparison
• Linear Regression • Logistic Regression

z = X w⃗ z = X w⃗
Map features to a
continuous variable

hw (X) = ϕ (Z ) hw (X) = ϕ (Z ) Compare

prediction with
reality
1 Predict based on
ϕ (Z ) = Z ϕ (Z ) =
1 + e −Z continuous variable

1 1 T
⃗
Jw (X, y ) = ⃗
[hw (X) − y ] Jw (X, y ) = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
⃗
2 T
2m m

δ 1 T δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗ Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m δwj m

@bgoncalves www.data4sci.com
Learning Procedure
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w

xN
Activation
Inputs Weights function

@bgoncalves www.data4sci.com
Generalized Perceptron
Bias
1
By changing the
w activation function,
x1 0j we change the
w underlying algorithm
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w

xN
Activation
Inputs Weights function

@bgoncalves www.data4sci.com
Activation Function
• Non-Linear function

• Differentiable

• non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract representation of the data

@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function

• Differentiable

• non-decreasing

ϕ (z) = z
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• The simplest

@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function

• Differentiable Linear Regression

• non-decreasing

ϕ (z) = z
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• The simplest

@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function

• Differentiable

• non-decreasing

1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data

• Perhaps the most common

@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function

• Differentiable Logistic Regression

• non-decreasing

1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data

• Perhaps the most common

@bgoncalves www.data4sci.com
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:

• obtain the inputs

• multiply the inputs by the respective weights

• calculate output using the activation function

@bgoncalves www.data4sci.com
Code - Forward Propagation
https://github.com/DataForScience/DeepLearning
Activation Function - ReLu
• Non-Linear function

• Differentiable

• non-decreasing

ϕ (z) = z, z > 0
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• Results in faster learning than with

sigmoid

@bgoncalves www.data4sci.com
Activation Function - ReLu
• Non-Linear function

• Differentiable Stepwise Regression

• non-decreasing

ϕ (z) = z, z > 0
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• Results in faster learning than with

sigmoid

@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline

• Multivariate Adaptive Regression Spline (MARS) is the best known example

• Fit curves using a linear combination of:

f ̂ (x) =
∑
ci Bi (x)
i
• The basis functions can be:
Bi (x)

• Constant

• “Hinge” functions of the form: and

max(0, x − b) max(0, b − x)

• Products of hinges

@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline

• Multivariate Adaptive Regression Spline (MARS) is the best known example

• Fit curves using a linear combination of: f ̂ (x) = ∑ ci Bi (x)

y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)

@bgoncalves www.data4sci.com
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
1
1
w
w a1 0k
x1 0j w
1k
w
1j a2 w2k
x2 w2j ak
w 3k wT a
w 3j wT x aj
x3
k
j wN
wN
aN
xN

• But how can we propagate back the errors and update the weights?
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline

f ̂ (x) =
∑
ci Bi (x)
i

1
1

x ReLu
1

x ReLu Linear

x ReLu

@bgoncalves www.data4sci.com
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:

• Quadratic error function:

1 X 2
E= |yn an |
N n
• Cross Entropy
1 Xh T T
i
J= yn log an + (1 yn ) log (1 an )
N n

The Cross Entropy is complementary to sigmoid

activation in the output layer and improves its stability.

@bgoncalves www.data4sci.com
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.

• Two common choices:

Jŵ (X) = Jw (X) + λ

∑
wij “Lasso”
ij

Jŵ (X) = Jw (X) + λ wij2

∑ L2
ij

• Lasso helps with feature selection by driving less important weights to zero

@bgoncalves www.data4sci.com
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:

• Forward propagate the inputs and calculate the deltas

• Update the weights

• The error at the output is a weighted average difference between predicted output and the
observed one.

• For inner layers there is no “real output”!

@bgoncalves www.data4sci.com
BackProp
• Let δ (l) be the error at each of the total L layers:

• Then:
δ (L) = hw (X) − y

• And for every other layer, in reverse order:

δ (l) = W (l)T δ (l+1). * ϕ † (z (l))

• Until:
δ (1) ≡ 0
as there’s no error on the input layer.

• And finally:

Δ(l)
ij
= Δ (l)
ij
+ a (l) (l+1)
j i
δ
∂ 1 (l)
⃗
(l) w ( )
(l)
J X, y = Δ + λw
∂wij m ij ij

@bgoncalves www.data4sci.com
A practical example - MNIST
8

Feature M
Feature 1
Feature 2
Feature 3
…

Label
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.

Sample N

yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST

Feature M
Feature 1
Feature 2
Feature 3
…

Label
Sample 1
Sample 2
Sample 3
s: Sample 4
er

arg max
Sample 5
Sample 6
er X ⇥1 ⇥2 .
er X y

Sample N

yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST
def forward(Theta, X, active):
N = X.shape[0]
5000 examples

# Add the bias column

X_ = np.concatenate((np.ones((N, 1)), X), 1)

arg max
# Multiply by the weights
X ⇥1 ⇥2 z = np.dot(X_, Theta.T)

# Apply the activation function

a = active(z)

return a

Vectors 784 50 10 1
Matrices 50 ⇥ 785 10 ⇥ 51 Forward Propagation
Matrices 50 ⇥ 784 10 ⇥ 50 Backward Propagation

def predict(Theta1, Theta2, X):

h1 = forward(Theta1, X, sigmoid)
h2 = forward(Theta2, h1, sigmoid)

return np.argmax(h2, 1)

@bgoncalves www.data4sci.com
Code - Simple Network
https://github.com/DataForScience/DeepLearning
Practical Considerations
• So far we have looked at very idealized cases. Reality is never this simple!

• In practice, many details have to be considered:

• Data normalization

• Overfitting

• Hyperparameters

• Bias, Variance tradeoffs

• etc…

@bgoncalves www.data4sci.com
Data Normalization
• The range of raw data values can vary widely.
• Using feature with very different ranges in the same analysis can cause numerical problems.
Many algorithms are linear or use euclidean distances that are heavily influenced by the
numerical values used (cm vs km, for example)
• To avoid difficulties, it’s common to rescale the range of all features in such a way that each
feature follows within the same range.
• Several possibilities:
x xmin
• Rescaling - x̂ =
xmax xmin
x µx
• Standardization - x̂ =
x
x
• Normalization - x̂ =
||x||

• In the rest of the discussion we will assume that the data has been normalized in some

@bgoncalves www.data4sci.com
Supervised Learning - Overfitting

Feature M
Feature 1
Feature 2
Feature 3
…

value
Sample 1
• “Learning the noise” Sample 2
Sample 3

Training
• “Memorization” instead of “generalization” Sample 4
Sample 5
Sample 6
• How can we prevent it? .

• Split dataset into two subsets: Training and Testing

Testing
• Train model using only the Training dataset and evaluate results in the previously unseen
Testing dataset. Sample N

• Different heuristics on how to split:

• Single split

• k-fold cross validation: split dataset in k parts, train in k-1 and evaluate in 1, repeat k
times and average results.

@bgoncalves www.data4sci.com
Bias-Variance Tradeoff

@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
High Bias Low Bias
Low Variance High Variance

Training
Error

Testing

Variance
Bias

Model Complexity

@bgoncalves www.data4sci.com
Learning Rate
δ
wij = wij − α Jw (X, y )⃗
δwij
δ
↵ defines size of step in direction of Jw (X, y )⃗
δwij

Very High Learning Rate

Loss

High Learning Rate

Low Learning Rate
Best Learning Rate

Epoch

@bgoncalves www.data4sci.com
Tips
• online learning - update weights after each case
- might be useful to update model as new data is obtained
- subject to fluctuations

• mini-batch - update weights after a “small” number of cases

- batches should be balanced
- if dataset is redundant, the gradient estimated using only a fraction of the data
is a good approximation to the full gradient.

• momentum - let gradient change the velocity of weight change instead of the value directly

• rmsprop - divide learning rate for each weight by a running average of “recent” gradients

• learning rate - vary over the course of the training procedure and use different learning rates
for each weight

@bgoncalves www.data4sci.com
Generalization
• Neural Networks are extremely modular in their design with

• Fortunately, we can write code that is also modular and can class Activation(object):
def f(z):
easily handle arbitrary numbers of layers pass

def df(z):
pass
• Let’s describe the structure of our network as a list of weight
matrices and activation functions class Linear(Activation):
def f(z):
return z

• We also need to keep track of the gradients of the activation def df(z):
return np.ones(z.shape)
functions so let us define a simple class:
class Sigmoid(Activation):
def f(z):
return 1./(1+np.exp(-z))

def df(z):
h = Sigmoid.f(z)
return h*(1-h)

@bgoncalves www.data4sci.com
Generalization
• Now we can describe our simple MNIST model with:

Thetas = []
Thetas.append(init_weights(input_layer_size, hidden_layer_size))
Thetas.append(init_weights(hidden_layer_size, num_labels))

model = []

model.append(Thetas[0])
model.append(Sigmoid)
model.append(Thetas[1])
model.append(Sigmoid)

• Where Sigmoid is an object that contains both the sigmoid function and its gradient as was
defined in the previous slide.

@bgoncalves www.data4sci.com
Generalization - Forward propagation

def forward(Theta, X, active):

N = X.shape[0]

# Add the bias column

X_ = np.concatenate((np.ones((N, 1)), X), 1)

# Multiply by the weights

z = np.dot(X_, Theta.T)

# Apply the activation function

a = active.f(z)

return a

def predict(model, X):

h = X.copy()

for i in range(0, len(model), 2):

theta = model[i]
activation = model[i+1]

h = forward(theta, h, activation)

return np.argmax(h, 1)

@bgoncalves www.data4sci.com
def backprop(model, X, y):
M = X.shape[0]

Thetas = model[0::2]
activations = model[1::2]
layers = len(Thetas)

K = Thetas[-1].shape[0]
J = 0
Deltas = []

for i in range(layers):
Deltas.append(np.zeros(Thetas[i].shape))

deltas = [0, 0, 0, 0]

for i in range(M):
As = []
Zs = [0]
Hs = [X[i]]
# Forward propagation, saving intermediate results
As.append(np.concatenate(([1], Hs[0]))) # Input layer

for l in range(1, layers+1):

Zs.append(np.dot(Thetas[l-1], As[l-1]))
Hs.append(activations[l-1].f(Zs[l]))
As.append(np.concatenate(([1], Hs[l])))

y0 = one_hot(K, y[i])

# Cross entropy
J -= np.dot(y0.T, np.log(Hs[2]))+np.dot((1-y0).T, np.log(1-Hs[2]))

# Calculate the weight deltas

deltas[layers] = Hs[layers]-y0

for l in range(layers-1, 1, -1):

deltas[l] = np.dot(Thetas[l-1].T, deltas[l+1])[1:]*activations[l-1].df(Zs[l-1])
Deltas[l] += np.outer(deltas[l+1], As[l])

J /= M
grads = []
grads.append(Deltas[0]/M)
grads.append(Deltas[1]/M)

@bgoncalves return [J, grads]

Code - Modular Network
https://github.com/DataForScience/DeepLearning
Neural Network Architectures

@bgoncalves www.data4sci.com
word2vec Mikolov 2013

Skipgram Continuous Bag of Words

max p (C|w) max p (w|C)

1
wj

wj
⇥2 ⇥1 word embeddings ⇥2

⇥2 context embeddings
wj

wj
⇥1 ⇥1
wj one hot vector
⇥2 ⇥2
activation function
wj+1

wj+1

Word Context Context Word

@bgoncalves www.data4sci.com
Visualization

www.data4sci.com
“You shall know a word by the company it keeps”
(J. R. Firth)
Analogies
• The embedding of each word is a function of the context it appears in:
(red) = f (context (red))
• words that appear in similar contexts will have similar embeddings:

context (red) ⇡ context (blue) =) (red) ⇡ (blue)

• “Distributional hypotesis” in linguistics

France
Italy Portugal Country context
Geometrical relations Paris
USA

between contexts imply Capital context Rome Lisbon

Washington DC
semantic relations
between words!

(F rance) (P aris) + (Rome) = (Italy)

~b ~a + ~c = d~
@bgoncalves www.data4sci.com
Analogies https://www.tensorflow.org/tutorials/word2vec

What is the word d that is most similar to

~b ~a + ~c = d~
b and c and most dissimilar to a?
⇣ ⌘T
~b ~a + ~c
d† = argmax ~x
x ~b ~a + ~c
⇣ ⌘
d† ⇠ argmax ~bT ~x ~aT ~x + ~cT ~x
x

@bgoncalves www.data4sci.com
Feed Forward Networks

ht Output

xt Input

ht = f (xt)

@bgoncalves www.data4sci.com
Feed Forward Networks

ht Output

Information
Flow

xt Input

ht = f (xt)

@bgoncalves www.data4sci.com
Information
Recurrent Neural Network (RNN) Flow

ht Output

Previous ht−1
Output
xt Input

ht = f (xt, ht−1)

@bgoncalves www.data4sci.com
Recurrent Neural Network (RNN)
• Each output depends (implicitly) on all previous outputs.

• Input sequences generate output sequences (seq2seq)

ht−1 ht ht+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves www.data4sci.com
Long-Short Term Memory (LSTM)
• What if we want to keep explicit information about previous states (memory)?

• How much information is kept, can be controlled through gates.

• LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber

ht−1 ht ht+1
ct−2 ct−1 ct ct+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves www.data4sci.com
Convolutional Neural Networks

@bgoncalves www.data4sci.com
@bgoncalves
Curve Fitting?

@bgoncalves www.data4sci.com
Interpretability?

@bgoncalves www.data4sci.com
“Deep” learning

@bgoncalves www.data4sci.com
Events
www.data4sci.com/newsletter

Time Series for Everyone

Jan 17, 2019 - 5am-9am (PST)

Applied Probability Theory for Everyone

Jan 27, 2019 - 5am-9am (PST)

Time series modeling:

ML and deep learning
approaches

Natural Language Processing (NLP) from Scratch

http://bit.ly/LiveLessonNLP - On Demand

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com

6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Deep Learning From Scratch
No ratings yet
Deep Learning From Scratch
96 pages
Bruno Gonçalves: Deep Learning From Scratch
No ratings yet
Bruno Gonçalves: Deep Learning From Scratch
95 pages
NN Theory
No ratings yet
NN Theory
138 pages
Session 6 Machine Learning Algorithms
No ratings yet
Session 6 Machine Learning Algorithms
46 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Week3_LearningI
No ratings yet
Week3_LearningI
48 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
05_optimization_basics
No ratings yet
05_optimization_basics
94 pages
2021 10 11 - Intro ML - Inserm
No ratings yet
2021 10 11 - Intro ML - Inserm
41 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
02A-DL2023-NN-basics
No ratings yet
02A-DL2023-NN-basics
52 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Lecture1
No ratings yet
Lecture1
56 pages
Chapter 2 - 2 Shallow neural network 2_2
No ratings yet
Chapter 2 - 2 Shallow neural network 2_2
34 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
cours1
No ratings yet
cours1
42 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
DL ppt
No ratings yet
DL ppt
110 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
Intro ML Linear Classifier
No ratings yet
Intro ML Linear Classifier
18 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
92 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
CH 1
No ratings yet
CH 1
24 pages
Lect 1
No ratings yet
Lect 1
24 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Lec3 MLP Optimization
No ratings yet
Lec3 MLP Optimization
86 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
AI2025_Lecture02_recording_slides (1)
No ratings yet
AI2025_Lecture02_recording_slides (1)
52 pages
S02_DNN_Perceptron_wip
No ratings yet
S02_DNN_Perceptron_wip
24 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
ECS171: Machine Learning: Lecture 1: Overview of Class, LFD 1.1, 1.2
No ratings yet
ECS171: Machine Learning: Lecture 1: Overview of Class, LFD 1.1, 1.2
29 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
03 Ai
No ratings yet
03 Ai
59 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Linear Models
No ratings yet
Linear Models
30 pages
383-Fall11-Lec19
No ratings yet
383-Fall11-Lec19
30 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
Write Now!: A Guide to Getting Started
From Everand
Write Now!: A Guide to Getting Started
Mary F Burke
No ratings yet
Agent Based
No ratings yet
Agent Based
38 pages
Insight Report - What Is Pathways in ADC Research
No ratings yet
Insight Report - What Is Pathways in ADC Research
6 pages
Information Retrieval
No ratings yet
Information Retrieval
1 page
NLP
No ratings yet
NLP
1 page
Chapter8 Switching
No ratings yet
Chapter8 Switching
60 pages
Seismic Engg MCQ
No ratings yet
Seismic Engg MCQ
45 pages
Refitting The Clutch
No ratings yet
Refitting The Clutch
5 pages
Irjet V11i233
No ratings yet
Irjet V11i233
4 pages
DrWeb Crash
No ratings yet
DrWeb Crash
8 pages
Finite Element and Experimental Analyses of An Armoured Vehicle
No ratings yet
Finite Element and Experimental Analyses of An Armoured Vehicle
7 pages
Geosynthetics Filter Codes of Practice
No ratings yet
Geosynthetics Filter Codes of Practice
42 pages
Final Revision - Grade 6 - 2020
No ratings yet
Final Revision - Grade 6 - 2020
19 pages
222 Chapter 1
No ratings yet
222 Chapter 1
22 pages
Project Report Techno Task Management: XXXXXXX
No ratings yet
Project Report Techno Task Management: XXXXXXX
121 pages
Automated Solar Powered Irrigation System: Page No:1 Lodha Kalyani Bipinchand
No ratings yet
Automated Solar Powered Irrigation System: Page No:1 Lodha Kalyani Bipinchand
20 pages
UFO Turn Upside-Down All My Life
No ratings yet
UFO Turn Upside-Down All My Life
14 pages
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
56 pages
Cell and Molecular Biology Concepts and Experiments 6th Edition Karp Test Bank
100% (55)
Cell and Molecular Biology Concepts and Experiments 6th Edition Karp Test Bank
10 pages
United States Patent: (75) Inventors: Prashant Anil Tatake, Mumbai (IN)
No ratings yet
United States Patent: (75) Inventors: Prashant Anil Tatake, Mumbai (IN)
8 pages
Engineering Mechanics Lecture # 03: Engr. Waqar Ahmad M. SC Structural Engineering
No ratings yet
Engineering Mechanics Lecture # 03: Engr. Waqar Ahmad M. SC Structural Engineering
21 pages
LBX 6513DS VTM
No ratings yet
LBX 6513DS VTM
4 pages
Laboratory Session One and Two (Biol-1012)
No ratings yet
Laboratory Session One and Two (Biol-1012)
12 pages
Nano Tech in Construction Industry
No ratings yet
Nano Tech in Construction Industry
19 pages
Pages From 230303 - 1632 - Phase 2 - Concept Design - Issue 2
No ratings yet
Pages From 230303 - 1632 - Phase 2 - Concept Design - Issue 2
1 page
Complex Number
No ratings yet
Complex Number
130 pages
7 Inch Liner Cementing Program
No ratings yet
7 Inch Liner Cementing Program
44 pages
Gov. Pablo Borbon Campus II, Alangilan Batangas City, Philippines 4200
No ratings yet
Gov. Pablo Borbon Campus II, Alangilan Batangas City, Philippines 4200
48 pages
Sample LAS S10 - Q1 - Week1
No ratings yet
Sample LAS S10 - Q1 - Week1
9 pages
Wastewater Treatment Lecture Material
100% (4)
Wastewater Treatment Lecture Material
203 pages
18 Mud Calculations
No ratings yet
18 Mud Calculations
2 pages
Basic Parameter To Calculate Rotary Vacuum Filter
No ratings yet
Basic Parameter To Calculate Rotary Vacuum Filter
5 pages
Block 13 ST3188
No ratings yet
Block 13 ST3188
20 pages
Difference Between Biological Neurons and Artificial Neurons
No ratings yet
Difference Between Biological Neurons and Artificial Neurons
2 pages
HP VEE 4.0 For Win 95 - NT Technical Specifications. HP. 1998
No ratings yet
HP VEE 4.0 For Win 95 - NT Technical Specifications. HP. 1998
10 pages