0% found this document useful (0 votes)
11 views93 pages

Lecture04 VDL

This document discusses variants of gradient descent algorithms for training neural networks, including gradient descent, minibatch gradient descent, and stochastic gradient descent. It also covers logistic regression and how it can be modeled as a neural network using cross entropy loss and the softmax activation function. Multiclass classification problems are discussed along with one-hot encoding of labels. Different activation functions like sigmoid, tanh, ReLU, leaky ReLU, and their benefits are also summarized.

Uploaded by

nikhit5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views93 pages

Lecture04 VDL

This document discusses variants of gradient descent algorithms for training neural networks, including gradient descent, minibatch gradient descent, and stochastic gradient descent. It also covers logistic regression and how it can be modeled as a neural network using cross entropy loss and the softmax activation function. Multiclass classification problems are discussed along with one-hot encoding of labels. Different activation functions like sigmoid, tanh, ReLU, leaky ReLU, and their benefits are also summarized.

Uploaded by

nikhit5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Very Deep Learning

Lecture 05

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker


MindGarage, University of Kaiserslautern
[email protected]

M. Zeshan Afzal, Very Deep Learning Ch. 5


Recap

M. Zeshan Afzal, Very Deep Learning Ch. 5 2


Variants of Gradient Descent

◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

M. Zeshan Afzal, Very Deep Learning Ch. 5 3


Logistic Regression (Decision Boundary)

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 4


Simple Examples

M. Zeshan Afzal, Very Deep Learning Ch. 5 5


XOR

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 6


XOR (Multilayer Perceptron)

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 7


Representation Matters

M. Zeshan Afzal, Very Deep Learning Ch. 5 8


Neural Network Playground

◼ https://playground.tensorflow.org/

M. Zeshan Afzal, Very Deep Learning Ch. 5 9


Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 10


Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 11


A Brief History of Deep Learning

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1986 Backpropagation Algorithm


^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights

Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

M. Zeshan Afzal, Very Deep Learning Ch. 5 12


Activation Functions

M. Zeshan Afzal, Very Deep Learning Ch. 5 13


Sigmoid

◼ Maps input to range [0, 1]


◼ Historically popular since they
have nice interpretation as
saturating firing neuron

◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred

M. Zeshan Afzal, Very Deep Learning Ch. 5 14


Sigmoid

◼ Maps input to range [0, 1]

◼ Problems

Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

M. Zeshan Afzal, Very Deep Learning Ch. 5 15


Tanh

◼ Maps input to range [-1, 1]

◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

M. Zeshan Afzal, Very Deep Learning Ch. 5 16


Relu

◼ Does not saturate (for x>0)


◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

M. Zeshan Afzal, Very Deep Learning Ch. 5 17


Leaky Relu

◼ Does not saturate


◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient

M. Zeshan Afzal, Very Deep Learning Ch. 5 18


Parametric Relu

◼ Does not saturate


◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient
◼ Parameter α is learned
from data

M. Zeshan Afzal, Very Deep Learning Ch. 5 19


ELU

◼ All benefits of leaky relu


◼ Adds some robustness to
noise
◼ Default value α = 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 20


Maxout

◼ Any continuous PWL function


can be expressed as a
difference of two convex PWL
functions.
◼ Any continuous function can
be approximated arbitrarily
well, by a piecewise linear
function.
◼ Generalizes relu and leaky
relu
◼ Increases number of
parameters per neuron
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 5 21


Training

M. Zeshan Afzal, Very Deep Learning Ch. 5 22


Logistic Regression

◼ As a neural network

M. Zeshan Afzal, Very Deep Learning Ch. 5 23


Logistic Regression

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification


◼ How should we choose the model is this case

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 5 24


Logistic Regression

◼ Putting it together

◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

M. Zeshan Afzal, Very Deep Learning Ch. 5 25


Logistic Regression

◼ In summary we have assumed the Bernoulli


distribution

Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 5 26


Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation

^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 27


One Hot representation
Class y y

1 (1, 0, 0, 0)T

2 (0, 1, 0, 0) T

3 (0, 0, 1, 0) T

4 (0, 0, 0, 1) T

M. Zeshan Afzal, Very Deep Learning Ch. 5 28


Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 29


Categorical Distribution / Cross Entropy Loss

M. Zeshan Afzal, Very Deep Learning Ch. 5 30


Softmax

◼ How can we ensure that predicts a valid categorical (discrete) distribution?


◼ We must guarantee
^ and
◼ An element-wise sigmoid as output function would ensure first condition only
◼ Solution: Softmax function

◼ Let s denote the network output after the last affine layer (=scores). Then:

M. Zeshan Afzal, Very Deep Learning Ch. 5 31


Putting is all together

◼ Cross entropy loss for a single training sample

Class Label y Prediction Scores Softmax(s) CE Loss


s

(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16

(0, 1, 0, 0) (1, 2, -1, -1) (0.25, 0.68, 0.033, 0.033) 0.38

(0, 0, 1, 0) (2, 2, 1, 3) (0.19, 0.19, 0.072, 0.534) 2.6

(0, 0, 0, 1) (3, 2, 3, -1) (0.41, 0.15, 0.419, 0.007) 4.9

M. Zeshan Afzal, Very Deep Learning Ch. 5 32


Softmax

◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier

Why is softmax activate function called 'softmax'? - Quora

M. Zeshan Afzal, Very Deep Learning Ch. 5 33


Loss function
◼ Simple example
^ Lets say there is a three class classification (Cat, Dog, Cow)

Computed Ground Truth class Correct


classification error = 1/3 = 0.33
0.3 0.3 0.4 0 0 1 Cat Yes
classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

Computed Ground Truth class Correct


0.1 0.2 0.7 0 0 1 Cat Yes classification error = 1/3 = 0.33
0.1 0.7 0.2 0 1 0 Dog Yes classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 34


Loss function (Cross Entropy)
-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
Computed Ground Truth class Correct
0.3 0.3 0.4 0 0 1 Cat Yes
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64


Computed Ground Truth class Correct
0.1 0.2 0.7 0 0 1 Cat Yes
0.1 0.7 0.2 0 1 0 Dog Yes
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 35


Image Classification (MNIST)

◼ we would like to classify into 10


different classes. The digits 0-9
◼ Its old but still used in research
◼ Its based on the data from the national
institute of standard and technology
◼ It is comprised of handwritten digits by
census employees and school children
◼ It has resolution of 28x28. 60K training
and 10K testing samples also with
labels
◼ The train and test samples are not
written by same participants
M. Zeshan Afzal, Very Deep Learning Ch. 5 37
Image Classification (MNIST)

◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
possible?
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784

M. Zeshan Afzal, Very Deep Learning Ch. 5 38


MLP (MNIST DEMO)

◼ Check Uploaded IPython Notebook

M. Zeshan Afzal, Very Deep Learning Ch. 5 39


Universal Representation

◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers

Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

M. Zeshan Afzal, Very Deep Learning Ch. 5 40


Motivation (Convolutional Neural Networks)

◼ Fully connected layers


^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view

Image source: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

M. Zeshan Afzal, Very Deep Learning Ch. 5 41


M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Edge detection

◼ Simple filters
^ Edge detection

M. Zeshan Afzal, Very Deep Learning Ch. 5 45


SIFT

https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774

M. Zeshan Afzal, Very Deep Learning Ch. 5 46


Motivation (Convolutional Neural Networks)

◼ Fully connected layers


^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view
◼ Can we find a better representation?

M. Zeshan Afzal, Very Deep Learning Ch. 5 47


Motivation (Convolutional Neural Networks)

◼ Can we find a better representation


^ We have a certain degree of locality in an image
^ We can find macro features at different locations
^ Hierarchy of features
• Edges + Corners → Eyes
• Eyes + Nose + Ears → Face
• Face + Body + Legs → human

M. Zeshan Afzal, Very Deep Learning Ch. 5 48


Convolutional Neural Network

▪ Feature hierarchies

M. Zeshan Afzal, Very Deep Learning Ch. 5


Convolutional Neural Network

Feature Extraction Classification

◼ Built-in invariances / equivariances (translation)


◼ Suitable for data on grid topologies
^ 1D (audio signal, time series)
^ 2D (pixelated images)
^ 3D (videos)
◼ CNNs: Consists of blocks
^ Convolutions + nonlinear activation + pooling (subsampling)
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998

M. Zeshan Afzal, Very Deep Learning Ch. 5 50


Convolutions

◼ Convolution operation

◼ Discrete convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 51


Convolutions
◼ Convolution is a linear operation
^ Example: Lets us consider and x1 x2 x3 x4 x5
^ We get the following equations for y
w1 w2 w3
◼ Convolving with can
be written as linear operation

^ We assumed that j remains within the


bounds of x. As a result the convolved
output y will be reduced to the dimension
n-m+1. For the same dimensions we can
pas with zeros (zero padding)
^ Shared Weights
^ Sparse Connectivity
M. Zeshan Afzal, Very Deep Learning Ch. 5 52
Convolution vs Cross Corelation

◼ Convolution operation

◼ Cross Corelation

^ Cross corelation is convolution with flipped kernel


^ In practice, cross correlation is used
• The filters (weights) are initialized randomly
• The values are learned with backpropagation
https://en.wikipedia.org/wiki/Convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 53


Multidimensional Convolutions

◼ 2D convolution

◼ 2D cross corelation

M. Zeshan Afzal, Very Deep Learning Ch. 5 54


A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 55


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 56


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 57


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 58


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 59


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 60


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 61


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 62


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 63


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 64


N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 66


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 67


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general,
• common to see CONV layers with stride 1
• filters of size FxF
• and zero-padding with (F-1)/2. (will preserve
size spatially)
e.g. F = 3 => zero pad with 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition
F = 5 => zero pad with 2
F = 7 => zero pad with 3M. Zeshan Afzal, Very Deep Learning Ch. 5 68
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Multichannel Convolutions

◼ Multichannel convolutions

M. Zeshan Afzal, Very Deep Learning Ch. 5 80


Convolution Layer
32x32x3 image -> preserve spatial structure

32 height

32 width
3 depth

M. Zeshan Afzal, Very Deep Learning Ch. 5 81


Convolution Layer
◼ 32x32x3 image

◼ 5x5x3 filter
32

◼ Convolve the filter with the image


◼ i.e. “slide over the image spatially,
computing dot products”
32
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 82


Convolution Layer Filters always extend the full
depth of the input volume

◼ 32x32x3 image

◼ 5x5x3 filter
32
◼ Convolve the filter with the image
◼ i.e. “slide over the image spatially,
computing dot products”

32
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 83


Convolution Layer

32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 84


Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 85


consider a second, green
Convolution Layer filter

32x32x3 image activation maps

5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Lecture 5 - 33
M. Zeshan Afzal, Very Deep Learning Ch. 5 86
For example, if we had 6 5x5 filters, we’ll get 6 separate activation
maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

M. Zeshan Afzal, Very Deep Learning Ch. 5 87


Preview: ConvNet is a sequence of Convolution Layers, interspersed
with activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

M. Zeshan Afzal, Very Deep Learning Ch. 5 88


Preview: ConvNet is a sequence of Convolutional Layers, interspersed
with activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

M. Zeshan Afzal, Very Deep Learning Ch. 5 89


F
A
L Preview [Zeiler and Fergus 2013]
Visualization of VGG-16 by Lane McIntosh. VGG-16
architecture from [Simonyan and Zisserman2014].
e
p
e
irc
-i
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
0
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
F
A
L
e
p
e Pooling layer
irc - makes the representations smaller and more manageable
-i - operates over each activation map independently:
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
1
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
M. Zeshan Afzal, Very Deep Learning Ch. 5 92
M. Zeshan Afzal, Very Deep Learning Ch. 5
Typical CNN Structure

◼ Image -> convolution -> max pooling -> output

Fully
Image Convolution Pooling Flattenning Connected Softmax Loss
Layer

M. Zeshan Afzal, Very Deep Learning Ch. 5 94


Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 5 95

You might also like