0% found this document useful (0 votes)

11 views93 pages

Lecture04 VDL

This document discusses variants of gradient descent algorithms for training neural networks, including gradient descent, minibatch gradient descent, and stochastic gradient descent. It also covers logistic regression and how it can be modeled as a neural network using cross entropy loss and the softmax activation function. Multiclass classification problems are discussed along with one-hot encoding of labels. Different activation functions like sigmoid, tanh, ReLU, leaky ReLU, and their benefits are also summarized.

Uploaded by

nikhit5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views93 pages

Lecture04 VDL

Uploaded by

nikhit5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Very Deep Learning

Lecture 05

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
[email protected]

M. Zeshan Afzal, Very Deep Learning Ch. 5

Recap

M. Zeshan Afzal, Very Deep Learning Ch. 5 2

Variants of Gradient Descent

◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

M. Zeshan Afzal, Very Deep Learning Ch. 5 3

Logistic Regression (Decision Boundary)

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 4

Simple Examples

M. Zeshan Afzal, Very Deep Learning Ch. 5 5

XOR

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 6

XOR (Multilayer Perceptron)

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 7

Representation Matters

M. Zeshan Afzal, Very Deep Learning Ch. 5 8

Neural Network Playground

◼ https://playground.tensorflow.org/

M. Zeshan Afzal, Very Deep Learning Ch. 5 9

Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 10

Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 11

A Brief History of Deep Learning

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1986 Backpropagation Algorithm

^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights

Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

M. Zeshan Afzal, Very Deep Learning Ch. 5 12

Activation Functions

M. Zeshan Afzal, Very Deep Learning Ch. 5 13

Sigmoid

◼ Maps input to range [0, 1]

◼ Historically popular since they
have nice interpretation as
saturating firing neuron

◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred

M. Zeshan Afzal, Very Deep Learning Ch. 5 14

Sigmoid

◼ Maps input to range [0, 1]

◼ Problems

Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

M. Zeshan Afzal, Very Deep Learning Ch. 5 15

Tanh

◼ Maps input to range [-1, 1]

◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

M. Zeshan Afzal, Very Deep Learning Ch. 5 16

Relu

◼ Does not saturate (for x>0)

◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

M. Zeshan Afzal, Very Deep Learning Ch. 5 17

Leaky Relu

◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient

M. Zeshan Afzal, Very Deep Learning Ch. 5 18

Parametric Relu

◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient
◼ Parameter α is learned
from data

M. Zeshan Afzal, Very Deep Learning Ch. 5 19

ELU

◼ All benefits of leaky relu

◼ Adds some robustness to
noise
◼ Default value α = 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 20

Maxout

◼ Any continuous PWL function

can be expressed as a
difference of two convex PWL
functions.
◼ Any continuous function can
be approximated arbitrarily
well, by a piecewise linear
function.
◼ Generalizes relu and leaky
relu
◼ Increases number of
parameters per neuron
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 5 21

Training

M. Zeshan Afzal, Very Deep Learning Ch. 5 22

Logistic Regression

◼ As a neural network

M. Zeshan Afzal, Very Deep Learning Ch. 5 23

Logistic Regression

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification

◼ How should we choose the model is this case

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 5 24

Logistic Regression

◼ Putting it together

◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

M. Zeshan Afzal, Very Deep Learning Ch. 5 25

Logistic Regression

◼ In summary we have assumed the Bernoulli

distribution

Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 5 26

Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation

^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 27

One Hot representation
Class y y

1 (1, 0, 0, 0)T

2 (0, 1, 0, 0) T

3 (0, 0, 1, 0) T

4 (0, 0, 0, 1) T

M. Zeshan Afzal, Very Deep Learning Ch. 5 28

Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 29

Categorical Distribution / Cross Entropy Loss

M. Zeshan Afzal, Very Deep Learning Ch. 5 30

Softmax

◼ How can we ensure that predicts a valid categorical (discrete) distribution?

◼ We must guarantee
^ and
◼ An element-wise sigmoid as output function would ensure first condition only
◼ Solution: Softmax function

◼ Let s denote the network output after the last affine layer (=scores). Then:

M. Zeshan Afzal, Very Deep Learning Ch. 5 31

Putting is all together

◼ Cross entropy loss for a single training sample

Class Label y Prediction Scores Softmax(s) CE Loss

(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16

(0, 1, 0, 0) (1, 2, -1, -1) (0.25, 0.68, 0.033, 0.033) 0.38

(0, 0, 1, 0) (2, 2, 1, 3) (0.19, 0.19, 0.072, 0.534) 2.6

(0, 0, 0, 1) (3, 2, 3, -1) (0.41, 0.15, 0.419, 0.007) 4.9

M. Zeshan Afzal, Very Deep Learning Ch. 5 32

Softmax

◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier

Why is softmax activate function called 'softmax'? - Quora

M. Zeshan Afzal, Very Deep Learning Ch. 5 33

Loss function
◼ Simple example
^ Lets say there is a three class classification (Cat, Dog, Cow)

Computed Ground Truth class Correct

classification error = 1/3 = 0.33
0.3 0.3 0.4 0 0 1 Cat Yes
classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

Computed Ground Truth class Correct

0.1 0.2 0.7 0 0 1 Cat Yes classification error = 1/3 = 0.33
0.1 0.7 0.2 0 1 0 Dog Yes classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 34

Loss function (Cross Entropy)
-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
Computed Ground Truth class Correct
0.3 0.3 0.4 0 0 1 Cat Yes
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64

Computed Ground Truth class Correct
0.1 0.2 0.7 0 0 1 Cat Yes
0.1 0.7 0.2 0 1 0 Dog Yes
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 35

Image Classification (MNIST)

◼ we would like to classify into 10

different classes. The digits 0-9
◼ Its old but still used in research
◼ Its based on the data from the national
institute of standard and technology
◼ It is comprised of handwritten digits by
census employees and school children
◼ It has resolution of 28x28. 60K training
and 10K testing samples also with
labels
◼ The train and test samples are not
written by same participants
M. Zeshan Afzal, Very Deep Learning Ch. 5 37
Image Classification (MNIST)

◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
possible?
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784

M. Zeshan Afzal, Very Deep Learning Ch. 5 38

MLP (MNIST DEMO)

◼ Check Uploaded IPython Notebook

M. Zeshan Afzal, Very Deep Learning Ch. 5 39

Universal Representation

◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers

Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

M. Zeshan Afzal, Very Deep Learning Ch. 5 40

Motivation (Convolutional Neural Networks)

◼ Fully connected layers

^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view

Image source: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

M. Zeshan Afzal, Very Deep Learning Ch. 5 41

M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Edge detection

◼ Simple filters
^ Edge detection

M. Zeshan Afzal, Very Deep Learning Ch. 5 45

SIFT

https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774

M. Zeshan Afzal, Very Deep Learning Ch. 5 46

Motivation (Convolutional Neural Networks)

◼ Fully connected layers

M. Zeshan Afzal, Very Deep Learning Ch. 5 47

Motivation (Convolutional Neural Networks)

◼ Can we find a better representation

^ We have a certain degree of locality in an image
^ We can find macro features at different locations
^ Hierarchy of features
• Edges + Corners → Eyes
• Eyes + Nose + Ears → Face
• Face + Body + Legs → human

M. Zeshan Afzal, Very Deep Learning Ch. 5 48

Convolutional Neural Network

▪ Feature hierarchies

M. Zeshan Afzal, Very Deep Learning Ch. 5

Convolutional Neural Network

Feature Extraction Classification

◼ Built-in invariances / equivariances (translation)

◼ Suitable for data on grid topologies
^ 1D (audio signal, time series)
^ 2D (pixelated images)
^ 3D (videos)
◼ CNNs: Consists of blocks
^ Convolutions + nonlinear activation + pooling (subsampling)
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998

M. Zeshan Afzal, Very Deep Learning Ch. 5 50

Convolutions

◼ Convolution operation

◼ Discrete convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 51

Convolutions
◼ Convolution is a linear operation
^ Example: Lets us consider and x1 x2 x3 x4 x5
^ We get the following equations for y
w1 w2 w3
◼ Convolving with can
be written as linear operation

^ We assumed that j remains within the

bounds of x. As a result the convolved
output y will be reduced to the dimension
n-m+1. For the same dimensions we can
pas with zeros (zero padding)
^ Shared Weights
^ Sparse Connectivity
M. Zeshan Afzal, Very Deep Learning Ch. 5 52
Convolution vs Cross Corelation

◼ Convolution operation

◼ Cross Corelation

^ Cross corelation is convolution with flipped kernel

^ In practice, cross correlation is used
• The filters (weights) are initialized randomly
• The values are learned with backpropagation
https://en.wikipedia.org/wiki/Convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 53

Multidimensional Convolutions

◼ 2D convolution

◼ 2D cross corelation

M. Zeshan Afzal, Very Deep Learning Ch. 5 54

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter