Introduction to Neural Netwoks

Intro to Neural
NetworksEng. Abdallah Bashir

Session topics:
1. Introduction to Neural Networks.
2. Neural Networks Basics.
3. Shallow neural networks.
4. Deep Neural Networks.

1. Introduction to Neural Networks

1.1 What is a neuron ?
The input is the size
of the house (x)
The output is the
price (y)

• It is a linear regression problem because the price as a function
of size is a continuous output.
• We know the prices can never be negative so we are creating a
function called Rectified Linear Unit (ReLU) which starts at zero.
• Single neuron = linear regression

1.2 Neural Network Architecture
• The price of a house can be affected by
other features such as size, number of
bedrooms, zip code and wealth.
• The role of the neural network is to
predicted the price and it will automatically
generate the hidden units. We only need to
give the inputs x and the output y.

Input layer
hidden layers
output layer

Each Input will be connected to the
hidden layer and the NN will decide
the connections.
Supervised learning means we have
the (X,Y) and we need to get the
function that maps X to Y.

1.3 SUPERVISED LEARNING WITH NEURAL
NETWORKS
Different types of neural networks for supervised learning which includes:
• Standard NN (Useful for Structured data)
• CNN or convolutional neural networks (Useful in computer vision)
• RNN or Recurrent neural networks (Useful in Speech recognition or NLP)
• Hybrid/custom NN or a Collection of NNs types

1.4 Structured vs Unstructured Data
• Structured data is like the databases and tables.
• Unstructured data is like images, video, audio, and text.

1.5 Why is deep learning taking off?
Deep learning is taking off for 3 reasons:
1. Data

•For small data NN can perform as Linear regression
or SVM (Support vector machine)
•For big data a small NN is better that SVM
•For big data a big NN is better that a medium NN is
better that small NN.

2. Computation:
•GPUs.
•Powerful CPUs.
•Distributed computing.
3.Algorithm:
Creative algorithms has
appeared that changed
the way NN works.

2.1 Binary Classification
In a binary classification problem, the
result is a discrete value output.
For example:
• account hacked (1) or compromised (0)
•Object is a cat (1) or no cat (0)

Example: Cat vs Non-Cat
The goal is to train a classifier that the input is an image
represented by a feature vector, X, and predicts whether the
corresponding label Y, is 1 or 0. In this case, whether this is a cat
image (1) or a non-cat image (0).

The value in a cell represents the pixel
intensity which will be used to create a
feature vector of n dimension. In pattern
recognition and machine learning, a
feature vector represents an object, in
this case, a cat or no cat.
To create a feature vector, x, the pixel
intensity values will be “unroll” or
“reshape” for each color. The dimension
of the input feature vector x is Nx = 64 x
64 x 3 = 12 288

2.1.1 Neural Networks Notations
Here are some of the notations:
• M is the number of examples in the datasets.
• Nx is the size of the input vector
• Ny is the size of the output vector
• X(1) is the first input vector
• Y(1) is the first output vector
• X = [x(1) x(2).. x(M)]
• Y = (y(1) y(2).. y(M))
• L is the number of layers.

2.2 Logistic Regression
Logistic regression is a learning algorithm used in a supervised learning
problem when the output y are all either zero or one. The goal of
logistic regression is to minimize the error between its predictions and
training data.
Example: Cat vs No - cat
Given an image represented by a feature vector x , the algorithm will
evaluate the probability of a cat being in that image.

The parameters used in Logistic regression are:

2.2.1 Cost Function
To train the parameters w and b we need to define a cost function
Loss Function:
The loss function measures the discrepancy between the prediction
and the desired output

To explain the last function lets see:
• if y= 1 ==> L(y',1) = -log(y’)
• if y = 0 ==> L(y',0) = -log(1-y') ==>

• Then the Cost function will be:
• The loss function computes the error for a single training
example
• the cost function is the average of the loss functions of the
entire training set.

2.2.2 Gradient descent
• Goal is to find 𝑤, 𝑏 that minimize the cost function 𝐽 𝑤, 𝑏
• First we initialize w and b to 0,0 or initialize them to a
random value in the cost function and then try to improve
the values
• In Logistic regression people always use 0,0 instead of
random.

•The gradient decent algorithm repeats:
• w = w - alpha * dw where alpha is the
learning rate and dw is the derivative of w
(Change to w) The derivative is also the
slope of w.
• w = w - alpha * d(J(w,b) / dw) (how much
the function slopes in the w direction)
• b = b - alpha * d(J(w,b) / db) (how much
the function slopes in the d direction)

Computing derivatives
𝑢= 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣6
11 33
𝑎 = 5
𝑐 = 2
𝑏 = 3

2.2.3 Vectorizing Logistic Regression
• As an input we have a matrix X and its [Nx, m] and a matrix Y and its
[Ny, m].
• We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b].
This can be written in python as:
Z = np.dot(W.T,X) + b #Z shape is (1, m)
A = 1 / 1 + np.exp(-Z) # A shape is (1, m)

Vectorizing Logistic Regression's Gradient Output:
• dz = A - Y # dz shape is (1, m)
• dw = np.dot(X, dz.T) / m #dw shape is (Nx, 1)
• db = dz.sum() / m # db shape is (1, 1)

Side Notes
The main steps for building a Neural Network
are:
•Define the model structure (such as number of
input features and outputs)
•Initialize the model's parameters.
•Loop.
• Calculate current loss (forward propagation)
• Calculate current gradient (backward propagation)
• Update parameters (gradient descent)

Side Notes
•Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter")
can make a big difference to the algorithm.
kaggle.com is a good place for datasets and competitions.

3.1 Neural Networks Overview
• In logistic regression we had:
𝑥1
𝑥2
𝑥3
𝑦
x
w
b
𝑧 = 𝑤 𝑇
𝑥 + 𝑏 𝑎 = 𝜎(𝑧) ℒ(𝑎, 𝑦)

• In neural networks with one layer we will have:
𝑥1
𝑥2
𝑥3
𝑦
𝑧[1] = 𝑊[1] 𝑥 + 𝑏[1]
x
𝑊[1]
𝑏[1]
𝑎[1]
= 𝜎(𝑧[1]
) 𝑧[2]
= 𝑊[2]
𝑎[1]
+ 𝑏[2] 𝑎[2]
= 𝜎(𝑧[2]
) ℒ(𝑎[2]
, 𝑦)

3.2 Shallow Neural Network Representation
• We will define the neural networks that has one hidden layer.
• NN contains of input layers, hidden layers, output layers.
• Hidden layer means we cant see that layers in the training set.
• a0 = x (the input layer)
• a1 will represent the activation of the hidden neurons.
• a2 will represent the output layer.
• We are talking about 2 layers NN. The input layer isn't counted.
𝑥1
𝑥2
𝑥3
𝑦

3.3 Forward Propagation
𝑥1
𝑥2
𝑥3
𝑦
…𝑋 = 𝑥(1)
𝑥(2) 𝑥(𝑚)
𝑎[1](2)A[1]
= 𝑎[1](1) 𝑎[1](𝑚)…
𝑍 1 = 𝑊 1 𝑋 + 𝑏 1
𝐴 1 = 𝜎(𝑍 1 )
𝑍 2
= 𝑊 2
𝐴 1
+ 𝑏 2
𝐴 2 = 𝜎(𝑍 2 )

Here are some information about the last image:
1) Nh= 4
2) Nx = 3
3) Shapes of the variables:
I. W1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,nx)
II. b1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,1)
III. z1 is the result of the equation z1 = W1*X + b, it has a shape of
IV. a1 is the result of the equation a1 = sigmoid(z1), it has a shape of
V. W2 is the matrix of the second layer, it has a shape of (1,noOfHiddenLayers)
VI. b2 is the matrix of the second layer, it has a shape of (1,1)
VII. z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1)
VIII. a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)

•Pseudo code for forward propagation for the 2
layers NN, Lets say we have X on shape (Nx,m):
Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)
A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)

3.4 Activation Functions
• In computational networks, the activation function of a node defines
the output of that node given an input or set of inputs. A standard
computer chip circuit can be seen as a digital network of activation
functions that can be "ON" (1) or "OFF" (0)
• So far we are using sigmoid, but in some cases other functions can be
a lot better.
• Sigmoid can lead us to gradient decent problem where the updates
are so low.
• Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) #
Where z is the input matrix
• Tanh activation function range is [-1,1] (Shifted version of sigmoid
function)

• It turns out that the tanh activation usually works better than
sigmoid activation function for hidden units.
• Sigmoid or Tanh function disadvantage is that if the input is too
small or too high, the slope will be near zero which will cause
us the gradient decent problem.
• One of the popular activation functions that solved the slow
gradient decent is the RELU function. RELU = max(0,z) # so if z
is negative the slope is 0 and if z is positive the slope remains
linear.
• So here is some basic rule for choosing activation functions, if
your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU

Side Notes
• In NN you will decide a lot of choices like:
• No of hidden layers.
• No of neurons in each hidden layer.
• Learning rate. (The most important parameter)
• Activation functions.
• And others..

3.4 Backpropagation
•This is when all the magic happens !!

3.4 Backward Propagation
NN parameters:
o n[0] = Nx
o n[1] = NoOfHiddenNeurons
o n[2] = NoOfOutputNeurons = 1
o W1 shape is (n[1],n[0])
o b1 shape is (n[1],1)
o W2 shape is (n[2],n[1])
o b2 shape is (n[2],1)

Then Gradient descent:
Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW1, db1, dW2, db2
Update: W1 = W1 - LearningRate * dW1
b1 = b1 - LearningRate * db1
W2 = W2 - LearningRate * dW2
b2 = b2 - LearningRate * db2

Forward propagation:
oZ1 = W1A0 + b1 # A0 is X
oA1 = g1(Z1)
oZ2 = W2A1 + b2
oA2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
𝑥1
𝑥2
𝑥3
𝑦

Back propagation :
odZ2 = A2 - Y
odW2 = (dZ2 * A1.T) / m
odb2 = Sum(dZ2) / m
odZ1 = (W2.T * dZ2) * g'1(Z1) # element wise product (*)
odW2 = (dZ1 * A0.T) / m # A0 = X
odb2 = Sum(dZ1) / m
𝑥1
𝑥2
𝑥3
𝑦

3.5 Random Initialization
• In logistic regression it wasn't important to initialize the
weights randomly, while in NN we have to initialize them
randomly.
• If we initialize all the weights with zeros in NN it won't
work (initializing bias with zero is OK):
• All hidden units will be completely identical (symmetric) -
compute exactly the same function.
• On each gradient descent iteration all the hidden units will
always update the same.

• To solve this we initialize the W's with a small random numbers:
• W1 = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
• b1 = np.zeros((2,1)) # its ok to have b as zero, it won't get us
to the symmetry breaking
𝑎1
[1]
𝑥1
𝑎2
[1]
𝑥2
𝑦𝑎1
[2]

4.1 Deep L-layer neural network
•Shallow NN is a NN with one or two layers.
•Deep NN is a NN with three or more layers.
•We will use the notation L to denote the number
of layers in a NN.
•n[l] is the number of neurons in a specific layer l.
•n[0] denotes the number of neurons input layer.
n[L] denotes the number of neurons in output
layer.
•g[l] is the activation function.

4.2 Forward Propagation in a Deep Network
Forward propagation general rule for m inputs:
•Z[l] = W[l]A[l-1] + B[l]
•A[l] = g[l](A[l])

4.2.1 Matrix Dimensions
•Dimension of W is (n[l],n[l-1]) . Can be thought by
right to left.
•Dimension of b is (n[l],1)
•dw has the same shape as W, while db is the
same shape as b
•Dimension of Z[l], A[l], dZ[l], and dA[l] is (n[l],m)

4.3 Intuition about deep representation
𝑦

4.4 Parameters vs Hyperparameters
• Main parameters of the NN is W and b
• Hyper parameters (parameters that control the algorithm) are like:
• Learning rate.
• Number of iteration.
• Number of hidden layers L.
• Number of hidden units n.
• Choice of activation functions.
• You have to try values yourself of hyper parameters.

4.5 NN and The Human Brain !
•The analogy that "It is like the brain" has become really
an oversimplified explanation.
•There is a very simplistic analogy between a single
logistic unit and a single neuron in the brain.
•No human today understand how a human brain
neuron works.
•No human today know exactly how many neurons on
the brain.

Introduction to Neural Netwoks

Recommended

More Related Content

What's hot (20)

Similar to Introduction to Neural Netwoks (20)

Recently uploaded (20)

Introduction to Neural Netwoks

Editor's Notes