Deep Learning Step by Step
Deep Learning Step by Step
1_introduction_to_deep_learning
At the outer most ring you have artificial intelligence (using computers to reason). One layer inside of
that is machine learning. With artificial neural networks and deep learning at the centre.
Broadly speaking, deep learning is a more approachable name for an artificial neural network. The
“deep” in deep learning refers to the depth of the network. An artificial neural network can be very
shallow.
Deep Learning is used to perform complex task which are more computationally expensive, they tend
to perform better when compared to machine Learning.
At the outer most ring you have artificial intelligence. One layer inside of that is machine learning. With
artificial neural networks and deep learning at the center.
Deep learning is a more approachable name for an artificial neural network. The “deep” in deep
learning refers to the depth of the network. An artificial neural network can be very shallow & deep.
Mathematics:
Functions
Vectors
Linear Algebra ->> [Metrix, Operations of Metrix]
Differential Calculus ->> [Gradient, Partial Derivatives, Differentiation]
Graphs
Programming:
Python with OOPs concepts
Code readability
System:
At least i3 or i5 processor
At least 8gb of RAM
Good internet connection
IDE / Code Editor:
Google Colab
Paperspace
Pycharm
•• Join me on LinkedIn for the latest updates on ML:
VS code
jupyter notebook
https://www.linkedin.com/groups/7436898/
spyder
Neural Neworks:
Biological Neuron produce short electrical impulses known as action potentials which travels
through axons to the synapses which releases chemical signals i.e neurotransmitters .
When a connected neuron recieves a sufficient amount of these neurotransmitters within a few
milliseconds, it fires (or does not fires, think of a logic gate) its own action potential or elctrical impulse.
These simple units form a strong network known as Biological Neural Network (BNN) to perform very
complex computation task.
To understand more about ANN (Artificial Neural Network) Please visit Tensorflow Playground
(https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle®Dataset=reg-
plane&learningRate=0.03®ularizationRate=0&noise=0&networkShape=4,2&seed=0.41926&showTestData
and play with the Neural Network.
Installation of Tensorflow, Keras, pytorch & Basic of
Google Colab
Tensorflow:
If you are using google colab then you don't have to install tensorflow additionaly, Colab already has
tensorflow pre-installed you have to just import that. To import tensorflow just write
import tensorflow as tf
But if you are using your local system like jupyter notebook then you have to install it. To install that first
of all create a virtual Environment then activate your Environment and write to your anaconda promt
pip install tensorflow it will install latest version of tensorflow for you. To check the version of
tensorflow tf.__version__ and for keras tf.keras.__version__
pytorch:
For installing pytorch please visit their website Pytorch (https://pytorch.org)
Deep learnings is made accessible by a number of open source projects. Some of the most popular
technologies include, but are not limited to, Deeplearning4j (DL4j), Theano, Torch, TensorFlow, and Caffe.
The deciding factors on which one to use are the tech stack they target, and if they are low-level, academic,
or application focused. Here’s an overview of each:
DL4J:
JVM-based
Distrubted
Integrates with Hadoop and Spark
Theano:
Torch:
Lua based
In house versions used by Facebook and Twitter
Contains pretrained models
TensorFlow:
Google written successor to Theano
Interfaced with via Python and Numpy
Highly parallel
Can be somewhat slow for certain problem sets
Caffe:
Both Tensorflow 2.0 and Keras have been released for four years (Keras was released in
March 2015, and Tensorflow was released in November of the same year). The rapid
development of deep learning in the past days, we also know some problems of
Tensorflow1.x and Keras:
Using Tensorflow means programming static graphs, which is difficult and inconvenient for programs
that are familiar with imperative programming
Tensorflow api is powerful and flexible, but it is more complex, confusing and difficult to use.
Keras api is productive and easy to use, but lacks flexibility for research
In [ ]: # Verify installation
import tensorflow as tf
Eager execution is by default in TensorFlow 2.0 and, it needs no special setup. The following
below code can be used to find out whether a CPU or GPU is in use
GPU/CPU Check
In [ ]: tf.config.list_physical_devices('GPU')
In [ ]: tf.config.list_physical_devices('CPU')
GPU is available
details
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
CPU is available
details
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
In [ ]:
In the above diagram we can neurons of human brains, these neurons resemble the ANN.
The largest and most important part of the human brain is the cerebral cortex. Although it cannot be
observed directly, various regions within the cortex are responsible for different functions, as shown in the
diagram. The cortex plays a crucial role in important cognitive processes such as memory, attention,
perception, thinking, language, and awareness.
Biological Neuron
Biological Neuron produce short electrical impulses known as action potentials which travels through
axons to the synapses which releases chemical signals i.e neurotransmitters.
When a connected neuron receives a sufficient amount of these neurotransmitters within a few
milliseconds, it fires ( or does not fires, think of a NOT gate here) its own action potential or electrical
impulse.
These simple units form a strong network known as Biological Neural Network (BNN) to perform very
complex computation task.
Similar to the Biological neuron we have artificial neuron which can be used to perform complex
computation task.
The first artificial neuron
It was in year 1943, Artificial neuron was introduced by-
Neurophysiologist Warren McCulloh and
Mathematician Walter Pitts
They have published their work in McCulloch, W.S., Pitts, W. A logical calculus of the
ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133
(1943). https://doi.org/10.1007/BF02478259 . read full paper at this link
(https://www.cs.cmu.edu/~./epxing/Class/10715/reading/McCulloch.and.Pitts.pdf)
They have shown that these simple neurons can perform small logical operation like OR, NOT, AND
gate etc.
Following figure represents these ANs which can perform (a) Buffer, (b) OR, (c) AND and (d) A-B
operation
These neuron only fires when they get two active inputs.
The Perceptron
Its the simplest ANN architecture. It was invented by Frank Rosenblatt in 1957 and published as
Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information
Storage and Organization in the Brain, Cornell Aeronautical Laboratory,
Psychological Review, v65, No. 6, pp. 386–408. doi:10.1037/h0042519
It has different architecture then the first neuron that we have seen above. Its known as threshold logic
unit(TLU) or linear threshold unit (LTU).
Here inputs are not just binary.
Lets see the architecture shown below -
Common activation functions used for Perceptrons are (with threshold at 0)-
−1 𝑧 < 0
𝑠𝑔𝑛(𝑧) = 0 𝑧 = 0
1 𝑧>0
In [2]: def sgn(x):
if x < 0:
return -1
elif x > 0:
return 1
return 0
sgn = np.array(list(map(sgn, x_axis)))
plt.plot(x_axis, sgn)
plt.xlabel("x_axis")
plt.ylabel(r"$sgn(z)$")
plt.axhline(0, color='k', lw=1);
plt.axvline(0, color='k', lw=1);
where,
𝑤𝑖,𝑗 :𝑡ℎconnection weight between 𝑖𝑡ℎ input neuron and 𝑗𝑡ℎ output neuron
𝑥𝑖 : 𝑖 input value.
𝑦^𝑗 : output of 𝑗𝑡ℎ output𝑡ℎ neuron
𝑦𝑗 : target output of 𝑗 output neuron
𝜂 : learning rate
It can also be written as for jth element of w vector
𝑤 𝑗 = 𝑤 𝑗 + △𝑤 𝑗
(𝑖)
Single TLUs (Threshold Logic Unit) are simple linear binary classifier hence not suitable for non
linear operation.
Rosenblatt proved that if the data is linearly separable then only this algorithm will converge which is
known as Perceptron learning theorem
Some serious weaknesses of Perceptrons was revealed In 1969 by Marvin Minsky and Seymour
Papert. Not able to solve some simple logic operations like XOR, EXOR etc.
But above mentioned problem were solved by implementing multiplayer perceptron.
Derivation:-
Let's assume that you are doing a binary classification with class +1 and -1
𝑤1 𝑥1
𝐰 = ⋮ 𝐱 = ⋮
𝑤𝑛 𝑥𝑛
so, 𝐳 = 𝐰𝐓 𝐱
Now, if
𝐱
for a sample
𝜙(𝑧) = { +1 if 𝑧 ≥ 𝜃
−1 if 𝑧 < 𝜃
Lets simplify the above equation -
𝜙(𝑧) = { +1 if 𝑧 − 𝜃 ≥ 0
−1 if 𝑧 − 𝜃 < 0
Suppose 𝑤0 = −𝜃 and 𝑥0 = 1
Then,
𝐳′ = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 +...+𝑤𝑛 𝑥𝑛
and
𝜙(𝑧) = { +1 if 𝑧 ≥ 0
−1 if 𝑧 < 0
here 𝑤0 𝑥0 is usually known as bias unit
AND Operation:
Out[3]: x1 x2 y
0 0 0 0
1 0 1 0
2 1 0 0
3 1 1 1
Out[4]: x1 x2
0 0 0
1 0 1
2 1 0
3 1 1
In [5]: y = AND['y']
y.to_frame()
Out[5]: y
0 0
1 0
2 0
3 1
f h 1
In [7]: model.predict(X)
Out[7]: array([0, 0, 0, 1])
In [8]: model.weights
Out[8]: array([0.50004495, 0.50018565, 0.99996853])
Out[9]: ['Perceptron_model\\AND_model.model']
[0 0 0 1]
OR Operation:
In [12]: data = {"x1": [0,0,1,1], "x2": [0,1,0,1], "y": [0,1,1,1]}
OR = pd.DataFrame(data)
OR
Out[12]: x1 x2 y
0 0 0 0
1 0 1 1
2 1 0 1
3 1 1 1
Out[13]: x1 x2
0 0 0
1 0 1
2 1 0
3 1 1
In [14]: y = OR['y']
y.to_frame()
Out[14]: y
0 0
1 1
2 1
3 1
In [15]: model = Perceptron(eta = 0.5, epochs=10)
model.fit(X,y)
f h 1
XOR Operation:
In [16]: data = {"x1": [0,0,1,1], "x2": [0,1,0,1], "y": [0,1,1,0]}
XOR = pd.DataFrame(data)
XOR
Out[16]: x1 x2 y
0 0 0 0
1 0 1 1
2 1 0 1
3 1 1 0
Out[17]: x1 x2
0 0 0
1 0 1
2 1 0
3 1 1
In [18]: y = XOR['y']
y.to_frame()
Out[18]: y
0 0
1 1
2 1
3 0
f h 1
Conclusion:
Here we can see Perceptron can only classify the linear problem like AND, OR operation because they were
linear problem. But in the case of XOR it couldn't classify correctly because it was a non-linear problem.
Lets see graphically.
Analysis with the graph
In [20]: AND.plot(kind="scatter", x="x1", y="x2", c="y", s=50, cmap="winter")
plt.axhline(y=0, color="black", linestyle="--", linewidth=2)
plt.axvline(x=0, color="black", linestyle="--", linewidth=2)
x = np.linspace(0, 1.4) # >>> 50
y = 1.5 - 1*np.linspace(0, 1.4) # >>> 50
plt.plot(x, y, "r--")
Drawbacks of Perceptron:-
It cannot be used if the data is non linear.
1. Add backward connections, so that output neurons feed back to input nodes, resulting in a recurrent
network
2. Add neurons between the input nodes and the outputs, creating an additional ("hidden") layer to the
network, resulting in a multi-layer perceptron
How to train a multilayer network is not intuitive. Propagating the inputs forward over two layers is
straightforward, since the outputs from the hidden layer can be used as inputs for the output layer. However,
the process for updating the weights based on the prediction error is less clear, since it is difficult to know
whether to change the weights on the input layer or on the hidden layer in order to improve the prediction.
1. moving forward through the network, calculating outputs given inputs and current weight estimates
2. moving backward updating weights according to the resulting error from forward propagation(using
backpropagation method).
In this sense, it is similar to a single-layer perceptron, except it has to be done twice, once for each layer.
##Activation Function of ANN: In ANN we use sigmoid as an activation function in each layer instead of step
function.Because ANN can solve non-linear problem so the output can be varied. Sigmoid outputs numbers
0 to 1. On the other hand step function outputs just 0 or 1.
𝜎(𝑥) = 1 +1𝑒−𝑥
Formula of Sigmoid:
𝑧1 = 𝑥1 . 𝑤1 ,𝑧2 = 𝑤2 . 𝑧1
𝑧2 = 𝑤2 . 𝑧1
𝑧2 = 𝑤2 . 𝑥1 𝑤1
𝑧2 = 𝑊 𝑥1
So, you can see it has been a single neuron.Behave like a single linear transformation.
Without activation function all the continuous function cannot be approximated.
𝜎(𝑧) = 1 +1𝑒−𝑧
In MLP key changes were to introduce a sigmoid activation function
In the classification outputs neuron can be multiple but in the case of Regression output neuron might be
one.
Simple Example:
Lets take a simple neuron network , Here consider bias = 0
###Neural Network:
A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set
of data through a process that mimics the way the human brain operates.
A neuron network is consisted by single layer or multiple layer
It can be very depth
It can solve non-linear problems.
It can have many hidden layers.
It use sigmoid, ReLu, softmax etc. as an activation function.
In [ ]:
Authors: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner
LeNet-5 Total seven layer , does not comprise an input, each containing a trainable parameters; each layer
has a plurality of the Map the Feature , a characteristic of each of the input FeatureMap extracted by means
of a convolution filter, and then each FeatureMap There are multiple neurons.
Detailed explanation of each layer parameter:
INPUT Layer
The first is the data INPUT layer. The size of the input image is uniformly normalized to 32 * 32.
Note: This layer does not count as the network structure of LeNet-5. Traditionally, the input
layer is not considered as one of the network hierarchy.
C1 layer-convolutional layer
Input picture: 32 * 32
Number of neurons: 28 * 28 * 6
Detailed description:
1. The first convolution operation is performed on the input image (using 6 convolution kernels of size 5 *
5) to obtain 6 C1 feature maps (6 feature maps of size 28 * 28, 32-5 + 1 = 28).
2. Let's take a look at how many parameters are needed. The size of the convolution kernel is 5 * 5, and
there are 6 * (5 * 5 + 1) = 156 parameters in total, where +1 indicates that a kernel has a bias.
3. For the convolutional layer C1, each pixel in C1 is connected to 5 * 5 pixels and 1 bias in the input
image, so there are 156 * 28 * 28 = 122304 connections in total. There are 122,304 connections, but we
only need to learn 156 parameters, mainly through weight sharing.
Input: 28 * 28
Sampling area: 2 * 2
Sampling method: 4 inputs are added, multiplied by a trainable parameter, plus a trainable
offset. Results via sigmoid
Sampling type: 6
Number of neurons: 14 * 14 * 6
Number of connections: (2 * 2 + 1) * 6 * 14 * 14
The size of each feature map in S2 is 1/4 of the size of the feature map in C1.
Detailed description:
The pooling operation is followed immediately after the first convolution. Pooling is performed using 2 * 2
kernels, and S2, 6 feature maps of 14 * 14 (28/2 = 14) are obtained.
The pooling layer of S2 is the sum of the pixels in the 2 * 2 area in C1 multiplied by a weight coefficient plus
an offset, and then the result is mapped again.
So each pooling core has two training parameters, so there are 2x6 = 12 training parameters, but there are
5x14x14x6 = 5880 connections.
C3 layer-convolutional layer
Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating that
the feature map of this layer is a different combination of the feature maps extracted from
the previous layer.
One way is that the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as
input. The next 6 feature maps take 4 subsets of neighboring feature maps in S2 as input.
The next three take the non-adjacent 4 feature map subsets as input. The last one takes all
the feature maps in S2 as input.
Detailed description:
After the first pooling, the second convolution, the output of the second convolution is C3, 16 10x10 feature
maps, and the size of the convolution kernel is 5 * 5. We know that S2 has 6 14 * 14 feature maps, how to
get 16 feature maps from 6 feature maps? Here are the 16 feature maps calculated by the special
combination of the feature maps of S2. details as follows:
The first 6 feature maps of C3 (corresponding to the 6th column of the first red box in the figure above) are
connected to the 3 feature maps connected to the S2 layer (the first red box in the above figure), and the
next 6 feature maps are connected to the S2 layer The 4 feature maps are connected (the second red box
in the figure above), the next 3 feature maps are connected with the 4 feature maps that are not connected
at the S2 layer, and the last is connected with all the feature maps at the S2 layer. The convolution kernel
size is still 5 * 5, so there are 6 * (3 * 5 * 5 + 1) + 6 * (4 * 5 * 5 + 1) + 3 * (4 * 5 * 5 + 1) +1 * (6 * 5 * 5 + 1) =
1516 parameters. The image size is 10 * 10, so there are 151600 connections.
Input: 10 * 10
Sampling area: 2 * 2
Sampling method: 4 inputs are added, multiplied by a trainable parameter, plus a trainable
offset. Results via sigmoid
Sampling type: 16
The size of each feature map in S4 is 1/4 of the size of the feature map in C3
Detailed description:
S4 is the pooling layer, the window size is still 2 * 2, a total of 16 feature maps, and the 16 10x10 maps of
the C3 layer are pooled in units of 2x2 to obtain 16 5x5 feature maps. This layer has a total of 32 training
parameters of 2x16, 5x5x5x16 = 2000 connections.
The connection is similar to the S2 layer.
C5 layer-convolution layer
Input: All 16 unit feature maps of the S4 layer (all connected to s4)
Detailed description:
The C5 layer is a convolutional layer. Since the size of the 16 images of the S4 layer is 5x5, which is the
same as the size of the convolution kernel, the size of the image formed after convolution is 1x1. This
results in 120 convolution results. Each is connected to the 16 maps on the previous level. So there are
(5x5x16 + 1) x120 = 48120 parameters, and there are also 48120 connections. The network structure of the
C5 layer is as follows:
Calculation method: calculate the dot product between the input vector and the weight
vector, plus an offset, and the result is output through the sigmoid function.
Trainable parameters: 84 * (120 + 1) = 10164
Detailed description:
Layer 6 is a fully connected layer. The F6 layer has 84 nodes, corresponding to a 7x12 bitmap, -1 means
white, 1 means black, so the black and white of the bitmap of each symbol corresponds to a code. The
training parameters and number of connections for this layer are (120 + 1) x84 = 10164. The ASCII
encoding diagram is as follows:
The output layer is also a fully connected layer, with a total of 10 nodes, which respectively represent the
numbers 0 to 9, and if the value of node i is 0, the result of network recognition is the number i. A radial
basis function (RBF) network connection is used. Assuming x is the input of the previous layer and y is the
output of the RBF, the calculation of the RBF output is:
The value of the above formula w_ij is determined by the bitmap encoding of i, where i ranges from 0 to 9,
and j ranges from 0 to 7 * 12-1. The closer the value of the RBF output is to 0, the closer it is to i, that is, the
closer to the ASCII encoding figure of i, it means that the recognition result input by the current network is
the character i. This layer has 84x10 = 840 parameters and connections.
Code Implementation
=================================================================
Total params: 62,006
Trainable params: 62,006
Non-trainable params: 0
_________________________________________________________________
Epoch 1/2
391/391 [==============================] - 14s 8ms/step - loss: 1.8395 - accuracy: 0.34
66 - val_loss: 1.7231 - val_accuracy: 0.3949
Epoch 2/2
391/391 [==============================] - 2s 5ms/step - loss: 1.6719 - accuracy: 0.411
2 - val_loss: 1.6083 - val_accuracy: 0.4258
313/313 [==============================] - 1s 3ms/step - loss: 1.6083 - accuracy: 0.425
8
Test Loss: 1.6083446741104126
Test accuracy: 0.42579999566078186
In [ ]:
Activation Function
Activation Function
Back Propagation
Some derivation of necessary mathematics:
Vectors
Differentiation
Partial differentiation
Gradient of a Function
Maxima & Minima
Vectors:
A vector is an object that has both a magnitude and a direction (i.e. 5km/m in north). Geometrically, we can
picture a vector as a directed line segment, whose length is the magnitude of the vector and with an arrow
indicating the direction. The direction of the vector is from its tail to its head.Two vectors are the same if they
have the same magnitude and direction. This means that if we take a vector and translate it to a new
position (without rotating it), then the vector we obtain at the end of this process is the same vector we had
in the beginning.
Vector in 3D:
Now lets derive some derivation:
𝑂𝐴 = 𝑥𝑖 ̂ + 𝑦𝑗 ̂
𝑂𝐵 = 𝑥′𝑖 ̂ + 𝑦′𝑗 ̂
We can also represent vectors like that,
𝑥 = [𝑥 𝑦]
[𝑦]
→ = Length of 𝐴𝐵
𝐴𝐵
𝑂𝐴 = 𝑖 ̂ + 𝑗 ̂
𝑂𝐵 = 3𝑖 ̂ + 2𝑗 ̂
→ = 𝑂𝐵
𝐴𝐵 → − 𝑂𝐴 →
= (3 − 1)𝑖 + (2 − 1)𝑗
→ = 2𝑖 + 𝑗
∴ 𝐴𝐵
Differentiation:
Differentiation generally refers to the rate of change of a function with respect to one of its variables. Here its
similar to finding the tangent line slope of a function at some specific point.
𝑠𝑙𝑜𝑝𝑒 = Δ𝑥Δ𝑦
𝑠𝑙𝑜𝑝𝑒 = 𝑡𝑎𝑛𝜃 = 𝑝𝑒𝑟𝑝𝑒𝑛𝑑𝑖𝑐𝑢𝑙𝑎𝑟
𝑏𝑎𝑠𝑒
𝑠𝑙𝑜𝑝𝑒 = 𝑥𝑦22−−𝑥𝑦11
Power Rule:
Here we use the power rule in order to calculate the derivative and it’s pretty simple though.
𝑖𝑓, 𝐟(𝐱) = 𝐱𝐧
𝑡ℎ𝑒𝑛, 𝐟 ′(𝐱) = 𝐧. 𝐱𝐧−1
Examples
𝑓(𝑥) = 𝑥 5
𝑓 ′ (𝑥)′ = 5𝑥(5−1)
𝑓 (𝑥) = 5𝑥4
Product Rule:
If a(x) and b(x) are two differentiable functions, then the product rule is used, where at first time it compute
derivative of first function and at second time it compute derivative of second function.
𝐟(𝐱) = 𝐟(𝐱).𝐠(𝐱)
𝐟 ′(𝐱) = 𝐟 ′(𝐱).𝐠(𝐱) + 𝐟(𝐱). 𝐠′ (𝐱)
Example
Example
𝑓(𝑥,𝑦) = 𝑥4 𝑦
Obtaining partial derivative w.r.t x
∂(𝑥4 𝑦) = 4𝑥3 𝑦
∂𝑥
Obtaining partial derivative w.r.t y
∂(𝑥4 𝑦) = 𝑥4
∂𝑦
Gradient of Function:
Let's say there's a function of two variable x and y
∂𝑓 ∂𝑓
⇒ 𝑓(𝑥,𝑦)
then ∂𝑥 and ∂𝑦 is partial derivative w.r.t x and y respectively
▿
∂𝑓∂𝑥
Now Gradient ' ' of f is defined as -
▿𝑓 = [ ∂𝑓 ∂𝑓 𝑇 = ∂𝑓
∂𝑥 ∂𝑦 ] ∂𝑦
Its nothing but vector of partial derivatives
EXAMPLE
𝑓(𝑥,𝑦) = 2.𝑥2 + 4𝑦
∂𝑓∂𝑥 4𝑥
▿𝑓 = ∂𝑓 = [ 4 ]
∂𝑦
Let 𝑓(𝑥,𝑦) is a bivariate function whose local minima or maxima point needs to be calculated.
Find -
𝑓𝑥 = 𝑝 = ∂𝑓∂𝑥 and
𝑓𝑦 = 𝑞 = ∂𝑓∂𝑦 .
Solve 𝑓𝑥 = 0 and 𝑓𝑦 = 0 and find stationary or critical points.
Find -
𝑟 = 𝑓𝑥𝑥 = ∂𝑓∂𝑥222 ,
𝑠 = 𝑓𝑥𝑦 = ∂𝑓∂𝑥𝑦 and
𝑡 = 𝑓𝑦𝑦 = ∂𝑓∂𝑦22
Lets do the analysis for the critical points that we have obtained. Lets take a critical point (a,b)
if 𝑟.𝑡 − 𝑠2 > 0 and
if 𝑟 > 0 ⇒ 𝑓(𝑎,𝑏) has local minimum at that critical point
if 𝑟 < 0 ⇒ 𝑓(𝑎,𝑏) has local maximum at that critical point
if 𝑟.𝑡 − 𝑠2 = 0 ⇒ test fails.
if 𝑟.𝑡 − 𝑠2 < 0 ⇒ its a sadal point at the critical point (i.e. neither max nor minimum)
1. Add backward connections, so that output neurons feed back to input nodes, resulting in a recurrent
network
2. Add neurons between the input nodes and the outputs, creating an additional ("hidden") layer to the
network, resulting in a multi-layer perceptron
How to train a multilayer network is not intuitive. Propagating the inputs forward over two layers is
straightforward, since the outputs from the hidden layer can be used as inputs for the output layer. However,
the process for updating the weights based on the prediction error is less clear, since it is difficult to know
whether to change the weights on the input layer or on the hidden layer in order to improve the prediction.
1. moving forward through the network, calculating outputs given inputs and current weight estimates
2. moving backward updating weights according to the resulting error from forward propagation.
In this sense, it is similar to a single-layer perceptron, except it has to be done twice, once for each layer.
Backpropagation
Backpropagation is a method for efficiently computing the gradient of the cost function of a neural network
with respect to its parameters. These partial derivatives can then be used to update the network's
parameters using, e.g., gradient descent. This may be the most common method for training neural
networks. Deriving backpropagation involves numerous clever applications of the chain rule for functions of
vectors.
∂𝑤 ∂𝑧 ∂𝑤
For scalar-valued functions of more than one variable, the chain rule essentially becomes additive. In other
words, if 𝐶 is a scalar-valued function of 𝑁 variables 𝑧1 ,…, 𝑧𝑁 , each of which is a function of some
variable 𝑤, the chain rule states that
∂𝐶 = 𝑁 ∂𝐶 ∂𝑧𝑖
∂𝑤 ∑ 𝑖=1 ∂𝑧𝑖 ∂𝑤
Notation
In the following derivation, we'll use the following notation:
𝐿 - Number of layers in the network.
𝑁 𝑛 - Dimensionality of layer 𝑛 ∈ {0,…,𝐿}. 𝑁 0 is the dimensionality of the input; 𝑁 𝐿 is the
dimensionality of the output.
𝑊 𝑚 ∈ ℝ𝑁 𝑚 ×𝑁 𝑚−1 - Weight matrix for layer 𝑚 ∈ {1,…,𝐿}. 𝑊𝑖𝑗𝑚 is the weight between the 𝑖𝑡ℎ unit in
layer 𝑚 and the 𝑗 𝑡ℎ unit in layer 𝑚 − 1.
Backpropagation in general
In order to train the network using a gradient descent algorithm, we need to know the gradient of each of the
parameters with respect to the cost/error function 𝐶 ; that is, we need to know ∂𝑊∂𝐶𝑚 and ∂𝑏∂𝐶𝑚 . It will be
sufficient to derive an expression for these gradients in terms of the following terms, which we can compute
based on the neural network's architecture:
∂𝐶
∂𝑎∂𝑎𝐿𝑚 : The derivative of the cost function with respect to its argument, the output of the network
∂𝑧𝑚 : The derivative of the nonlinearity used in layer 𝑚 with respect to its argument
To compute the gradient of our cost/error function 𝐶 to 𝑊𝑖𝑗𝑚 (a single entry in the weight matrix of the layer
𝑚), we can first note that 𝐶 is a function of 𝑎𝐿 , which𝑚 is itself a function of the linear mix variables 𝑧𝑚
𝑘 , which
are themselves functions of the weight matrices 𝑊 and biases 𝑏 . With this in mind, we can use the
𝑚
chain rule as follows:
∂𝐶 = 𝑁 𝑚 ∂𝐶 ∂𝑧𝑚𝑘
∂𝑊𝑖𝑗𝑚 ∑
𝑘=1 ∂𝑧𝑘 ∂𝑊𝑖𝑗
𝑚 𝑚
Note that by definition
𝑁𝑚
𝑧𝑚𝑘 = ∑ 𝑊𝑘𝑙𝑚 𝑎𝑚−1
𝑙 + 𝑏𝑚𝑘
𝑙=1
∂ 𝑧
It follows that ∂ 𝑚 will evaluate to zero when 𝑖 ≠ 𝑘 because 𝑧𝑚
𝑚𝑘
𝑊𝑖𝑗 𝑘 does not interact with any elements in
𝑊 𝑚 except for those in the 𝑘 row, and we are only considering the entry 𝑊𝑖𝑗𝑚 . When 𝑖 = 𝑘, we have
th
∂𝑧𝑚𝑖 = ∂ 𝑁 𝑚 𝑊 𝑚 𝑎𝑚−1 + 𝑏𝑚
∂𝑊𝑖𝑗𝑚 ∂𝑊𝑖𝑗𝑚 (∑ 𝑙=1
𝑖𝑙 𝑙 𝑖)
= 𝑎𝑚−1
𝑗
→ ∂∂𝑊𝑧𝑘𝑚 = { 0𝑎𝑚−1 𝑘≠𝑖
𝑚
𝑖𝑗 𝑗 𝑘=𝑖
The fact that ∂𝑎𝑚 is 0 unless 𝑘 = 𝑖 causes the summation above to collapse, giving
∂𝐶
𝑘
∂𝐶 = ∂𝐶 𝑎𝑚−1
∂𝑊 𝑚 ∂𝑧𝑚 𝑗
𝑖𝑗 𝑖
or in vector form
∂𝐶 = ∂𝐶 𝑎𝑚−1⊤
∂𝑊 𝑚 ∂𝑧𝑚
Similarly for the bias variables 𝑏𝑚 , we have
∂𝐶 = 𝑁 𝑚 ∂𝐶 ∂𝑧𝑚𝑘
∂𝑏𝑚𝑖 ∑ 𝑘=1 ∂𝑧𝑘 ∂𝑏𝑖
𝑚 𝑚
∂𝑧𝑚
As above, it follows that 𝑘𝑚 will evaluate to zero when 𝑖 ≠ 𝑘 because 𝑧𝑚
∂𝑏𝑖 𝑘 does not interact with any
element in 𝑏𝑚 except 𝑏𝑚 𝑘 . When 𝑖 = 𝑘 , we have
∂𝑧𝑚𝑖 = ∂ 𝑁 𝑚 𝑊 𝑚 𝑎𝑚−1 + 𝑏𝑚
∂𝑏𝑚𝑖 ∂𝑏𝑚𝑖 (∑ 𝑙=1
𝑖𝑙 𝑙 𝑖)
=1
→ ∂𝑧∂𝑏𝑖𝑚 = { 01 𝑘𝑘 ≠= 𝑖𝑖
𝑚
𝑖
The summation also collapses to give
∂𝐶 = ∂𝐶
∂𝑏𝑚𝑖 ∂𝑧𝑚𝑖
or in vector form
∂𝐶 = ∂𝐶
∂𝑏𝑚 ∂𝑧𝑚
Now, we must compute ∂𝐶𝑚 . For the final layer (𝑚 = 𝐿), this term is straightforward to compute using the
∂𝑧𝑘
chain rule:
∂𝐶 = ∂𝐶 ∂𝑎𝐿𝑘
∂𝑧𝐿𝑘 ∂𝑎𝐿𝑘 ∂𝑧𝐿𝑘
or, in vector form
∂𝐶 = ∂𝐶 ∂𝑎𝐿
∂𝑧𝐿 ∂𝑎𝐿 ∂𝑧𝐿
∂𝐶
The first term
∂𝑎𝐿 is just the derivative of the cost function with respect to its argument, whose form
depends on the cost function chosen. Similarly,
∂𝑎𝑚𝑚
∂𝑧 (for any layer 𝑚 includling 𝐿) is the derivative of the
layer's nonlinearity with respect to its argument and will depend on the choice of nonlinearity. For other
layers, we again invoke the chain rule:
∂𝐶 = ∂𝐶 ∂𝑎𝑚𝑘
∂𝑧𝑚𝑘 ∂𝑎𝑚𝑘 ∂𝑧𝑚𝑘
= ( ∑ ∂𝑧𝑚+1 ∂𝑎𝑙 𝑚 ) ∂𝑎∂𝑧𝑘𝑚
𝑁 𝑚+1 ∂𝐶 ∂𝑧𝑚+1 𝑚
𝑙=1 𝑙 𝑘 𝑘
𝑁 ∂𝐶 ∂ 𝑁
𝑚+1 𝑚
= ( ∑ ∂𝑧𝑚+1 ∂𝑎𝑚 (∑ 𝑊𝑙ℎ 𝑎ℎ + 𝑏𝑙 )) ∂𝑧𝑚
𝑚+1 𝑚 𝑚+1 ∂𝑎𝑚𝑘
𝑙=1 𝑙 𝑘 ℎ=1 𝑘
𝑁 ∂𝐶
𝑚+1
= ( ∑ ∂𝑧𝑚+1 𝑊𝑙𝑘 ) ∂𝑧𝑚
𝑚+1 ∂𝑎 𝑚𝑘
𝑙=1 𝑙 𝑘
𝑁 𝑚+1
∂𝐶
= ( ∑ 𝑊𝑘𝑙 ∂𝑧𝑚+1 ) ∂𝑧𝑚
𝑚+1⊤ ∂𝑎 𝑚𝑘
𝑙=1 𝑙 𝑘
∂𝐶
where the last simplification was made because by convention
∂𝑧𝑚+1
𝑙
is a column vector, allowing us to write
∂𝐶 = 𝑊 𝑚+1⊤ ∂𝐶 ∘ ∂𝑎𝑚
∂𝑧𝑚 ( ∂𝑧𝑚+1 ) ∂𝑧𝑚
Backpropagation in practice
As discussed above, the exact form of the updates depends on both the chosen cost function and each
layer's chosen nonlinearity. The following two table lists the some common choices for non-linearities and
the required partial derivative for deriving the gradient for each layer:
Sigmoid
1
1+𝑒𝑧𝑚 𝜎 𝑚(𝑧𝑚)(1 − 𝜎 𝑚(𝑧𝑚)) = 𝑎𝑚 (1 − 𝑎𝑚 ) "Squashes" any input to the range [0, 1]
𝑒𝑧𝑚𝑚 −𝑒−𝑧𝑚𝑚 1 − (𝜎 𝑚(𝑧𝑚))2 = 1 − (𝑎𝑚 )2
Tanh
𝑒𝑧 +𝑒−𝑧 Equivalent, up to scaling, to the sigmoid function
ReLU max(0, 𝑧𝑚) 0, 𝑧𝑚 < 0; 1, 𝑧𝑚 ≥ 0 Commonly used in neural networks with many
layers
Similarly, the following table collects some common cost functions and the partial derivative needed to
compute the gradient for the final layer:
Cost
Function
𝐶 ∂𝑎∂𝐶𝐿 Notes
Cross-
Entropy (𝑦 − 1) log(1 − 𝑎𝐿 ) − 𝑦 log(𝑎𝐿 ) 𝑎𝐿𝑎(1−𝐿 −𝑦𝑎𝐿 ) Commonly used for binary classification tasks; can yield
faster convergence
In practice, backpropagation proceeds in the following manner for each training sample:
∂𝑊 𝑚 ∂𝑧𝑚𝑖
∂𝐶 = ∂𝐶
and
∂𝑊 𝐿 ∂𝑧𝐿𝐿 𝐿−1⊤
= (𝑎 − 𝑦)𝑎
∂𝐶 = ∂𝐶 𝑎𝐿−2⊤
∂𝑊 𝐿−1 ∂𝑧𝐿−1
= 𝑊 𝐿⊤ (𝑎𝐿 − 𝑦) ∘ 𝑎𝐿−1 (1 − 𝑎𝐿−1 )𝑎𝐿−2⊤
𝑊 𝑚 = 𝑊 𝑚 − 𝜆 ∂𝑊∂𝐶𝑚
and so on. Standard gradient descent then updates each parameter as follows:
𝑏𝑚 = 𝑏𝑚 − 𝜆 ∂𝑏∂𝐶𝑚
Toy Python example
Due to the recursive nature of the backpropagation algorithm, it lends itself well to software
implementations. The following code implements a multi-layer perceptron which is trained using
backpropagation with user-supplied non-linearities, layer sizes, and cost function.
𝑑𝑔 = 𝛽𝑔(ℎ)(1 − 𝑔(ℎ))
𝑑ℎ
Alternatively, the hyperbolic tangent function is also sigmoid:
The simplest algorithm for iterative minimization of differentiable functions is known as just gradient
descent. Recall that the gradient of a function is defined as the vector of partial derivatives:
Equivalently, it points away from the direction of maximum decrease - thus, if we start at any point, and keep
moving in the direction of the negative gradient, we will eventually reach a local minimum.
This simple insight leads to the Gradient Descent algorithm. Outlined algorithmically, it looks like this:
𝛼
Note that the step size, , is simply a parameter of the algorithm and has to be fixed in advance.
Notice that the hyperbolic tangent function asymptotes at -1 and 1, rather than 0 and 1, which is sometimes
beneficial, and its derivative is simple:
𝑑tanh(𝑥) = 1 − tanh2 (𝑥)
𝑑𝑥
Performing gradient descent will allow us to change the weights in the direction that optimally reduces the
error. The next trick will be to employ the chain rule to decompose how the error changes as a function of
the input weights into the change in error as a function of changes in the inputs to the weights, multiplied by
the changes in input values as a function of changes in the weights.
∂𝐸 = ∂𝐸 ∂ℎ
∂𝑤 ∂ℎ ∂𝑤
This will allow us to write a function describing the activations of the output weights as a function of the
activations of the hidden layer nodes and the output weights, which will allow us to propagate error
backwards through the network.
∂𝐸 = ∂𝐸 ∂𝑦𝑘 = ∂𝐸 ∂𝑔(ℎ𝑘)
∂ℎ𝑘 ∂𝑦𝑘 ∂ℎ𝑘 ∂𝑔(ℎ𝑘) ∂ℎ𝑘
The second term of this chain rule is just the derivative of the activation function, which we have chosen to
have a convenient form, while the first term simplifies to:
∂𝐸 = ∂ 1 (𝑡𝑘 − 𝑦𝑘 )2 = 𝑡𝑘 − 𝑦𝑘
∂𝑔(ℎ𝑘) ∂𝑔(ℎ𝑘) [ 2 ∑𝑘 ]
Combining these, and assuming (for illustration) a logistic activation function, we have the gradient:
𝛿ℎ𝑗 = [∑ 𝑤𝑗𝑘𝛿𝑘] 𝑎𝑗 (1 − 𝑎𝑗 )
𝑘
update output layer weights:
𝑤𝑗𝑘 ← 𝑤𝑗𝑘 − 𝜂𝛿𝑘𝑎𝑗
update hidden layer weights:
𝑣𝑖𝑗 ← 𝑣𝑖𝑗 − 𝜂𝛿ℎ𝑗 𝑥𝑖
Return to (2) and iterate until learning completes. Best practice is to shuffle input vectors to avoid training in
the same order.
Its important to be aware that because gradient descent is a hill-climbing (or descending) algorithm, it is
liable to be caught in local minima with respect to starting values. Therefore, it is worthwhile training several
networks using a range of starting values for the weights, so that you have a better chance of discovering a
globally-competitive solution.
One useful performance enhancement for the MLP learning algorithm is the addition of momentum to the
weight updates. This is just a coefficient on the previous weight update that increases the correlation
between the current weight and the weight after the next update. This is particularly useful for complex
models, where falling into local minima is an issue; adding momentum will give some weight to the previous
direction, making the resulting weights essentially a weighted average of the two directions. Adding
momentum, along with a smaller learning rate, usually results in a more stable algorithm with quicker
convergence. When we use momentum, we lose this guarantee, but this is generally seen as a small price
to pay for the improvement momentum usually gives.
Vanishing Gradient
• Saturating non-linearities such as sigmoid or tanh are not suitable for deep
networks, as the signal tends to get trapped in the saturation region as the
network grows deeper. This makes it difficult for the network to learn and
can result in slow convergence during training. To overcome this problem
we can use the following.
• Non-linearities like ReLU which do not saturate.
• Smaller learning rates
•
Careful initializations
What is Normalization?
•
3) Scale and shift the normalized activations using the learned parameters 𝛾
and 𝛽, respectively:
• 𝑦𝑖 = 𝛾 𝑥𝑖̂ + 𝛽
• The parameters 𝛾 and 𝛽 are learned during training using backpropaga-
tion.
2
4) During inference, the running mean and variance of each
layer are used for normalization instead of the mini-batch
statistics. These running statistics are updated using a
moving average of the mini-batch statistics during training.
The benefits of batch normalization include:
• Improved training performance: Batch normalization reduces the internal
covariate shift, which is the change in the distribution of the activations
of each layer due to changes in the distribution of the inputs. This allows
the network to converge faster and with more stable gradients.
• Regularization: Batch normalization acts as a form of regularization by
adding noise to the activations of each layer, which can help prevent over-
fitting.
•
3
Practical discussion of Callback functions:
A callback is a powerful tool to customize the behavior of a Keras model during training, evaluation, or
inference.TensorBoard to visualize training progress and results with TensorBoard, or tf.keras.callbacks.
ModelCheckpoint to periodically save your model during training. This callback reduces the learning rate
when a metric you've mentioned during training eg. accuracy or loss has stopped improving. Models often
benefit from reducing the learning rate.There are many callback functions,
Out[3]: []
In [4]: tf.config.list_physical_devices("CPU")
In [8]: X_test.shape
In [9]: len(X_test[1][0])
Out[9]: 28
In [10]: # create a validation data set from the full training data
# Scale the data between 0 to 1 by dividing it by 255. as its an unsigned data between 0
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
# scale the test set as well
X_test = X_test / 255.
In [11]: len(X_train_full[5000:] )
Out[11]: 55000
In [12]: # Lets view some data
plt.imshow(X_train[0], cmap="binary")
plt.show()
In [16]: model_clf.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
inputLayer (Flatten) (None, 784) 0
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
Out[18]: 266610
Out[19]: 'hiddenLayer1'
Out[20]: True
In [21]: len(hidden1.get_weights()[1])
Out[21]: 300
In [22]: hidden1.get_weights()
shape
(784, 300)
shape
(300,)
Epoch 1/30
1719/1719 [==============================] - 8s 4ms/step - loss: 0.0278 - accuracy: 0.9
939 - val_loss: 0.0670 - val_accuracy: 0.9794
Epoch 2/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.0264 - accuracy: 0.9
944 - val_loss: 0.0673 - val_accuracy: 0.9794
Epoch 3/30
1719/1719 [==============================] - 6s 4ms/step - loss: 0.0248 - accuracy: 0.9
947 - val_loss: 0.0669 - val_accuracy: 0.9804
Epoch 4/30
1719/1719 [==============================] - 6s 3ms/step - loss: 0.0233 - accuracy: 0.9
952 - val_loss: 0.0694 - val_accuracy: 0.9798
Epoch 5/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.0221 - accuracy: 0.9
960 - val_loss: 0.0681 - val_accuracy: 0.9798
Epoch 6/30
1719/1719 [==============================] - 9s 5ms/step - loss: 0.0208 - accuracy: 0.9
962 - val_loss: 0.0675 - val_accuracy: 0.9808
Epoch 7/30
1719/1719 [==============================] - 6s 3ms/step - loss: 0.0198 - accuracy: 0.9
967 - val_loss: 0.0664 - val_accuracy: 0.9802
Epoch 8/30
1719/1719 [==============================] - 11s 6ms/step - loss: 0.0186 - accuracy: 0.
9972 - val_loss: 0.0662 - val_accuracy: 0.9802
Epoch 9/30
1719/1719 [==============================] - 10s 6ms/step - loss: 0.0177 - accuracy: 0.
9972 - val_loss: 0.0667 - val_accuracy: 0.9798
Epoch 10/30
1719/1719 [==============================] - 11s 7ms/step - loss: 0.0169 - accuracy: 0.
9973 - val_loss: 0.0698 - val_accuracy: 0.9798
Epoch 11/30
1719/1719 [==============================] - 10s 6ms/step - loss: 0.0160 - accuracy: 0.
9977 - val_loss: 0.0660 - val_accuracy: 0.9800
Epoch 12/30
1719/1719 [==============================] - 13s 7ms/step - loss: 0.0150 - accuracy: 0.
9981 - val_loss: 0.0661 - val_accuracy: 0.9806
Epoch 13/30
1719/1719 [==============================] - 12s 7ms/step - loss: 0.0143 - accuracy: 0.
9982 - val_loss: 0.0676 - val_accuracy: 0.9804
Epoch 14/30
1719/1719 [==============================] - 11s 6ms/step - loss: 0.0136 - accuracy: 0.
9983 - val_loss: 0.0666 - val_accuracy: 0.9810
Epoch 15/30
1719/1719 [==============================] - 11s 6ms/step - loss: 0.0130 - accuracy: 0.
9985 - val_loss: 0.0660 - val_accuracy: 0.9812
Epoch 16/30
1719/1719 [==============================] - 11s 6ms/step - loss: 0.0123 - accuracy: 0.
9987 - val_loss: 0.0666 - val_accuracy: 0.9810
Saving the Model
In [32]: import time
import os
def save_model_path(MODEL_dir = "TRAINED_MODEL"):
os.makedirs(MODEL_dir, exist_ok= True)
fileName = time.strftime("Model_%Y_%m_%d_%H_%M_%S_.h5")
model_path = os.path.join(MODEL_dir, fileName)
print(f"Model {fileName} will be saved at {model_path}")
return model_path
Out[33]: 'TRAINED_MODEL\\Model_2023_07_26_00_53_52_.h5'
In [35]: history.params
In [36]: # history.history
Out[37]:
loss accuracy val_loss val_accuracy
In [38]: pd.DataFrame(history.history).plot()
Out[42]: array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. ],
[0. , 0. , 0.999, 0.001, 0. , 0. , 0. , 0. , 0. ,
0. ],
[0. , 0.998, 0. , 0. , 0. , 0. , 0. , 0.001, 0.001,
0. ]], dtype=float32)
In [43]: y_prob
In [45]: y_pred
In [46]: actual
######################
######################
######################
In [ ]:
Q1. Explain the concept of batch normalization in the context of Artificial Neural Networks.
Batch normalization is a technique used in Artificial Neural Networks to normalize the inputs of each layer
during training. It aims to stabilize and speed up the training process by reducing internal covariate shift.
Internal covariate shift refers to the change in the distribution of each layer's inputs during training, which
can slow down learning and require more careful tuning of hyperparameters.
Batch normalization addresses this issue by normalizing the inputs of each layer to have zero mean and unit
variance. It does this by computing the mean and variance of the inputs within a mini-batch (a small subset
of the training data) and then applying a normalization transformation. Additionally, batch normalization
introduces learnable parameters, scale, and shift, which allow the network to learn the optimal mean and
variance for each layer. These parameters give the model more flexibility to adapt to the data distribution.
Improved training stability: Batch normalization helps to mitigate the vanishing and exploding gradient
problems, making it easier for deep neural networks to converge during training.
Faster convergence: By reducing internal covariate shift, batch normalization accelerates the training
process, leading to faster convergence and fewer training iterations required.
Reduced sensitivity to initialization: Batch normalization makes neural networks less sensitive to the
choice of weight initialization, allowing for a wider range of initialization strategies.
Regularization effect: Batch normalization acts as a form of regularization, reducing the need for other
regularization techniques like dropout.
Allows for larger learning rates: The improved stability provided by batch normalization allows for the
use of larger learning rates, which can speed up training further.
Q3. Discuss the working principle of batch normalization, including the normalization step and the
learnable parameters.
The working principle of batch normalization involves two main steps: normalization and learnable
parameters.
Normalization Step: For each mini-batch of input data during training, batch normalization computes the
mean and variance of the data within that mini-batch. It then normalizes the data by subtracting the mean
and dividing by the standard deviation, resulting in inputs with zero mean and unit variance:
where x is the input data, mean and variance are the batch-wise statistics, and epsilon is a small constant
added to avoid division by zero.
Learnable Parameters: Batch normalization introduces two learnable parameters for each layer: scale
(gamma) and shift (beta). These parameters are applied after normalization and allow the model to learn the
optimal scaling and shifting of the normalized inputs. The transformed output is given by:
Impementation
In [4]: X_test.shape
In [5]: len(X_test[1][0])
Out[5]: 28
In [7]: # create a validation data set from the full training data
# Scale the data between 0 to 1 by dividing it by 255. as its an unsigned data between 0
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
# scale the test set as well
X_test = X_test / 255.
In [8]: len(X_train_full[5000:] )
Out[8]: 55000
In [13]: model_clf_without_bn.layers
In [14]: model_clf_without_bn.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
inputLayer (Flatten) (None, 784) 0
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
Out[16]: 266610
Epoch 1/10
860/860 [==============================] - 6s 5ms/step - loss: 0.2387 - accuracy: 0.930
3 - val_loss: 0.1005 - val_accuracy: 0.9700
Epoch 2/10
860/860 [==============================] - 4s 4ms/step - loss: 0.0893 - accuracy: 0.972
8 - val_loss: 0.0895 - val_accuracy: 0.9714
Epoch 3/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0595 - accuracy: 0.981
7 - val_loss: 0.0751 - val_accuracy: 0.9762
Epoch 4/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0426 - accuracy: 0.986
1 - val_loss: 0.0703 - val_accuracy: 0.9814
Epoch 5/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0313 - accuracy: 0.989
2 - val_loss: 0.0843 - val_accuracy: 0.9780
Epoch 6/10
860/860 [==============================] - 5s 6ms/step - loss: 0.0278 - accuracy: 0.990
8 - val_loss: 0.0726 - val_accuracy: 0.9818
Epoch 7/10
860/860 [==============================] - 5s 6ms/step - loss: 0.0204 - accuracy: 0.993
3 - val_loss: 0.0752 - val_accuracy: 0.9812
Epoch 8/10
860/860 [==============================] - 5s 5ms/step - loss: 0.0159 - accuracy: 0.994
8 - val_loss: 0.0866 - val_accuracy: 0.9808
Epoch 9/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0155 - accuracy: 0.994
6 - val_loss: 0.0835 - val_accuracy: 0.9776
Epoch 10/10
860/860 [==============================] - 4s 4ms/step - loss: 0.0135 - accuracy: 0.995
4 - val_loss: 0.0837 - val_accuracy: 0.9814
In [19]: history.params
In [21]: pd.DataFrame(history.history).plot()
In [25]: model_clf_with_bn.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
inputLayer (Flatten) (None, 784) 0
=================================================================
Total params: 268,210
Trainable params: 267,410
Non-trainable params: 800
_________________________________________________________________
Epoch 1/10
860/860 [==============================] - 8s 7ms/step - loss: 0.2105 - accuracy: 0.937
2 - val_loss: 0.0989 - val_accuracy: 0.9702
Epoch 2/10
860/860 [==============================] - 6s 7ms/step - loss: 0.0910 - accuracy: 0.972
2 - val_loss: 0.0835 - val_accuracy: 0.9756
Epoch 3/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0654 - accuracy: 0.979
1 - val_loss: 0.0852 - val_accuracy: 0.9726
Epoch 4/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0488 - accuracy: 0.984
3 - val_loss: 0.0738 - val_accuracy: 0.9774
Epoch 5/10
860/860 [==============================] - 4s 5ms/step - loss: 0.0426 - accuracy: 0.985
8 - val_loss: 0.0756 - val_accuracy: 0.9786
Epoch 6/10
860/860 [==============================] - 5s 6ms/step - loss: 0.0369 - accuracy: 0.988
0 - val_loss: 0.0636 - val_accuracy: 0.9800
Epoch 7/10
860/860 [==============================] - 5s 5ms/step - loss: 0.0288 - accuracy: 0.990
2 - val_loss: 0.0733 - val_accuracy: 0.9802
Epoch 8/10
860/860 [==============================] - 5s 5ms/step - loss: 0.0281 - accuracy: 0.990
2 - val_loss: 0.0687 - val_accuracy: 0.9802
Epoch 9/10
860/860 [==============================] - 5s 5ms/step - loss: 0.0224 - accuracy: 0.992
2 - val_loss: 0.0764 - val_accuracy: 0.9810
Epoch 10/10
860/860 [==============================] - 5s 5ms/step - loss: 0.0203 - accuracy: 0.993
0 - val_loss: 0.0688 - val_accuracy: 0.9802
Total training time: 49.95 seconds
In [31]: pd.DataFrame(history.history)
In [32]: pd.DataFrame(history.history).plot()
The provided experiment results show the comparison of two models trained on the same dataset
using different batch sizes. The models were trained with and without batch normalization. Let's
analyze the effects of different batch sizes on training dynamics and model performance.
Experiment Results:
Observations:
Both models achieve high accuracy, indicating that they are able to effectively learn from the data and make
accurate predictions on unseen samples. The model with batch normalization slightly outperforms the model
without batch normalization in terms of accuracy and loss. This suggests that batch normalization has
provided some improvement in the model's performance.
Out[41]: array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. ],
[0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. ],
[0. , 0.998, 0. , 0. , 0. , 0.001, 0. , 0. , 0. ,
0. ]], dtype=float32)
######################
######################
######################
-----------------------------------------------------------------------------------------Done-----------------------------------------------
----------------------------------------------------------------
In [36]: df=pd.read_csv("wine.csv")
In [37]: df.head()
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 bad
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 bad
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 bad
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 good
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 bad
In [38]: df.shape
In [39]: df.columns
Loading [MathJax]/extensions/Safe.js
In [40]: # Calculate class distribution
class_distribution = df['quality'].value_counts()
# Display the class distribution
print(class_distribution)
# Check if the target variable is imbalanced
if len(class_distribution) == 2:
majority_class_count = max(class_distribution)
minority_class_count = min(class_distribution)
class_ratio = majority_class_count / minority_class_count
if class_ratio > 2:
print("The target variable is imbalanced.")
else:
print("The target variable is not imbalanced.")
else:
print("The target variable is not binary.")
good 855
bad 744
Name: quality, dtype: int64
The target variable is not imbalanced.
Loading [MathJax]/extensions/Safe.js
In [42]: sns.boxplot(df['fixed acidity'])
In [43]: sns.boxplot(df['alcohol'])
Loading [MathJax]/extensions/Safe.js
As we have some Outliers in data , But we are using Deep Learning Model so dont worry
In [44]: y = df.quality
X = df.drop(columns = ['quality'])
In [45]: X.shape
In [46]: X.head()
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
In [47]: y.head()
Out[47]: 0 bad
1 bad
2 bad
3 good
4 bad
Name: quality, dtype: object
Loading [MathJax]/extensions/Safe.js
In [52]: print(X_train_full.shape)
print(X_test.shape)
print(X_train.shape)
print(X_valid.shape)
(1199, 11)
(400, 11)
(899, 11)
(300, 11)
In [53]: X_train.shape[1:]
Out[53]: (11,)
In [54]: X_train.shape[1:]
Out[54]: (11,)
In [55]: y_train.shape[:1]
Out[55]: (899,)
In [57]: # Logging
import time
def get_log_path(log_dir="logs/fit"):
fileName = time.strftime("log_%Y_%m_%d_%H_%M_%S")
logs_path = os.path.join(log_dir, fileName)
print(f"Saving logs at {logs_path}")
return logs_path
log_dir = get_log_path()
tb_cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir)
Loading [MathJax]/extensions/Safe.js
In [60]: # Q13. Use binary cross-entropy as the loss function, Adam optimizer, and ['accuracy'] a
loss_function = 'binary_crossentropy'
optimizer = 'adam'
metrics = ['accuracy']
# Q14. Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=metrics)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
HiddenLayer1 (Dense) (None, 30) 360
=================================================================
Total params: 731
Trainable params: 731
Non-trainable params: 0
_________________________________________________________________
Loading [MathJax]/extensions/Safe.js
In [68]: # Orginal train
EPOCHS = 40
VALIDATION_SET = (X_valid, y_valid)
history = model.fit(X_train, y_train, epochs=EPOCHS,
validation_data=VALIDATION_SET, batch_size=64, callbacks=[tb_cb, ear
Epoch 1/40
15/15 [==============================] - 1s 20ms/step - loss: 0.4931 - accuracy: 0.7575
- val_loss: 0.5633 - val_accuracy: 0.7333
Epoch 2/40
15/15 [==============================] - 0s 11ms/step - loss: 0.4900 - accuracy: 0.7608
- val_loss: 0.5635 - val_accuracy: 0.7233
Epoch 3/40
15/15 [==============================] - 0s 12ms/step - loss: 0.4862 - accuracy: 0.7631
- val_loss: 0.5632 - val_accuracy: 0.7233
Epoch 4/40
15/15 [==============================] - 0s 17ms/step - loss: 0.4844 - accuracy: 0.7608
- val_loss: 0.5631 - val_accuracy: 0.7333
Epoch 5/40
15/15 [==============================] - 0s 11ms/step - loss: 0.4816 - accuracy: 0.7620
- val_loss: 0.5643 - val_accuracy: 0.7300
Epoch 6/40
15/15 [==============================] - 0s 12ms/step - loss: 0.4788 - accuracy: 0.7653
- val_loss: 0.5647 - val_accuracy: 0.7200
Epoch 7/40
15/15 [==============================] - 0s 12ms/step - loss: 0.4773 - accuracy: 0.7631
- val_loss: 0.5635 - val_accuracy: 0.7300
Epoch 8/40
15/15 [==============================] - 0s 12ms/step - loss: 0.4746 - accuracy: 0.7664
- val_loss: 0.5651 - val_accuracy: 0.7267
Epoch 9/40
15/15 [==============================] - 0s 14ms/step - loss: 0.4729 - accuracy: 0.7664
- val_loss: 0.5665 - val_accuracy: 0.7200
Loading [MathJax]/extensions/Safe.js
In [70]: # Q18. Plot the model's training history
plt.plot(history_df['accuracy'], label='Train Accuracy')
plt.plot(history_df['val_accuracy'], label='Validation Accuracy')
plt.plot(history_df['loss'], label='Train Loss')
plt.plot(history_df['val_loss'], label='Validation Loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Metric Value')
plt.title('Training History')
plt.show()
# Q19. Evaluate the model's performance using the test data
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')
In [71]: X_test.shape
In [74]: new.reshape((1,11))
In [83]: # Assuming you have already trained the model and loaded the test data (X_test, y_test)
# Make predictions on the test data using the trained model
y_pred_probs = model.predict(X_test)
# Convert the predicted probabilities to class labels (0 or 1)
y_pred_labels = (y_pred_probs > 0.5).astype(int)
# If you have used the sigmoid activation function in the output layer
# and want to predict the class with the highest probability directly:
# y_pred_labels = np.argmax(y_pred_probs, axis=1)
# Evaluate the model's performance
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_labels)
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_labels)
# Print the classification report
class_report = classification_report(y_test, y_pred_labels)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
Loading [MathJax]/extensions/Safe.js
In [87]: # Assuming you have already trained the model and loaded the test data (X_test)
# Make predictions on the test data using the trained model
y_pred_probs = model.predict(X_test)
# Verify the shape of the y_pred_probs array
print("Shape of y_pred_probs:", y_pred_probs.shape)
# Assuming '0' represents 'Bad' and '1' represents 'Good',
# you can access the probability of 'Good' wine for the first sample (index 0) as follow
prob_good = y_pred_probs[0][0]
print("Probability of Good Wine:", prob_good)
In [ ]:
Loading [MathJax]/extensions/Safe.js
Assignment Question
Answer: The purpose of forward propagation in a neural network is to compute the output of the network
based on given input data. It involves passing the input through the network's layers, applying weights and
biases to the data, and activating neurons using specific functions until the final output is generated.
Answer: In a single-layer feedforward neural network, the forward propagation process can be
mathematically represented as follows:
Input: The input features are denoted as a vector x = [x₁, x₂, ..., xn]. Weighted Sum: Each input feature is
multiplied by its corresponding weight, and the biases are added to the weighted sum. Activation Function:
The weighted sum is then passed through an activation function, producing the output of the neuron.
Answer: Activation functions are applied during forward propagation to introduce non-linearity in the neural
network, allowing it to learn complex patterns and make predictions for more diverse datasets. The
activation function takes the output of a neuron and transforms it into a new value, which becomes the input
for the next layer in the network.
Answer: The weights and biases are crucial parameters in forward propagation. The weights determine the
strength of connections between neurons, controlling how much influence each input has on the neuron's
output. Biases, on the other hand, shift the output of the activation function and allow the network to learn
from different parts of the data distribution.
Q5. What is the purpose of applying a softmax function in the output layer during forward
propagation?
Answer: The softmax function is typically applied in the output layer of a neural network when dealing with
multi-class classification problems. It converts the raw output scores (logits) of the network into probabilities.
The softmax function ensures that the output probabilities sum up to 1, making it easier to interpret the
model's certainty about each class.
Answer: The purpose of backward propagation, also known as backpropagation, is to adjust the network's
weights and biases based on the computed error during forward propagation. By propagating the error
backward through the network, it allows the model to learn and improve its performance through the process
of gradient descent.
Q8. Can you explain the concept of the chain rule and its application in backward propagation?
Answer: The chain rule is a fundamental concept in calculus that allows us to find the derivative of a
composite function. In the context of neural networks, it enables us to compute the gradients of the loss
function with respect to the weights and biases of each layer. During backward propagation, the chain rule is
applied to calculate how the changes in the output of a layer affect the error, and these gradients are used
to update the parameters of the network through gradient descent.
Q9. What are some common challenges or issues that can occur during backward propagation, and
how can they be addressed?
Vanishing gradients: When gradients become very small, hindering the learning process. This can be
mitigated using activation functions that preserve gradients, like ReLU.
Exploding gradients: When gradients become extremely large, causing instability during learning.
Techniques like gradient clipping can be applied to control the magnitude of gradients.
Overfitting: When the model becomes too complex and performs well on training data but poorly on
unseen data. Regularization techniques, such as L1 or L2 regularization, can be used to address this
problem.
Learning rate selection: Choosing an appropriate learning rate is essential for stable and efficient
learning. Techniques like learning rate scheduling or adaptive learning rate methods (e.g., Adam) can
be used.
Local minima: The optimization process may get stuck in local minima leading to suboptimal solutions
HEMANT THAPA
In [88]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.animation as ani
from matplotlib.animation import FuncAnimation
import statistics as st
import yfinance as yf
import tensorflow as tf
import PIL
import math
import warnings
warnings.filterwarnings("ignore")
def chart(self):
return yf.download(self.ticker, period=self.time)
DATASET
In [4]: df = pd.read_csv('advertising.csv')
In [5]: df.isnull().sum()
TV 0
Out[5]:
Radio 0
Newspaper 0
Sales 0
dtype: int64
In [6]: df.info()
STANDARD SCALE
In [7]: class StandardScale:
def __init__ (self, data):
self.data = data
def scale_fit(self):
return (self.data - self.data.mean())/self.data.std()
In [9]: data[:10]
In [11]: model
Out[11]: ▾ LinearRegression
LinearRegression()
In [12]: X = data['TV'].values.reshape(-1,1)
y = data['Sales'].values
Out[14]: ▾ LinearRegression
LinearRegression()
INTERCEPTION
In [15]: model.intercept_
-0.009917747255374178
Out[15]:
COEFFICIENT
In [16]: model.coef_
array([0.81519156])
Out[16]:
In [18]: y_pred
array([ 0.7616275 , -0.04925105, 1.35411956, 0.15584235, -0.86012961,
Out[18]:
-0.98356546, 0.95627634, 1.37500871, -0.69681449, -0.8145533 ,
-0.8724732 , -1.1801133 , -1.04528338, 0.687566 , 0.34954168,
1.22118865, -1.14498156, 0.47012901, 0.27453066, -0.70915808,
-0.56768161, -1.32633731, -1.16871923, -0.7566334 , 1.07496465,
0.66762636, -0.75093636, -0.73574426, 0.43309825, -0.770876 ,
1.12813701, 0.42835072, -0.99780806, -0.68067288, 0.5574836 ,
0.47012901, 0.58406979, -0.51450924, -0.0843828 , -1.22853814,
-0.09672638, 0.26218708, 0.20996422, 0.79770875, -0.45279132,
0.66857587, 0.44923986, -1.22758863, 0.87366927, 0.48247259,
0.177681 , -0.69111745, 1.11579343, 0.53184693, -0.35499215,
1.22308767, -0.49267059, 0.55843311, -0.68162239, -0.77847205])
In [19]: plt.figure(figsize=(8,5))
plt.scatter(X_train,y_train)
plt.plot(X_test, y_pred, color="red")
plt.grid(True, linestyle="--", alpha=0.5)
plt.show()
0.526344665266148
Out[21]:
https://www.linkedin.com/groups/7436898/
In [23]: mean_squared_error(y_test, y_pred)
0.40920214509628555
Out[23]:
0.6396891003419439
Out[24]:
In [26]: data[:5]
In [27]: X = data['TV']
y = data['Sales']
GRADIENT DESCENT
In [28]: class GradientDescent:
def __init__(self, x, y, m_curr=0, c_curr=0, iteration=100, rate=0.01):
self.x = x
self.y = y
self.predicted_y = (m_curr * x) + c_curr # Initialize predicted_y using initial slope and intercept
self.m_curr = m_curr
self.c_curr = c_curr
self.iteration = iteration
self.rate = rate
def cost_function(self):
N = len(self.y)
#mean squared error
return sum((self.y - self.predicted_y) ** 2) / N
def calculation(self):
N = float(len(self.y))
gradient_descent = pd.DataFrame(columns=['m_curr', 'c_curr', 'cost'])
# Perform gradient descent iterations
for i in range(self.iteration):
# Calculate the predicted y values using current slope and intercept
self.predicted_y = (self.m_curr * self.x) + self.c_curr
cost = self.cost_function()
# Calculate gradients for slope (m_grad) and intercept (c_grad)
m_gradient = -(2/N) * np.sum(self.x * (self.y - self.predicted_y))
c_gradient = -(2/N) * np.sum(self.y - self.predicted_y)
# Update the slope and intercept using gradient and learning rate
self.m_curr -= self.rate * m_gradient
self.c_curr -= self.rate * c_gradient
return gradient_descent
In [29]: gd = GradientDescent(X, y)
gradient_descent_result = gd.calculation()
GRADIENT DESCENT
In [30]: gradient_descent_result[:10]
In [31]: plt.figure(figsize=(8,5))
gradient_descent_result.cost.plot()
plt.grid(True, linestyle="--", alpha=0.5)
plt.xlabel('Iteration')
plt.ylabel('Cost Function')
plt.title('Gradient Descent')
plt.show()
In [32]: gradient_descent_result.reset_index(inplace=True)
MACHINE LEARNING
In [34]: jp = stock("JPM", "2y").chart()
gs = stock("GS", "2y").chart()
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
In [35]: X = jp.Close.values.reshape(-1,1)
y = gs.Close.values
In [36]: print(X.shape)
(504, 1)
In [37]: print(y.shape)
(504,)
In [38]: y = y[:2519]
print(y.shape)
(504,)
Out[41]: ▾ LinearRegression
LinearRegression()
In [43]: y_pred_sklearn
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
In [48]: x = jp.Close.values.reshape(-1,1)
y = gs.Close.values
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
In [60]: x = X.Close.values
y = Y.Close.values
In [61]: print(x.shape)
print(y.shape)
(504,)
(504,)
In [63]: df = pd.DataFrame(dataset)
df[:5]
Out[63]: x y
0 158.929993 408.350006
1 157.009995 404.970001
2 155.580002 398.799988
3 154.279999 393.570007
4 154.720001 395.869995
X Mean: 138.566210413736
y Mean: 347.76824412270196
In [67]: df[:5]
1.5913570686889646
Out[68]:
127.25992569936074
Out[69]:
In [71]: predicted_values = []
for i in x_value:
y = bo + b1 * i
predicted_values.append(y)
In [74]: df[:5]
Out[74]: x y xi - x mean yi - y mean (xi - x mean)(yi - y mean) (xi - x mean)**2 y predicted Residual
0 158.929993 408.350006 20.363782 60.581762 1233.673810 414.683628 380.174293 28.175713
1 157.009995 404.970001 18.443784 57.201757 1055.016858 340.173172 377.118890 27.851111
2 155.580002 398.799988 17.013791 51.031744 868.243442 289.469098 374.843261 23.956726
3 154.279999 393.570007 15.713788 45.801763 719.719214 246.923145 372.774492 20.795515
4 154.720001 395.869995 16.153791 48.101751 777.025623 260.944957 373.474693 22.395302
R SQUARE
In [76]: r_square = ssr/sst
f"R square: {r_square}"
PEARSON CORRELATION
In [80]: df['x square'] = df['x'] * df['x']
df['y square'] = df['y'] * df['y']
df['xy'] = df['x'] * df['y']
In [81]: df[:5]
xi - x yi - y (xi - x (xi - x y predicted Residual (ymean)**2
pred - y
Out[81]:
x y mean mean mean)(yi - y mean)**2 x square y square
mean)
0 158.929993 408.350006 20.363782 60.581762 1233.673810 414.683628 380.174293 28.175713 1050.152002 25258.742572 166749.727485
1 157.009995 404.970001 18.443784 57.201757 1055.016858 340.173172 377.118890 27.851111 861.460432 24652.138375 164000.701889
2 155.580002 398.799988 17.013791 51.031744 868.243442 289.469098 374.843261 23.956726 733.056558 24205.136970 159041.430264
3 154.279999 393.570007 15.713788 45.801763 719.719214 246.923145 372.774492 20.795515 625.312449 23802.318023 154897.350665
4 154.720001 395.869995 16.153791 48.101751 777.025623 260.944957 373.474693 22.395302 660.821530 23938.278778 156713.053034
REFRENCES:
D. Kass. (2021). Gradient Descent with Linear Regression from Scratch. Retrieved from:
https://dmitrijskass.netlify.app/2021/04/03/gradient-descent-with-linear-regression-from-scratch/
S. Sayad. Machine Learning Recipes. Retrieved from: http://saedsayad.com/mlr.htm
It is used to minimize the error which is the mean of sum of all the absolute differences in between the true
value and the predicted value.
Out[5]: 50005100.0
The disadvantage of the L2 norm is that when there are outliers, these points will account for the main
component of the loss.
In [6]: def MSE(actual, pred):
return np.square(actual - pred)
Out[8]: -1572109088.0
3. Huber Loss
Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to
outliers(because if the residual is too large, it is a piecewise function, loss is a linear function of the
residual).
1 (𝑦 − 𝑦)̂ 2 , |𝑦 − 𝑦|̂ ≤ 𝛿
𝐿𝛿 (𝑦, 𝑦) = { 𝛿(|𝑦2 − 𝑦|̂ − 1 𝛿), 𝑓𝑜𝑟𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
̂
2
Among them, 𝛿 is a set parameter, 𝑦 represents the real value, and 𝑓(𝑥) represents the predicted value.
The advantage of this is that when the residual is small, the loss function is L2 norm, and when the residual
is large, it is a linear function of L1 norm
The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees
For Classification
5.Hinge Loss
Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value y
= wx + b
6.Cross-entropy loss
𝐽(𝑤) = −𝑦.𝑙𝑜𝑔(𝑦)̂ − (1 − 𝑦).𝑙𝑜𝑔(1 − 𝑦)̂ = − ∑ 𝑝𝑖 .𝑙𝑜𝑔(𝑞𝑖 )
𝑖
Cross-entropy loss is mainly applied to binary classification problems. The predicted value is a probability
value and the loss is defined according to the cross entropy. Note the value range of the above value: the
predicted value of y should be a probability and the value range is [0,1]
5.Sigmoid-Cross-entropy loss
The above cross-entropy loss requires that the predicted value is a probability. Generally, we calculate
𝑠𝑐𝑜𝑟𝑒𝑠 = 𝑥 ∗ 𝑤 + 𝑏 . Entering this value into the sigmoid function can compress the value range to (0,1).
It can be seen that the sigmoid function smoothen the predicted value(such as directly inputting 0.1 and
0.01 and inputting 0.1, 0.01 sigmoid and then entering, the latter will obviously have a much smaller change
value), which makes the predicted value of sigmoid-ce far from the label loss growth is not so steep.
𝐻(𝑝,𝑞) = − ∑ 𝑝(𝑥)𝑙𝑜𝑔(𝑞(𝑥))
𝑥
where 𝑝(𝑥) represents the probability that classification 𝑥 is a correct classification, and the value of 𝑝 can
only be 0 or 1. This is the prior value
𝑞(𝑥) is the prediction probability that the 𝑥 category is a correct classification, and the value range is (0,1)
So specific to a classification problem with a total of C types, then 𝑝(𝑥𝑗 ), (0 ≤ 𝑗 ≤ 𝐶) must be only 1 and
C-1 is 0(because there can be only one correct classification, correct the probability of classification as
correct classification is 1, and the probability of the remaining classification as correct classification is 0)
𝐿𝑖 = −𝑙𝑜𝑔( ∑𝑒 𝑒𝑦𝑖𝑓𝑦𝑗 )
𝑓
𝑗
Type Markdown and LaTeX: 𝛼2
Disadvantages:
Because this method calculates the gradient for the entire data set in one update, the calculation is
very slow, it will be very tricky to encounter a large number of data sets, and you cannot invest in new data
to update the model in real time.
We will define an iteration number epoch in advance, first calculate the gradient vector params_grad, and
then update the parameter params along the direction of the gradient. The learning rate determines how big
we take each step.
Batch gradient descent can converge to a global minimum for convex functions and to a local
minimum for non-convex functions.
x += - learning_rate * dx
For large data sets, there may be similar samples, so BGD calculates the gradient. There will be
redundancy, and SGD is updated only once, there is no redundancy, it is faster, and new samples
can be added.
Disadvantages: However, because SGD is updated more frequently, the cost function will have severe
oscillations. BGD can converge to a local minimum, of course, the oscillation of SGD may jump to a better
local minimum.
When we decrease the learning rate slightly, the convergence of SGD and BGD is the same.
MBGD uses a small batch of samples, that is, n samples to calculate each time. In this way, it can reduce
the variance when the parameters are updated, and the convergence is more stable. It can make full use of
the highly optimized matrix operations in the deep learning library for more efficient gradient calculations.
The difference from SGD is that each cycle does not act on each sample, but a batch with n
samples.
Cons:
However, the setting of this threshold needs to be written in advance adapt to the characteristics of the data
set.
In addition, this method is to apply the same learning rate to all parameter updates. If our data is sparse,
we would prefer to update the features with lower frequency.
In addition, for non-convex functions, it is also necessary to avoid trapping at the local minimum or saddle
point, because the error around the saddle point is the same, the gradients of all dimensions are close to 0,
and SGD is easily trapped here.
Saddle points are the curves, surfaces, or hyper surfaces of a saddle point neighborhood of a smooth
function are located on different sides of a tangent to this point. For example, this two-dimensional figure
looks like a saddle: it curves up in the x-axis direction and down in the y-axis direction, and the saddle point
is (0,0).
Momentum
One disadvantage of the SGD method is that its update direction depends entirely on the current batch, so
its update is very unstable. A simple way to solve this problem is to introduce momentum.
Momentum is momentum, which simulates the inertia of an object when it is moving, that is, the direction
of the previous update is retained to a certain extent during the update, while the current update gradient is
Adagrad
Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters,
using low learning rates for parameters associated with frequently occurring features, and using high
learning rates for parameters associated with infrequent features.
But the same update rate may not be suitable for all parameters. For example, some parameters may have
reached the stage where only fine-tuning is needed, but some parameters need to be adjusted a lot due to
the small number of corresponding samples.
Adagrad proposed this problem, an algorithm that adaptively assigns different learning rates to various
parameters among them. The implication is that for each parameter, as its total distance updated increases,
its learning rate also slows.
GloVe word embedding uses adagrad where infrequent words required a greater
update and frequent words require smaller updates.
Adadelta
There are three problems with the Adagrad algorithm
It does this by restricting the window of the past accumulated gradient to some fixed size of
w. Running average at time t then depends on the previous average and the current
gradient.
In Adadelta we do not need to set the default learning rate as we take the ratio of the
running average of the previous time steps to the current gradient.
RMSProp
The full name of RMSProp algorithm is called Root Mean Square Prop, which is an adaptive learning rate
optimization algorithm proposed by Geoff Hinton.
RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving
average of the squared gradient. It utilizes the magnitude of the recent gradient descents to
normalize the gradient.
Adagrad will accumulate all previous gradient squares, and RMSprop just calculates the corresponding
average value, so it can alleviate the problem that the learning rate of the Adagrad algorithm drops quickly.
The difference is that RMSProp calculates the differential squared weighted average of the gradient .
This method is beneficial to eliminate the direction of large swing amplitude, and is used to correct the swing
amplitude, so that the swing amplitude in each dimension is smaller. On the other hand, it also makes the
network function converge faster.
In RMSProp learning rate gets adjusted automatically and it chooses a different learning
rate for each parameter.
RMSProp divides the learning rate by the average of the exponential decay of squared
gradients
Adam
Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each
parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta
and RMSprop.
Adam implements the exponential moving average of the gradients to scale the learning
rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average
of past gradients.
Adam optimizer is one of the most popular and famous gradient descent optimization
Comparisons
In [ ]:
Q1: What is the role of optimization algorithms in artificial neural networks? Why are they
necessary?
A A1: Optimization algorithms play a crucial role in artificial neural networks as they are responsible for
updating the network's parameters during the training process. The main objective of training a neural
network is to minimize the loss function, which measures the difference between the predicted outputs and
the actual targets. Optimization algorithms are necessary because they determine how the network should
adjust its weights and biases to reach the optimal configuration that minimizes the loss and improves the
model's performance.
Q2: Explain the concept of gradient descent and its variants. Discuss their differences and tradeoffs
in terms of convergence speed and memory requirements.
A A2: Gradient descent is a popular optimization algorithm used in training neural networks. It works by
calculating the gradient of the loss function with respect to the model parameters and then updating the
parameters in the opposite direction of the gradient to minimize the loss. The basic variants of gradient
descent include:
Batch Gradient Descent: It computes the gradient using the entire training dataset, making it
computationally expensive and memory-intensive. However, it provides a more stable convergence
path, leading to better convergence.
Stochastic Gradient Descent (SGD): This variant randomly selects one training sample at a time to
compute the gradient, which reduces memory requirements but introduces more noise and oscillations
during training. It converges faster in many cases but might not reach the global minimum.
Mini-batch Gradient Descent: It strikes a balance between batch and stochastic gradient descent. It
randomly samples a small subset (mini-batch) of the training data to compute the gradient, combining
the advantages of both batch and SGD. It is widely used due to its efficiency and convergence
properties.
Q3: Describe the challenges associated with traditional gradient descent optimization methods (e.g.,
slow convergence, local minima). How do modern optimizers address these challenges?
A A3: Traditional gradient descent methods, such as Batch Gradient Descent and Stochastic Gradient
Descent, face several challenges during training:
Slow Convergence: Traditional methods may take a long time to converge to the optimal solution,
especially when dealing with complex and high-dimensional data.
Local Minima: They can get trapped in local minima, preventing the model from finding the global
minimum of the loss function.
A A4: Momentum and learning rate are essential concepts in optimization algorithms:
Momentum: Momentum introduces inertia to the parameter updates, helping the optimizer to continue
moving in the same direction as previous updates. This accelerates convergence and smoothes the
optimization path, reducing oscillations. It can help escape local minima and speed up convergence,
especially in areas with high curvature.
Learning Rate: The learning rate controls the step size taken during parameter updates. A large
learning rate may lead to overshooting and unstable updates, while a small learning rate may slow
down convergence. It needs to be carefully tuned to find the right balance between fast convergence
and avoiding divergence.
Both momentum and learning rate significantly impact convergence and model performance. Properly
tuned momentum can help the optimizer navigate complex loss surfaces, while a suitable learning rate
is crucial for achieving fast and stable convergence without overshooting the optimal solution.
Q1: Explain the concept of Stochastic Gradient Descent (SGD) and its advantages compared to
traditional gradient descent. Discuss its limitations and scenarios where it is most suitable.
A A1: Stochastic Gradient Descent (SGD) is an optimization technique that computes the gradient and
updates the model's parameters for each individual training sample. Its advantages over traditional gradient
descent methods include:
Faster Updates: Since it processes one training sample at a time, it updates the parameters more
frequently, leading to faster convergence, especially in large datasets.
Lower Memory Requirements: SGD uses less memory as it only needs to store information about a single
sample at a time, making it suitable for training on large datasets that do not fit entirely in memory.
Escaping Local Minima: The noise introduced by the randomness of sample selection in SGD can help
escape shallow local minima, leading to the possibility of finding better solutions.
Noisy Updates: The stochastic nature of SGD can cause noisy updates, leading to fluctuations in the
optimization path, which may slow down convergence.
Learning Rate Sensitivity: It requires careful tuning of the learning rate, as a large learning rate can lead to
divergence, while a small learning rate may slow down convergence.
SGD is most suitable when working with large datasets where memory constraints are an issue and when
the optimization landscape has many local minima, as it increases the chances of finding better solutions.
Q2: Describe the concept of the Adam optimizer and how it combines momentum and adaptive
learning rates. Discuss its benefits and potential drawbacks.
A A2: The Adam optimizer is an extension of the Stochastic Gradient Descent with momentum. It combines
the advantages of both momentum and adaptive learning rates. The key features of Adam are:
Momentum: Adam uses momentum, just like traditional momentum-based optimization, to smooth the
optimization path and speed up convergence.
Adaptive Learning Rates: It incorporates adaptive learning rates for each parameter based on the historical
gradient information. It maintains separate learning rates for each parameter, allowing faster convergence
for frequently updated parameters and more stability for less frequently updated ones.
Benefits of Adam:
Fast Convergence: Adam typically converges faster compared to standard stochastic gradient descent
methods due to its adaptive learning rates.
Robustness: It performs well across a wide range of different architectures and datasets, requiring less
manual tuning of hyperparameters.
Potential Drawbacks:
Memory Intensive: Adam needs to store the historical gradient information for each parameter, making it
more memory-intensive compared to basic SGD.
Sensitivity to Learning Rate: While Adam adapts the learning rates, it can still be sensitive to the initial
learning rate and may require tuning.
Q3: Explain the concept of RMSprop optimizer and how it addresses the challenges of adaptive
learning rates. Compare it with Adam and discuss their relative strengths and weaknesses.
A A3: RMSprop (Root Mean Square Propagation) is an optimization algorithm that addresses the
challenges of adaptive learning rates. It works by dividing the learning rate for each parameter by the root
mean square of the historical gradients for that parameter. The formula for RMSprop update is similar to that
of Adam but lacks the momentum term.
Adaptive Learning Rates: Both Adam and RMSprop adapt learning rates based on historical gradients,
allowing them to perform well in various situations without extensive manual tuning.
Momentum: Adam includes momentum, which helps it accumulate velocity and overcome potential noisy
updates. RMSprop lacks momentum and, therefore, may exhibit more oscillations in the optimization path.
Memory Requirements: RMSprop requires less memory compared to Adam since it does not store the
momentum information.
Performance: Adam often shows faster convergence than RMSprop, but this can vary depending on the
dataset and architecture.
Choosing between RMSprop and Adam depends on the specific problem, and it is advisable to
experiment with both optimizers to determine which one performs better in a given scenario.
Generally, Adam is preferred when faster convergence is crucial, while RMSprop can be a good
In [1]: import tensorflow as tf
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Define the deep learning model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
# Function to get the chosen optimizer
def get_optimizer(optimizer_name, learning_rate):
if optimizer_name == 'SGD':
return tf.keras.optimizers.SGD(learning_rate=learning_rate)
elif optimizer_name == 'Adam':
return tf.keras.optimizers.Adam(learning_rate=learning_rate)
elif optimizer_name == 'RMSprop':
return tf.keras.optimizers.RMSprop(learning_rate=learning_rate)
else:
raise ValueError("Invalid optimizer name")
# Compile the model with the chosen optimizer
optimizer_name = 'Adam' # Replace with 'Adam' or 'RMSprop' to use different optimizers
learning_rate = 0.01 # Experiment with different learning rates
optimizer = get_optimizer(optimizer_name, learning_rate)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['acc
# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=64, validation_data=(x_test,
# Compare model performance
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Curve')
plt.show()
SGD:
Final Training Accuracy: 94.65% Final Validation Accuracy: 95.60% Final Training Loss: 0.1883 Final
Validation Loss: 0.1530 Adam:
Final Training Accuracy: 95.49% Final Validation Accuracy: 95.91% Final Training Loss: 0.1803 Final
Validation Loss: 0.2242 RMSprop:
Final Training Accuracy: 95.76% Final Validation Accuracy: 96.45% Final Training Loss: 0.3231 Final
Validation Loss: 0.4432
Based on the provided results, RMSprop achieved the highest validation accuracy of 96.45% and the lowest
validation loss of 0.4432. It appears that RMSprop performed the best among the three optimizers on the
given neural network architecture and MNIST dataset.
In [ ]:
Dropout:
Refer the paper (https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
ℓ1 and ℓ2 regularization
In [ ]: from tensorflow import keras
In [ ]: model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_1 (Flatten) (None, 784) 0
_________________________________________________________________
dense_4 (Dense) (None, 300) 235500
_________________________________________________________________
dense_5 (Dense) (None, 100) 30100
_________________________________________________________________
dense_6 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_2 (Flatten) (None, 784) 0
_________________________________________________________________
dense_7 (Dense) (None, 300) 235500
_________________________________________________________________
dense_8 (Dense) (None, 100) 30100
_________________________________________________________________
dense_9 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
Max-Norm Regularization
In [ ]: from functools import partial
RegularizedDense = partial(keras.layers.Dense,
activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01),
kernel_constraint=keras.constraints.max_norm(1.))
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
RegularizedDense(300),
RegularizedDense(100),
RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accur
# n_epochs = 2
# history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
# validation_data=(X_valid_scaled, y_valid))
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_6 (Flatten) (None, 784) 0
_________________________________________________________________
dense_13 (Dense) (None, 300) 235500
_________________________________________________________________
dense_14 (Dense) (None, 100) 30100
_________________________________________________________________
dense_15 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
Dropout
In [ ]: model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accur
# n_epochs = 2
# history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
# validation_data=(X_valid_scaled, y_valid))
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_3 (Flatten) (None, 784) 0
_________________________________________________________________
dropout (Dropout) (None, 784) 0
_________________________________________________________________
dense_10 (Dense) (None, 300) 235500
_________________________________________________________________
dropout_1 (Dropout) (None, 300) 0
_________________________________________________________________
dense_11 (Dense) (None, 100) 30100
_________________________________________________________________
dropout_2 (Dropout) (None, 100) 0
_________________________________________________________________
dense_12 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
In [3]: housing.keys()
Out[4]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
Out[5]:
target
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
In [6]: X.shape
Out[6]: (20640, 8)
In [7]: y.shape
Out[7]: (20640, 1)
In [9]: print(X_train_full.shape)
print(X_test.shape)
print(X_train.shape)
print(X_valid.shape)
(15480, 8)
(5160, 8)
(11610, 8)
(3870, 8)
In [10]: X_train.shape[1:]
Out[10]: (8,)
Architecture used:
In [11]: LAYERS = [
tf.keras.layers.Dense(30, activation="relu", input_shape = X_train.shape[1:]),
tf.keras.layers.Dense(10, activation="relu"),
tf.keras.layers.Dense(5, activation='relu'),
tf.keras.layers.Dense(1)
]
Q)while defining the layer in classification you didn't applied Activation function and used Flatten ,but here
you directly started from dense and applied RELU in the very first layer ,why?
The choice of layer architecture and activation functions in a neural network can vary depending on
the specific task and the desired behavior of the model. Better option is, add relu activation function
in dense layers and in output layer if it is binary classification add sigmoid otherwise add softmax.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 30) 270
=================================================================
Total params: 641
Trainable params: 641
Non-trainable params: 0
_________________________________________________________________
In [17]: pd.DataFrame(history.history)
Out[17]:
loss val_loss
0 0.765455 0.603858
1 0.461002 0.392607
2 0.408732 0.434563
3 0.389118 0.379874
4 0.375180 0.356915
5 0.366648 0.355371
6 0.361943 0.376984
7 0.359288 0.386361
8 0.357164 0.349506
9 0.352737 0.348626
10 0.349832 0.348474
11 0.346712 0.337208
12 0.342892 0.361606
13 0.340700 0.328218
14 0.339156 0.338185
15 0.338889 0.338527
16 0.333643 0.353137
17 0.333382 0.388871
18 0.330911 0.366475
19 0.331260 0.328746
In [18]: pd.DataFrame(history.history).plot()
Out[19]: 0.3217606842517853
In [20]: X_test.shape
Out[20]: (5160, 8)
In [30]: new
In [23]: new.shape
Out[23]: (8,)
In [24]: X_test[0]
In [25]: new.reshape((1,8))
In [31]: new2.reshape((1,8))
In [26]: model.predict(new.reshape((1,8)))
In [32]: model.predict(new2.reshape((1,8)))