UNIT 4 ML NN ,DL,CNN-1
UNIT 4 ML NN ,DL,CNN-1
Digital Notes
[Department of Computer Science Engineering]
Course : B.TECH
Branch : CSE 3rd Yr
Subject Name :Machine Learning Techniques
(BCS055)
Prepared by: Mr. Abhishek Singh Sengar
Unit 4
• ARTIFICIAL NEURAL NETWORKS – Perceptron’s,
Multilayer perceptron, Gradient descent and the Delta
rule, Multilayer networks, Derivation of Back propagation
Algorithm, Generalization, Unsupervised Learning – SOM
Algorithm and its variant
• DEEP LEARNING - Introduction, concept of convolutional
neural network, Types of layers – (Convolutional Layers,
Activation function, pooling, fully connected), Concept of
Convolution (1D and 2D) layers, Training of network, Case
study of CNN for eg., on Diabetic Retinopathy, Building a
smart speaker, Self-deriving car etc.
ARTIFICIAL NEURAL NETWORKS
• Artificial Neural Networks (ANNs), or neural
networks, are computational models inspired by
the human brain, designed to learn from data
and make predictions. They consist of
interconnected nodes (neurons) arranged in
layers that process and transmit information,
mimicking how biological neurons communicate.
• ANNs are a type of machine learning algorithm
and a key component of deep learning
Key aspects of ANNs:
• Structure:
• ANNs typically have an input layer, one or more hidden layers, and an output
layer.
• Nodes:
• Each node (neuron) receives input, processes it, and produces an output, which
is then passed to other nodes.
• Interconnections:
• The connections between nodes have weights, which determine the strength of
the connection, according to ScienceDirect.com.
• Learning:
• ANNs learn by adjusting the weights during a process called "training," where
they compare their predictions with actual outcomes and refine their internal
parameters, according to NVIDIA Developer.
• Applications:
• ANNs are used in a wide range of applications, including image recognition,
natural language processing, and predictive modeling
Perceptron
• A perceptron is a fundamental unit in artificial neural networks, essentially a single
neuron. It takes inputs, applies weights and biases, and then uses an activation function to
produce an output. A multilayer perceptron (MLP), on the other hand, is a more complex
neural network architecture that builds upon the perceptron by stacking multiple layers of
perceptrons together. This allows MLPs to learn more intricate patterns and relationships in
data compared to a single-layer perceptron.
• Here's a more detailed breakdown:
• Perceptron:
• Definition:
• A perceptron is a basic computational unit in neural networks, often considered the building
block for more complex networks.
• Structure:
• It has an input layer, where data is fed in, and a single output neuron that produces a binary
output (usually 0 or 1) based on a threshold.
• Function:
• The perceptron calculates a weighted sum of its inputs and then applies an activation
function to this sum. This function determines the final output.
Key aspects of ANNs:
• Structure:
• ANNs typically have an input layer, one or more hidden layers, and an output layer.
• Nodes:
• Each node (neuron) receives input, processes it, and produces an output, which is
then passed to other nodes.
• Interconnections:
• The connections between nodes have weights, which determine the strength of the
connection, according to ScienceDirect.com.
• Learning:
• ANNs learn by adjusting the weights during a process called "training," where they
compare their predictions with actual outcomes and refine their internal
parameters, according to NVIDIA Developer.
• Applications:
• ANNs are used in a wide range of applications, including image recognition, natural
language processing, and predictive modeling
Limitations:
• The algorithm minimizes a cost function, which quantifies the error or loss of the
model’s predictions compared to the true labels for:
• 1. Linear Regression
• Gradient descent minimizes the Mean Squared Error (MSE) which serves as the loss
function to find the best-fit line. Gradient Descent is used to iteratively update the
weights (coefficients) and bias by computing the gradient of the MSE with respect
to these parameters.
• Since MSE is a convex function gradient descent guarantees convergence to the
global minimum if the learning rate is appropriately chosen. For each iteration:
• The algorithm computes the gradient of the MSE with respect to the weights and
biases.
• It updates the weights (w) and bias (b) using the formula:
• Calculating the gradient of the log-loss with respect to the weights.
• Updating weights and biases iteratively to maximize the likelihood of the correct
classification:
• The formula is the parameter update rule for
gradient descent, which adjusts the weights w
and biases b to minimize a cost function. This
process iteratively adjusts the line’s slope and
intercept to minimize the error.
2. Logistic Regression
• In logistic regression, gradient descent minimizes
the Log Loss (Cross-Entropy Loss) to optimize the
decision boundary for binary classification. Since
the output is probabilistic (between 0 and 1), the
sigmoid function is applied. The process involves:
• Calculating the gradient of the log-loss with respect
to the weights.
• Updating weights and biases iteratively to
maximize the likelihood of the correct
classification:
3. Support Vector Machines (SVMs)
• For SVMs, gradient descent optimizes the hinge loss,
which ensures a maximum-margin hyperplane. The
algorithm:
• Calculates gradients for the hinge loss and the
regularization term (if used, such as L2 regularization).
• Updates the weights to maximize the margin between
classes while minimizing misclassification penalties
with same formula provided above.
• Gradient descent ensures the optimal placement of
the hyperplane to separate classes with the largest
possible margin.
• Gradient Descent Python Implementation
• Diving further into the concept, let’s
understand in depth, with practical
implementation.
• Import the necessary libraries
• import torch
• import torch.nn as nn
• import matplotlib.pyplot as plt
• # set random seed for reproducibility
• torch.manual_seed(42)
• # create random weights and bias for the linear regression model
• true_weights = torch.tensor([1.3, -1])
• true_bias = torch.tensor([-3.5])
• # Target variable
• y = x @ true_weights.T + true_bias
• ax[0].set_xlabel('X1')
• ax[0].set_ylabel('Y')
• ax[1].set_xlabel('X2')
• ax[1].set_ylabel('Y')
• plt.show()
Define the loss function
• Vanishing and exploding gradients are common problems that can occur
during the training of deep neural networks. These problems can significantly
slow down the training process or even prevent the network from learning
altogether.
• The vanishing gradient problem occurs when gradients become too small
during backpropagation. The weights of the network are not considerably
changed as a result, and the network is unable to discover the underlying
patterns in the data. Many-layered deep neural networks are especially prone
to this issue. The gradient values fall exponentially as they move backward
through the layers, making it challenging to efficiently update the weights in
the earlier layers.
• The exploding gradient problem, on the other hand, occurs when gradients
become too large during backpropagation. When this happens, the weights
are updated by a large amount, which can cause the network to diverge or
oscillate, making it difficult to converge to a good solution.
• To address these problems the following technique can be used:
• Weights Regularzations: The initialization of weights can be adjusted
to ensure that they are in an appropriate range. Using a different
activation function, such as the Rectified Linear Unit (ReLU), can also
help to mitigate the vanishing gradient problem.
• Gradient clipping: It involves limiting the maximum and minimum
values of the gradient during backpropagation. This can prevent the
gradients from becoming too large or too small and can help to
stabilize the training process.
• Batch normalization: It can also help to address these problems by
normalizing the input to each layer, which can prevent the activation
function from saturating and help to reduce the vanishing and
exploding gradient problems.
Different Variants of Gradient Descent
• There are several variants of gradient descent that differ in the way the
step size or learning rate is chosen and the way the updates are made.
Here are some popular variants:
• Batch Gradient Descent
• In batch gradient descent, To update the model parameter values like
weight and bias, the entire training dataset is used to compute the
gradient and update the parameters at each iteration. This can be slow
for large datasets but may lead to a more accurate model.
• It is effective for convex or relatively smooth error manifolds because it
moves directly toward an optimal solution by taking a large step in the
direction of the negative gradient of the cost function. However, it can be
slow for large datasets because it computes the gradient and updates the
parameters using the entire training dataset at each iteration. This can
result in longer training times and higher computational costs.
• Stochastic Gradient Descent (SGD)
• In SGD, only one training example is used to compute the
gradient and update the parameters at each iteration. This can
be faster than batch gradient descent but may lead to more
noise in the updates.
• Mini-batch Gradient Descent
• In Mini-batch gradient descent a small batch of training
examples is used to compute the gradient and update the
parameters at each iteration. This can be a good compromise
between batch gradient descent and Stochastic Gradient
Descent, as it can be faster than batch gradient descent and
less noisy than Stochastic Gradient Descent.
• Momentum-based Gradient Descent
• In momentum-based gradient descent, Momentum is a variant of
gradient descent that incorporates information from the previous
weight updates to help the algorithm converge more quickly to
the optimal solution. Momentum adds a term to the weight
update that is proportional to the running average of the past
gradients, allowing the algorithm to move more quickly in the
direction of the optimal solution.
• The updates to the parameters are based on the current gradient
and the previous updates. This can help prevent the optimization
process from getting stuck in local minima and reach the global
minimum faster.
• .
Nesterov Accelerated Gradient (NAG)
• Training:
• Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.
• Step 2: Calculate squared Euclidean distance.
• D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
• Step 3: Find index J, when D(j) is minimum that will be considered as winning
index.
• Step 4: For each j within a specific neighborhood of j and for all i, calculate the
new weight.
• wij(new)=wij(old) + α[xi – wij(old)]
• Step 5: Update the learning rule by using :
• α(t+1) = 0.5 * t
• Step 6: Test the Stopping Condition.
• Below is the implementation of the above approach:
• import math
•
•
• class SOM:
•
• # Function here computes the winning vector
• # by Euclidean distance
• def winner(self, weights, sample):
•
• D0 = 0
• D1 = 0
•
• for i in range(len(sample)):
•
• D0 = D0 + math.pow((sample[i] - weights[0][i]), 2)
• D1 = D1 + math.pow((sample[i] - weights[1][i]), 2)
•
• # Selecting the cluster with smallest distance as winning cluster
•
• if D0 < D1:
• return 0
• else:
• return 1
• # Function here updates the winning vector
• def update(self, weights, sample, J, alpha):
• # Here iterating over the weights of winning cluster and modifying them
• for i in range(len(weights[0])):
• weights[J][i] = weights[J][i] + alpha * (sample[i] - weights[J][i])
•
• return weights
•
• # Driver code
•
•
• def main():
•
• # Training Examples ( m, n )
• T = [[1, 1, 0, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1]]
•
• m, n = len(T), len(T[0])
•
• # weight initialization ( n, C )
• weights = [[0.2, 0.6, 0.5, 0.9], [0.8, 0.4, 0.7, 0.3]]
•
• # training
• ob = SOM()
•
• epochs = 3
• alpha = 0.5
•
•
• for i in range(epochs):
• for j in range(m):
•
• # training sample
• sample = T[j]
•
• # Compute winner vector
• J = ob.winner(weights, sample)
•
• # Update winning vector
• weights = ob.update(weights, sample, J, alpha)
•
• # classify test sample
• s = [0, 0, 0, 1]
• J = ob.winner(weights, s)
•
• print("Test Sample s belongs to Cluster : ", J)
• print("Trained weights : ", weights)
•
•
• if __name__ == "__main__":
• main()
Generalization
• In machine learning, generalization refers to a model's ability to make accurate predictions on new,
unseen data that it wasn't explicitly trained on. A model that generalizes well learns the underlying
patterns in the training data and can apply those patterns to new, similar examples. Good generalization
is crucial for building effective and reliable machine learning models.
• Elaboration:
• Deep learning has made significant advancements in various fields, but there
are still some challenges that need to be addressed. Here are some of the main
challenges in deep learning:
• Data availability: It requires large amounts of data to learn from. For using
deep learning it’s a big concern to gather as much data for training.
• Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs
and TPUs.
• Time-consuming: While working on sequential data depending on the
computational resource it can take very large even in days or months.
• Interpretability: Deep learning models are complex, it works like a black box. it
is very difficult to interpret the result.
• Overfitting: when the model is trained again and again, it becomes too
specialized for the training data, leading to overfitting and poor performance on
new data.
• Deep Learning Applications
• 1. Computer vision
• In computer vision, deep learning models enable machines to identify and
understand visual data. Some of the main applications of deep learning in
computer vision include:
• Object detection and recognition: Deep learning models are used to identify
and locate objects within images and videos, making it possible for machines
to perform tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images
into categories such as animals, plants, and buildings. This is used in
applications such as medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify specific
features within images.
• 2. Natural language processing (NLP)
• In NLP, deep learning model enable machines to understand and generate human
language. Some of the main applications of deep learning in NLP include:
• Automatic Text Generation: Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.
• Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different linguistic
backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral.
• Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.
• .
3. Reinforcement learning
• A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence of
layers, and every layer transforms one volume to another through a differentiable function.
• Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will be an
image or a sequence of images. This layer holds the raw input of the image with width 32, height 32, and
depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset. It applies
a set of learnable filters known as the kernels to the input images. The filters/kernels are smaller matrices
usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes the dot product between
kernel weight and the corresponding input image patch. The output of this layer is referred as feature maps.
Suppose we use a total of 12 filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer, activation layers add
nonlinearity to the network. it will apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU: max(0, x), Tanh, Leaky RELU, etc. The
volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the size of
volume which makes the computation fast reduces memory and also prevents overfitting. Two common
types of pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2 filters and
stride 2, the resultant volume will be of dimension 16x16x12.
• # import the necessary libraries
• import numpy as np
• import tensorflow as tf
• import matplotlib.pyplot as plt
• from itertools import product
• Convolutional Layer:
• This layer uses filters (kernels) to scan the input image, extracting
features like edges, textures, and shapes.
• Feature Extraction:
• CNNs learn to automatically extract relevant features from the
input data, reducing the need for manual feature engineering.
• Pooling Layers:
• These layers reduce the spatial dimensions of the feature maps,
making the network more robust to small variations in the input.
• Fully Connected Layers:
• These layers integrate the extracted features to make a final
prediction or classification.
• # Reformat
• image = tf.image.convert_image_dtype(image, dtype=tf.float32)
• image = tf.expand_dims(image, axis=0)
• kernel = tf.reshape(kernel, [*kernel.shape, 1, 1])
• kernel = tf.cast(kernel, dtype=tf.float32)
• # convolution layer
• conv_fn = tf.nn.conv2d
• image_filter = conv_fn(
• input=image,
• filters=kernel,
• strides=1, # or (1, 1)
• padding='SAME',
• )
• plt.figure(figsize=(15, 5))
• plt.imshow(
• tf.squeeze(image_filter)
• )
• plt.axis('off')
• plt.title('Convolution')
• # activation layer
• relu_fn = tf.nn.relu
• # Image detection
• image_detect = relu_fn(image_filter)
• plt.subplot(1, 3, 2)
• plt.imshow(
• # Reformat for plotting
• tf.squeeze(image_detect)
• )
• plt.axis('off')
• plt.title('Activation')
• # Pooling layer
• pool = tf.nn.pool
• image_condense = pool(input=image_detect,
• window_shape=(2, 2),
• pooling_type='MAX',
• strides=(2, 2),
• padding='SAME',
• )
• plt.subplot(1, 3, 3)
• plt.imshow(tf.squeeze(image_condense))
• plt.axis('off')
• plt.title('Pooling')
• plt.show()
How CNNs Work:
• 1. Input:
• A CNN takes an image as input (e.g., a 3D matrix representing the image's pixels).
• 2. Convolution:
• Convolutional layers apply filters to the input, creating feature maps that highlight
specific patterns.
• 3. Pooling:
• Pooling layers reduce the spatial dimensions of the feature maps.
• 4. Activation:
• Non-linear activation functions (like ReLU) are applied to introduce non-linearity into
the model.
• 5. Fully Connected Layers:
• The extracted features are passed to fully connected layers for classification or
prediction.
Applications:
• convolutions operate on two axes, like width and height, typically used for image processing. 1D convolutions
are commonly used for sequential data like time series or text, and 2D convolutions are used for images.
• 1D Convolution:
• Input: 1D data (e.g., time series, text sequences).
• Kernel: A 1D vector (filter) that slides along the input sequence.
• Output: A 1D feature map.
• Applications: Time series analysis, NLP (Natural Language Processing), audio processing.
• 2D Convolution:
• Input: 2D data (e.g., images).
• Kernel: A 2D matrix (filter) that slides over the input image in two directions (width and height).
• Output: A 2D feature map.
• Applications: Image processing, computer vision, video analysis.
• Key Differences:
• Input Data: 1D convolutions work on 1D data, while 2D convolutions work on 2D data.
• Kernel: 1D convolutions use a 1D kernel, while 2D convolutions use a 2D kernel.
• Output: 1D convolutions produce a 1D output, while 2D convolutions produce a 2D output.
• Direction of Sliding: The kernel in 1D convolution slides in one direction, while the kernel in 2D convolution
slides in two directions (width and height).
• Applications: 1D convolutions are suitable for sequential data, while 2D convolutions are suitable for images.
• In machine learning, network training refers to the process of teaching a neural network or a similar model to perform a specific task by exposing it to
data and adjusting its internal parameters (weights and biases) to minimize prediction errors. This is typically achieved through iterative algorithms like
backpropagation, which adjusts the network's internal structure to better match the input data and desired output.
• Here's a more detailed explanation:
• 1. The Goal: The goal of network training is to find the optimal set of parameters for a neural network (or other model) so that it can accurately predict
the desired output for new, unseen input data.
• 2. The Process:
• Data:
• Training involves providing the network with a large dataset of labeled examples, where each example consists of an input and the corresponding
correct output.
• Forward Propagation:
• The network processes the input data, passing it through its layers and neurons until it produces an output.
• Error Calculation:
• The difference between the network's output and the correct output (target value) is calculated, often using a "loss function".
• Backward Propagation (Backpropagation):
• This algorithm calculates the error gradient (the rate of change of the loss function with respect to the network's parameters).
• Parameter Adjustment:
• The network's weights and biases are adjusted based on the error gradient, typically using an optimization algorithm like gradient descent, to minimize
the loss function.
• Iteration:
• This process (forward propagation, error calculation, backward propagation, and parameter adjustment) is repeated iteratively for many epochs (passes
through the entire training dataset) until the network's performance improves to a desired level.
• 3. Key Concepts:
• Weights and Biases:
• These are the parameters of the network that control how it processes data. They are adjusted during training to minimize errors.
• Loss Function:
• A mathematical function that quantifies the difference between the network's predictions and the actual target values.
• Optimization Algorithm:
• An algorithm used to adjust the network's parameters in a way that minimizes the loss function (e.g., gradient descent).
• Backpropagation:
• The most common algorithm for training neural networks, it calculates the error gradient and allows for efficient adjustment of the network's
parameters.
Types of Network Training:
•
• Supervised Learning:
• The network learns from labeled data, where each input
has a corresponding correct output.
• Unsupervised Learning:
• The network learns from unlabeled data, trying to
discover patterns and relationships within the data.
• Reinforcement Learning:
• The network learns by interacting with an environment
and receiving rewards or penalties for its actions.
Here is how to build a smart speaker using a
Convolutional Neural Network (CNN) in Python