Machine Learning - AL3451 - Notes - Unit 4 - Neural Networks
Machine Learning - AL3451 - Notes - Unit 4 - Neural Networks
4th Semester
2nd Semester
Deep Learning -
AD3501
Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester
7th Semester
8th Semester
Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering
www.BrainKart.com Page 1 of 34
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron.
MLP networks are used for supervised learning format. A typical learning algorithm for MLP networks
is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set of outputs
from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed
graph between the input nodes connected as a directed graph between the input and output layers. MLP
uses backpropagation for training the network. MLP is a deep learning method.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 2 of 34
Input Layer: This layer accepts input features. It provides information from the outside world to the
network, no computation is performed at this layer, nodes here just pass on the information(features)
to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the abstraction
provided by any neural network. The hidden layer performs all sorts of computation on the features
entered through the input layer and transfers the result to the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce non-
linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias, and their respective activation function. In a neural network, we would update the weights and
biases of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients are
supplied along with the error to update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.
Mathematical proof
Suppose we have a Neural net like this :-
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 3 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 4 of 34
Tanh Function
• The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
• Equation :-
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.
RELU Function
• It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 5 of 34
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying to handle
multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum
of the outputs.
• Output:- The softmax function is ideally used in the output layer of the classifier where we
are actually trying to attain the probabilities to define the class of each input.
• The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.
• If your output is for binary classification then, sigmoid function is very natural choice for
output layer.
• If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.
➢ Supervised learning
➢ Unsupervised learning
➢ Reinforced learning
➢ Hebbian learning
➢ Gradient descent learning
➢ Competitive learning
➢ Stochastic learning
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 6 of 34
Supervised learning :
Every input pattern that is used to train the network is associated with an output pattern which is the
target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the
error.The error can then be used to change network parameters, which result in an improvement
in performance.
Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering
and adapting to structural features in the input patterns.
Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only
indicates if the computed output correct or incorrect.The information provided helps the
network in the learning process.
Hebbian learning:
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 7 of 34
oldestlearning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖, 𝑦𝑖) are
associated by the weight matrix W, known as the correlation matrix.
It is computed as
𝑛 𝑥𝑖𝑦𝑖𝑇
W = ∑ 𝑖= ------------ eq(1)
1
Here 𝑦𝑖𝑇 is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the
rule havebeen proposed.
Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation
function of the network.Also it is required that the activation function employed by the network
is differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
∆𝑤 = ɳ 𝜕 𝐸 ----------- eq(2)
𝑖𝑗 𝜕𝑤𝑖𝑗
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 8 of 34
weight 𝑤𝑖𝑗.
❖ The word ‘stochastic‘ means a system or a process that is linked with a random
probability.
❖ Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
❖ In Gradient Descent, there is a term called “batch” which denotes the totalnumber
of samples from a dataset that is used for calculating the gradient for each iteration.
❖ In typicalGradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
❖ Although, using the whole dataset is really useful for getting to the minima in a
less noisy and less random manner, but the problem arises when our
datasets gets big.
❖ Suppose, you have a million samples in your dataset, so if you use a typical
Gradient Descent optimization technique, you will have to use all of the one million
samples for completing one iteration while performing the Gradient Descent, and
it has to be done for every iteration until the minima is reached. Hence, it becomes
computationally very expensive to perform
4.5 Backpropagation
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 9 of 34
❖ It includes the following factors that are responsible for the training and
performance of the network:
Working of Backpropagation
Need of Backpropagation
o Since it is fast as well as simple, it is very easy to implement.
o Apart from no of inputs, it does not encompass of any other parameter to perform
tuning.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 10 of 34
o As it does not necessitate any kind of prior knowledge, so it tends out to be more
flexible.
o It is a standard method that results well.
• Static Back-propagation
• Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in
static back-propagation while it is nonstatic in recurrent backpropagation.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 11 of 34
• Discomfort (bias)
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able
to perform the task for which it is being trained. Derivatives of the activation function to
be known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 12 of 34
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 13 of 34
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 14 of 34
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 15 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 16 of 34
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 17 of 34
We again Split both because there is no any y1 and y2 term in E1 and E2. We
split it as
Now, we find the value of by putting values in equation (18) and (19) as
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 18 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 19 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 20 of 34
We calculate the partial derivative of the total net input to H1 with respect to w1 the same
as we did for the output neuron:
So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 21 of 34
In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is down
to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 22 of 34
Certain activation functions, like the sigmoid function, squishes a large input space into a
small input space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.
As an example, Image 1 is the sigmoid function and its derivative. Note how when the
inputs of the sigmoid function becomes larger or smaller (when |x| becomes bigger), the
derivative becomes close to zero.
For shallow network with only a few layers that use these activations, this isn’t a big
problem. However, when more layers are used, it can cause the gradient to be too small
for training to work effectively.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 23 of 34
However, when n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially as we
propagate down to the initial layers.
A small gradient means that the weights and biases of the initial layers will not be
updated effectively with each training session. Since these initial layers are often crucial
to recognizing the core elements of the input data, it can lead to overall inaccuracy of the
whole network.
Solutions:
The simplest solution is to use other activation functions, such as ReLU, which doesn’t
cause a small derivative.
Residual networks are another solution, as they provide residual connections straight to
earlier layers. As seen in Image 2, the residual connection directly adds the value at the
beginning of the block, x, to the end of the block (F(x)+x). This residual connection
doesn’t go through activation functions that “squashes” the derivatives, resulting in a
higher overall derivative of the block.
Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives to disappear. In Image 1, this is most clearly seen at when |x| is big. Batch
normalization reduces this problem by simply normalizing the input so |x| doesn’t reach
the outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so
that most of it falls in the green region, where the derivative isn’t too small.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 24 of 34
❖ Here the prefix "hyper" suggests that the parameters are top-level parameters that
are used in controlling the learning process.
❖ The value of the Hyperparameter is selected and set by the machine learning
engineer before the learning algorithm begins training the model.
❖ Hence, these are external to the model, and their values cannot be changed
during the training process.
Model Parameters:
Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent
variables in the Linear regression model. or Weights or Coefficients of independent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 25 of 34
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to control
the learning process. Some key points for model parameters are as follows:
Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories, which are given below:
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 26 of 34
o Batch Size: To enhance the speed of the learning process, the training set is
divided into different subsets, which are known as a batch. Number of Epochs: An
epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs
varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into
account. The number of epochs is increased until there is a reduction in a
validation error. If there is no improvement in reduction error for the consecutive
epochs, then it indicates to stop increasing the number of epochs.
Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:
o A number of Hidden Units: Hidden units are part of neural networks, which refer
to the components comprising the layers of processors between input and output
units in a neural network.
It is important to specify the number of hidden units hyperparameter for the neural
network. It should be between the size of the input layer and the size of the output layer.
More specifically, the number of hidden units should be 2/3 of the size of the input layer,
plus the size of the output layer.
For complex functions, it is necessary to specify the number of hidden units, but it should
not overfit the model.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 27 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 28 of 34
❖ Even though the input X was normalized but the output is no longer on
the same scale.
❖ The data passes through multiple layers of network with multiple
times(sigmoidal) activation functions are applied, which leads to an
internal co-variate shift in the data.
❖ This motivates us to move towards Batch Normalization
❖ Normalization is the process of altering the input data to have mean as
zero and standard deviationvalue as one.
Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to
calculate the mean of this hidden activation.After calculating the
mean the next step is to calculate the standard deviation of the
hidden activations.
(2) Now we normalize the hidden activations using these Mean &
Standard Deviation values. To dothis, we subtract the mean from
each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(3) As the final stage, the re-scaling and offsetting of the input is
performed. Here two components of the BN algorithm is used,
γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) and shifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence
during the training of neural network,the optimal values of γ and β
are obtained and used. Hence we get the accurate normalization of
eachbatch.
4.9 Regularization
Definition: - “any modification we make to a learning algorithm that is intended
to reduce its generalization error but not its training error.”
❖ In the context of deep learning, most regularization strategies
are based onregularizing estimators.
❖ Regularization of an estimator works by trading increased bias
for reducedvariance.An effective regularizer is one that makes a
profitable trade, reducing variancesignificantly while not overly
increasing the bias.
❖ Many regularization approaches are based on limiting the capacity
of models, such as neural networks, linear regression, or logistic
regression, by adding a parameter norm penalty Ω(θ) to the
objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 29 of 34
We can see that the addition of the weight decay term has modified the learning
rule to multiplicatively shrink the weight vector by a constant factor on each
step, just before performing the usual gradient update. This describes what
happens in a single step.The approximation ^J is Given by
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 30 of 34
The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’To
study the eff ect of weight decay,
As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormal basis of eigenvectors, Q, such that H = QΛQT. Applying Decomposition to theabove
equation, We Obtain
L1 Regularization
While L2 weight decay is the most common form of weight decay, there are other
ways to penalize the size of the model parameters. Another option is to use L1 regularization.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 31 of 34
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 32 of 34
Normalization Standardization
This technique uses minimum and max This technique uses mean and standard deviation
values for scaling of model. for scaling of model.
It is helpful when features are of different It is helpful when the mean of a variable is set to 0
scales. and the standard deviation is set to 1.
Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 33 of 34
values multiplied by so that the overall sum of the neuron values remains the
same.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
www.BrainKart.com Page 34 of 34
The two images represent dropout applied to a layer of 6 units, shown at multiple
training steps. The dropout rate is 1/3, and the remaining 4 neurons at each training
step have their value scaled by x1.5. Thereby, we are choosing a random sample of
neurons rather than training the whole network at once. This ensures that the co-
adaptation is solved and they learn the hidden features better.
Why dropout works?
• By using dropout, in every iteration, you will work on a smaller neural
network than the previous one and therefore, it approaches regularization.
• Dropout helps in shrinking the squared norm of the weights and this tends to
a reduction in overfitting.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Click on Subject/Paper under Semester to enter.
Professional English Discrete Mathematics Environmental Sciences
Professional English - - II - HS3252 - MA3354 and Sustainability -
I - HS3152 GE3451
Digital Principles and
Statistics and Probability and
Computer Organization
Matrices and Calculus Numerical Methods - Statistics - MA3391
- CS3351
- MA3151 MA3251
3rd Semester
1st Semester
4th Semester
2nd Semester
Deep Learning -
AD3501
Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester
7th Semester
8th Semester
Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering