Unit2 Updated ML f
Unit2 Updated ML f
Greater Noida
MINING ASSOCIATION AND
SUPERVISED LEARNING
Unit: 2
MACHINE LEARNING
Dr. Hitesh Singh
Associate Professor
B Tech 5th Sem Section A & B IT DEPARTMENT
I am pleased to introduce myself as Dr. Hitesh Singh, presently associated with NIET, Greater Noida as
Assistant Professor in IT Department. I completed my Ph.D. degree under the supervision of Boncho Bonev
(PhD), Technical University of Sofia, Sofia, Bulgaria in 2019. My area of research interest is related to Radio
wave propagation, Machine Learning and have rich experience of millimetre wave technologies.
I started my research carrier in 2009 and since then I published research articles in SCI/Scopus indexed
Journals/Conferences like Springer, IEEE, Elsevier. I presented research work in international reputed
Conferences like (IEEE International Conference on Infocom Technologies and Unmanned
Systems (ICTUS'2017)”, Dubai and ELECTRONICA, Sofia. Four patents and two book chapter have been
published (Elsevier Publication) under my inventor ship and authorship.
My area of research interest is related to Radio wave propagation, Machine Learning and have rich
experience of millimeter wave technologies.
KCS055.1 3 2 2 1 2 2 - - - 1 - -
KCS055.2 3 2 2 3 2 2 1 - 2 1 1 2
KCS055.3 2 2 2 2 2 2 2 1 1 - 1 3
KCS055.4 3 3 1 3 1 1 2 - 2 1 1 2
KCS055.5 3 2 1 2 1 2 1 1 2 1 1 1
AVG 2.8 2.2 1.6 2.2 1.6 1.8 1.2 0.4 1.4 0.8 0.8 1.6
Matrix of CO/PSO:
PSO1 PSO2 PSO3 PSO4
RCS080.1 3 2 3 1
RCS080.2 3 2 2 3
RCS080.3 3 2 3 2
RCS080.4 2 1 1 1
RCS080.5 2 2 1 2
Prerequisites:
• Statistics.
• Linear Algebra.
• Calculus.
• Probability.
• Programming Languages.
https://www.youtube.com/watch?v=PPLop4L2eGk&list=PLLssT5z_DsK-
h9vYZkQkYNWcItqhlRJLN
➢ Unit 2 Content:
Classification and Regression,
Regression: Linear Regression,
Multiple Linear Regression,
Logistic Regression,
Polynomial Regression,
Decision Trees: ID3, C4.5, CART.
Apriori Algorithm: Market basket analysis, Association Rules.
Neural Networks: Introduction, Perceptron, Multilayer Perceptron, Support vector
machine.
Linear Regression
Logistic Regression
Polynomial Regression
• Logistic Regression:
• Logistic regression is one of the most popular Machine learning algorithm that
comes under Supervised Learning techniques.
• It can be used for Classification as well as for Regression problems, but mainly
used for Classification problems.
• Logistic regression is used to predict the categorical dependent variable with the
help of independent variables.
• The output of Logistic Regression problem can be only between the 0 and 1.
• Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
• Logistic regression is based on the concept of Maximum Likelihood estimation.
According to this estimation, the observed data should be most probable.
• In logistic regression, we pass the weighted sum of inputs through an activation
function that can map values in between 0 and 1. Such activation function is
known as sigmoid function and the curve obtained is called as sigmoid curve or S-
curve. Consider the below image:
34
Use Case: Regression?
35
Use Case: Regression?
36
Regression!!
37
Regression!!
38
Regression!!
39
Regression!!
40
Regression!!
41
Regression!!
42
Regression!!
43
Regression!!
44
Regression!!
45
Regression!!
46
Regression!!
47
Regression!!
48
Regression!!
49
Logistic Regression!!
50
Logistic Regression!!
51
Logistic Regression!!
52
Logistic Regression!!
53
Logistic Regression!!
54
Logistic Regression!!
55
Logistic Regression!!
56
Logistic Regression!!
57
Logistic Regression!!
THE CONCEPT LEARNING(CO1)
TASK
9/7/2022 58
Dr. Hitesh Singh KCS 055 ML Unit 2
Logistic Regression!!
THE CONCEPT LEARNING(CO1)
TASK
Hours Pass
0.5 0
0.75 0
1 0
1.25 0
Coefficie P-value
1.5 0 Std.Error z-value
nt (Wald)
1.75 0
1.75 1 Intercept −4.0777 1.761 −2.316 0.0206
2 0
2.25 1 Hours 1.5046 0.6287 2.393 0.0167
2.5 0
2.75 1
3 0
3.25 1
3.5 0
4 1
4.25 1
4.5 1
4.75 1
5 1
5.5 1
9/7/2022 59
Dr. Hitesh Singh KCS 055 ML Unit 2
Logistic Regression!!
THE CONCEPT LEARNING(CO1)
TASK
9/7/2022 60
Dr. Hitesh Singh KCS 055 ML Unit 2
Logistic Regression!!
THE CONCEPT LEARNING(CO1)
TASK
9/7/2022 61
Dr. Hitesh Singh KCS 055 ML Unit 2
Logistic Regression!!
THE CONCEPT LEARNING(CO1)
TASK
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given
set of independent variables. set of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the value of In logistic Regression, we predict the values
continuous variables. of categorical variables.
In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve
by which we can easily predict the output. by which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is
estimation of accuracy. used for estimation of accuracy.
The output of Logistic Regression must be a
The output for Linear Regression must be a
Categorical value such as 0 or 1, Yes or No,
continuous value, such as price, age, etc.
etc.
In Linear regression, it is required that In Logistic regression, it is not required to
relationship between dependent variable have the linear relationship between the
and independent variable must be linear. dependent and independent variable.
In linear regression, there may be In logistic regression, there should not be
collinearity between the independent collinearity between the independent
9/7/2022
variables. variable.
62
Dr. Hitesh Singh KCS 055 ML Unit 2
Decision TreeLEARNING
THE CONCEPT (CO1,2,3,5)
TASK
Introduction
• Each node in the tree acts as a test case for some attribute,
and each edge descending from the node corresponds to the
possible answers to the test case.
The ID3 (Iterative Dichotomiser 3) algorithm builds decision trees using a top-
down greedy search approach through the space of possible branches with no
backtracking. A greedy algorithm, as the name suggests, always makes the choice
that seems to be the best at that moment.
1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
• Entropy:
• Entropy is a measure of the randomness in the information being
processed.
• The higher the entropy, the harder it is to draw any conclusions from that
information.
• Flipping a coin is an example of an action that provides information that is
random.
Information Gain:
• Information gain or IG is a statistical property that measures how well a given
attribute separates the training examples according to their target classification.
• Constructing a decision tree is all about finding an attribute that returns the
highest information gain and the smallest entropy.
Gini Index
• You can understand the Gini index as a cost function used to evaluate
splits in the dataset.
• It is calculated by subtracting the sum of the squared probabilities of each
class from one.
• It favors larger partitions and easy to implement whereas information gain
favors smaller partitions with distinct values.
Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.
9/7/2022 Dr. Hitesh Singh KCS 055 ML Unit 3 84
Decision TreeLEARNING
THE CONCEPT (CO1,2,3,5)
TASK
Reduction in Variance:
• Above X-bar is the mean of the values, X is actual and n is the number of
values.
• Steps to calculate Variance:
• Calculate variance for each node.
• Calculate variance for each split as the weighted average of each node
variance.
Chi-Square:
• The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
• It is one of the oldest tree classification methods.
• It finds out the statistical significance between the differences between
sub-nodes and parent node.
• We measure it by the sum of squares of standardized differences between
observed and expected frequencies of the target variable.
• It works with the categorical target variable “Success” or “Failure”.
• It can perform two or more splits.
• Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
• It generates a tree called CHAID (Chi-square Automatic Interaction
Detector).
In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been
pruned as it has more importance on the right-hand side of the tree, hence
removing overfitting.
Random Forest
• Random Forest is an example of ensemble learning, in which we combine
multiple machine learning algorithms to obtain better predictive
performance.
9/7/2022 123
Dr. Hitesh Singh KCS 055 ML Unit 3
THE
Introduction CONCEPT
of Machine LEARNING
Learning TASK
Approaches(CO1,2,3,4)
A Neural Network
• A neural network is a processing device, either an algorithm or an actual
hardware, whose design was inspired by the design and functioning of
animal brains and components thereof.
• The neural networks have ability to learn by example, which makes them
very flexible and powerful.
• These networks are also well suited for real-time systems because of their
fast response and computational times which are because of their parallel
architecture.
• To depict the basic operation of a neural net, consider a set of neurons, say X1
and X2, transmitting signals to another neuron, Y.
• Here X1 and X2 are input neurons, which transmit signals, and Y is the output
neuron, which receives signals.
• Input neurons X1 and X2 are connected to the output neuron Y, over a
weighted interconnection links (W1 and W2).
Activation Function
Characteristics of ANN:
• a
• The output here remains the same as input. The input layer uses the
identity activation function.
• where Ɵ represents the threshold value. This function is most widely used
in single-layer nets to convert the net input to an output that is a binary (1
or 0).
• a
• a
Weights:
Threshold
• Threshold is a set value based upon which the final output of the network
may be calculated.
• The threshold value is used in the activation function.
• A comparison is made between the calculated net input and the threshold
to obtain the network output.
• For each and every application there is a threshold limit.
Learning Rate:
• The learning rate is denoted by "α." It is used to control the amount of
weight adjustment at each step of training. The learning rate , ranging
from 0 to 1, determine the rate of learning at each time step.
• NOTE: All has to solve the questions send to you and revert
back in group.
• a
• a
McCulloch-Pitts Neuron:
• For inhibition to be absolute, the threshold with the activation function should
satisfy the following condition:
• The Hebb rule is more suited for bipolar data than binary data. If binary data is
used, then above weight updating formula cannot distinguish two conditions
namely;
1. A training pair in which an input unit is "on" and target value is "off."
2. A training pair in which both the input unit and the target value are "off."
• Design a Hebb net to implement logical AND function (use bipolar inputs and
targets).
• Solution: The training data for the AND function is
• given in Table:
• Perceptron network consists of three units, namely, sensory unit (input unit),
associator unit (hidden unit), response unit (output unit).
• The sensory units are connected to associator units with fixed weights having
values 1, 0 or -l, which are assigned at random.
• The binary activation function is used in sensory unit and associator unit.
• The response unit has an activation of l, 0 or -1.
• The binary step will fixed threshold Ɵ is used as activation for associator.
• The output signals that are sent from the associator unit to the response unit are
only binary.
• The output of the perceptron network is given by:
• The perceptron learning rule is used in the weight updating between the associator unit
and the response unit.
• For each training input, the net will calculate the response and it will determine whether
or not an error has occurred.
• The error calculation is based on the comparison of the targets with those of the
calculated outputs.
• The weights on the connections from the units that send the nonzero signal will get
adjusted suitably.
• The weights will be adjusted on the basis of the learning rule if an error has occurred for
a particular training patterns i.e..,
• If no error occurs, there is no weight updating and hence the training process may be
stopped.
• In the above equations, the target value "t" is+ I or –l and a is the learning rate.
• The Support Vector Machine is a supervised learning algorithm mostly used for
classification but it can be used also for regression.
• The main idea is that based on the labeled data (training data) the algorithm tries
to find the optimal hyperplane which can be used to classify new data points.
• In two dimensions the hyperplane is a simple line.
• Usually a learning algorithm tries to learn the most common characteristics (what
differentiates one class from another) of a class and the classification is based on
those representative characteristics learnt (so classification is based on differences
between classes).
• The SVM works in the other way around. It finds the most similar examples
between classes. Those will be the support vectors.
• So if we compare the picture above with the picture below, we can easily observe,
that the first is the optimal hyperplane (line) and the second is a sub-optimal
solution, because the margin is far shorter.
• Because we want to maximize the margins taking in consideration all the classes,
instead of using one margin for each class, we use a “global” margin, which takes
in consideration all the classes. This margin would look like the purple line in the
following picture:
• This margin is orthogonal to the boundary and equidistant to the support vectors.
• So where do we have vectors? Each of the calculations (calculate distance and optimal
hyperplanes) are made in vectorial space, so each data point is considered a vector. The
dimension of the space is defined by the number of attributes of the examples. To
understand the math behind, please read this brief mathematical description of vectors,
hyperplanes and optimizations: SVM Succintly.
• All in all, support vectors are data points that defines the position and the margin of the
hyperplane. We call them “support” vectors, because these are the representative data
points of the classes, if we move one of them, the position and/or the margin will change.
Moving other data points won’t have effect over the margin or the position of the
hyperplane.
• To make classifications, we don’t need all the training data points (like in the case of KNN), we
have to save only the support vectors. In worst case all the points will be support vectors, but
this is very rare and if it happens, then you should check your model for errors or bugs.
• So basically the learning is equivalent with finding the hyperplane with the best margin, so
it is a simple optimization problem.
• Basic Steps
• The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points between
them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the
decision boundary
• This is very nice and easy, but finding the best margin, the optimization problem is
not trivial (it is easy in 2D, when we have only two attributes, but what if we have
N dimensions with N a very big number)
• To solve the optimization problem, we use the Lagrange Multipliers. To
understand this technique you can read the following two articles: Duality
Langrange Multiplier and A Simple Explanation of Why Langrange Multipliers
Wroks.
• Until now we had linearly separable data, so we could use a line as class boundary.
But what if we have to deal with non-linear data sets?
• In this case we cannot find a straight line to separate apples from lemons. So how
can we solve this problem. We will use the Kernel Trick!
• The basic idea is that when a data set is inseparable in the current dimensions, add
another dimension, maybe that way the data will be separable.
• Just think about it, the example above is in 2D and it is inseparable, but maybe in
3D there is a gap between the apples and the lemons, maybe there is a level
difference, so lemons are on level one and lemons are on level two.
• In this case we can easily draw a separating hyperplane (in 3D a hyperplane is a
plane) between level 1 and 2.
• Now we have to map the apples and lemons (which are just simple points) to this
new space.
• Think about it carefully, what did we do?
• We just used a transformation in which we added levels based on distance.
• If you are in the origin, then the points will be on the lowest level.
• As we move away from the origin, it means that we are climbing the hill (moving
from the center of the plane towards the margins) so the level of the points will be
higher.
• Now if we consider that the origin is the lemon from the center, we will have
something like this:
• Kernel Rules
• Define kernel or a window function as follows:
2. Gaussian kernel
• It is a general-purpose kernel; used when there is no prior knowledge
about the data. Equation is:
6. Sigmoid kernel
• We can use it as the proxy for neural networks. Equation is
• Mapping from 1D to 2D
• Another, easier example in 2D would be:
• After using the kernel and after all the transformations we will get:
• So after the transformation, we can easily delimit the two classes using just a
single line.
• In real life applications we won’t have a simple straight line, but we will have lots
of curves and high dimensions. In some cases we won’t have two hyperplanes
which separates the data with no points between them, so we need some trade-
offs, tolerance for outliers. Fortunately the SVM algorithm has a so-called
regularization parameter to configure the trade-off and to tolerate outliers.
• Tuning Parameters
• As we saw in the previous section choosing the right kernel is crucial, because if
the transformation is incorrect, then the model can have very poor results. As a
rule of thumb, always check if you have linear data and in that case always use
linear SVM (linear kernel). Linear SVM is a parametric model, but an RBF kernel
SVM isn’t, so the complexity of the latter grows with the size of the training set.
Not only is more expensive to train an RBF kernel SVM, but you also have to keep
the kernel matrix around, and the projection into this “infinite” higher
dimensional space where the data becomes linearly separable is more expensive
as well during prediction. Furthermore, you have more hyperparameters to tune,
so model selection is more expensive as well! And finally, it’s much easier to
overfit a complex model!
• Regularization
• The Regularization Parameter (in python it’s called C) tells the SVM optimization how much
you want to avoid miss classifying each training example.
• If the C is higher, the optimization will choose smaller margin hyperplane, so training data
miss classification rate will be lower.
• On the other hand, if the C is low, then the margin will be big, even if there will be miss
classified training data examples. This is shown in the following two diagrams:
• Gamma
• The next important parameter is Gamma. The gamma parameter defines how far
the influence of a single training example reaches. This means that high Gamma
will consider only points close to the plausible hyperplane and low Gamma will
consider points at greater distance.
• As you can see, decreasing the Gamma will result that finding the correct
hyperplane will consider points at greater distances so more and more
points will be used (green lines indicates which points were considered
when finding the optimal hyperplane).
• Margin
• The last parameter is the margin. We’ve already talked about margin,
higher margin results better model, so better classification (or
prediction). The margin should be always maximized.
Disadvantages
1.Training time is high when we have large data sets
2.When the data set has more noise (i.e. target classes are
overlapping) SVM doesn’t perform well
1.Text Classification
2.Detecting spam
3.Sentiment analysis
4.Aspect-based recognition
5.Aspect-based recognition
6.Handwritten digit recognition
9/7/2022 Dr. Hitesh Singh KCS 055 ML Unit 2 226
SVM
THE EXAMPLE
CONCEPT (CO1)
LEARNING TASK
• Algorithms that use association rules include AIS, SETM and Apriori. The
Apriori algorithm is commonly cited by data scientists in research articles
about market basket analysis. It identifies frequent items in the database
and then evaluates their frequency as the datasets are expanded to larger
sizes.
• SUPPORT
• CONFIDENCE
• LIFT
• For example, suppose 5000 transactions have been made through a popular e-
Commerce website. Now they want to calculate the support, confidence, and
lift for the two products. For example, let's say pen and notebook, out of 5000
transactions, 500 transactions for pen, 700 transactions for notebook, and
1000 transactions for both.
SUPPORT
• It has been calculated with the number of transactions divided by the total
number of transactions made,
1. Support = freq(A, B)/N
• support(pen) = transactions related to pen/total transactions
• i.e support -> 500/5000=10 percent
CONFIDENCE
LIFT
• Lift-> 20/10=2
• When the Lift value is below 1, the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things
together is high when compared to the transaction for the individual items sold.
1. Calculate the regression coefficient and obtain the lines of regression for the following data
2. Calculate the two regression equations of X on Y and Y on X from the data given below, taking deviations from a actual
means of X and Y.
•What is entropy?
•What is information gain?
•How are entropy and information gain related vis-a-vis decision trees?
•How do you calculate the entropy of children nodes after the split based on on a
feature?
•How do you decide a feature suitability when working with decision tree?
•Explain feature selection using information gain/entropy technique?
•Which algorithm (packaged) is used for building models based on the decision tree?
•What are some of the techniques to decide decision tree pruning?
1.Decision Tree
2.Entropy, Information Gain, Gini Impurity
3.Decision Tree Working For Categorical and Numerical Features
4.What are the scenarios where Decision Tree works well
5.Decision Tree Low Bias And High Variance- Overfitting
6.Hyperparameter Techniques
7.Library used for constructing decision tree
8.Impact of Outliers Of Decision Tree
9.Impact of mising values on Decision Tree
10.Does Decision Tree require Feature Scaling
Question 1 :
SVM stands for?
Options :
a. Simple Vector Machine
b. Support Vector Machine
c. Super Vector Machine
d. All the Above
Question 2 :
SVM is classified into how many types?
Options :
a. One
b. Two
c. Three
d. Four
Question 3 :
SVM, which best segregates classes into how many classes?
Options :
a. One
b. Two
c. Three
d. Four
Question 4 :
SVM is a supervised Machine Learning can be used for
Options :
a. Regression
b. Classification
c. Either a or b
d. None of These
Youtube video-
•https://www.youtube.com/watch?v=PDYfCkLY_DE
•https://www.youtube.com/watch?v=ncOirIPHTOw
•https://www.youtube.com/watch?v=cW03t3aZkmE
Assignment 1
1. What are Support Vector Machines (SVMs)?
Text books:
Thank you