ML UNIT II
ML UNIT II
Supervised Learning:
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns
in the supervision of the teacher.
Supervised learning can be further divided into two types of problems:
1. Regression
2. Classification
Regression:
Regression algorithms are used if there is a relationship between the input variable and the output variable.
Types of Regression Algorithms:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
Classification:
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Types of Classification Algorithms:
o Decision Trees
o Logistic Regression
o Support vector Machines
Regression Algorithms:
Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression.
Regression analysis is a form of predictive modeling technique which investigates the relationship
between a dependent and independent variable.
Linear Regression in Machine Learning
Linear Regression is a simple machine learning model for regression problems, i.e., when the target
variable is a real value.
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the
input variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x).
= random error
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the
least error.
The different values for weights or the coefficient of lines (β 0, β1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
The different values for weights or coefficient of lines (β0, β1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be written
as:
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and
hence the cost function.
Estimating Error
We can calculate a error for our predictions called the Root Mean Squared Error or RMSE.
RMSE = sqrt( sum( (pi – yi)^2 )/n )
p is the predicted value and y is the actual value, i is the index for a specific instance, n is the
number of predictions, because we must calculate the error across all predicted values.
Example:
y = B0 + B1 * x
Data:
x y
1 1
2 3
4 3
3 2
5 5
Calculating mean values of x:
x mean(x) x - mean(x)
1 3 -2
2 3 -1
4 3 1
3 3 0
5 3 2
Calculating mean values of y:
y mean(y) y - mean(y)
1 2.8 -1.8
3 2.8 0.2
3 2.8 0.2
2 2.8 -0.8
5 2.8 2.2
-2 -1.8 3.6 -2 4
-1 0.2 -0.2 -1 1
1 0.2 0.2 1 1
0 -0.8 0 0 0
2 2.2 4.4 2 4
8 10
B1 =
–
B0= mean(y)-B1*mean(x)=0.4
y = 0.4 + 0.8 * x
x y predicted y
1 1 1.2
2 3 2
4 3 3.6
3 2 2.8
5 5 4.4
RMSE = sqrt( sum( (pi – yi)^2 )/n )
predicted squared
y error(difference)
y error
1.2 1 0.2 0.04
2 3 -1 1
3.6 3 0.6 0.36
2.8 2 0.8 0.64
4.4 5 -0.6 0.36
2.4
RMSE SQRT(2.4/5) 0.69282
Classification:
Classification is a process of finding a function which helps in dividing the dataset into classes
based on different parameters.
In Classification, a computer program is trained on the training dataset and based on that training,
it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the input(x) to the discrete
output(y).
The Classification algorithm is a Supervised Learning technique that is used to identify the category of new
observations on the basis of training data.
Classification Algorithms can be further divided into the mainly two categories:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Naïve Bayes
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a prediction.
This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One
of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life,
it is almost impossible that we get a set of predictors which are completely independent.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it
could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature.
Here we can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly
used in text classification (due to better result in multi class problems and independence rule) have
higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering
(identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and
negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together
builds a Recommendation System that uses machine learning and data mining techniques to filter
unseen information and predict whether a user would like a given resource or not.
Some machine learning models belong to either the “generative” or “discriminative” model
categories. Yet what is the difference between these two categories of models? What does it mean
for a model to be discriminative or generative?
The short answer is that generative models are those that include the distribution of the data set,
returning a probability for a given example. Generative models are often used to predict what
occurs next in a sequence.
Meanwhile, discriminative models are used for either classification or regression and they return a
prediction based on conditional probability.
Generative Models
Generative models are those that center on the distribution of the classes within the dataset.
The machine learning algorithms typically model the distribution of the data points.
Generative models rely on finding joint probability. Creating points where a given input feature
and a desired output/label exist concurrently.
Generative models are typically employed to estimate probabilities and likelihood, modeling data
points and discriminating between classes based on these probabilities. Because the model learns a
probability distribution for the dataset, it can reference this probability distribution to generate
new data instances.
Generative models often rely on Bayes theorem to find the joint probability, finding p(x,y).
Essentially, generative models model how the data was generated, answer the following question:
“What’s the likelihood that this class or another class generated this data point/instance?”
Examples of generative machine learning models include Linear Discriminant Analysis (LDA),
Hidden Markov models, and Bayesian networks like Naive Bayes.
Discriminative Models
While generative models learn about the distribution of the dataset, discriminative models learn
about the boundary between classes within a dataset.
With discriminative models, the goal is to identify the decision boundary between classes to apply
reliable class labels to data instances. Discriminative models separate the classes in the dataset by
using conditional probability, not making any assumptions about individual data points.
Examples of discriminative models in machine learning include support vector machines, logistic
regression, decision trees, and random forests.
Generative models:
Generative models aim to capture the actual distribution of the classes in the dataset.
Generative models predict the joint probability distribution – p(x,y) – utilizing Bayes
Theorem.
Generative models are computationally expensive compared to discriminative models.
Generative models are useful for unsupervised machine learning tasks.
Generative models are impacted by the presence of outliers more than discriminative
models.
Discriminative models:
Discriminative models model the decision boundary for the dataset classes.
Discriminative models learn the conditional probability – p(y|x).
Discriminative models are computationally cheap compared to generative models.
Discriminative models are useful for supervised machine learning tasks.
Discriminative models have the advantage of being more robust to outliers, unlike the
generative models.
Discriminative models are more robust to outliers compared to generative models.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function can
be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and values below the threshold level are rounded up to 0.
Example:
This dataset has two input variables (X1 and X2) and one output variable (Y). In input variables
are real-valued random numbers drawn from a Gaussian distribution. The output variable has two
values, making the problem a binary classification problem
X1 X2 Y
2.7810836 2.550537003 0
1.465489372 2.362125076 0
3.396561688 4.400293529 0
1.38807019 1.850220317 0
3.06407232 3.005305973 0
7.627531214 2.759262235 1
5.332441248 2.088626775 1
6.922596716 1.77106367 1
8.675418651 -0.2420686549 1
7.673756466 3.508563011 1
Logistic Function
Before we dive into logistic regression, let’s take a look at the logistic function, the heart of the
logistic regression technique.
The logistic function is defined as:
transformed = 1 / (1 + e^-x)
Where e is the numerical constant Euler’s number and x is a input we plug into the function.
Let’s plug in a series of numbers from -5 to +5 and see how the logistic function transforms them:
X Transformed
-5 0.006692850924
-4 0.01798620996
-3 0.04742587318
-2 0.119202922
-1 0.2689414214
0 0.5
1 0.7310585786
2 0.880797078
3 0.9525741268
4 0.98201379
5 0.9933071491
You can see that all of the inputs have been transformed into the range [0, 1] and that the smallest
negative numbers resulted in values close to zero and the larger positive numbers resulted in
values close to one. You can also see that 0 transformed to 0.5 or the midpoint of the new range.
From this we can see that as long as our mean value is zero, we can plug in positive and negative
values into the function and always get out a consistent transform into the new range.
These are the probabilities of each instance belonging to class=0. We can convert these into crisp
class values using:
prediction = IF (output < 0.5) Then 0 Else 1
With this simple procedure we can convert all of the outputs to class values:
0
0
0
0
0
1
1
1
1
1
Finally, we can calculate the accuracy for the model on the training dataset:
accuracy = (correct predictions / num predictions made) * 100
accuracy = (10 /10) * 100
accuracy = 100%
Decision Tree
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the
decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The
decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, and ‘Does he eat a
lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was
a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
What we have seen above is an example of classification tree, where the outcome was a
variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll
go through few definitions. Entropy Entropy, also called as Shannon Entropy is denoted by H(S)
for a finite set S, is the measure of the amount of uncertainty or randomness in data.
A classification tree is an algorithm where the target variable is fixed or categorical. The
algorithm is then used to identify the “class” within which a target variable would most likely fall.
An example of a classification-type problem would be determining who will or will not subscribe
to a digital platform; or who will or will not graduate from high school.
These are examples of simple binary classifications where the categorical dependent variable can
assume only one of two, mutually exclusive values. In other cases, you might have to predict
among a number of different variables. For instance, you may have to predict which type of
smartphone a consumer may decide to purchase.
In such cases, there are multiple values for the categorical dependent variable. Here’s what a
classic classification tree looks like
Classification trees are used when the dataset needs to be split into classes which belong to the
response variable. In many cases, the classes Yes or No.
In other words, they are just two and mutually exclusive. In some cases, there may be more than
two classes in which case a variant of the classification tree algorithm is used.
Regression Trees
A regression tree refers to an algorithm where the target variable is and the algorithm is used to
predict it’s value. As an example of a regression type problem, you may want to predict the
selling prices of a residential house, which is a continuous dependent variable.
This will depend on both continuous factors like square footage as well as categorical factors like
the style of home, area in which the property is located and so on.
Regression trees are used when the response variable is continuous. For instance, if the response
variable is something like the price of a property or the temperature of the day, a regression tree is
used.
In other words, regression trees are used for prediction-type problems while classification trees
are used for classification-type problems.
A Classification and Regression Tree (CART) is a predictive algorithm used in machine learning.
It explains how a target variable’s values can be predicted based on other values.
It is a decision tree where each fork is a split in a predictor variable and each node at the end
has a prediction for the target variable.
The CART algorithm is an important decision tree algorithm that lies at the foundation of
machine learning. Moreover, it is also the basis for other powerful machine learning algorithms
like bagged decision trees, random forest and boosted decision trees.
Summing up
The Classification and regression tree (CART) methodology is one of the oldest and most
fundamental algorithms. It is used to predict outcomes based on certain predictor variables.
They are excellent for data mining tasks because they require very little data pre-processing.
Decision tree models are easy to understand and implement which gives them a strong advantage
when compared to other analytical models.
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of
each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)2 for i=1 to number of classes
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions
for outlook feature.
Weak 6 2 8
Strong 3 3 6
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
Time to decide
We’ve calculated gini index values for each feature. The winner will be outlook feature because
its cost is the lowest.
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
We’ll put outlook decision at the top of the tree.
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that
overcast leaf is over.
We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature,
humidity and wind features respectively.
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Gini of humidity for sunny outlook
High 0 3 3
Normal 2 0 2
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Gini of wind for sunny outlook
Wind Yes No Number of instances
Weak 1 2 3
Strong 1 1 2
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
Decision for sunny outlook
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity
because it has the lowest value.
Feature Gini index
Temperature 0.2
Humidity 0
Wind 0.466
We’ll put humidity check at the extension of sunny outlook.
As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision
will always be yes for normal humidity and sunny outlook. This branch is over.
Now, we need to focus on rain outlook.
Rain outlook
Cool 1 1 2
Mild 2 1 3
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of humidity for rain outlook
High 1 1 2
Normal 2 1 3
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of wind for rain outlook
Weak 3 0 3
Strong 0 2 2
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
Decision for rain outlook
The winner is wind feature for rain outlook because it has the minimum gini index score in
features.
Feature Gini index
Temperature 0.466
Humidity 0.466
Wind 0
Put the wind feature for rain outlook branch and monitor the new sub data sets.
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if
wind is strong. This means that this branch is over.
Artificial Neural Network
The term "Artificial Neural Network" is derived from Biological neural networks that develop
the structure of a human brain. Similar to the human brain that has neurons interconnected to one
another, artificial neural networks also have neurons that are interconnected to one another in
various layers of the networks. These neurons are known as nodes.
The typical Artificial Neural Network looks something like the given figure.
A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-
input neuron looks like:
The activation function is used to turn an unbounded input into an output that has a nice,
predictable form. A commonly used activation function is the sigmoid function.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it to
the output layer. There are distinctive activation functions available that can be applied upon the
sort of task we are performing.
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following
parameters:
w=[0,1] b=4
w=[0,1] is just a way of writing w1=0,w2=1 in vector form. Now, let’s give the neuron an input
of x=[2,3]. We’ll use the dot product to write things more concisely:
Feedback ANN:
In this type of ANN, the output returns into the network to accomplish the best-evolved results
internally. As per the University of Massachusetts, Lowell Centre for Atmospheric Research.
The feedback networks feed information back into itself and are well suited to solve optimization
issues. The Internal system error corrections utilize feedback ANNs.
Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer, an output layer,
and at least one layer of a neuron. Through assessment of its output by reviewing its input, the
intensity of the network can be noticed based on group behavior of the associated neurons, and the
output is decided. The primary advantage of this network is that it figures out how to evaluate and
recognize input patterns.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified
using a decision boundaryor hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector. How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as: z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Ensemble methods:
Ensemble learning is a machine learning paradigm where multiple models (often called “weak
learners”) are trained to solve the same problem and combined to get better results. The main
hypothesis is that when weak models are correctly combined, we can obtain more accurate and/or
robust models.
Ensemble methods take multiple small models and combine their predictions to obtain a more
powerful predictive power.
i. Boosting
ii. Bagging
1. Bagging :
Bagging is the type of ensemble technique in which a single training algorithm is used on different
subsets of the training data where the subset sampling is done with replacement (bootstrap). Once
the algorithm is trained on all the subsets, then bagging predicts by aggregating all the predictions
made by the algorithm on different subsets.
For aggregating the outputs of base learners, bagging uses majority voting (most frequent
prediction among all predictions) for classification and averaging (mean of all the predictions) for
regression.
2. Bagging methods work so well because of diversity in the training data since the sampling is
done by bootstrapping.
3. Also, if the training set is very huge, it can save computational time by training the model on a
relatively smaller data set and still can increase the accuracy of the model.
1. The main disadvantage of Bagging is that it improves the accuracy of the model at the expense
of interpretability i.e., if a single tree was being used as the base model, then it would have a more
attractive and easily interpretable diagram, but with the use of bagging this interpretability gets lost.
which features are being selected i.e., there are chances that some features are never used, which
the data but a fraction of the original dataset. There might be some data that are never sampled at
all. The remaining data which are not sampled are called out of bag instances.
The Random Forest approach is a bagging method where deep trees (Decision Trees), fitted on
2.BOOSTING
· Boosting models fall inside this family of ensemble methods.
· Boosting, initially named Hypothesis Boosting, consists of the idea of filtering or weighting the
data that is used to train our team of weak learners, so that each new learner gives more weight or is
only trained with observations that have been poorly classified by the previous learners..
· By doing this team of models learns to make accurate predictions on all kinds of data, not just
on the most common or easy observations. Also, if one of the individual models is very bad at
making predictions on some kind of observation, it does not matter, as the other N-1 models will
Definition: — The term ‘Boosting’ refers to a family of algorithms which converts weak learner to
strong learners. Boosting is an ensemble method for improving the model predictions of any given
learning algorithm. The idea of boosting is to train weak learners sequentially, each trying to
correct its predecessor. The weak learners are sequentially corrected by their predecessors and, in
2. Provably effective
1. A disadvantage of boosting is that it is sensitive to outliers since every classifier is obliged to fix
the errors in the predecessors. Thus, the method is too dependent on outliers.
2. Another disadvantage is that the method is almost impossible to scale up. This is because every
estimator bases its correctness on the previous predictors, thus making the procedure difficult to
streamline.
Ada Boost(Adaptive Boosting), Gradient Boosting, XG Boost(Xtreme Gradient Boosting) are few
common examples of Boosting Techniques.