ML 4
ML 4
AIML
Logistic regression is a supervised machine learning algorithm
that accomplishes binary classification tasks by predicting the
probability of an outcome, event, or observation. The model delivers a
binary or dichotomous outcome limited to two possible outcomes:
yes/no, 0/1, or true/false.
For example, if the output of the sigmoid function is above 0.5, the
output is considered as 1. On the other hand, if the output is less than
0.5, the output is classified as 0. Also, if the graph goes further to the
negative end, the predicted value of y will be 0 and vice versa. In other
words, if the output of the sigmoid function is 0.65, it implies that
there are 65% chances of the event occurring; a coin toss, for
example.
Here,
x = input value
y = predicted output
b0 = bias or intercept term
b1 = coefficient for input (x)
Examples:
Examples:
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function. The perceptron model begins with the multiplication
of all input values and their weights, then adds these values together to create the
weighted sum. Then this weighted sum is applied to the activation function 'f' to obtain
the desired output. This activation function is also known as the step function and is
represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight of
input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
The multi-layer perceptron model is also known as the Back propagation algorithm,
which executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR,
XOR, NAND, NOT, XNOR, NOR.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
f(x)=1; if w.x+b>0
Otherwise, f(x) =0
o 'w' represents real-valued weights vector
o 'b' represents the bias
o 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.
Since, to predict a class label y, we are only interested in the arg max ,
the denominator can be removed from (ii).
Hence to predict the label y from the training example x, generative models evaluate:
The most important part in the above is P(x | y). This is what allows the model to
be generative! P(x | y) means – what x (features) are there given class y. Hence, with
the joint probability distribution function (i), given a y, you can calculate (“generate”) its
corresponding x. For this reason they are called generative models!
Generative learning algorithms make strong assumptions on the data. To explain this
let’s look at a generative learning algorithm called Gaussian Discriminant Analysis
(GDA)
I won’t go into the maths involved but just note that the multivariate normal distribution
in n-dimensions, also called the multi-variate Gaussian distribution, is parameterized by
∈
a mean vector μ Rn and a covariance matrix Σ Rnxn. ∈
A Gaussian distribution is fit for each class. This allows us to find P(y) and P(x | y).
Using this two we can finally find out P(y | x), which is required for prediction.
For a two class dataset, pictorially what the algorithm is doing can be seen as follows:
Shown in the figure are the training set, as well as the contours of the two Gaussian
distributions that have been fit to the data for each of the two classes. Also shown in the
figure is the straight line giving the decision boundary at which p(y = 1|x) = 0.5. On one
side of the boundary, we’ll predict y = 1 to be the most likely outcome, and on the other
side, we’ll predict y = 0.
As we now have the Gaussian distribution (model) for each class, we can also generate
new samples of the classes. The features, x, for these new samples will be taken from
the respective Gaussian distribution (model).
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
Problem: If the weather is sunny, then the Player should play or not?
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm. We will first train our model with lots of images
of cats and dogs so that it can learn about different features of cats and dogs, and then
we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it
will see the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
This scalar is, in essence, the dot product of the two input vectors.
However, it's not computed in the original space of these vectors.
Instead, it's as if this dot product is calculated in a much higher-
dimensional space, known as the Z space. This is where the kernel's
true power and elegance come into play. It manages to convey how
close or similar these two vectors are in the Z space without the
computational overhead of actually mapping the vectors to this higher-
dimensional space and calculating their dot product there.
The kernel thus serves as a kind of guardian of the Z space. It allows
you to glean the necessary information about the vectors in this more
complex space without having to access the space directly. This
approach is particularly useful in SVMs, where understanding the
relationship and position of vectors in a higher-dimensional space is
crucial for classification tasks.
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction creates
new features. Feature selection is a way of reducing the input variable for the model by
using only relevant data in order to reduce overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in
model building." Feature selection is performed by either including the important
features or excluding the irrelevant features in the dataset without changing them.
Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a
spare part, and to do this, we have a dataset. This dataset contains a Model of the car,
Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed
or not, so we can remove this column and select the rest of the features(column) for the
model building.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with other
combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant feature and redundant columns from the model
by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does
not overfit the data.
Some common techniques of Filter methods are as follows:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns
the rank of the variable on the fisher's criteria in descending order. Then we can select
the variables with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of
missing values in each column divided by the total number of observations. The variable
is having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the
most important features that contribute the most to training in a particular iteration.
Some techniques of embedded methods are:
o Regularization- Regularization adds a penalty term to different parameters of the
machine learning model for avoiding overfitting in the model. This penalty term is added
to the coefficients; hence it shrinks some coefficients to zero. Those features with zero
coefficients can be removed from the dataset. The types of regularization techniques are
L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature selection help us
with feature importance to provide a way of selecting features. Here, feature importance
specifies which feature has more importance in model building or has a great impact on
the target variable. Random Forest is such a tree-based method, which is a type of
bagging algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini impurity) over all
the trees. Nodes are arranged as per the impurity values, and thus it allows to pruning of
trees below a specific node. The remaining nodes create a subset of the most important
features.
Below are some univariate statistical measures, which can be used for filter-based
feature selection:
Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.
Numerical Input with categorical output is the case for classification predictive modelling
problems. In this case, also, correlation-based techniques should be used, but with
categorical output.
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
This is the case of regression predictive modelling with categorical input. It is a different
example of a regression problem. We can use the same measures as discussed in the
above case but in reverse order.
The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.
Here are the major types under which the models are categorized based on their behavior.
Decision trees,
Predicts categorical variables for a given
Classification logistic regression,
dataset.
neural networks.
Polynomial regression,
linear regression.
Hierarchical clustering,
Uses unsupervised learning algorithms to
Clustering K-means
group similar data points.
DBSCAN
Linear discriminant
analysis,
Dimensions
Reduces the number of features in a dataset. t-SNE
reduction
principal component
analysis
Autoregressive models,
Generates new data that is comparable to the generative adversarial
Generative
training dataset. networks, variational
autoencoders.
Features to consider
Selecting a suitable model is the most crucial step in machine learning because it influences the
observations and the results obtained. Let's discuss a few important features when selecting a
model.
Complexity
Determine the complexity of the problem that is to be solved. There might be some cases where
simple models are sufficient enough to solve the issues, but at times there is a necessity to use
complex models. Hence, the size of the dataset, the complexity of the inputs, and potential
connections should be kept under consideration when selecting the model.
Data availability
Analyze the existing data accessibility and quality. If the dataset is limited, it is preferred to use
simpler models with limited parameters than a complicated model with many parameters to avoid
overfitting. It is essential to consider the missing data, outliers, noise, and models' responses to the
difficulties before selecting the model.
Regulations
Analyze the model's capacity to determine whether it fits well on the fresh and untested data. We
can incorporate penalty terms into the model's objective function and implement approaches such as
L1 or L2 regularization to overcome the overfitting issues. The regularised models potentially
perform better on sparse training data.
Domain Expertise
Consider your expertise and domain knowledge. On the basis of previous knowledge of the data or
particular features of the domain, consider if particular models are appropriate for the task. Models
that are more likely to capture important patterns can be found by using domain expertise to direct
the selection process.
Resource constraints
Take into account any resource limitations you may have, such as constrained memory space,
processing speed, or time. Make that the chosen model can be successfully implemented using the
resources at hand. Some models require significant resources during training or inference.
Scalability
If you're working with massive datasets or real-time applications, take the model's scalability and
computing efficiency into consideration. Deep neural networks and support vector machines are two
examples of models that could need a lot of time and computing power to train.
Interpretability
Consider whether the model's interpretability is crucial in your particular setting. Some models, like
decision trees or linear regression, offer interpretability by giving precise insights into the correlations
between the input data and the desired outcome. Complex models, such as neural networks, may
perform better but offer less interpretability.
Formulate problem: Precisely define the problem to be catered to, predictions to be made,
and the expected task it should perform.
Choose potential models: Choose models that are appropriate for the requirements. The
chosen models can be simple, like decision trees and linear regression, or complex, like
deep neural networks and random forests.
Do hyperparameter tuning: Find the best combination of hyperparameters for the model,
like learning rate and regularisation strength, to achieve optimal performance. They help to
avoid overfitting and overfitting and underfitting.
Train and evaluate each model: Train each model using a subset of the original dataset,
and measure its performance using the other subset that is not trained to evaluate its
effectiveness.
Compare the performance and accuracy: Compare the performance of the chosen
models based on different metrics, including the F1-score, mean squared error, accuracy,
precision, and recall. Also, consider factors like data handling capabilities, interpretability,
and computational difficulty.
Finalize the best-suited model: Based on the observation and comparison results, select
the model that performs the best. The finalized model can be used on the fresh dataset to
perform the required tasks and make predictions.
Combining classifiers
improved results across various tasks in machine learning and data analysis.
What is Bagging?
improve the accuracy and stability of machine learning algorithms. It involves the
following steps:
for regression or majority voting for classification) to produce the final output.
Key Benefits:
Reduces Variance: By averaging multiple predictions, bagging reduces the
What is Boosting?
2. Weight Adjustment: Each instance in the training set is weighted. Initially, all
instances have equal weights. After each model is trained, the weights of
misclassified instances are increased so that the next model focuses more on
difficult cases.
Key Benefits:
Reduces Bias: By focusing on hard-to-classify instances, boosting reduces
AdaBoost
XGBoost
LightGBM
Both bagging and boosting involve training multiple models on different subsets of
the training data and then combining their predictions to make a final prediction.
These techniques aim to reduce the variance of the model and improve its overall
Additionally, using bagging and boosting with various base models, such as
decision trees, to create a diverse set of models that capture different aspects of the
data.
Differences between Bagging and Boosting
While bagging and boosting share some similarities, their approach and
methodology differ.
Bagging trains each base model independently and in parallel, using bootstrap
sampling to create multiple subsets of the training data. The final prediction is then
made by averaging the predictions of all base models. Bagging focuses on reducing
focusing on correcting the errors made by the previous ones. Boosting adjusts the
bias and improving predictive accuracy. The final prediction is made by combining
approach.
more complex due to its sequential nature and may be more prone to overfitting if
improved results across various tasks in machine learning and data analysis.
What is Bagging?
improve the accuracy and stability of machine learning algorithms. It involves the
following steps:
for regression or majority voting for classification) to produce the final output.
Key Benefits:
What is Boosting?
Boosting is another ensemble learning technique that focuses on creating a strong
2. Weight Adjustment: Each instance in the training set is weighted. Initially, all
instances have equal weights. After each model is trained, the weights of
misclassified instances are increased so that the next model focuses more on
difficult cases.
Key Benefits:
AdaBoost
XGBoost
LightGBM
Both bagging and boosting involve training multiple models on different subsets of
the training data and then combining their predictions to make a final prediction.
These techniques aim to reduce the variance of the model and improve its overall
Additionally, using bagging and boosting with various base models, such as
decision trees, to create a diverse set of models that capture different aspects of the
data.
While bagging and boosting share some similarities, their approach and
methodology differ.
Bagging trains each base model independently and in parallel, using bootstrap
sampling to create multiple subsets of the training data. The final prediction is then
made by averaging the predictions of all base models. Bagging focuses on reducing
focusing on correcting the errors made by the previous ones. Boosting adjusts the
bias and improving predictive accuracy. The final prediction is made by combining
approach.
more complex due to its sequential nature and may be more prone to overfitting if
Introduction to AdaBoost
AdaBoost is a boosting set of rules that was added with the aid of Yoav Freund and
Robert Schapire in 1996. It is part of a class of ensemble getting-to-know strategies that
aim to improve the overall performance of gadget getting-to-know fashions by
combining the outputs of a couple of weaker fashions, known as vulnerable,
inexperienced persons or base novices. The fundamental idea at the back of AdaBoost
is to offer greater weight to the schooling instances that are misclassified through the
modern-day model, thereby focusing on the samples that are tough to classify.
1. Weight Initialization
At the start, every schooling instance is assigned an identical weight. These weights
determine the importance of every example in the getting-to-know method.
2. Model Training
A weak learner is skilled at the dataset, with the aim of minimizing class errors. A weak
learner is usually an easy model, which includes a selection stump (a one-stage
decision tree) or a small neural network.
After the vulnerable learner is skilled, its miles are used to make predictions at the
education dataset. The weighted mistakes are then calculated by means of summing up
the weights of the misclassified times. This step emphasizes the importance of the
samples which are tough to classify.
The weight of the susceptible learner is calculated primarily based on their Performance
in classifying the training data. Models that perform properly are assigned higher
weights, indicating that they're more reliable.
The example weights are updated to offer more weight to the misclassified samples
from the previous step. This adjustment focuses on the studying method at the times
that the present-day model struggles with.
6. Repeat
Steps 2 through five are repeated for a predefined variety of iterations or till a distinctive
overall performance threshold is met.
8. Classification
To make predictions on new records, AdaBoost uses the very last ensemble model.
Each vulnerable learner contributes its prediction, weighted with the aid of its
significance, and the blended result is used to categorize the enter.
1. Weak Learners
Weak novices are the individual fashions that make up the ensemble. These are
generally fashions with accuracy barely higher than random hazards. In the context of
AdaBoost, weak beginners are trained sequentially, with each new model focusing on
the instances that preceding models determined difficult to classify.
2. Strong Classifier
The strong classifier, additionally known as the ensemble, is the final version created by
combining the predictions of all weak first-year students. It has the collective know-how
of all of the fashions and is capable of making correct predictions.
3. Weighted Voting
In AdaBoost, every susceptible learner contributes to the very last prediction with a
weight-based totally on its Performance. This weighted vote-casting machine ensures
that the greater correct fashions have a greater say in the final choice.
4. Error Rate
The error rate is the degree of ways a vulnerable learner plays on the schooling
statistics. It is used to calculate the load assigned to each vulnerable learner. Models
with lower error fees are given higher weights.
5. Iterations
The range of iterations or rounds in AdaBoost is a hyperparameter that determines what
number of susceptible newbies are educated. Increasing the range of iterations may
additionally result in a more complex ensemble; however, it can also increase the risk of
overfitting.
Advantages of AdaBoost
AdaBoost gives numerous blessings that make it a popular choice in gadget mastering:
1. Improved Accuracy
2. Versatility
AdaBoost can be used with a number of base newbies, making it a flexible set of rules
that may be carried out for unique forms of problems.
3. Feature Selection
It routinely selects the most informative features, lowering the need for giant function
engineering.
4. Resistance to Overfitting
AdaBoost may be sensitive to noisy facts and outliers because it offers greater weight to
misclassified times. Outliers can dominate the studying system and result in suboptimal
consequences.
2. Computationally Intensive
3. Overfitting
Although AdaBoost is much less prone to overfitting than a few different algorithms, it
may nonetheless overfit if the number of iterations is too excessive.
4. Model Selection
Selecting the proper vulnerable learner and tuning hyperparameters may be difficult, as
the Performance of AdaBoost is noticeably dependent on these alternatives.
Practical Applications
AdaBoost has found applications in a huge range of domains, along with but not
constrained to:
1. Face Detection
AdaBoost has been used in computer imagination and prescient for obligations like face
detection, in which it allows the perception of faces in pics or motion pictures.
2. Speech Recognition
3. Anomaly Detection
In NLP, AdaBoost can decorate the overall Performance of sentiment analysis and
textual content category fashions.
AdaBoost has been used for protein type, gene prediction, and different bioinformatics
duties.
Implementation and Understanding
Step 1 - Creating the First Base Learner
In step one of the AdaBoost set of rules, we begin by growing the first base learner,
which is basically a selection stump, and we'll call it f1. For this case, we've got three
features (f1, f2, and f3) in our dataset, so we're going to create three stumps. The
preference of which stump to apply is because the first base learner is based totally on
the assessment of Gini impurity or entropy, just like decision timber. We select the
stump with the bottom Gini impurity or entropy, and in this situation, let's anticipate that
f1 has the bottom entropy.
Next, we calculate the Total Error (TE), which represents the sum of errors within the
classified records for the sample weights. In this case, there may be the handiest one
error, so TE is calculated as 1/5.
Performance = ½*ln(1-TE/TE)
In our case, TE is 1/5. By substituting this cost into the system and fixing it, we discover
that the overall Performance of the stump is approximately 0.693.
The subsequent step entails updating the pattern weights. For incorrectly labeled facts,
the formulation for updating the weights is:
In this situation, the Sample Weight is 1/5, and the Performance is 0.693. So, the
updated weight for incorrectly classified statistics is about 0.399.
For correctly labeled facts, the equal formulation is used, but with a terrible Performance
fee:
In this example, the up-to-date weight for successfully labeled statistics is about 0.100.
The sum of all of the up-to-date weights needs to be 1, preferably. However, in this
example, the sum is 0.799.
To normalize the weights, each updated weight is divided by way of the full sum of the
updated weights. For example, if the updated weight is 0.399 and the entire sum of up-
to-date weights is 0.799, then the normalized weight is 0.399 / 0.799 ≈ 0.50. This
normalization guarantees that the sum of weights turns into approximately 1.
In this step, we create a new dataset from the previous one, considering the normalized
weights. The new dataset could have greater instances of incorrectly labeled
information than effectively categorized ones. To create this new dataset, the algorithm
divides the normalized weights into buckets. For instance, if the normalized weights
range from 0 to 0.13,0.013, to 0.63 to 0.76, and so on, the set of rules selects data
randomly from those buckets primarily based on their weights.
This system is repeated in multiple instances (in this situation, 5 iterations) to form the
brand-new dataset. Incorrectly classified records could likely be selected extra regularly,
as their weights are better. The result is a new dataset to be used to educate the
following choice tree/stump within the AdaBoost algorithms.
The AdaBoost algorithm continues iterating through these steps, sequentially deciding
on stumps and creating new datasets, with a focal point on correctly classifying the data
that were formerly misclassified. This iterative procedure facilitates AdaBoost to improve
the overall performance of its ensemble of weak learners.
The way of Deciding the Output of the Algorithm for Test Data
So, in our example, if the first trees (stumps) say it's a 1, and the third one says it is a 0,
the majority opinion wins, and the final output for the test information might be 1.
This technique of mixing the opinions of multiple experts, with extra weight given to the
better professionals, is what makes AdaBoost a powerful algorithm for classification
responsibilities. It leverages the strengths of each professional to make a more correct
final selection.
Evaluating your machine learning algorithm is an essential part of any project. Your
model may give you satisfying results when evaluated using a metric say
accuracy_score but may give poor results when evaluated against other metrics such
as logarithmic_loss or any other such metric. Most of the times we use classification
accuracy to measure the performance of our model, however it is not enough to truly
judge our model. In this post, we will cover different types of evaluation metrics
available.
Classification Accuracy
Logarithmic Loss
Confusion Matrix
For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. It is a
output can be two or more classes. Confusion matrix is a table with 4 different
Here are a few definitions you need to remember for a confusion matrix:
False Positive: (Type 1 Error): You predicted positive, and it’s false.
False Negative: (Type 2 Error): You predicted negative, and it’s false.
Accuracy: the proportion of the total number of correct predictions that were
correct.
Positive Predictive Value or Precision: the proportion of positive cases that were
correctly identified.
Negative Predictive Value: the proportion of negative cases that were correctly
identified.
Sensitivity or Recall: the proportion of actual positive cases which are correctly
identified.
Specificity: the proportion of actual negative cases which are correctly identified.
Rate: It is a measuring factor in a confusion matrix. It has also 4 types TPR, FPR,
The accuracy for the problem in hand comes out to be 88%. As you can see from
the above two tables, the Positive Predictive Value is high, but the negative
predictive value is quite low. The same holds for Sensitivity and Specificity. This is
threshold value, the two pairs of starkly different numbers will come closer.
In general, we are concerned with one of the above-defined metrics. For instance, in
positive diagnosis. Hence, they will be more concerned about high Specificity. On
the other hand, an attrition model will be more concerned with Sensitivity. Confusion
F1 Score
In the last section, we discussed precision and recall for classification problems and
also highlighted the importance of choosing a precision/recall basis for our use
case. What if, for a use case, we are trying to get the best precision and recall at
the same time? F1-Score is the harmonic mean of precision and recall values for a
Now, an obvious question that comes to mind is why you are taking a harmonic
mean and not an arithmetic mean. This is because HM punishes extreme values
Precision: 0, Recall: 1
Here, if we take the arithmetic mean, we get 0.5. It is clear that the above result
comes from a dumb classifier that ignores the input and predicts one of the classes
as output. Now, if we were to take HM, we would get 0 which is accurate as this
This seems simple. There are situations, however, for which a data scientist would
Altering the above expression a bit such that we can include an adjustable
Fbeta measures the effectiveness of a model with respect to a user who attaches β
Gain and Lift charts are mainly concerned with checking the rank ordering of the
Step 3: Build deciles with each group having almost 10% of the observations.
Step 4: Calculate the response rate at each decile for Good (Responders), Bad
This is a very informative table. The cumulative Gain chart is the graph between
Cumulative %Right and Cumulative %Population. For the case in hand, here is the
graph:
This graph tells you how well is your model segregating responders from non-
responders. For example, the first decile, however, has 10% of the population, has
14% of the responders. This means we have a 140% lift at the first decile.
What is the maximum lift we could have reached in the first decile? From the first
table of this article, we know that the total number of responders is 3850. Also, the
first decile will contain 543 observations. Hence, the maximum lift at the first decile
could have been 543/3850 ~ 14.1%. Hence, we are quite close to perfection with
this model.
Let’s now plot the lift curve. The lift curve is the plot between total lift and
%population. Note that for a random model, this always stays flat at 100%. Here is
Post which every decile will be skewed towards non-responders. Any model with lift
@ decile above 100% till minimum 3rd decile and maximum 7th decile is a good
Lift / Gain charts are widely used in campaign targeting problems. This tells us to
which decile we can target customers for a specific campaign. Also, it tells you how
This is again one of the popular evaluation metrics used in the industry. The biggest
advantage of using the ROC curve is that it is independent of the change in the
proportion of responders. This statement will get clearer in the following sections.
Let’s first try to understand what the ROC (Receiver operating characteristic) curve
is. If we look at the confusion matrix below, we observe that for a probabilistic
Hence, for each sensitivity, we get a different specificity. The two vary as follows:
The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is
also known as the false positive rate, and sensitivity is also known as the True
Positive rate. Following is the ROC curve for the case in hand.
Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the
confusion matrix:
As you can see, the sensitivity at this threshold is 99.6%, and the (1-specificity) is
~60%. This coordinate becomes on point in our ROC curve. To bring this curve
down to a single number, we find the area under this curve (AUC).
Note that the area of the entire square is 1*1 = 1. Hence AUC itself is the ratio
under the curve and the total area. For the case in hand, we get AUC ROC as
We see that we fall under the excellent band for the current model. But this might
out-of-time validations.
Points to Remember
1. For a model which gives class as output will be represented as a single point in
2. Such models cannot be compared with each other as the judgment needs to be
taken on a single metric and not using multiple metrics. For instance, a model with
parameters (0.2,0.8) and a model with parameters (0.8,0.2) can be coming out of
the same model; hence these metrics should not be directly compared.
3. In the case of the probabilistic model, we were fortunate enough to get a single
number which was AUC-ROC. But still, we need to look at the entire curve to make
conclusive decisions. It is also possible that one model performs better in some
Lift is dependent on the total response rate of the population. Hence, if the response
rate of the population changes, the same model will give a different lift chart. A
solution to this concern can be a true lift chart (finding the ratio of lift and perfect
model lift at each decile). But such a ratio rarely makes sense for the business.
The ROC curve, on the other hand, is almost independent of the response rate. This
is because it has the two axes coming out from columnar calculations of the
confusion matrix. The numerator and denominator of both the x and y axis will
Log Loss
AUC ROC considers the predicted probabilities for determining our model’s
performance. However, there is an issue with AUC ROC, it only takes into account
the order of probabilities, and hence it does not take into account the model’s
that case, we could use the log loss, which is nothing but a negative average of the
yi = 1 for the positive class and 0 for the negative class (actual values)
Let us calculate log loss for a few random values to get the gist of the above
mathematical function:
It’s apparent from the gentle downward slope towards the right that the Log Loss
approaches 0.
So, the lower the log loss, the better the model. However, there is no absolute
Whereas the AUC is computed with regards to binary classification with a varying
decision threshold, log loss actually takes the “certainty” of classification into
account.
Gini Coefficient
coefficient can be derived straight away from the AUC ROC number. Gini is nothing
but the ratio between the area between the ROC curve and the diagonal line & the
Gini = 2*AUC – 1
Gini above 60% is a good model. For the case in hand, we get Gini as 92.7%.
This is, again, one of the most important evaluation metrics for any classification
prediction problem. To understand this, let’s assume we have 3 students who have
A – 0.9
B – 0.5
C – 0.3
Now picture this. if we were to fetch pairs of two from these three students, how
many pairs would we have? We will have 3 pairs: AB, BC, and CA. Now, after the
year ends, we see that A and C passed this year while B failed. No, we choose all
the pairs where we will find one responder and another non-responder. How many
We have two pairs AB and BC. Now for each of the 2 pairs, the concordant pair is
where the probability of the responder was higher than the non-responder. Whereas
discordant pair is where the vice-versa holds true. In case both the probabilities
were equal, we say it’s a tie. Let’s see what happens in our case :
AB – Concordant
BC – Discordant
more than 60% is considered to be a good model. This metric generally is not used
when deciding how many customers to target etc. It is primarily used to access the
model’s predictive power. Decisions like how many to target are again taken by KS /
Lift charts.
an assumption that errors are unbiased and follow a normal distribution. Here are
1. The power of ‘square root’ empowers this metric to show large number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results, which
prevent canceling the positive and negative error values. In other words, this metric
mathematical calculations.
4. When we have more samples, reconstructing the error distribution using RMSE is
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed
6. As compared to mean absolute error, RMSE gives higher weightage and punishes
large errors.
In the case of Root mean squared logarithmic error, we take the log of the
predictions and actual values. So basically, what changes are the variance that we
are measuring? RMSLE is usually used when we don’t want to penalize huge
differences in the predicted and the actual values when both predicted, and true
1. If both predicted and actual values are small: RMSE and RMSLE are the same.
3. If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes
almost negligible)
R-Squared/Adjusted R-Squared
We learned that when the RMSE decreases, the model’s performance will improve.
could gauge how good our model is against a random model, which has an
accuracy of 0.5. So the random model can be treated as a benchmark. But when we
This is where we can use the R-Squared metric. The formula for R-Squared is as
follows:
MSE(model): Mean Squared Error of the predictions against the actual values
MSE(baseline): Mean Squared Error of mean prediction against the actual values
In other words, how good is our regression model as compared to a very simple
model that just predicts the mean value of the target from the train set as
predictions?
Adjusted R-Squared
A model performing equal to the baseline would give R-Squared as 0. Better the
model, the higher the r2 value. The best model with all correct predictions would
give R-Squared of 1. However, on adding new features to the model, the R-Squared
value either increases or remains the same. R-Squared does not penalize for
adding features that add no value to the model. So an improved version of the R-
Squared is the adjusted R-Squared. The formula for adjusted R-Squared is given
by:
k: number of features
n: number of samples
As you can see, this metric takes the number of features into account. When we
add more features, the term in the denominator n-(k +1) decreases, so the whole
expression increases.
If R-Squared does not increase, that means the feature added isn’t valuable for our
model. So overall, we subtract a greater value from 1 and adjusted r2, in turn, would
decrease.
Beyond these 12 evaluation metrics, there is another method to check the model
performance. These 7 methods are statistically prominent in data science. But, with
the arrival of machine learning, we are now blessed with more robust methods of
Cross Validation
these days, I don’t get much time to participate in data science competitions. A long
You will notice that the third entry which has the worst Public score turned out to be
the best model on Private ranking. There were more than 20 models above the
really worked out well). What caused this phenomenon? The dissimilarity in my
Over-fitting is nothing, but when your model becomes highly complex that it starts
capturing noise, also. This ‘noise’ adds no value to the model but only inaccuracy.
In the following section, I will discuss how you can know if a solution is an over-fit or
It simply says, try to leave a sample on which you do not train the model and test
The above diagram shows how to validate the model with the in-time sample. We
simply divide the population into 2 samples and build a model on one sample. The
from training the model. Hence, the model is very high bias. And this won’t give the
best estimate for the coefficients. So what’s the next best option?
What if we make a 50:50 split of the training population and the train on the first 50
and validate on the rest 50? Then, we train on the other 50 and test on the first 50.
This way, we train the model on the entire population, however, on 50% in one go.
This reduces bias because of sample selection to some extent but gives a smaller
sample to train the model on. This approach is known as 2-fold cross-validation.
K-Fold Cross-Validation
Let’s extrapolate the last example to k-fold from 2-fold cross-validation. Now, we will
Here’s what goes on behind the scene: we divide the entire population into 7 equal
sample (grey box). Then, at the second iteration, we train the model with a different
sample and held each of them as validation. This is a way to reduce the selection
bias and reduce the variance in prediction power. Once we have all 7 models, we
take an average of the error terms to find which of the models is best.
If the performance metrics at each of the k times modeling are close to each other
and the mean of the metric is highest. In a Kaggle competition, you might rely more
on the cross-validation score than the Kaggle public score. This way, you will be
Classification Error
The MCE is used to assess the ability of a classification model to generalize to new and
unseen data, known as generalizability. A model with a low MCE is able to correctly
classify most test data and has better generalization ability than a model with a high
MCE.
The MCE is determined by comparing the model predictions with the actual labels of the
test data. Classification error is defined as the proportion of examples that are
misclassified. The MCE is achieved when the minimum possible classification error
value is found for the model, implying that the model is as accurate as possible in the
classification task.
MCE is used in binary and multiclass classification problems, and is useful when the
number of training samples is limited or when the classes are unbalanced.