0% found this document useful (0 votes)
19 views

Pdf&rendition 1 3

Machine Learning Bsc CS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Pdf&rendition 1 3

Machine Learning Bsc CS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Deccan Education Society’s

WILLINGDON COLLEGE, SANGLI


B.Sc. Computer Sci ( Entire) DEPARTMENT

B.SC. Part - III

Subject – Machine Learning


B. Sc. Part- III Computer Science Entire (Semester I)
Course Title: Machine Learning

3. Machine Learning Modelling


• ML Modeling flow, How to treat Data in ML?
• Types of machine learning, performance measures
• Bias-Variance Trade-Off
• Overfitting & Under fitting, Bootstrap Sampling, Bagging, Aggregation
Types of machine learning
ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided
into mainly four types, which are:
• Supervised Machine Learning
• Unsupervised Machine Learning
• Semi-Supervised Machine Learning
• Reinforcement Learning
Types of machine learning
1. Supervised Machine Learning
As its name suggests, Supervised Machine Learning is based on
supervision. It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can
say; first, we train the machine with the input and corresponding
output, and then we ask the machine to predict the output using the
test dataset.
Types of machine learning
1. Supervised Machine Learning
Let’s understand it with the help of an example.
Example: Consider a scenario where you have to build an image
classifier to differentiate between cats and dogs. If you feed the datasets
of dogs and cats labelled images to the algorithm, the machine will learn
to classify between a dog or a cat from these labelled images. When we
input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is
how supervised learning works, and this is particularly an image
classification.
Types of machine learning
1. Supervised Machine Learning
1. Supervised Machine Learning
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of
problems, which are given below:
Classification
a) Classification b) Regression
b) Classification
Classification algorithms are used to solve the classification problems in
which the output variable is categorical, such as "Yes" or No, Male or
Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
Types of machine learning
1. Supervised Machine Learning
b) Regression
Regression algorithms are used to solve regression problems in which
there is a linear relationship between input and output variables. These
are used to predict continuous output variables, such as market trends,
weather prediction, etc.
Some popular Regression algorithms are given below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression
Types of machine learning
1. Supervised Machine Learning
Advantages and Disadvantages of Supervised Learning
Advantages -
• Since supervised learning work with the labelled dataset so we can
have an exact idea about the classes of objects.
• These algorithms are helpful in predicting the output on the basis of
prior experience.
Disadvantages -
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the
training data.
• It requires lots of computational time to train the algorithm.
Types of machine learning
2. Unsupervised Machine Learning
Unsupervised learning is different from the Supervised learning
technique; as its name suggests, there is no need for supervision. It
means, in unsupervised machine learning, the machine is trained using
the unlabelled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that
is neither classified nor labelled, and the model acts on that data
without any supervision.
The main aim of the unsupervised learning algorithm is to group or
categories the unsorted dataset according to the similarities, patterns,
and differences. Machines are instructed to find the hidden patterns
from the input dataset.
Types of machine learning
2. Unsupervised Machine Learning

Let's take an example to understand it more preciously; suppose


there is a basket of fruit images, and we input it into the machine
learning model. The images are totally unknown to the model, and the
task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such
as colour difference, shape difference, and predict the output when it is
tested with the test dataset.
Types of machine learning
2. Unsupervised Machine Learning
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which
are given below:
Clustering Association
1) Clustering
The clustering technique is used when we want to find the inherent
groups from the data. It is a way to group the objects into a cluster such
that the objects with the most similarities remain in one group and have
fewer or no similarities with the objects of other groups. An example of
the clustering algorithm is grouping the customers by their purchasing
behaviour.
Some of the popular clustering algorithms are given below:
K-Means Clustering algorithm Mean-shift algorithm
DBSCAN Algorithm Principal Component Analysis
Independent Component Analysis
Types of machine learning
2. Unsupervised Machine Learning
Categories of Unsupervised Machine Learning
2) Association
Association rule learning is an unsupervised learning technique,
which finds interesting relations among variables within a large dataset.
The main aim of this learning algorithm is to find the dependency of one
data item on another data item and map those variables accordingly so
that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production,
etc.
Some popular algorithms of Association rule learning are
Apriori Algorithm
Eclat
FP-growth algorithm.
Types of machine learning
2. Unsupervised Machine Learning
Advantages and Disadvantages of Unsupervised Learning Algorithm
Advantages -
 These algorithms can be used for complicated tasks compared to
the supervised ones because these algorithms work on the unlabelled
dataset.
 Unsupervised algorithms are preferable for various tasks as getting
the unlabelled dataset is easier as compared to the labelled dataset.
Disadvantages:
 The output of an unsupervised algorithm can be less accurate as
the dataset is not labelled, and algorithms are not trained with the exact
output in prior.
 Working with Unsupervised learning is more difficult as it works
with the unlabelled dataset that does not map with the output.
Types of machine learning
3. Semi-supervised Machine Learning
Semi-Supervised learning is a type of Machine Learning algorithm
that lies between Supervised and Unsupervised machine learning. It
represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabelled
datasets during the training period.
The main aim of Semi-Supervised learning is to effectively use all
the available data, rather than only labelled data like in supervised
learning. Initially, similar data is clustered along with an unsupervised
learning algorithm, and further, it helps to label the unlabelled data into
labelled data.
Types of machine learning
3. Semisupervised Machine Learning
We can imagine these algorithms with an example. Supervised
learning is where a student is under the supervision of an instructor at
home and college. Further, if that student is self-analysing the same
concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student
has to revise himself after analysing the same concept under the
guidance of an instructor at college.
data.
Types of machine learning
3. Semi-supervised Machine Learning
Advantages and disadvantages of Semi-supervised Learning
Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised
Learning algorithms.
Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.
Types of machine learning
4. Reinforcement Machine Learning
Reinforcement learning works on a feedback-based process, in which
an AI agent (A software component) automatically explore its
surrounding by hitting & trail, taking action, learning from experiences,
and improving its performance. Agent gets rewarded for each good
action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised
learning, and agents learn from their experiences only.
The reinforcement learning process is similar to a human being; for
example, a child learns various things by experiences in his day-to-day
life. An example of reinforcement learning is to play a game, where the
Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback
in terms of punishment and rewards.
Types of machine learning
4. Reinforcement Machine Learning
Due to its way of working, reinforcement learning is employed in
different fields such as Game theory, Operation Research, Information
theory, multi-agent systems.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of
methods/algorithms:
Positive Reinforcement Learning- Positive reinforcement learning
specifies increasing the tendency that the required behaviour would
occur again by adding something. It enhances the strength of the
behaviour of the agent and positively impacts it.
Negative Reinforcement Learning- Negative reinforcement learning
works exactly opposite to the positive RL. It increases the tendency that
the specific behaviour would occur again by avoiding the negative
condition.
Types of machine learning
4. Reinforcement Machine Learning
Advantages and Disadvantages of Reinforcement Learning
Advantages
• It helps in solving complex real-world problems which are difficult
to be solved by general techniques.
• The learning model of RL is similar to the learning of human
beings; hence most accurate results can be found.
• Helps in achieving long term results.
Disadvantage
• RL algorithms are not preferred for simple problems.
• RL algorithms require huge data and computations.
• Too much reinforcement learning can lead to an overload of states
which can weaken the results.
Overfitting
• A statistical model is said to be overfitted when the model does not
make accurate predictions on testing data. When a model gets trained
with so much data, it starts learning from the noise and inaccurate
data entries in our data set.
• And when testing with test data results in High variance. Then the
model does not categorize the data correctly, because of too many
details and noise.
• The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid
overfitting is using a linear algorithm if we have linear data or using
the parameters like the maximal depth if we are using decision trees.
• Too much reinforcement learning can lead to an overload of states
which can weaken the results.
Overfitting
Reasons for Overfitting
• High variance and low bias.
• The model is too complex.
• The size of the training data.

Techniques to Reduce Overfitting


• Improving the quality of training data reduces overfitting by
focusing on meaningful patterns, mitigate ( less severe) the risk of
fitting the noise or irrelevant features.
• Increase the training data can improve the model’s ability to
generalize to unseen data and reduce the likelihood of overfitting.
• Reduce model complexity.
• Early stopping during the training phase (have an eye over the loss
over the training period as soon as loss begins to increase stop
training).
Underfitting
• A statistical model or a machine learning algorithm is said to have
underfitting when a model is too simple to capture data
complexities. It represents the inability of the model to learn the
training data effectively result in poor performance both on the
training and testing data.

• In simple terms, an underfit model’s are inaccurate, especially


when applied to new, unseen examples.

• It mainly happens when we uses very simple model with overly


simplified assumptions. To address underfitting problem of the
model, we need to use more complex models, with enhanced
feature representation and less regularization.

Note that the underfitting model has High bias and low variance.
Underfitting
Reasons for Underfitting
• The model is too simple, So it may be not capable to represent the
complexities in the data.
• The input features which is used to train the model is not the
adequate representations of underlying factors influencing the
target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which
constraint the model to capture the data well.
• Features are not scaled.

Techniques to Reduce Underfitting


• Increase model complexity.
• Increase the number of features
• Remove noise from the data.
Bias-Variance Trade-Off

• The bias is known as the difference between the prediction of the


values by the Machine Learning model and the correct value.
Being high in biasing gives a large error in training as well as
testing data. It recommended that an algorithm should always be
low-biased to avoid the problem of underfitting.

• The variability of model prediction for a given data point which


tells us the spread of our data is called the variance of the model.
The model with high variance has a very complex fit to the training
data and thus is not able to fit accurately on the data which it
hasn’t seen before. As a result, such models perform very well on
training data but have high error rates on test data. When a model
is high on variance, it is then said to as Overfitting of Data.
Bias-Variance Trade-Off

While building the machine learning model, it is really important to take


care of bias and variance in order to avoid overfitting and underfitting in
the model. If the model is very simple with fewer parameters, it may
have low variance and high bias. Whereas, if the model has a large
number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this
balance between the bias error and variance error is known as the Bias-
Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance


and low bias. But this is not possible because bias and variance are
related to each other:
If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.
Bias-Variance Trade-Off

Bias-Variance trade-off is a central issue in supervised learning. Ideally,


we need a model that accurately captures the regularities in training
data and simultaneously generalizes well with the unseen dataset.
Unfortunately, doing this is not possible simultaneously. Because a
high variance algorithm may perform well with training data, but it may
lead to overfitting to noisy data. Whereas, high bias algorithm generates
a much simple model that may not even capture important regularities
in the data. So, we need to find a sweet spot between bias and variance
to make an optimal model.
Bootstrapping
• In statistics and machine learning, bootstrapping is a resampling
technique that involves repeatedly drawing samples from our source
data with replacement, often to estimate a population parameter. By
“with replacement”, we mean that the same data point may be
included in our resampled dataset multiple times. It is used to
determine various parameters of a population.
• A bootstrap plot is a graphical representation of the distribution of a
statistic calculated from a sample of data. It is often used to visualize
the variability and uncertainty of a statistic, such as the mean or
standard deviation, by showing the distribution of the statistic over
many bootstrapped samples of the data.
• The bootstrap plot is a powerful tool for understanding the
uncertainty in a statistic, especially when the underlying distribution
of the data is unknown or complex. It can also be used to generate
confidence intervals for a statistic and to compare the distributions of
different statistics.
Bootstrapping

Advantages of bootstrap
• It is a non-parametric method, which means it does not require
any assumptions about the underlying distribution of the data.
• It can be used to estimate standard errors and confidence intervals
for a wide range of statistics.
• It can be used to estimate the uncertainty of a statistic even when
the sample size is small.
• It can be used to perform hypothesis tests and compare the
distributions of different statistics.
• It is widely used in many fields such as statistics, finance, and
machine learning
Bootstrapping

Disadvantages of bootstrap:
• It can be computationally intensive, especially when working with
large datasets.
• It may not be appropriate for all types of data, such as highly
skewed or heavy-tailed distributions.
• It may not be appropriate for estimating the uncertainty of
statistics that have very large variances.
• It may not be appropriate for estimating the uncertainty of
statistics that are not smooth or have very different variances.
• It may not always be a good substitute for other statistical
methods, when large sample sizes are available.
Bagging

• Bagging, also known as bootstrap aggregation, is the ensemble


learning method that is commonly used to reduce variance within a
noisy data set. In bagging, a random sample of data in a training set
is selected with replacement—meaning that the individual data points
can be chosen more than once.

• Bootstrap Aggregating, also known as bagging, is a machine learning


ensemble meta-algorithm designed to improve the stability and
accuracy of machine learning algorithms used in statistical
classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods.
Bagging
Implementation Steps of Bagging
Step 1 Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.

Step 2 A base model is created on each of these subsets.

Step 3 Each model is learned in parallel with each training set and
independent of each other.

Step 4 The final predictions are determined by combining the


predictions from all the models.

Aggregation

In its simplest form, data aggregation is the process of compiling


typically large amounts of information from a given database and
organizing it into a more consumable and comprehensive medium.
Data aggregation can be applied at any scale, from pivot tables to data
lakes, in order to summarize information and make conclusions based
on data-rich findings.
Because of the growing accessibility to information and importance of
personalization metrics across the enterprise, the application of data
aggregation has become extremely relevant.
Aggregation

In our technologically advanced world, data is constantly evolving,


expanding, and becoming more convoluted with each actioned input
and output. Data is one of the most valuable currencies of our time, but
data without organization, segmentation, and understanding is
essentially useless.
What makes data valuable is the extraction of insights that point to
key trends, results and give a better understanding of the information at
hand. A process in which data is searched, gathered, and presented in a
summarized, report-based form, data aggregation helps organizations to
achieve specific business objectives or conduct process/human analysis
at almost any scale.
Performance Measures in Machine Learning

Evaluating the performance of a Machine learning model is one of


the important steps while building an effective ML model. To evaluate
the performance or quality of the model, different metrics are used, and
these metrics are known as performance metrics or evaluation
metrics. These performance metrics help us understand how well our
model has performed for the given data. In this way, we can improve the
model's performance by tuning the hyper-parameters. Each ML model
aims to generalize well on unseen/new data, and performance metrics
help determine how well the model generalizes on the new dataset.
at almost any scale.
To evaluate the performance of a classification model, different
metrics are used, and some of them are as follows:
Accuracy Confusion Matrix Precision Recall
F-Score AUC(Area Under the Curve)-ROC
Performance Measures in Machine Learning

I. Accuracy
The accuracy metric is one of the simplest Classification metrics to
implement, and it can be determined as the number of correct
predictions to the total number of predictions.

It can be formulated as

To implement an accuracy metric, we can compare ground truth and


predicted values in a loop
It is good to use the Accuracy metric when the target variable classes
in data are approximately balanced. For example, if 60% of classes in a
fruit image dataset are of Apple, 40% are Mango. In this case, if the
model is asked to predict whether the image is of Apple or Mango, it will
give a prediction with 97% of accuracy.
Performance Measures in Machine Learning

II. Confusion Matrix


A confusion matrix is a tabular representation of prediction
outcomes of any binary classifier, which is used to describe the
performance of the classification model on a set of test data when true
values are known. The confusion matrix is simple to implement, but the
terminologies used in this matrix might be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below
image(However, it can be extended to use for classifiers with more than
two classes).
Performance Measures in Machine Learning

II. Confusion Matrix

In general, the table is divided into four terminologies, which are as


follows -
True Positive(TP) - In this case, the prediction outcome is true, and it is
true in reality, also.
True Negative(TN)- in this case, the prediction outcome is false, and it
is false in reality, also.
False Positive(FP)- In this case, prediction outcomes are true, but they
are false in actuality.
False Negative(FN) - In this case, predictions are false, and they are true
in actuality.
Performance Measures in Machine Learning

III. Precision

The precision metric is used to overcome the limitation of Accuracy. The


precision determines the proportion of positive prediction that was
actually correct. It can be calculated as the True Positive or predictions
that are actually true to the total positive predictions (True Positive and
False Positive).
Performance Measures in Machine Learning

IV. Recall or Sensitivity

It is also similar to the Precision metric; however, it aims to calculate


the proportion of actual positive that was identified incorrectly. It can be
calculated as True Positive or predictions that are actually true to the
total number of positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false negative).
The formula for calculating Recall is given below:

In simple words, if we maximize precision, it will minimize the FP errors,


and if we maximize recall, it will minimize the FN error.
Performance Measures in Machine Learning

V. F-Scores

F-score or F1 Score is a metric to evaluate a binary classification model


on the basis of predictions that are made for the positive class. It is
calculated with the help of Precision and Recall. It is a type of single
score that represents both Precision and Recall. So, the F1 Score can be
calculated as the harmonic mean of both precision and Recall, assigning
equal weight to each of them.
The formula for calculating the F1 score is given below:
Performance Measures in Machine Learning

VI. AUC-ROC
Sometimes we need to visualize the performance of the classification
model on charts; then, we can use the AUC-ROC curve. It is one of the
popular and important metrics for evaluating the performance of the
classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve)
curve. ROC represents a graph to show the performance of a
classification model at different threshold levels. The curve is plotted
between two parameters, which are:
True Positive Rate
False Positive Rate
TPR or true Positive rate is a synonym for Recall, hence can be
calculated as:
Performance Metrics for Regression
Regression is a supervised learning technique that aims to find the
relationships between the dependent and independent variables. A
predictive regression model predicts a numeric or discrete value. The
metrics used for regression are different from the classification metrics.
It means we cannot use the Accuracy metric (explained above) to
evaluate a regression model; instead, the performance of a Regression
model is reported as errors in the prediction. Following are the popular
metrics that are used to evaluate the performance of Regression models.
• Mean Absolute Error
• Mean Squared Error
• R2 Score
• Adjusted R2
Performance Metrics for Regression
1) Mean Absolute Error (MAE)
Mean Absolute Error or MAE is one of the simplest metrics, which
measures the absolute difference between actual and predicted values,
where absolute means taking a number as Positive.
To understand MAE, let's take an example of Linear Regression, where
the model draws a best fit line between dependent and independent
variables. To measure the MAE or error in prediction, we need to
calculate the difference between actual values and predicted values. But
in order to find the absolute error for the complete dataset, we need to
find the mean absolute of the complete dataset.
The below formula is used to calculate MAE:
Performance Metrics for Regression
1) Mean Absolute Error (MAE)

Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total
number of data points.
MAE is much more robust for the outliers. One of the limitations of MAE
is that it is not differentiable, so for this, we need to apply different
optimizers such as Gradient Descent. However, to overcome this
limitation, another metric can be used, which is Mean Squared Error or
MSE.
Performance Metrics for Regression
II. Mean Squared Error
Mean Squared error or MSE is one of the most suitable metrics for
Regression evaluation. It measures the average of the Squared
difference between predicted values and the actual value given by the
model. Since in MSE, errors are squared, therefore it only assumes
non-negative values, and it is usually positive and non-zero.
Moreover, due to squared differences, it penalizes small errors also, and
hence it leads to over-estimation of how bad the model is.
MSE is a much-preferred metric compared to other regression metrics
as it is differentiable and hence optimized better.
The formula for calculating MSE is given below:

Here, Y is the Actual outcome, Y' is the predicted outcome, and N is the
total number of data points.
Performance Metrics for Regression
III. R Squared Error
R squared error is also known as Coefficient of Determination, which is
another popular metric used for Regression model evaluation. The R-
squared metric enables us to compare our model with a constant
baseline to determine the performance of the model. To select the
constant baseline, we need to take the mean of the data and draw the
line at the mean.
The R squared score will always be less than or equal to 1 without
concerning if the values are too large or small.
Performance Metrics for Regression
IV. Adjusted R Squared
Adjusted R squared, as the name suggests, is the improved version of R
squared error. R square has a limitation of improvement of a score on
increasing the terms, even though the model is not improving, and it
may mislead the data scientists.
To overcome the issue of R square, adjusted R squared is used, which
will always show a lower value than R². It is because it adjusts the
values of increasing predictors and only shows improvement if there is a
real improvement.
We can calculate the adjusted R squared as follows:

You might also like