Machine Learning unit 3
Machine Learning unit 3
UNIT - III
Prepared by-
Srishyla K
Asst.
Professor
Dept. of CSE(AI&ML)
Instance-based learning
• Instance-based learning (also known as memory-based learning or lazy learning) involves
memorizing training data in order to make predictions about future data points.
• This approach doesn’t require any prior knowledge or assumptions about the data, which makes
it easy to implement and understand.
• However, it can be computationally expensive since all of the training data needs to be stored in
memory before making a prediction.
• Additionally, this approach doesn’t generalize well to unseen data sets because its predictions are
based on memorized examples rather than learned models
• At the time of making prediction, the system uses similarity measure and compare the new cases
with the learned data. K-nearest neighbors (KNN) is an algorithm that belongs to the instance-
based learning class of algorithms.
• KNN is a non-parametric algorithm because it does not assume any specific form or underlying
structure in the data. Instead, it relies on a measure of similarity between each pair of data
points. Generally speaking, this measure is based on either Euclidean distance or ma
Advantages of Instance-Based Learning
1. No need for model creation: Instance-based learning doesn’t require creating a model, which can
be an advantage if you don’t have the expertise to create the model.
2. Can handle small datasets: Instance-based learning can handle small datasets because it doesn’t
require a large dataset to create a model.
3. More flexibility: Instance-based learning can be more flexible than model-based learning because
the machine stores all instances of data and can use this data to make predictions.
Disadvantages of Instance-Based
Learning
1. Slower predictions: Instance-based learning is typically slower than model-based learning
because the machine has to compare the new data to all instances of data in order to make a
prediction.
2. Less accurate predictions: Instance-based learning can often make less accurate predictions than
model-based learning because it doesn’t have a mathematical model to generalize from.
3. Limited understanding of data: Instance-based learning doesn’t provide as much insight into the
relationships between input and output variables as model-based learning does.
• In nearest-neighbor learning the target function may be either discrete-valued or real valued.
• Let us first consider learning discrete-valued target functions of the form
• The value 𝑓̂(xq) returned by this algorithm as its estimate of f(xq) is just the most common value of f
among the k training examples nearest to xq.
• If k = 1, then the 1- Nearest Neighbor algorithm assigns to 𝑓̂(xq) the value f(xi). Where xi is the
training instance nearest to xq. For larger values of k, the algorithm assigns the most common value
among the k nearest training examples.
Knn contd.
Example for discrete & real valued target
function
Distance-Weighted Nearest Neighbor Algorithm
The refinement to the k-NEAREST NEIGHBOR Algorithm is to weight the contribution of each of the
k neighbors according to their distance to the query point xq, giving greater weight to closer
neighbors. For example, in the k-Nearest Neighbor algorithm, which approximates discrete-valued
target functions, we might weight the vote of each neighbor according to the inverse square of its
distance from xq
2. Equal Distances: In high-dimensional spaces, the concept of distance becomes less meaningful.
As the number of dimensions increases, the distance between any two points tends to become
more uniform, or equidistant. This phenomenon occurs because the influence of any single
dimension diminishes as the number of dimensions grows, leading to points being distributed
more uniformly across the space.
3. Degraded Performance: KNN relies on the assumption that nearby points in the feature space are
likely to have similar labels. However, in high-dimensional spaces, this assumption may no longer
hold true due to the increased sparsity and equalization of distances. As a result, KNN may
struggle to accurately classify data points, leading to degraded performance.
4. Increased Computational Complexity: With higher dimensionality, the computational cost of KNN
increases significantly. The algorithm needs to compute distances in a high-dimensional space,
which involves more calculations. This can make the KNN algorithm slower and less efficient,
especially when dealing with large datasets.
BAYES THEOREM
Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself. Bayes theorem is one of the most popular machine learning concepts that helps to
calculate the probability of occurring one event with uncertain knowledge while other one has
already occurred.
The theorem can be mathematically expressed as:
Here, both events X and Y are independent events which means probability of outcome of both
events does not depends one another.
P(X|Y) is called as posterior. It is defined as updated probability after considering the evidence.
P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
P(X) is called the prior probability, probability of hypothesis before considering the evidence
P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Naive Bayes Classifier Algorithm
The Naive Bayes classifier algorithm is a machine learning technique used for classification tasks. It
is based on Bayes’ theorem and assumes that features are conditionally independent of each other
given the class label. The algorithm calculates the probability of a data point belonging to each class
and assigns it to the class with the highest probability.
• https://www.javatpoint.com/machine-learning-naive-bayes-classifier
Logistic Regression
• Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value
tends to 1, and a value below the threshold values tends to 0.
Differences between linear and logistic regression
Feature Selection
A feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important
features for the model is known as feature selection. "It is a process of automatically or manually selecting the subset
of most appropriate and relevant features to be used in model building”. Role of feature selection:
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
There are mainly two types of Feature Selection techniques, which are:
• Supervised Feature Selection technique
It considers the target variable and can be used for the
labelled dataset.
• Unsupervised Feature Selection technique
It ignores the target variable and can be used
for the unlabeled dataset.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It trains
the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with this feature set,
the model has trained again.
Wrapper Methods
1. Forward selection
Forward Feature Selection is a feature selection technique that iteratively builds a model by adding
one feature at a time, selecting the feature that maximizes model performance.
It starts with an empty set of features and adds the most predictive feature in each iteration until a
stopping criterion is met.
This method is particularly useful when dealing with a large number of features, as it incrementally
builds the model based on the most informative features.
This process involves assessing new features, evaluating combinations of features, and selecting the
optimal subset of features that best contribute to model accuracy.
Wrapper Methods
2. Backward elimination
This method is also an iterative approach where we initially start with all features and after each
iteration, we remove the least significant feature. The stopping criterion is till no improvement in
the performance of the model is observed after the feature is removed.
3. Exhaustive selection – This technique is considered as the brute force approach for the
evaluation of feature subsets. It creates all possible subsets and builds a learning algorithm for each
subset and selects the subset whose model’s performance is best.
3. Fisher’s score: It may be used for continuous features in a classification problem. Fisher's Score
is calculated as the ratio of between-class and within-class variance. A higher Fisher's Score implies
the characteristic is more discriminative and valuable for classification. Fisher's score is one of the
popular supervised technique of features selection. It returns the rank of the variable on the fisher's
criteria in descending order. Then we can select the variables with a large fisher's score.
4. Missing Value Ratio: The value of the missing value ratio can be used for evaluating the feature
set against the threshold value. The formula for obtaining the missing value ratio is the number of
missing values in each column divided by the total number of observations. The variable is having
more than the threshold value can be dropped.
Univariate feature selection
• Univariate feature selection is a method used to select the most important features in a dataset.
The idea behind this method is to evaluate each individual feature’s relationship with the target
variable and select the ones that have the strongest correlation. This process is repeated for each
feature and the best ones are selected based on defined criteria, such as the highest correlation
or statistical significance.
• In univariate feature selection, the focus is on individual features and their contribution to the
target variable, rather than considering the relationships between features. This method is simple
and straightforward, but it does not take into account any interactions or dependencies between
features.
• Univariate feature selection is useful when working with a large number of features and the goal
is to reduce the dimensionality of the data and simplify the modeling process. It is also useful for
feature selection in cases where the relationship between the target variable and individual
features is not complex and can be understood through a simple statistical analysis.
Multivariate feature selection
Multivariate feature selection refers to the process of selecting subsets of
features from multivariate data (data with multiple variables) to improve the
performance of a machine learning model or to gain insights into the underlying
data structure. Unlike univariate feature selection, which considers each feature
individually, multivariate feature selection methods take into account the
relationships between features.