Module -03 Machine Learning(BCS602) Search Creators
Module -03 Machine Learning(BCS602) Search Creators
Module-3
Nearest-Neighbor Learning
Definition:
Working:
o Classification:
The algorithm determines the class of a test instance by considering
the ‘K’ nearest neighbors and selecting the class with the majority vote.
o Regression:
The output is the mean of the target variable values of the ‘K’ nearest
neighbors.
Assumption:
o k-NN relies on the assumption that similar objects are closer to each other in the
feature space.
Instance-Based Learning:
o Memory-Based: The algorithm does not build a prediction model ahead of time, but
stores training data for predictions to be made at the time of the test instance.
o Lazy Learning: No model is constructed during training; the learning process
happens only during testing when predictions are required.
Distance Metric:
o The most common distance metric used is Euclidean distance to measure the
closeness of training data instances to the test instance.
Choosing ‘K’:
o The value of ‘K’ determines how many neighbors should be considered for the
prediction. It is typically selected by experimenting with different values of K to find
the optimal one that produces the most accurate predictions.
Classification Process:
o For a discrete target variable (classification): The class of the test instance is
determined by the majority vote of the 'K' nearest neighbors.
o For a continuous target variable (regression): The output is the mean of the output
variable values of the ‘K’ nearest neighbors.
Advantages:
Disadvantages:
Overview:
Motivation:
o Traditional k-NN assigns equal importance to all the ‘k’ nearest neighbors, which can
lead to poor performance when:
Neighbors are at varying distances.
The nearest instances are more relevant than the farther ones.
Working Principle:
Closer neighbors get higher weights, while farther neighbors get lower
weights.
o The final prediction is based on the weighted majority vote (classification) or the
weighted average (regression) of the k nearest neighbors.
Weight Assignment:
o Uniform Weighting: All neighbors are given the same weight (as in standard k-NN).
o Distance-Based Weighting: Weights are computed based on the inverse distance,
giving closer neighbors more influence.
Advantages:
Applications:
o Classification: Predict the class of the test instance by weighted voting of the k
nearest neighbors.
o Regression: Predict the output value by computing the weighted mean of the k
nearest neighbors.
Limitations:
The idea of this classifier is to classify a test instance to the class whose
centroid/mean is closest to that instance.
The Nearest Centroid Classifier (also known as the Mean Difference Classifier) is a
simple alternative to k-Nearest Neighbors (k-NN) for similarity-based classification.
Algorithm
2. Compute the distance between the test instance and mean/centroid of each class
(Euclidean Distance).
3. Predict the class by choosing the class with the smaller distance.
Using nearest neighbors algorithm, we find the instances that are closest to a test
instance and fit linear function to each of those ‘K’ nearest instances in the local
regression model.
The key idea is that we need to approximate the linear functions of all ‘K’ neighbors
that minimize the error such that the prediction line is no more linear but rather it is a
curve.
Chapter – 02
Regression Analysis
Introduction to Regression
Definition:
Regression analysis is a supervised learning technique used to model the relationship
between one or more independent variables (x) and a dependent variable (y).
Objective:
The goal is to predict or forecast the dependent variable (y) based on the independent
variables (x), which are also called explanatory, predictor, or independent variables.
Mathematical Representation:
The relationship is represented by a function:
Purpose:
Regression analysis helps to determine how the dependent variable changes when an
independent variable is varied while others remain constant.
Applications:
Sales forecasting
Bond values in portfolio management
Insurance premiums
Agricultural yield predictions
Real estate pricing
Prediction Focus:
Regression is primarily used for predicting continuous or quantitative variables, such as
price, revenue, and other measurable factors.
Definition:
Linear Regression is a fundamental supervised learning algorithm used to model the
relationship between one or more independent variables (predictors) and a dependent
variable (target).
Objective:
The primary goal of linear regression is to find a linear equation that best fits the data
points. This equation is used to predict the dependent variable based on the values of
the independent variables.
Mathematical Representation:
The relationship is represented as:
Assumptions:
Applications:
Advantages:
Limitations:
Multiple regression model involves multiple predictors or independent variables and one
dependent variable.
Definition:
Multiple Linear Regression (MLR) is an extension of simple linear regression, where
multiple independent variables (predictors) are used to model the relationship with a
single dependent variable (target).
Mathematical Representation:
The relationship is represented as:
Normality of Residuals: The residuals (errors) should be normally distributed for valid
inference and hypothesis testing.
o Linearity: The relationship between each independent variable and the dependent
variable should be linear.
o Independence of Errors: Observations should be independent of each other.
o Homoscedasticity: The variance of residuals should be constant across all levels of
the independent variables.
Applications:
o Predicting house prices based on multiple features (size, location, number of rooms,
etc.).
o Estimating the sales of a product based on various factors (price, advertising budget,
competition, etc.).
o Modeling health outcomes based on multiple risk factors (age, BMI, physical activity,
etc.).
Advantages:
o Can model the relationship between multiple predictors and a single outcome.
o Provides insights into how different predictors influence the dependent variable.
Limitations:
Polynomial Regression
Definition:
Polynomial Regression is a form of regression analysis that models the relationship
between the independent variable(s) and the dependent variable as a polynomial
function.
It is used when the relationship between variables is non-linear and cannot be effectively
modeled using linear regression.
Purpose:
When the data exhibits a non-linear trend, linear regression may result in large errors.
Polynomial regression overcomes this limitation by fitting a curved line to the data.
Applications:
Advantages:
Limitations:
Increasing the polynomial degree can lead to overfitting the training data.
Sensitive to outliers, which can significantly distort the fitted curve.
May require careful tuning of the degree nnn to balance bias and variance.
Logistic Regression
Definition:
Logistic Regression is a supervised learning algorithm used for classification problems,
particularly binary classification, where the output is a categorical variable with two
possible outcomes (e.g., yes/no, pass/fail, spam/not spam).
Purpose:
Logistic Regression predicts the probability of a categorical outcome and maps the
prediction to a value between 0 and 1. It works well when the dependent variable is
binary.
Applications:
Core Concept:
o For instance, if the predicted probability of an email being spam is 0.7, there is a 70%
chance the email is spam.
o Linear regression can predict values outside the range of 0 to 1, which is unsuitable
for probabilities.
o Logistic Regression overcomes this by using a sigmoid function to map values to the
range [0, 1].
Sigmoid Function:
The sigmoid function (also called the log it function) is used to map any real number to
the range [0, 1]. It is mathematically represented as:
For example:
If the probability of an event is 0.75, the odds are:
Advantages:
Limitations:
Struggles with non-linear decision boundaries (can be addressed with extensions like
polynomial logistic regression).
Sensitive to outliers in the dataset.
Chapter – 03
Overview:
Root Node: The topmost node that represents the entire dataset.
Internal/Decision Nodes: These are nodes that perform tests on input attributes
and split the dataset based on test outcomes.
Branches: Represent the outcomes of a test condition at a decision node.
Leaf Nodes/Terminal Nodes: Represent the target labels or output of the decision
process.
Path: A path from root to leaf node represents a logical rule for classification.
Tree Construction:
o Start from the root and recursively find the best attribute for splitting.
o This process continues until the tree reaches leaf nodes that cannot be
further split.
o The tree represents all possible hypotheses about the data.
Output: A fully constructed decision tree that represents the learned model.
Inference or Classification:
Goal: For a given test instance, classify it into the correct target class.
Classification:
o Start at the root node and traverse the tree based on the test conditions for
each attribute.
o Continue evaluating test conditions until reaching a leaf node, which provides
the target class label for the instance.
5. Fast to train.
1. It is difficult to determine how deep the tree should grow and when to stop.
2. Sensitive to errors and missing attribute values in training data.
3. Computational complexity in handling continuous attributes, requiring
discretization.
4. Risk of overfitting with complex trees.
5. Not suitable for classifying multiple output classes.
6. Learning an optimal decision tree is an NP-complete problem.
Several decision tree algorithms are widely used in classification tasks, including ID3,
C4.5, and CART, among others.
These algorithms differ in their splitting criteria, handling of attributes, and robustness
to data characteristics.
C4.5:
Algorithm