ML Notes-1
ML Notes-1
UNIT‐I
Introduction: Machine learning, terminologies in machine learning, Perspectives and issues in machine
learning, application of Machine learning, Types of machine learning: supervised, unsupervised, semi-
supervised learning. Review of probability, Basic Linear Algebra in Machine Learning Techniques, Dataset
and its types, Data preprocessing, Bias and Variance in Machine learning, Function approximation,
Overfitting
UNIT‐II
Regression Analysis in Machine Learning: Introduction to regression and its terminologies, Types of
regression, Logistic Regression
Simple Linear regression: Introduction to Simple Linear Regression and its assumption, Simple Linear
Regression Model Building, Ordinary Least square estimation, Properties of the least-squares estimators
and the fitted regression model, Interval estimation in simple linear regression, Residuals
Multiple Linear Regression: Multiple linear regression model and its assumption.
Interpret Multiple Linear Regression Output (R-Square, Standard error, F, Significance F, Coefficient P
values)
Access the fit of multiple linear regression model (R squared, Standard error)
Feature Selection and Dimensionality Reduction: PCA, LDA, ICA
UNIT‐III
Introduction to Classification and Classification Algorithms: What is Classification? General Approach to
Classification, k-Nearest Neighbour Algorithm, Random Forests, Fuzzy Set Approaches
Support Vector Machine: Introduction, Types of support vector kernel – (Linear kernel, polynomial kernel,
and Gaussian kernel), Hyperplane – (Decision surface), Properties of SVM, and Issues in SVM.
Decision Trees: Decision tree learning algorithm, ID-3algorithm, Inductive bias, Entropy and information
theory, Information gain, Issues in Decision tree learning.
Bayesian Learning - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve Bayes classifier,
Bayesian belief networks, EM algorithm.
Ensemble Methods: Bagging, Boosting and AdaBoost and XBoost,
Classification Model Evaluation and Selection: Sensitivity, Specificity, Positive Predictive Value, Negative
Predictive Value, Lift Curves and Gain Curves, ROC Curves, Misclassification Cost Adjustment to Reflect
Real-World Concerns, Decision Cost/Benefit Analysis
UNIT – IV
Introduction to Cluster Analysis and Clustering Methods: The Clustering Task and the Requirements for
Cluster Analysis.
Overview of Some Basic Clustering Methods:‐ k-Means Clustering, k-Medoids Clustering,
Density-Based Clustering: DBSCAN - Density-Based Clustering Based on Connected Regions with High Dens,
Gaussian Mixture Model algorithm, Balance Iterative Reducing and Clustering using Hierarchies (BIRCH) ,
Affinity Propagation clustering algorithm, Mean-Shift clustering algorithm, ordering Points to Identify the
Clustering Structure (OPTICS) algorithm, Agglomerative Hierarchy clustering algorithm, Divisive
Hierarchical , Measuring Clustering Goodness
UNIT 1
➢ Machine Learning (ML)
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to automatically
learn and improve from experience without being explicitly programmed.
In simpler terms, it allows machines to make decisions or predictions based on data. The core concept
revolves around the idea that systems can learn from data, identify patterns, and make decisions with
minimal human intervention.
➢ Key Terminologies:
1. Model:
A mathematical representation of a process that the machine learning algorithm tries to learn from
data.
Example: A linear regression model that predicts house prices based on features like size and
location.
2. Algorithm:
The method or procedure used to train the model from data. It defines the logic and rules by which the
model makes predictions.
Example: Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbours.
3. Training:
The process of feeding data into a machine learning algorithm to build a model.
Example: Training a neural network on labelled images to classify them.
4. Training Data:
The dataset used to teach the model. The model learns patterns, relationships, and trends from this
data.
Example: A dataset containing labelled data of houses with their features and corresponding
prices.
5. Test Data:
The dataset used to evaluate the performance of a trained model. This data has not been used during
the training phase and is meant to test the model’s generalization ability.
Example: A separate set of house prices that the model has not seen during training.
6. Feature:
An individual measurable property or characteristic of the data. Features are the input variables that
help the model make predictions.
Example: In a house price prediction model, features could include the number of bedrooms,
location, and size of the house.
7. Label:
The output or result that the model is trying to predict. In supervised learning, labels are known and
used to train the model.
Example: The actual price of a house in the house price prediction model.
8. Overfitting:
When a model learns the training data too well, including noise and irrelevant details, causing it to
perform poorly on new, unseen data.
Example: A decision tree that perfectly predicts the training data but performs badly on test
data.
9. Underfitting:
When a model is too simple and fails to capture the underlying trends in the data, leading to poor
performance on both training and test data.
Example: A linear model trying to fit complex, non-linear data and failing to capture the data's
nuances.
10. Recall (Sensitivity):
The ratio of correctly predicted positive observations to all actual positives. It shows how well the
model identifies positive cases.
Formula: Recall = TP / (TP + FN) {TP: True Positive // FN: False Negative}
11. F1 Score:
The harmonic mean of precision and recall. It provides a balance between precision and recall,
especially when dealing with imbalanced datasets.
2. Unsupervised Learning
Definition:
In unsupervised learning, the algorithm is trained on data that does not have any labelled output. The
goal is to discover hidden patterns, structures, or relationships in the data.
Key Characteristics:
Unlabeled Data: The model is provided with input data without corresponding output labels.
Goal: Find patterns, groupings, or structure in the data.
Applications: Primarily used for clustering, association, and dimensionality reduction.
How it works:
1. The algorithm explores the input data and tries to learn the underlying patterns.
2. The model groups similar data points together or identifies hidden relationships between data
features.
Examples of Unsupervised Learning:
Clustering: Grouping data into clusters where points in the same group are more similar to each
other than to those in other groups.
Association: Discovering relationships or associations between variables in large datasets.
Dimensionality Reduction: Reducing the number of features in the data while preserving the
key information.
Algorithms Used in Unsupervised Learning:
• K-Means Clustering
• Hierarchical Clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Apriori Algorithm (for association rule learning)
• Principal Component Analysis (PCA)
• t-SNE (t-Distributed Stochastic Neighbor Embedding)
3. Semi‐Supervised Learning
Definition:
Semi‐supervised learning is a hybrid approach that combines both labeled and unlabeled data. It lies
between supervised and unsupervised learning. In many real-world applications, obtaining labeled data is
expensive or time-consuming, while unlabeled data is abundant.
Semi‐supervised learning leverages a small amount of labeled data with a large amount of unlabeled data
to improve learning accuracy.
Key Characteristics:
Combination of Labeled and Unlabeled Data: A small portion of the data is labeled, and a large
portion is unlabeled.
Goal: Use labeled data to guide the learning process, but also leverage the unlabeled data to
uncover additional patterns or relationships.
Applications: Often used in situations where labeled data is scarce or expensive to obtain.
How it works:
1. The algorithm starts by learning from the small set of labeled data.
2. Then, it uses the patterns learned from the labeled data to label the unlabeled data or learn
hidden structures.
3. The model improves its performance by incorporating both labeled and unlabeled data in its
training process.
Examples of Semi‐Supervised Learning:
Image Classification: Labeling thousands of images manually can be labor-intensive, so a small
set of labeled images is used along with a large set of unlabeled images.
Speech Recognition: Manually labeling vast amounts of speech data is costly. Semi-supervised
learning can be used to improve speech recognition systems with minimal labeled data.
Algorithms Used in Semi‐Supervised Learning:
• Self-training
• Co-training
• Generative Models (such as Variational Autoencoders or Gaussian Mixture Models)
• Graph-Based Methods (such as Label Propagation)
➢ Review of Probability
Experiment: Any process that leads to a well-defined outcome. For ex: rolling a die or flipping a coin.
Outcome: A possible result of an experiment.
Sample Space (S): The set of all possible outcomes of an experiment.
Event (E): A subset of the sample space. It represents one or more outcomes that are of interest.
Probability (P): A numerical value between 0 and 1 that represents the likelihood of an event occurring
Basic Probability Concepts in Machine Learning
1. Random Variable:
o A random variable is a variable whose possible values are outcomes of a random
phenomenon.
o Types:
▪ Discrete Random Variable: Takes on distinct values (e.g., number of heads in coin
tosses).
▪ Continuous Random Variable: Takes on any value within a range (e.g., temperature).
2. Probability Distribution:
o Describes how the probabilities are distributed over the values of a random variable.
o For Discrete Random Variables: Probability Mass Function (PMF) gives the probability of
each specific value.
o For Continuous Random Variables: Probability Density Function (PDF) gives the probability of
values in a range.
3. Joint Probability:
o The probability of two or more events occurring together.
o Example: The probability that a student is both a high scorer and attends all classes.
4. Marginal Probability:
o The probability of a single event occurring, irrespective of other events.
o Example: The probability that a student is a high scorer, ignoring their class attendance.
5. Conditional Probability:
o The probability of an event occurring given that another event has already occurred.
o Formula: P(A∣B) = P(A∩B) / P(A | B)
o Example: The probability that a student is a high scorer given that they attend all classes.
6. Independence:
o Two events are independent if the occurrence of one event does not affect the probability of
the other.
o Formula: P(A∩B)=P(A)×P(B)
o Example: Tossing two coins; the outcome of one toss doesn't affect the other.
7. Bayes’ Theorem:
o A method to calculate the conditional probability of an event based on prior knowledge of
related events.
o Formula:
o Example: Given the probability of having a disease and the probability of testing positive,
Bayes’ theorem helps find the probability of having the disease given a positive test result.
8. Expectation (Expected Value):
o The expected value of a random variable is the long-term average value of repetitions of the
experiment.
o Formula:
oExample: The expected number of heads in 10 coin tosses (each with a 50% chance of heads)
is 10×0.5=5.
9. Variance and Standard Deviation:
o Variance measures how much the values of a random variable differ from the expected
value.
o Formula:
oStandard Deviation is the square root of the variance, giving the spread of data.
o Example: In coin tosses, variance tells us how far the actual number of heads will typically be
from the expected value.
10. Probability in ML Models:
o Classification: Models like Naive Bayes or logistic regression use probabilities to classify data.
o Generative vs Discriminative Models:
▪ Generative Models: Learn the joint probability distribution P(X,Y) and then predict
P(Y∣X). Example: Naive Bayes.
▪ Discriminative Models: Learn the conditional probability distribution P(Y∣X).
Example: Logistic regression.
▪ Formula:
Matrices:
• Definition: A matrix is a 2D array of numbers. It is used to represent multiple data points.
o Example:
o Each row can represent a data point and each column a feature.
• Operations:
o Matrix Multiplication: Used to transform data or compute weighted sums.
• Geometrical Interpretation:
3. Visual Representation
The relationship between bias, variance, and the error can often be visualized:
Total Error:
Graph:
➢ Function Approximation
Function approximation in machine learning is essentially about finding a function that can predict outputs
(like labels or values) based on given inputs (features).
Key Concepts of Function Approximation in Machine Learning:
1. Inputs (Features):
o These are the data points or variables we have.
For example, in predicting house prices, features could be the size of the house, the number
of bedrooms, the location, etc.
2. Outputs (Targets/Labels):
o These are the actual values we want to predict, such as the price of a house in our example.
In supervised learning, these values are known during the training phase.
3. Hypothesis or Function:
o The hypothesis is the learned function, denoted as h(x), that tries to approximate the true
function f(x) which maps inputs x to outputs y. In practice, we don’t know the true function,
so we create models to approximate it.
4. Learning Process:
o The machine learning model (function approximator) learns from the training data by
adjusting its internal parameters to minimize the difference between its predictions and the
actual outputs. This process is done through training the model using various algorithms like
gradient descent.
where w_i are the weights for each feature, and b is the bias term.
o Example: In linear regression, the model tries to find the best line that fits the data by
adjusting the weights w.
2. Non‐Linear Function Approximation:
o Definition: In many cases, the relationship between inputs and outputs is not linear, and
non-linear models like neural networks, decision trees, or polynomial regression are used to
capture complex patterns.
o Equation (for a simple neural network model):
where the function returns a probability between 0 and 1, which is used to make a
classification decision.
3. Neural Networks (Complex Non-linear Approximation):
o Used for complex tasks like image recognition or natural language processing. Neural
networks with multiple layers can approximate very complex functions by stacking layers of
non-linear functions.
➢ Overfitting
Overfitting in machine learning occurs when a model performs well on the training data but poorly on
unseen data because it has learned the specific patterns and noise of the training data instead of general
patterns. Here is a concise explanation:
Causes of Overfitting:
1. Too complex model: Models with many parameters (e.g., deep neural networks, decision trees) can
overfit by learning noise.
2. Small dataset: With limited data, the model learns details that do not generalize well to new data.
3. Too many features: The model may find relationships between irrelevant features, leading to
overfitting.
Symptoms of Overfitting:
1. High accuracy on training data but low accuracy on test data.
2. Large gap between training and validation performance.
How to Prevent Overfitting:
1. Simplify the model: Use fewer parameters or features.
2. Regularization: Techniques like L1 (Lasso) or L2 (Ridge) add penalties for complexity.
3. Cross‐validation: Use k-fold cross-validation to ensure the model generalizes well.
4. Early stopping: Stop training when performance on the validation set starts to decline.
5. More training data: Adding more data helps the model generalize better.
6. Dropout (for neural networks): Randomly ignore some neurons during training to prevent over-
reliance on specific patterns.
Example:
If a model predicting house prices perfectly fits the training data but performs poorly on new, unseen data,
it likely overfitted to the unique details in the training set (e.g., specific houses with unusual features).
UNIT 2
Regression Analysis in Machine Learning:
➢ Introduction to Regression Analysis
Regression analysis is a statistical technique used in machine learning and data science to model the
relationship between a dependent variable and one or more independent variables. The goal of
regression analysis is to predict the output variable (also known as the target or response variable)
based on the input features (also known as predictors or explanatory variables).
1. What is Regression?
Regression can be defined as a method for predicting a continuous outcome based on the values of
one or more input variables. It provides insights into the relationships between variables and helps
identify trends and patterns in data.
Dependent Variable (Response Variable): The variable we are trying to predict or explain (e.g.,
house prices, sales revenue).
Independent Variables (Predictors): The variables used to predict the dependent variable (e.g.,
square footage, number of bedrooms).
R‐squared (R2): A statistical measure that represents the proportion of variance for the dependent variable
that is explained by the independent variables in the model. An R2 value close to 1 indicates a good fit.
Adjusted R‐squared: A modified version of R2 that adjusts for the number of predictors in the model. It
provides a more accurate measure when comparing models with different numbers of predictors.
Overfitting: A scenario where a model captures noise in the training data rather than the underlying
pattern, leading to poor generalization on unseen data.
Multicollinearity: A situation where two or more independent variables are highly correlated, making it
difficult to determine the individual effect of each variable on the dependent variable.
➢ Type of Regression
1. Linear Regression
A. Simple Linear Regression
Description: Models the relationship between a single independent variable and a dependent
variable using a linear equation.
Equation:
Use Case: Predicting a continuous outcome like house prices based on one feature, such as square
footage.
B. Multiple Linear Regression
Description: Extends simple linear regression to multiple independent variables.
Equation:
Use Case: Predicting a continuous outcome based on several predictors, such as predicting salaries
based on education, experience, and location.
2. Polynomial Regression
Description: Models the relationship between the dependent variable and the independent
variable as an nth-degree polynomial.
Equation:
Use Case: Suitable for modelling nonlinear relationships, such as predicting sales based on
advertising spend when the relationship is quadratic.
3. Logistic Regression
Description: A classification algorithm that predicts the probability of a binary outcome based on
one or more predictor variables. It uses the logistic function to constrain predictions to the (0, 1)
interval.
Equation:
Use Case: Predicting whether a customer will buy a product (yes/no) based on features like age and
income.
4. Ridge Regression (L2 Regularization)
Description: A type of linear regression that includes L2 regularization, which adds a penalty equal
to the square of the coefficients' magnitude. It helps to prevent overfitting by discouraging large
coefficients.
Objective Function:
Use Case: Suitable when dealing with multicollinearity or when the number of predictors is large
compared to the number of observations.
5. Lasso Regression (L1 Regularization)
Description: Like Ridge regression, but it adds an L1 penalty, which can shrink some coefficients to
zero, effectively performing variable selection.
Objective Function:
Use Case: Useful when you want to identify and retain only the most important predictors in your
model.
6. Elastic Net Regression
Description: Combines the penalties of both Ridge and Lasso regression, allowing for both feature
selection and regularization.
Objective Function:
Use Case: Effective in scenarios with highly correlated features and when there are more predictors
than observations.
Equation of Simple Linear Regression: The equation of a simple linear regression line is given by:
Where:
• SSres = Sum of squared residuals (errors).
• SStot = Total sum of squares (variance in Y).
2. Mean Squared Error (MSE):
• MSE measures the average of the squared differences between the actual and predicted values.
A lower MSE indicates a better-fitting model.
3. Residual Analysis:
• Analyze the residuals (differences between actual and predicted values). Residuals should be
randomly distributed and have constant variance (homoscedasticity).
• A residual plot can help in checking if the errors are randomly distributed around zero.
6. Make Predictions
Once the model is evaluated, you can use it to make predictions for new data points.
• Prediction Formula:
7. Interpretation of Results
• Intercept (β0): The value of Y when X is zero. It may or may not have a meaningful interpretation
depending on the context.
• Slope (β1): Indicates how much the dependent variable changes for each unit change in the
independent variable. A positive slope suggests a direct relationship, while a negative slope
suggests an inverse relationship.
Objective of OLS:
The goal of OLS estimation is to find the values of the coefficients (β0 and β1) that minimize the sum of the
squared differences between the observed values and the predicted values. These squared differences are
referred to as residuals.
The general linear regression model is:
2. Efficiency:
o Among the class of linear, unbiased estimators, OLS estimators are the most efficient (i.e.,
they have the smallest variance). This property is known as Gauss‐Markov theorem, which
states that OLS estimators are the Best Linear Unbiased Estimators (BLUE) when certain
assumptions hold (such as homoscedasticity and no correlation among errors).
3. Consistency:
o As the sample size nnn increases, the OLS estimators β0^ and β1^ converge to the true
population parameters β0 and β1. This means that with larger data, the estimators become
more accurate.
4. Normality:
o If the error terms ϵ are normally distributed, the OLS estimators will also follow a normal
distribution. This is particularly useful for hypothesis testing and confidence interval
estimation.
5. Independence:
o The OLS estimators β0^ and β1^ are independent if the errors are homoscedastic and
uncorrelated.
Fitted Regression Model:
Properties of Residuals:
1. Sum of Residuals: The sum of residuals is always zero:
Residual Plots:
Plotting the residuals can help assess the assumptions of the regression model:
• A scatter plot of residuals versus the fitted values (predicted Y) should show no discernible pattern
if the model is appropriate.
• A histogram of residuals should ideally show a normal distribution if the assumption of normality is
met.
Where:
• Y is the house price,
• β₀ is the base price (intercept),
• β₁, β₂, β₃ are the coefficients (the impact of each factor on the price),
• ϵ is the error term (things we can’t measure perfectly).
➢ F‐Statistic:
• What it is: The F-statistic tests whether your model is useful overall.
• Easy interpretation:
o A high F‐statistic means your model does a good job at predicting the data.
o A low F‐statistic means your independent variables might not help much in predicting the
outcome.
➢ Coefficient P‐Values:
• What it is: Each independent variable (like size of the house or number of bedrooms) has a p-value,
which shows if that variable is helping to predict the outcome.
• Easy interpretation:
o If a p-value for a variable (e.g., size) is less than 0.05, it’s important for the prediction.
o If a p-value is greater than 0.05, that variable might not be significant and can be ignored or
removed from the model.
➢ Adjusted R‐Squared
What it is: Adjusted R² is a modified version of R² that takes into account how many predictors (variables)
are in the model. It helps you compare models with different numbers of predictors.
Easy Example:
• Suppose you have two models predicting ice cream sales:
o Model A uses temperature and sunny days as predictors.
o Model B uses temperature, sunny days, and humidity.
• Adjusted R² will tell you if adding humidity (a new variable) to the model actually improves
predictions or if it just complicates things.
• If Adjusted R² increases after adding humidity, it means the new variable is useful. If it decreases, it
means the new variable is not helping much and might even hurt the model.
Key Point: Adjusted R² helps prevent overfitting by penalizing models with too many unnecessary variables.
➢ F‐Test
What it is: The F-test checks whether your model is useful. It tells you if at least one of your predictors (like
temperature or sunny days) is significantly helping to predict the dependent variable (like ice cream sales).
Easy Example:
• You run a model using temperature and sunny days to predict ice cream sales.
• If the F-test gives a small p‐value (less than 0.05), it means at least one of these predictors is
significantly helping to predict sales.
• If the F-test p-value is large (greater than 0.05), your model might not be useful, and you should
rethink your predictors.
Key Point: A small p‐value from the F-test means the model is working well. A large p‐value means it is not
useful.
Feature Selection and Dimensionality Reduction:
Introduction
In machine learning, Feature Selection and Dimensionality Reduction are techniques used to improve
model performance by simplifying the data. This makes the model more efficient, accurate, and
interpretable.
• Feature Selection: Involves selecting only the most important features (variables) from the dataset
to improve model performance. It eliminates irrelevant or redundant features.
• Dimensionality Reduction: Refers to reducing the number of input variables (features) in the
dataset, transforming the data into a lower-dimensional space without losing essential information.
Both techniques help deal with large datasets (high-dimensional data) and avoid problems like overfitting.
• Example: Classify whether a fruit is an apple or orange based on features like color and size.
4. Random Forest
• Definition: An ensemble method that combines multiple decision trees to improve
classification performance.
• Key Points:
o Builds many decision trees during training.
o Combines the output of all trees (majority voting) for the final classification.
• Advantages:
o Handles large datasets efficiently.
o Reduces overfitting compared to a single decision tree.
• Disadvantages:
o Requires more computational resources.
o Less interpretable than a single decision tree.
• Example:
o Predict whether a loan applicant is "Creditworthy" or "Not Creditworthy" based on
features like income, credit score, and employment history.
• Numerical Aspect:
o Decision Tree Splitting:
• Advantages:
o Effective for complex problems with overlapping classes.
o Provides a degree of confidence for each class.
• Disadvantages:
o Requires careful design of membership functions.
o Computationally intensive.
• Example:
o Classify the "risk level" of patients (Low, Medium, High) based on fuzzy inputs like
blood pressure and heart rate.
Recommended Resources
o "k-Nearest Neighbour Algorithm" by Simplilearn
o "Random Forest Algorithm Explained" by StatQuest
o "Fuzzy Logic with Examples" by Neso Academy
▪ Advantages of SVM
•Effective for high-dimensional datasets.
• Works well for both linear and non-linear classification.
• Robust to overfitting, especially in high-dimensional spaces.
▪ Disadvantages of SVM
• Computationally expensive for large datasets.
• Requires careful selection of kernel functions and parameters.
• Can be sensitive to outliers.
Recommended Resources
1. YouTube:
o "Support Vector Machine Explained" by StatQuest
o "SVM Kernels - Linear, Polynomial, RBF" by Great Learning
❖ Hyperplane – Decision Surface
Definition:
• A hyperplane is a decision surface that separates data points of different classes in the
feature space. In a 2D space, it is a line; in 3D, it is a plane; and in higher dimensions, it is an
n-dimensional flat surface.
• SVM determines the optimal hyperplane that maximizes the margin between classes.
Key Characteristics:
1. Separation: The hyperplane divides the feature space such that data points from different
classes are on opposite sides.
2. Optimality: SVM chooses the hyperplane that has the largest margin, ensuring better
generalization to unseen data.
Example: For a dataset with two classes (e.g., cats and dogs), the hyperplane is the decision
boundary that separates the feature representations of cats from those of dogs.
➢ Properties of SVM
1. Margin Maximization:
o SVM seeks to maximize the margin between the hyperplane and the nearest data
points (support vectors).
o Larger margins reduce overfitting and improve model generalization.
2. Support Vectors:
o Only the data points closest to the hyperplane (support vectors) are used to define
the decision boundary.
o These points are critical for training the SVM.
3. Kernel Trick:
o SVM can handle non-linearly separable data by using kernel functions to transform it
into a higher-dimensional space where it becomes linearly separable.
4. Dual Representation:
o The optimization problem in SVM can be expressed in terms of Lagrange multipliers,
allowing efficient computation.
5. Robustness to High Dimensions:
o SVM performs well in datasets with many features (e.g., text classification with
thousands of words).
➢ Issues in SVM
1. High Computational Cost:
o Training an SVM can be computationally expensive for large datasets, especially with
non-linear kernels.
2. Choice of Kernel:
o Selecting the appropriate kernel function (e.g., linear, polynomial, or Gaussian) and
tuning its parameters can be challenging and critical for model performance.
3. Sensitivity to Outliers:
o SVM is sensitive to noise and outliers, as they can affect the position of the hyperplane.
4. Imbalanced Data:
o SVM struggles with imbalanced datasets, as it assumes equal importance for all classes.
This may result in a biased hyperplane.
5. Interpretability:
o Compared to simpler models like decision trees, SVM is less interpretable, especially
when using complex kernels.
Recommended Resources
o "SVM Explained Visually" by StatQuest
o "Understanding the SVM Hyperplane and Support Vectors" by Edureka
2. Bayes’ Theorem
Formula:
Key Points:
• Prior probability is updated using new evidence to compute the posterior probability.
• The posterior becomes the new prior as more evidence accumulates.
❖ Concept Learning
• Definition: Concept learning involves finding a hypothesis HHH that best explains the
observed data DDD.
• Bayesian Perspective:
o All possible hypotheses are considered.
o The best hypothesis is the one with the highest posterior probability P(H∣D)
• Key Equation:
Bayes Optimal Classifier
• Definition: A Bayes Optimal Classifier combines all hypotheses weighted by their posterior
probabilities to make the most accurate prediction.
• Formula:
Steps:
1. Compute the prior probability P(C)P(C)P(C) for each class.
2. Compute the likelihood P(X∣C)P(X|C)P(X∣C) for each feature assuming independence.
3. Use Bayes’ theorem to compute the posterior probability for each class.
4. Choose the class with the highest posterior probability.
Example: Email Spam Classification:
• Features: Words in the email (e.g., "money," "free").
• Class: Spam or not spam.
• Assumes the presence of "money" and "free" are independent indicators.
Example:
Clustering customer data based on purchase behavior where some features are missing.
Boosting
• Concept: Trains models sequentially, where each subsequent model focuses on correcting
the errors of the previous ones.
• Key Points:
o Reduces bias and improves accuracy.
o Can be sensitive to noise and outliers.
7. Recommended Resources
1. YouTube:
o "Clustering and K-Means Algorithm Explained" by StatQuest.
o "Understanding DBSCAN Clustering" by Data School.
o "Hierarchical Clustering Tutorial" by Simplilearn.
Overview of Some Basic Clustering Methods
Clustering is an unsupervised learning technique that groups similar data points together. Here’s an
overview of some widely used clustering algorithms:
1. k‐Means Clustering
Definition:
k-Means is a partitioning-based clustering algorithm that divides the data into kk distinct clusters,
where each data point belongs to the cluster whose center (centroid) is closest.
Advantages:
• Simple and easy to implement.
• Scalable to large datasets.
• Works well when the clusters are spherical and evenly sized.
Disadvantages:
• The number of clusters kk must be pre-defined.
• Sensitive to initial centroid placement.
• Assumes clusters are spherical, which might not be true for all datasets.
• Sensitive to outliers.
Real‐World Example:
In customer segmentation, k-Means can be used to group customers based on purchasing behavior
(e.g., frequent buyers vs. occasional buyers).
2. k‐Medoids Clustering
Definition: k-Medoids is similar to k-Means, but instead of using the mean of the points to
represent the centroid of a cluster, it uses the most centrally located point (medoid). It minimizes
the sum of dissimilarities between points and the representative medoid.
Advantages:
• Can discover clusters of arbitrary shape.
• Does not require the number of clusters to be specified in advance.
• Can handle noise and outliers effectively.
• Works well with datasets containing clusters of varying shapes and densities.
Disadvantages:
• Sensitive to the choice of ϵ\epsilon and MinPtsMinPts parameters.
• Struggles with datasets of varying density, where some clusters may be harder to identify.
• Computationally expensive for large datasets.
Real‐World Example:
DBSCAN is widely used in spatial data clustering, such as identifying areas of high customer activity
in retail sales, or in geographic data analysis, where it helps to find densely populated regions in a
map.
Advantages:
• Can model clusters of elliptical shapes, unlike k-Means (which assumes spherical clusters).
• Provides probabilities for cluster membership, which can be useful for decision-making.
• Can model complex data distributions.
Disadvantages:
• Computationally intensive and requires careful initialization.
• Assumes data is generated from Gaussian distributions, which may not always be the case.
• The number of clusters kk must be specified.
Real‐World Example:
GMM can be used in image segmentation, where the algorithm assigns pixels in an image to
different regions based on color distributions, modeling the color distribution as a mixture of
Gaussians.
Real‐World Example:
BIRCH is often used in large-scale data analysis like customer segmentation in large retail stores,
where millions of customer records need to be processed quickly.
Advantages:
• Does not require the number of clusters to be specified in advance.
• Produces a hierarchical tree (dendrogram) that provides insight into the data structure.
• Can handle clusters of arbitrary shapes.
Disadvantages:
• Computationally expensive for large datasets (especially when the number of data points is
large).
• Sensitive to noise and outliers.
Real‐World Example:
Agglomerative hierarchical clustering is used in gene expression analysis, where the goal is to
group similar genes based on their expression patterns across multiple conditions.
Real‐World Example:
Divisive hierarchical clustering can be used in document classification, where initially, all
documents are in one cluster, and the task is to split them based on the topic until each document
is in its own topic-based cluster.
Where:
o a(i)a(i) is the average distance between point ii and all other points in the same cluster.
o b(i)b(i) is the average distance between point ii and all points in the nearest cluster.
Where:
Recommended Resources
1. YouTube:
o "Agglomerative Clustering - Machine Learning" by StatQuest.
o "Divisive Hierarchical Clustering" by Data Science Society.
o "Measuring Clustering Performance - Machine Learning" by Simplilearn.