ML Notes All
ML Notes All
● Human learning refers to the process through which individuals acquire new
knowledge, skills, behaviors, or attitudes, resulting in a relatively permanent
change in their capabilities.
➔
8. Vicarious Learning:
Types of Human Learning ➔ Learning by experiencing the consequences of others'
actions. Linked to observational learning and empathy.
1. Classical Conditioning:
➔ Involves learning associations between stimuli and 9. Rote Learning:
responses
2. Operant Conditioning: ➔ Memorization of information through repetition. Common
➔ Learning occurs through reinforcement (increasing in early education for basic facts and figures.
behavior) or punishment (decreasing behavior).
3. Observational Learning: 10. Latent Learning:
➔ Learning by observing others.
4. Cognitive Learning: ➔ latent learning highlights the idea that individuals can
➔ Emphasizes mental processes like thinking, acquire information without immediately displaying it in
memory, problem-solving. their behavior
5. Associative Learning:
➔ Involves forming associations or connections between 11. Social Learning:
stimuli or events. Classical and operant conditioning
are examples of associative learning. ➔ Emphasizes the role of social interactions in learning.
6. Insight Learning:
➔ Sudden realization or understanding of a problem 12. Habituation and Sensitization:
without prior experience. Often associated with
problem-solving. ➔ Habituation involves a decrease in response to a
repeated stimulus, while sensitization involves an increase
7. Experiential Learning: in response, often due to the intensity or repeated
➔ Learning through direct experience, reflection, and exposure to a stimulus.
active engagement.
Machine Learning
Arthur Samuel described Machine Learning as:
“the field of study that gives computers the ability to learn without being explicitly programmed.”
“A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.”
Types of Machine Learning
● Machine learning is a subset of AI, which enables the machine to automatically learn from data, improve
performance from past experiences, and make predictions.
Well Posed Learning Problem
A computer program is said to learn from experience E in context to some task T and some performance measure
P, if its performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
1. Task
2. Performance Measure
3. Experience
Example 1:
Task – Classifying emails as spam or not, Performance
Measure – The fraction of emails accurately classified as spam or not spam
Experience – Observing you label emails as spam or not spam
Example 2:
Task – predicting different types of faces
Performance Measure – able to predict maximum types of faces
Experience – training machine with maximum amount of datasets of different face images
Application of Machine Learning
Healthcare:
Finance:
Retail:
Marketing:
Education:
Manufacturing:
➔ Chatbots and Virtual Assistants: Using NLP for natural and interactive
communication with users.
➔ Language Translation: Translating text from one language to another
with improved accuracy.
Issues in Machine Learning
Poor Quality of Data
A.) Discrete data type: The numeric data which have discrete
values or whole numbers. This type of variable value if
expressed in decimal format will have no proper meaning. Their
values can be counted.
These are the data types that cannot be expressed in numbers. This
describes categories or groups and is hence known as the categorical
data type.
E.g.) Sunny=1, cloudy=2, windy=3 or binary form data like 0 or1, Good
or bad, etc.
b. Unstructured data: This type of data does not have the proper
format and therefore known as unstructured data.
[https://www.heavy.ai/learn/data-exploration]
[https://www.simplilearn.com/tutorials/statistics-tutorial/what-is-normal-distribution].
Nominal Data and Ordinal Data
The main differences between Nominal Data and Ordinal Data are:
While Nominal Data is classified without any intrinsic ordering or rank, Ordinal Data has some
predetermined or natural order.
Nominal data is qualitative or categorical data, while Ordinal data is considered “in-between”
qualitative and quantitative data.
Nominal data do not provide any quantitative value, and you cannot perform numeric operations
with them or compare them with one another. However, Ordinal data provide sequence, and it is
possible to assign numbers to the data. No numeric operations can be performed. But ordinal
data makes it possible to compare one item with another in terms of ranking.
Example of Nominal Data – Eye color, Gender; Example of Ordinal data – Customer Feedback,
Economic Status
Exploratory Data Analysis
Data exploration steps to follow before building a machine learning model include:
➔ Variable identification: define each variable and its role in the dataset
➔ Univariate analysis: for continuous variables, build box plots or histograms for each variable independently;
for categorical variables, build bar charts to show the frequencies
➔ Bi-variable analysis - determine the interaction between variables by building visualization tools
➔ Continuous and Continuous: scatter plots
➔ Categorical and Categorical: stacked column chart
➔ Categorical and Continuous: boxplots combined with swarmplots
➔ Detect and treat missing values
➔ Detect and treat outlier
The ultimate goal of data exploration machine learning is to provide data insights that will inspire subsequent
feature engineering and the model-building process.
Ref. https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
https://www.simplilearn.com/tutorials/data-analytics-tutorial/exploratory-data-analysis
Data Types
Nominal Data: Categories without any inherent order (e.g., colors, gender).
Ordinal Data: Categories with a specific order (e.g., education levels, satisfaction ratings).
Exploration Techniques:
Frequency Distribution: Count the occurrences of each category. Helps understand the distribution of data.
Bar Charts: Visual representation of frequency distribution. Useful for comparing the frequency of different categories.
Pie Charts:Represents proportions of the whole. Suitable for displaying the contribution of each category.
Central Tendency: Mode is often used for central tendency in categorical data. Mode is the category with the highest
frequency.
Exploring Relationship between variables
Correlation is very useful in data analysis and modelling to better
understand the relationships between variables. The statistical
relationship between two variables is referred to as their correlation.
[https://www.geeksforgeeks.org/exploring-correlation-in-python/]
Box Plot (https://youtu.be/GMb6HaLXmjY)
https://www.six-sigma-material.com/Box-Plot.html
Data issues and Remediation in machine learning
Insufficient Data: Lack of high-quality and representative data can hinder the training process and lead
to poor model performance.
Overfitting: Occurs when a model learns the training data too well, capturing noise and outliers,
resulting in poor generalization to new, unseen data.
Underfitting: The opposite of overfitting, where the model is too simple and fails to capture the
underlying patterns in the data.
Feature Engineering Challenges: Selecting and creating relevant features is crucial; improper feature
selection or extraction can lead to suboptimal model performance.
Data Leakage: Accidental inclusion of information from the test set in the training process, leading to
overly optimistic performance estimates.
Data issues and Remediation in machine learning
Data Augmentation: Generate additional training data by applying transformations (e.g., rotation,
cropping) to existing data, helping to address issues related to insufficient data.
Model Complexity Adjustment: Experiment with model architectures and complexity, ensuring a
balance between simplicity and capturing essential patterns to mitigate underfitting and overfitting.
Feature Scaling and Normalization: Standardize and normalize features to a similar scale, preventing
certain features from dominating the learning process and aiding in better convergence.
Data Cleaning: Identify and address data quality issues through thorough data cleaning, handling
outliers, and addressing missing or inaccurate values.
https://www.javatpoint.com/issues-in-machine-learning
Data Preprocessing
➔ Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model.
➔ It is the first and crucial step while creating a machine learning model.
➔ Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
Data Preprocessing
➔ Getting the dataset
➔ Importing libraries
➔ Importing datasets
➔ Finding Missing Data
➔ Encoding Categorical
Data
➔ Splitting dataset into
training and test set
➔ Feature scaling
Data Preprocessing
➔ A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models.
➔ Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
➔ After cleaning and proper formatting of data we need to scaling of data.
➔ Scaling generally bound the all features of a dataset in a fixed range and this
range is same for all features,this increase the accuracy of our machine
learning model with a great margin.
➔ For performing data preprocessing using python we need to import some
predefined Python libraries.
k-fold cross validation and bootstrap sampling
What is K-Fold Cross Validation?
➔ Let’s have a generalised K value. If K=5, it means, in the given dataset and
we are splitting into 5 folds and running the Train and Test.
➔ During each run, one fold is considered for testing and the rest will be for
training and moving on with iterations, the below pictorial representation
would give you an idea of the flow of the fold-defined size.
https://www.analyticsvidhya.com/blog/2022/02/k-fold-cross-validation-technique-and-its-essentials/
Bootstrap Sampling
https://www.analyticsvidhya.com/blog/2020/02/what-is-bootstrap-sampling-in-statistics-and-machine-learning/
Bootstrap Sampling
What is bootstrap sampling?
➔ The bootstrap sampling method is
a resampling method that uses
random sampling with
replacement.
➔ Bootstrap sampling is used in a
machine learning ensemble
algorithm called bootstrap
aggregating (also called bagging).
➔ It helps in avoiding overfitting and
improves the stability of machine
learning algorithms.
Bootstrap Sampling
What is the advantage of bootstrap sampling?
➔ The advantage of bootstrap sampling is that it allows for robust statistical inference
without relying on strong assumptions about the underlying data distribution.
➔ By repeatedly resampling from the original data, it provides an estimate of the
sampling distribution of a statistic, helping to quantify its uncertainty.
➔ This method is particularly useful when the data is limited or when traditional
parametric methods are not appropriate.
➔ Bootstrap sampling is used in a machine learning ensemble algorithm called
bootstrap aggregating (also called bagging).
➔ It helps in avoiding overfitting and improves the stability of machine learning
algorithms.
Overfitting & Underfitting, Bias & Variance
https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Overfitting & Underfitting, Bias & Variance
Bias:
https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Overfitting & Underfitting, Bias & Variance
Variance:
https://www.javatpoint.com/bias-and-variance-in-machine-learning
https://censi
us.ai/wiki/ov
erfitting-vs-
underfitting
Model Performance & Evaluation
➔ Model performance in general refers to how well a model accomplishes its
intended task, but it is important to define exactly what element of a model
is being considered, and what “doing well” means for that element.
➔ Performance evaluation is the quantitative measure of how well a trained
model performs on specific model evaluation metrics in machine learning.
➔ Two of the most important categories of evaluation methods are
classification and regression model performance metrics.
Classification metrics/ a Confusion matrix
➔ True Positive: You predicted positive, and it’s true(TP).
➔ True Negative: You predicted negative, and it’s true(TN).
➔ False Positive: (Type 1 Error): You predicted positive, and it’s false(FP).
➔ False Negative: (Type 2 Error): You predicted negative, and it’s false(FN).
➔ Sensitivity or Recall: The proportion of actual positive cases which are
correctly identified.
➔ Specificity: The proportion of actual negative cases which are correctly
identified.
Classification metrics/ a Confusion matrix
➔ Precision - percentage of positive cases that
were true positives as opposed to false
positives. Use the formula Precision = TP /
(TP+FP)
➔ Recall - percentage of actual positive cases
that were predicted as positives, as
opposed to those classified as false
negatives. Use the formula Recall =
TP/(TP+FN)
● Logarithmic loss - measure of how many total
➔ Accuracy - percentage of the total variables errors a model has. The closer to zero, the more
that were correctly classified. Use the correct predictions a model makes in
classifications.
formula Accuracy = (TP+TN) / ● Area under curve - method of visualizing true
(TP+TN+FP+FN) and false positive rates against each other.
Regression Metrics
We decompose
variability into the
sum of squares
total (SST), the sum
of squares
regression (SSR),
and the sum of
squares error
(SSE).
➔ The unwanted presence of missing and outlier values in machine learning training data often
reduces the accuracy of a trained model or leads to a biased model.
➔ It leads to inaccurate predictions.
➔ Missing: In the case of continuous variables, you can impute the missing values with mean, median,
or mode.
➔ Outlier: You can delete the observations and perform transformations, binning, or imputation (same
as missing values). Alternatively, you can also treat outlier values separately.
Performance Improvement
3. Feature Engineering
➔ This step helps to extract more information from existing data. New information is
extracted in terms of new features.
➔ Feature Transformation: Changing the scale of a variable from the original
scale to a scale between zero and one is a common practice in machine
learning, known as data normalization.
➔ Feature Creation: Deriving new variable(s) from existing variables is known as
feature creation. It helps to unleash the hidden relationship of a data set. Let’s
say we want to predict the number of transactions in a store based on
transaction dates.
Performance Improvement
4. Feature Selection
➔ a) Feature Selection is a process of finding out the best subset of attributes that
better explains the relationship of independent variables with the target variable.
➔ b) Domain Knowledge: Based on domain experience, we select feature(s) which
may have a higher impact on the target variable.
➔ c) Visualization: As the name suggests, it helps to visualize the relationship between
variables, which makes your variable selection process easier.
➔ d) Statistical Parameters: We also consider the p-values, information values, and
other statistical metrics to select the right features.
➔ e) PCA: It helps to represent training data into lower dimensional spaces but still
characterizes the inherent relationships in the data. It is a type of dimensionality
reduction technique.
Performance Improvement
5. Multiple Algorithms
➔ There are many different algorithms in machine learning, but hitting the right
machine learning algorithm is the ideal approach to achieve higher accuracy.
But, it is easier said than done.
Performance Improvement
6. Algorithm Tuning
Ref. https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
Performance Improvement
8. Cross Validation
➔ To find the right answer to this question, we must use the cross-validation
technique.
➔ Cross Validation is one of the most important concepts in data modeling.
➔ It says to try to leave a sample on which you do not train the model and test
the model on this sample before finalizing the model.
Ref. https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
Feature Construction and Extraction
➔ Feature extraction/construction is a process through which a set of new features is
created.
What is Feature Extraction?
➔ Feature extraction is the process of identifying and selecting the most important
information or characteristics from a data set.
a. Feature Selection
b. Feature Extraction:
Feature Selection
Feature Selection: Feature selection is a way of selecting the subset of the most
relevant features from the original features set by removing the redundant,
irrelevant, or noisy features.
● Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of
the magnitude of coefficients.
● Ridge regression performs L2 regularization which adds penalty equivalent to square of the
magnitude of coefficients.
For more details and implementation of LASSO and RIDGE regression, you can refer to this article.
Lasso regression performs L1 regularization
Ridge regression performs L2 regularization
Random Forest Importance
Feature importance in Random Forest is determined through the following process:
Gini Importance:
Random Forest calculates the importance of a feature by looking at how much the tree nodes that use
that feature reduce impurity across all trees in the forest. This is known as Gini importance.
○ The Gini importance of a feature is the average of how much each tree node that uses that
feature reduces impurity across all trees in the forest.
The mean decrease in impurity (MDI) is another method used to determine feature importance in
Random Forest.
○ For each tree in the forest, the algorithm records how much each feature decreases the
impurity (typically measured by Gini impurity or entropy) as the data is split across the nodes.
The average decrease in impurity across all trees is then used as the feature importance.
Random Forest Importance
Feature Importance Calculation:
○ Once the Gini importance or MDI is calculated for each feature, the importances are
normalized so that they sum to 1. This helps in comparing the relative importance of
different features.
Application:
○ Feature importance scores can then be used to identify the most influential features
in the model, allowing for insights into which features are most informative for making
predictions.
In conclusion, Random Forest determines feature importance by evaluating how much each
feature contributes to reducing impurity across the ensemble of trees, and then normalizing
these importances to provide a relative ranking of feature importance.
Features Selection Techniques
Dimensionality Reduction:
➔ Principal Component Analysis (PCA): Reduces the dimensionality of the feature
space while retaining most of the variation in the data.
➔ Linear Discriminant Analysis (LDA): Maximizes the separation between classes by
finding the linear combinations of features that best represent the classes.
Hybrid Methods:
➔ Recursive Feature Elimination (RFE): Combines both wrapper and embedded
methods by recursively fitting a model and removing the least important feature until
the desired number of features is reached.
Each technique has its strengths and weaknesses, and the choice of method depends on
the specific characteristics of the dataset and the machine learning task at hand.
Features Selection Techniques
Feature Selection Techniques
Feature selection is a critical step in the machine learning pipeline that involves choosing a subset of
relevant features for model training. Here are several popular feature selection techniques:
Filter Methods:
➔ Variance Threshold: Remove features with low variance as they contain little information.
➔ Correlation-based Feature Selection: Identify and remove highly correlated features to reduce
redundancy.
Wrapper Methods:
➔ Forward Selection: Start with an empty set of features and iteratively add the feature that improves
the model performance the most.
➔ Backward Elimination: Begin with all features and remove the least significant feature at each
iteration.
Features Selection Techniques
Dimensionality Reduction:
➔ Principal Component Analysis (PCA): Reduces the dimensionality of the feature
space while retaining most of the variation in the data.
➔ Linear Discriminant Analysis (LDA): Maximizes the separation between classes by
finding the linear combinations of features that best represent the classes.
Hybrid Methods:
➔ Recursive Feature Elimination (RFE): Combines both wrapper and embedded
methods by recursively fitting a model and removing the least important feature until
the desired number of features is reached.
Each technique has its strengths and weaknesses, and the choice of method depends on
the specific characteristics of the dataset and the machine learning task at hand.
Binomial Distribution
Binomial Distribution:
● A discrete probability distribution Condition of Binomial Distribution:
that gives the probability of only two ● Experiment consist of n identical
possible outcomes in n independent trials
trails is known as Binomial ● Each trial are independent
Distribution. ● Each trial results in one of the two
Example: possible outcomes i.e. Success or
Failure
● Number of Tails in flipping coin n
● The probability of success remains
times.
● The number of times getting 1 on constant throughout the experiment
throwing a dice.
https://www.shiksha.com/online-courses/articles/binomial-distribution-definition-and-examples/
Mathematical Definition: Binomial Distribution
Mathematical Definition: Binomial Distribution Examples
1. The probability that man aged 60 will
live up to 70 is 0.65 out of 10 men, now
aged 60. Find the probability:
I. 3 boys
II. 5 girls
III. Either 2 or 3 boys
IV. At least 2 girls
Ref. https://www.youtube.com/watch?v=Nhw5qGKZm5o
Mathematical Definition: Binomial Distribution Examples
3. 4 coins are tossed 100 times and
following are obtained.
0 5
1 29
2 36
4 25
5 5
● can be used to model the number of times an event occurs in an interval of time
or space
● is a discrete probability distribution
● expresses the probability of a given number of events occurring in a fixed interval of
time or space
● events occur at a constant mean rate
● events occur independently of the time since the last event
● can be applied to measures such as distance, area, and volume
Mathematical Definition: Poisson Distribution
A Poisson distribution:
Lamda is
known as
mean=np
Mathematical Definition: Poisson Distribution example
A Poisson distribution:
Ref. https://www.youtube.com/watch?v=raySzcgcBFk
Bernoulli Distribution:
Bernoulli Distribution:
Bernoulli distribution is a special case of Binomial Distribution when the random experiment is done just only one
time.
● Success (1)
● Failure (0)
Example:
Example:
In this case:
In summary, the Central Limit Theorem provides theoretical support for the
effectiveness of Monte Carlo approximation techniques in machine learning,
enabling practitioners to leverage random sampling to estimate solutions to
challenging problems accurately.
Bayes Theorem, Prior and Posterior Probability, Likelihood
Bayes Theorem:
● One of the most recent developments in artificial intelligence is machine learning.
● The Bayes theorem is frequently referred to as the Bayes rule or Bayes Law.
● One of the most well-known theories in machine learning, the Bayes theorem helps
determine the likelihood that one event will occur with unclear information while
another has already happened.
Bayes Theorem, Prior and Posterior Probability, Likelihood
● Posterior is described as a revised probability that takes the available data into account.
● Likelihood is the likelihood that the theory will be supported by evidence.
● P(A) is referred to as the prior probability, or the probability of a hypothesis before taking the data
into account.
● P(B) referred to as marginal probability. It is described as the likelihood of the evidence taken into
account.
● The Bayes theorem has a wide range of uses in machine learning, making it one of the most
popular approaches among all algorithms for classification-related issues.
● Evidence: P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
Bayes Theorem, Prior and Posterior Probability, Likelihood
B = Contains offer
P(contains offer) =
BBNs provide a compact and interpretable way to encode complex probabilistic dependencies and
perform probabilistic inference.
Key Points:
Diagnosis and Decision Support: BBNs are used for medical diagnosis, fault diagnosis, and decision
support systems.
Risk Analysis: Assessing risks and uncertainties in various domains such as finance, engineering,
and environmental science.
Natural Language Processing: BBNs are applied in tasks like language modeling, information
retrieval, and sentiment analysis.
Advantages:
Modularity: BBNs provide a modular representation that allows easy incorporation of domain
knowledge.
Uncertainty Handling: BBNs naturally handle uncertainty and missing data through probabilistic
inference.
Interpretability: BBNs offer intuitive graphical representations that facilitate understanding and
interpretation of complex probabilistic relationships.
Bayesian Belief Network
Bayesian Belief Network
✔ Bayesian belief network is key method for dealing with probabilistic events and to solve a
problem which has uncertainty.
✔ A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph.
✔ It is also called a Bayes network, belief network, decision network, or Bayesian model.
✔ Bayesian networks are probabilistic, because these networks are built from a probability
distribution.
Application
✔ Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network.
✔ It can also be applied in various tasks including prediction, anomaly detection, diagnostics,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Belief Network cont..
Bayesian Belief Network(Directed Acyclic Graph or DAG)
✔ Each node corresponds to the random variables, and a variable can be
continuous or discrete.
✔ Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables.
✔ These directed links or arrows connect
the pair of nodes in the graph.
✔ In the diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
✔ If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
✔ Each node in the Bayesian network has condition probability distribution
P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.
✔ Bayesian network is based on Joint probability distribution and conditional
probability.
Bayesian Belief Network cont..
Joint probability distribution
✔ If we have variables x1, x2, x3,....., xn, then the probabilities of different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.
✔ The joint probability distribution of the variables are written as,
P[x1, x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
✔ In general, P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
● Supervised learning uses a training set to teach models to yield the desired output. This training
dataset includes inputs and correct outputs, which allow the model to learn over time.
● The algorithm measures its accuracy through the loss function, adjusting until the error has been
sufficiently minimized.
● Classification and
● Regression: https://www.ibm.com/topics/supervised-learning
Classification
Classification
● linear classifiers,
● support vector machines (SVM),
● decision trees,
● k-nearest neighbor, and
● random forest
https://www.ibm.com/topics/supervised-learning
Regression Supervised learning algorithms
Regression
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:
● Primarily leveraged for deep learning algorithms, neural networks process
training data by mimicking the interconnectivity of the human brain through
layers of nodes.
● Each node is made up of inputs, weights, a bias (or threshold), and an
output.
● If that output value exceeds a given threshold, it “fires” or activates the
node, passing data to the next layer in the network.
● Neural networks learn this mapping function through supervised learning,
adjusting based on the loss function through the process of gradient
descent.
● When the cost function is at or near zero, we can be confident in the
model’s accuracy to yield the correct answer.
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Naive bayes:
● Naive Bayes is classification approach that adopts the principle of class
conditional independence from the Bayes Theorem.
● This means that the presence of one feature does not impact the presence of
another in the probability of a given outcome, and each predictor has an equal
effect on that result.
● There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes,
Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
● This technique is primarily used in text classification, spam identification, and
recommendation systems.
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Naive bayes:
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Support vector machines (SVM):
● A support vector machine is a popular supervised learning model developed
by Vladimir Vapnik, used for both data classification and regression.
● That said, it is typically leveraged for classification problems, constructing a
hyperplane where the distance between two classes of data points is at its
maximum.
● This hyperplane is known as the decision boundary, separating the classes of
data points (e.g., oranges vs. apples) on either side of the plane.
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Support vector machines (SVM):
● A support vector machine is a popular supervised
learning model developed by Vladimir Vapnik,
used for both data classification and regression.
● That said, it is typically leveraged for classification
problems, constructing a hyperplane where the
distance between two classes of data points is at
its maximum.
● This hyperplane is known as the decision
boundary, separating the classes of data points
(e.g., oranges vs. apples) on either side of the
plane.
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
K-nearest neighbor:
● K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm
that classifies data points based on their proximity and association to other available
data.
● This algorithm assumes that similar data points can be found near each other.
● As a result, it seeks to calculate the distance between data points, usually through
Euclidean distance, and then it assigns a category based on the most frequent
category or average.
● Its ease of use and low calculation time make it a preferred algorithm by data
scientists, but as the test dataset grows, the processing time lengthens, making it
less appealing for classification tasks.
● KNN is typically used for recommendation engines and image recognition.
https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
K-nearest neighbor:
https://www.ibm.com/topics/supervised-learning
Business applications of Supervised learning
● Image- and object-recognition: Supervised learning algorithms can be used to locate,
isolate, and categorize objects out of videos or images, making them useful when applied to
various computer vision techniques and imagery analysis.
● Predictive analytics: A widespread use case for supervised learning models is in creating
predictive analytics systems to provide deep insights into various business data points. This
allows enterprises to anticipate certain results based on a given output variable, helping
business leaders justify decisions or pivot for the benefit of the organization.
● Customer sentiment analysis: Using supervised machine learning algorithms,
organizations can extract and classify important pieces of information from large volumes of
data—including context, emotion, and intent—with very little human intervention. This can be
incredibly useful when gaining a better understanding of customer interactions and can be
used to improve brand engagement efforts.
● Spam detection: Spam detection is another example of a supervised learning model. Using
supervised classification algorithms, organizations can train databases to recognize patterns
or anomalies in new data to organize spam and non-spam-related correspondences
effectively.
https://www.ibm.com/topics/supervised-learning
Supervised learning challenges:
The following are some of these challenges:
● Supervised learning models can require certain levels of expertise to structure
accurately.
● Training supervised learning models can be very time intensive.
● Datasets can have a higher likelihood of human error, resulting in algorithms
learning incorrectly.
● Unlike unsupervised learning models, supervised learning cannot cluster or
classify data on its own.
https://www.ibm.com/topics/supervised-learning
Introduction to K-Nearest Neighbor (KNN) Algorithm
Lesson Title: Introduction to K-Nearest Neighbor (KNN) Algorithm
Objective:
https://towardsdatascience.com/how-to-find-the-
optimal-value-of-k-in-knn-35d936e554eb
Distance Metrics of K-Nearest Neighbor
https://towardsdatascience.com/how-to-find-the-
optimal-value-of-k-in-knn-35d936e554eb
Selection of Hyperparameter(k) of KNN algorithm
● The value of k in KNN determines the number of neighbors
considered for classification.
● Choosing the appropriate value of k is crucial and depends on
factors such as the dataset size, complexity, and noise level.
● A smaller value of k may lead to a more flexible decision
boundary but may be sensitive to noise, while a larger value of k
may lead to smoother decision boundaries but could over-smooth
the classification.
● The intuition behind KNN revolves around the idea of classifying
instances based on the collective information provided by their
nearest neighbors.
How to select the optimal K value
● Derive a plot between error rate and K denoting values in a
defined range. Then choose the K value as having a minimum
error rate.
KNN algorithm
1. Load the data
2. Set the value for K which equals to number of
neighbors
3. Calculate the distance between input sample and
training samples
4. Add distances to data structure
5. Sort the data structure including all the distances
in ascending order (smallest value first)
6. Select the corresponding label according to
majority of K nearest neighbors
Decision Tree
Objective
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3
https://www.youtube.com/watch?v=coOTEc-0OGw
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
Similarly…
https://www.youtube.com/watch?v=xyDv3DLYjfM
Introduction to Ensemble Learning
Ensemble learning refers to algorithms that combine the predictions from two or more models.
Key principles of ensemble methods:
● Diversity: Ensemble models should have different biases and learn from different parts of
the data to reduce errors.
● Independence: The models in the ensemble should be trained independently to ensure
diversity.
● Aggregation: Combining the predictions of individual models to make a final prediction
using techniques like averaging or voting.
“Standard” ensemble learning strategies:
1) Bagging.
2) Stacking.
3) Boosting.
Ref. https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/
Introduction to Bagging
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly
used to reduce variance within a noisy data set.
1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Ref.
https://www.javatpoint.com/machine-learning-random-forest-algorithm
Introduction to Support Vector Machine
● A support vector machine (SVM) is a supervised
machine learning algorithm that classifies data by
finding an optimal line or hyperplane that
maximizes the distance between each class in an
N-dimensional space.
● SVMs were developed in the 1990s by Vladimir N.
Vapnik and his colleagues, and they published this
work in a paper titled "Support Vector Method for
Function Approximation, Regression Estimation, ● The importance of SVM in
and Signal Processing"1 in 1995. machine learning is that it
● The SVM algorithm is widely used in machine has the ability to handle
learning as it can handle both linear and nonlinear high-dimensional data and
complex decision
classification tasks.
boundaries.
Mathematical Foundations of SVM
● Suppose there are n-dimensional sample
vectors in a region, then there is
Ref.
https://www.ibm.com/topics/support-vector-machine
https://towardsdatascience.com/the-kernel-trick-c98cd
bcaeb3f
SVM Types
Non-Linear SVM:
● Polynomial kernel
● Radial basis function kernel (also known as a Gaussian or RBF
kernel)
● Sigmoid kernel
SVM: kernel trick
● The kernel trick provides a
solution to this problem. The “trick”
is that kernel methods represent the
data only through a set of pairwise
similarity comparisons between the
original data observations x (with
the original coordinates in the lower
dimensional space), instead of
explicitly applying the
transformations ϕ(x) and
representing the data by these
transformed coordinates in the
higher dimensional feature space.
Application of SVM:
Some common applications of SVM are-
● Face detection – SVMs classify parts of the image as a face and non-face and create a square
boundary around the face.
● Text and hypertext categorization – SVMs allow Text and hypertext categorization for both
inductive and transductive models. They use training data to classify documents into different
categories. It categorizes on the basis of the score generated and then compares with the threshold
value.
● Classification of images – Use of SVMs provides better search accuracy for image classification. It
provides better accuracy in comparison to the traditional query-based searching techniques.
● Bioinformatics – It includes protein classification and cancer classification. We use SVM for
identifying the classification of genes, patients on the basis of genes and other biological problems.
● Protein fold and remote homology detection – Apply SVM algorithms for protein remote
homology detection.
● Handwriting recognition – We use SVMs to recognize handwritten characters used widely.
● Generalized predictive control(GPC) – Use SVM based GPC to control chaotic dynamics with
useful parameters.
Simple Linear Regression
● Simple linear regression aims to
find a linear relationship to describe
the correlation between an
independent and possibly
dependent variable.
● The regression line can be used to
predict or estimate missing values,
this is known as interpolation.
Simple Linear Regression
● Linear regression is defined as a
statistical method used for modeling
the relationship between a dependent
variable (target) and one or more
independent variables (features).
● Intercept is represented by a
● Slope is represented by b
● Discuss the error term ε and its
assumption of being normally
distributed with mean zero and
constant variance.
The equation of linear regression is
represented by the following eq:
Simple Linear Regression
● Linear regression is defined as a
statistical method used for
modeling the relationship between
a dependent variable (target) and
one or more independent variables
(features).
Example of Linear Regression
Example of Linear Regression
Assumptions of simple linear Present real-world applications of
regression: simple linear regression in various
● Linearity: The relationship between domains, such as:
the dependent and independent ● Predicting sales based on
variables is linear. advertising expenditure
● Independence: Observations are ● Estimating house prices based on
independent of each other. square footage
● Homoscedasticity: The variance of ● Analyzing the relationship between
the error term is constant across all temperature and energy
levels of the independent variable.
consumption
● Normality: The error term follows a
normal distribution.
Multiple linear regression (MLR)
● Multiple linear regression (MLR) is used to determine a mathematical relationship among several random variables.
● In other terms, MLR examines how multiple independent variables are related to one dependent variable.
● X1, X2, X3, etc. are the values of independent predictor variables (i.e., risk factors), and
● b1, b2, b3, etc. are the coefficients for each risk factor.
● Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses
where e is the base of the natural log and the x corresponds to the real numerical value you
want to transform.
Ref. https://tutorialforbeginner.com/linear-regression-vs-logistic-regression-in-machine-learning
Logistic Regression
Basics of Unsupervised Learning
● Unsupervised learning, also known
as unsupervised machine learning,
uses machine learning (ML)
algorithms to analyze and cluster
unlabeled data sets.
● These algorithms discover hidden
patterns or data groupings without
the need for human intervention.
● Unsupervised learning models are
utilized for three main
tasks—clustering, association, and
dimensionality reduction.
Basics of Unsupervised Learning
● Unsupervised learning models are
utilized for three main
tasks—clustering, association, and
dimensionality reduction.
● Clustering is a data mining
technique which groups unlabeled
data based on their similarities or
differences. Clustering algorithms
are used to process raw,
unclassified data objects into
groups represented by structures
or patterns in the information.
K-means clustering
● K-means clustering is a common example of an
exclusive clustering method where data points are
assigned into K groups, where K represents the
number of clusters based on the distance from
each group’s centroid.
● The data points closest to a given centroid will be
clustered under the same category.
● A larger K value will be indicative of smaller
groupings with more granularity whereas a smaller
K value will have larger groupings and less
granularity.
● K-means clustering is commonly used in market
segmentation, document clustering, image
segmentation, and image compression.
Ref. https://www.ibm.com/topics/unsupervised-learning
Hierarchical clustering(HC)
There are two type of HC:
agglomerative and divisive.
In agglomerative (bottom-up
approach), start from considering
each data point as a cluster, in
each iteration we are merging the
closest clusters until we have one
cluster. In other words,
It starts clustering by treating the individual data points as a single cluster then it is
merged continuously based on similarity until it forms one big cluster containing all
objects.
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Hierarchical clustering(HC)
● Agglomerative clustering requires a
definition of linkage, i.e. how to calculate
the distance between two clusters in the
case where a cluster contains more than
one sequence.
● The most commonly used definitions are
minimum distance (single linkage),
maximum distance (complete linkage),
and average linkage (also called UPGMA
(Unweighted Pair Group Method with
Arithmetic Mean)).
,
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Hierarchical clustering(HC)
In divisive clustering(top-down
approach) all the data points
belong to a single cluster, at each
iteration we split the farthest point
until each cluster contains a
unique observation. In other
words,
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Agglomerative clustering Examples ..
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Agglomerative clustering Examples ..
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Understanding Association in Machine Learning
The importance of association analysis in real-world applications are market
basket analysis, recommendation systems, and healthcare analytics.
Apriori algorithm is one of the most popular algorithms for association rule mining.
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
https://www.youtube.com/watch?v=43CMKRHdH30
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Activation Functions
● The biological equivalent of a neuron becomes a node, connected at its input
end and output end to other nodes, just like synapses in a brain.
● The chemical signals become mathematical values given to the input, and the
output chemical signals become mathematical in nature as well.
● As for the threshold that holds back the neuron firing until it receives the
appropriate input, that is converted into a bias.
● The bias is a numerical value that all of the inputs must exceed in order for the
node to output, and if so, how large the value of that output.
● As the network “learns” each input to a node is weighted with different
modifying values that increase or decrease how each individual input is
calculated when combined then compared to the bias.
● These weights change each time the network learns, as well as the bias they
are compared to.
Activation Functions
Why we use Activation functions with Neural Networks?
● It is used to determine the output of neural network like yes or no. It maps the
resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
Ref. https://medium.com/@BenDosch/ml-activation-functions-f851fd6334d2
Types of Activation Functions
ReLU (Rectified Linear Unit)
Activation Function
Ref. https://medium.com/@BenDosch/ml-activation-functions-f851fd6334d2
Types of Activation Functions
Leaky ReLU
● It is an attempt to solve the dying ReLU problem
Ref. https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
PERCEPTRON and Its Application in Machine Learning
● In Machine Learning,
PERCEPTRON is considered as a
single-layer neural network
1. What is perceptron?
a) a single layer feed-forward neural network with
pre-processing
b) an auto-associative neural network
c) a double layer auto-associative neural network
d) a neural network that contains feedback
Test Your Skills: MCQ on PERCEPTRON
2. A perceptron is a _________
A) Backtracking algorithm
B) Backpropagation algorithm
C) Feed-forward neural network
D) Feed Forward-backward algorithm
Test Your Skills: MCQ on PERCEPTRON
● McCulloch–Pitt neuron
allows binary activation (1
ON or 0 OFF), i.e., it either
fires with an activation 1 or
does not fire with an
activation of 0.
McCulloch-Pitts Neuron: AND function
https://medium.com/analytics-vidhya/mp-neuron-and-perceptron-model-with-sample-code-c2189edebd3f
McCulloch-Pitts Neuron: OR function
McCulloch-Pitts Neuron: OR function
ANN Architecture
● Interconnection can be defined as the way processing elements (Neuron) in
ANN are connected to each other. Hence, the arrangements of these
processing elements and geometry of interconnections are very essential
in ANN.
● These arrangements always have two layers that are common to all
network architectures, the Input layer and output layer where the input
layer buffers the input signal, and the output layer generates the output of
the network.
● The third layer is the Hidden layer, in which neurons are neither kept in the
input layer nor in the output layer. These neurons are hidden from the
people who are interfacing with the system and act as a black box to them.
● By increasing the hidden layers with neurons, the system’s computational
and processing power can be increased but the training phenomena of the
system get more complex at the same time.
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
There exist five basic types of neuron connection architecture :
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
Ref. https://www.linkedin.com/pulse/demystifying-forward-propagation-neural-networks-real-world-v/
Learning Process in ANN: Forward Propagation Steps
Forward propagation involves several key steps:
1. Input Layer: The process begins with the input layer, where data is fed into
the neural network.
2. Weighted Sum: Each connection between neurons in adjacent layers has
an associated weight. Forward propagation computes the weighted sum of
inputs.
3. Bias Addition: A bias term is added to the weighted sum. This helps in
shifting the activation function's input and introducing non-linearity.
4. Activation Function: The weighted sum plus bias is passed through an
activation function, which introduces non-linearity into the network and
determines the neuron's output.
5. Output Layer: This process is repeated for each layer in the network until
the final output layer is reached.
Learning Process in ANN: Backward Propagation
Backpropagation is an algorithm used in artificial intelligence (AI) to
fine-tune mathematical weight functions and improve the accuracy of an
artificial neural network’s outputs.
Optimization Algorithms
Optimization Algorithms
Adadelta:
● Adadelta is an extension of AdaGrad that addresses its aggressive,
monotonically decreasing learning rates. It uses a sliding window of
gradients to compute an exponentially decaying average, and it
updates the parameters based on the root mean square of recent
updates.
Nadam (Nesterov-accelerated Adaptive Moment Estimation):
● Nadam is an extension of Adam that incorporates Nesterov
momentum into its optimization process. It combines the advantages
of Nesterov momentum and Adam's adaptive learning rate
Optimization Algorithms
● The fundamental assumption behind using HMMs for modeling sequential data is
that there exists an underlying sequence of hidden states that generates the
observed data.
● In HMMs, observations are the data points that we can directly observe or
measure. These observations could be discrete symbols (e.g., words in a
sentence, nucleotides in a DNA sequence) or continuous values (e.g., sensor
readings, stock prices).
● Hidden Markov Models assume the existence of a sequence of hidden states that
are not directly observable. These hidden states represent the underlying structure
or dynamics of the system generating the observed data.
HMM: Markov chain
● Sunny – Rainy (Tuesday) – Cloudy (Wednesday): The As shown above, the Markov chain is a
probability of a cloudy Wednesday can be calculated process with a known finite number of
as 0.1 x 0.3 = 0.03 states in which the probability of being
in a particular state is determined only
● Sunny – Cloudy (Tuesday) – Cloudy (Wednesday):
The probability of a cloudy Wednesday can be by the previous state.
calculated as 0.4 x 0.1 = 0.0