0% found this document useful (0 votes)
52 views

ML Notes All

Uploaded by

learncoding290
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

ML Notes All

Uploaded by

learncoding290
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 257

Human Learning

● Human learning refers to the process through which individuals acquire new
knowledge, skills, behaviors, or attitudes, resulting in a relatively permanent
change in their capabilities.

8. Vicarious Learning:
Types of Human Learning ➔ Learning by experiencing the consequences of others'
actions. Linked to observational learning and empathy.
1. Classical Conditioning:
➔ Involves learning associations between stimuli and 9. Rote Learning:
responses
2. Operant Conditioning: ➔ Memorization of information through repetition. Common
➔ Learning occurs through reinforcement (increasing in early education for basic facts and figures.
behavior) or punishment (decreasing behavior).
3. Observational Learning: 10. Latent Learning:
➔ Learning by observing others.
4. Cognitive Learning: ➔ latent learning highlights the idea that individuals can
➔ Emphasizes mental processes like thinking, acquire information without immediately displaying it in
memory, problem-solving. their behavior
5. Associative Learning:
➔ Involves forming associations or connections between 11. Social Learning:
stimuli or events. Classical and operant conditioning
are examples of associative learning. ➔ Emphasizes the role of social interactions in learning.
6. Insight Learning:
➔ Sudden realization or understanding of a problem 12. Habituation and Sensitization:
without prior experience. Often associated with
problem-solving. ➔ Habituation involves a decrease in response to a
repeated stimulus, while sensitization involves an increase
7. Experiential Learning: in response, often due to the intensity or repeated
➔ Learning through direct experience, reflection, and exposure to a stimulus.
active engagement.
Machine Learning
Arthur Samuel described Machine Learning as:

“the field of study that gives computers the ability to learn without being explicitly programmed.”

Tom Mitchell provides a more modern definition:

“A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.”
Types of Machine Learning
● Machine learning is a subset of AI, which enables the machine to automatically learn from data, improve
performance from past experiences, and make predictions.
Well Posed Learning Problem
A computer program is said to learn from experience E in context to some task T and some performance measure
P, if its performance on T, as was measured by P, upgrades with experience E.

Any problem can be segregated as well-posed learning problem if it has three traits –

1. Task
2. Performance Measure
3. Experience

Example 1:
Task – Classifying emails as spam or not, Performance
Measure – The fraction of emails accurately classified as spam or not spam
Experience – Observing you label emails as spam or not spam

Example 2:
Task – predicting different types of faces
Performance Measure – able to predict maximum types of faces
Experience – training machine with maximum amount of datasets of different face images
Application of Machine Learning
Healthcare:

➔ Disease Prediction and Diagnosis: Machine learning algorithms can


analyze patient data, such as medical records and imaging, to predict and
diagnose diseases.
➔ Personalized Medicine: Tailoring treatment plans based on individual
patient characteristics and genetic information.

Finance:

➔ Credit Scoring: Machine learning models help assess credit risk by


analyzing customer data.
➔ Fraud Detection: Identifying fraudulent transactions by recognizing patterns
that deviate from normal behavior.

Retail:

➔ Recommendation Systems: Providing personalized product


recommendations based on user behavior and preferences.
➔ Demand Forecasting: Predicting product demand to optimize inventory
management.
Application of Machine Learning cont...
Automotive:

➔ Autonomous Vehicles: Machine learning is crucial for self-driving cars


to perceive and respond to their environment.
➔ Predictive Maintenance: Anticipating when vehicle components may
fail to schedule timely maintenance.

Marketing:

➔ Customer Segmentation: Grouping customers based on behavior for


targeted marketing campaigns.
➔ Sentiment Analysis: Analyzing customer reviews and social media
data to gauge public sentiment towards a product or brand.

Education:

➔ Adaptive Learning Platforms: Personalizing educational content


based on individual student performance.
➔ Student Outcome Prediction: Identifying students at risk of
underperforming based on historical data.
Application of Machine Learning cont...
Cybersecurity:

➔ Anomaly Detection: Identifying unusual patterns or behavior that may


indicate a security threat.
➔ Malware Detection: Using machine learning to recognize patterns
associated with malicious software.

Manufacturing:

➔ Quality Control: Inspecting and identifying defects in manufacturing


processes using image recognition.
➔ Predictive Maintenance: Anticipating equipment failures to reduce
downtime.

Natural Language Processing (NLP):

➔ Chatbots and Virtual Assistants: Using NLP for natural and interactive
communication with users.
➔ Language Translation: Translating text from one language to another
with improved accuracy.
Issues in Machine Learning
Poor Quality of Data

➔ Incomplete Data: Poor data quality often stems from


incomplete datasets, where crucial information is missing or
not collected uniformly.
➔ Noisy Data: Noise in data refers to irrelevant or erroneous
information that can mislead machine learning algorithms.

Underfitting of Training Data


➔ Insufficient Model Complexity: Underfitting occurs when
the chosen machine learning model is too simplistic to
capture the underlying patterns in the data.
➔ Limited Training Iterations: Inadequate training iterations or
epochs can prevent the model from fully converging to a
suitable solution.
Issues in Machine Learning cont…
Overfitting of Training Data

➔ Complex Model: Overfitting occurs when a machine learning


model is excessively complex, fitting not only the
underlying patterns in the data but also the noise and
randomness present in the training dataset.
➔ Limited Data: When the training dataset is small, complex
models can effectively memorize the training examples
rather than learn the actual patterns. As a result, they
struggle to make accurate predictions on new data.

Lack of Training Data


➔ Limited Representativeness: Insufficient training data may
lead to models that fail to capture the true diversity and
complexity of real-world scenarios.
➔ Reduced Generalization: Inadequate training data can
hinder a model's ability to generalize well beyond the data it
was trained on
Basic Data Types in Machine Learning
Quantitative data type: This type of data type consists of
numerical values. Anything which is measured by numbers.

A.) Discrete data type: The numeric data which have discrete
values or whole numbers. This type of variable value if
expressed in decimal format will have no proper meaning. Their
values can be counted.

E.g.: No. of cars you have, no. of marbles in containers,


students in a class, etc.

B.) Continuous data type: The numerical measures which can


take the value within a certain range. This type of variable value
if expressed in decimal format has true meaning. Their values
can not be counted but measured. The value can be infinite

E.g.: height, weight, time, area, distance, measurement of


rainfall, etc.
Basic Data Types in Machine Learning cont…
Qualitative data type:

These are the data types that cannot be expressed in numbers. This
describes categories or groups and is hence known as the categorical
data type.

a. Structured Data: This type of data is either number or words. This


can take numerical values but mathematical operations cannot be
performed on it. This type of data is expressed in tabular format.

E.g.) Sunny=1, cloudy=2, windy=3 or binary form data like 0 or1, Good
or bad, etc.

b. Unstructured data: This type of data does not have the proper
format and therefore known as unstructured data.

This comprises textual data, sounds, images, videos, etc.


Exploring Numerical Data
➔ Data exploration refers to the initial step in data analysis in which data analysts use
data visualization and statistical techniques to describe dataset characterizations, such
as size, quantity, and accuracy, in order to better understand the nature of the data.

[https://www.heavy.ai/learn/data-exploration]

What is Exploratory Data Analysis ?

➔ Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing


their main characteristics often plotting them visually. This step is very important
especially when we arrive at modeling the data in order to apply Machine learning.
Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more

[https://www.simplilearn.com/tutorials/statistics-tutorial/what-is-normal-distribution].
Nominal Data and Ordinal Data
The main differences between Nominal Data and Ordinal Data are:
While Nominal Data is classified without any intrinsic ordering or rank, Ordinal Data has some
predetermined or natural order.
Nominal data is qualitative or categorical data, while Ordinal data is considered “in-between”
qualitative and quantitative data.
Nominal data do not provide any quantitative value, and you cannot perform numeric operations
with them or compare them with one another. However, Ordinal data provide sequence, and it is
possible to assign numbers to the data. No numeric operations can be performed. But ordinal
data makes it possible to compare one item with another in terms of ranking.
Example of Nominal Data – Eye color, Gender; Example of Ordinal data – Customer Feedback,
Economic Status
Exploratory Data Analysis
Data exploration steps to follow before building a machine learning model include:
➔ Variable identification: define each variable and its role in the dataset
➔ Univariate analysis: for continuous variables, build box plots or histograms for each variable independently;
for categorical variables, build bar charts to show the frequencies
➔ Bi-variable analysis - determine the interaction between variables by building visualization tools
➔ Continuous and Continuous: scatter plots
➔ Categorical and Categorical: stacked column chart
➔ Categorical and Continuous: boxplots combined with swarmplots
➔ Detect and treat missing values
➔ Detect and treat outlier
The ultimate goal of data exploration machine learning is to provide data insights that will inspire subsequent
feature engineering and the model-building process.
Ref. https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
https://www.simplilearn.com/tutorials/data-analytics-tutorial/exploratory-data-analysis
Data Types

Quantitative Data Qualitative Data

Discrete Data Continuous Data Structure Data Unstructured Data


Exploring Categorical Data
Definition: Categorical data represents categories or groups. It can be nominal (no inherent order) or ordinal (has a
defined order).

Types of Categorical Data:

Nominal Data: Categories without any inherent order (e.g., colors, gender).

Ordinal Data: Categories with a specific order (e.g., education levels, satisfaction ratings).

Exploration Techniques:

Frequency Distribution: Count the occurrences of each category. Helps understand the distribution of data.

Bar Charts: Visual representation of frequency distribution. Useful for comparing the frequency of different categories.

Pie Charts:Represents proportions of the whole. Suitable for displaying the contribution of each category.

Central Tendency: Mode is often used for central tendency in categorical data. Mode is the category with the highest
frequency.
Exploring Relationship between variables
Correlation is very useful in data analysis and modelling to better
understand the relationships between variables. The statistical
relationship between two variables is referred to as their correlation.

A correlation could be presented in different ways:

Positive Correlation: both variables change in the same direction.

Neutral Correlation: No relationship in the change of the variables.

Negative Correlation: variables change in opposite directions.

[https://www.geeksforgeeks.org/exploring-correlation-in-python/]
Box Plot (https://youtu.be/GMb6HaLXmjY)
https://www.six-sigma-material.com/Box-Plot.html
Data issues and Remediation in machine learning
Insufficient Data: Lack of high-quality and representative data can hinder the training process and lead
to poor model performance.

Overfitting: Occurs when a model learns the training data too well, capturing noise and outliers,
resulting in poor generalization to new, unseen data.

Underfitting: The opposite of overfitting, where the model is too simple and fails to capture the
underlying patterns in the data.

Feature Engineering Challenges: Selecting and creating relevant features is crucial; improper feature
selection or extraction can lead to suboptimal model performance.

Data Leakage: Accidental inclusion of information from the test set in the training process, leading to
overly optimistic performance estimates.
Data issues and Remediation in machine learning
Data Augmentation: Generate additional training data by applying transformations (e.g., rotation,
cropping) to existing data, helping to address issues related to insufficient data.

Regularization Techniques: Implement regularization methods (e.g., L1 or L2 regularization) to prevent


overfitting by penalizing overly complex models.

Model Complexity Adjustment: Experiment with model architectures and complexity, ensuring a
balance between simplicity and capturing essential patterns to mitigate underfitting and overfitting.

Feature Scaling and Normalization: Standardize and normalize features to a similar scale, preventing
certain features from dominating the learning process and aiding in better convergence.

Data Cleaning: Identify and address data quality issues through thorough data cleaning, handling
outliers, and addressing missing or inaccurate values.

https://www.javatpoint.com/issues-in-machine-learning
Data Preprocessing
➔ Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model.
➔ It is the first and crucial step while creating a machine learning model.

➔ When creating a machine learning project, it is not always a case that we


come across the clean and formatted data.
➔ And while doing any operation with data, it is mandatory to clean it and put in
a formatted way.
Data Preprocessing
Why do we need Data Preprocessing?

➔ A real-world data generally contains noises, missing values, and maybe in an


unusable format which cannot be directly used for machine learning models.

➔ Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
Data Preprocessing
➔ Getting the dataset
➔ Importing libraries
➔ Importing datasets
➔ Finding Missing Data
➔ Encoding Categorical
Data
➔ Splitting dataset into
training and test set
➔ Feature scaling
Data Preprocessing
➔ A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models.
➔ Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
➔ After cleaning and proper formatting of data we need to scaling of data.
➔ Scaling generally bound the all features of a dataset in a fixed range and this
range is same for all features,this increase the accuracy of our machine
learning model with a great margin.
➔ For performing data preprocessing using python we need to import some
predefined Python libraries.
k-fold cross validation and bootstrap sampling
What is K-Fold Cross Validation?

➔ K-fold cross-validation is a technique for


evaluating predictive models. The dataset is
divided into k subsets or folds.
➔ The model is trained and evaluated k times,
using a different fold as the validation set
each time.
➔ Performance metrics from each fold are
averaged to estimate the model’s
generalization performance.
➔ This method aids in model assessment,
selection, and hyperparameter tuning,
providing a more reliable measure of a
model’s effectiveness.
k-fold cross validation and bootstrap sampling
k-fold cross validation and bootstrap sampling
What is K-Fold Cross Validation?

➔ Let’s have a generalised K value. If K=5, it means, in the given dataset and
we are splitting into 5 folds and running the Train and Test.
➔ During each run, one fold is considered for testing and the rest will be for
training and moving on with iterations, the below pictorial representation
would give you an idea of the flow of the fold-defined size.

https://www.analyticsvidhya.com/blog/2022/02/k-fold-cross-validation-technique-and-its-essentials/
Bootstrap Sampling

● Sampling: With respect to statistics, sampling is the process of selecting a subset of


items from a vast collection of items (population) to estimate a certain characteristic of the
entire population
● Sampling with replacement: It means a data point in a drawn sample can reappear in
future drawn samples as well
● Parameter estimation: It is a method of estimating parameters for the population using
samples. A parameter is a measurable characteristic associated with a population. For
example, the average height of residents in a city, the count of red blood cells, etc.
Bootstrap Sampling
➔ Instead of measuring the heights of all the students, we can draw a random
sample of 5 students and measure their heights.
➔ We would repeat this process 20 times and then average the collected
height data of 100 students (5 x 20).
➔ This average height would be an estimate of the mean height of all the
students of the school.

https://www.analyticsvidhya.com/blog/2020/02/what-is-bootstrap-sampling-in-statistics-and-machine-learning/
Bootstrap Sampling
What is bootstrap sampling?
➔ The bootstrap sampling method is
a resampling method that uses
random sampling with
replacement.
➔ Bootstrap sampling is used in a
machine learning ensemble
algorithm called bootstrap
aggregating (also called bagging).
➔ It helps in avoiding overfitting and
improves the stability of machine
learning algorithms.
Bootstrap Sampling
What is the advantage of bootstrap sampling?
➔ The advantage of bootstrap sampling is that it allows for robust statistical inference
without relying on strong assumptions about the underlying data distribution.
➔ By repeatedly resampling from the original data, it provides an estimate of the
sampling distribution of a statistic, helping to quantify its uncertainty.
➔ This method is particularly useful when the data is limited or when traditional
parametric methods are not appropriate.
➔ Bootstrap sampling is used in a machine learning ensemble algorithm called
bootstrap aggregating (also called bagging).
➔ It helps in avoiding overfitting and improves the stability of machine learning
algorithms.
Overfitting & Underfitting, Bias & Variance

https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Overfitting & Underfitting, Bias & Variance
Bias:

➔ Bias refers to the error due to overly simplistic assumptions


in the learning algorithm. These assumptions make the
model easier to comprehend and learn but might not
capture the underlying complexities of the data.

➔ While making predictions, a difference occurs between


prediction values made by the model and actual
values/expected values, and this difference is known as
bias errors or Errors due to bias.

https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Overfitting & Underfitting, Bias & Variance
Variance:

➔ The variance would specify the amount of


variation in the prediction if the different
training data was used.
➔ In simple words, variance tells that how much
a random variable is different from its
expected value.
Overfitting & Underfitting, Bias & Variance
What is underfitting

➔ Underfitting occurs when a model is not able to make accurate


predictions based on training data and hence, doesn’t have the
capacity to generalize well on new data.

➔ Machine learning models with underfitting tend to have poor


performance both in training and testing sets(like the child who
learned only addition and was not able to solve problems
related to other basic arithmetic operations both from his math
problem book and during the math exam).

➔ Underfitting models usually have high bias and low variance.


Overfitting & Underfitting, Bias & Variance
What is overfitting
➔ A model is considered overfitting when it
does extremely well on training data but
fails to perform on the same level on the
validation data (like the child who
memorized every math problem in the
problem book and would struggle when
facing problems from anywhere else).

➔ Models that are overfitting usually have
low bias and high variance
Overfitting & Underfitting, Bias & Variance
Bias-Variance Trade-Off
➔ While building the machine learning model, it is
really important to take care of bias and variance in
order to avoid overfitting and underfitting in the
model.
➔ If the model is very simple with fewer parameters,
it may have low variance and high bias.
➔ Whereas, if the model has a large number of
parameters, it will have high variance and low bias.
➔ So, it is required to make a balance between bias
and variance errors, and this balance between the
bias error and variance error is known as the
Bias-Variance trade-off.

https://www.javatpoint.com/bias-and-variance-in-machine-learning
https://censi
us.ai/wiki/ov
erfitting-vs-
underfitting
Model Performance & Evaluation
➔ Model performance in general refers to how well a model accomplishes its
intended task, but it is important to define exactly what element of a model
is being considered, and what “doing well” means for that element.
➔ Performance evaluation is the quantitative measure of how well a trained
model performs on specific model evaluation metrics in machine learning.
➔ Two of the most important categories of evaluation methods are
classification and regression model performance metrics.
Classification metrics/ a Confusion matrix
➔ True Positive: You predicted positive, and it’s true(TP).
➔ True Negative: You predicted negative, and it’s true(TN).
➔ False Positive: (Type 1 Error): You predicted positive, and it’s false(FP).
➔ False Negative: (Type 2 Error): You predicted negative, and it’s false(FN).
➔ Sensitivity or Recall: The proportion of actual positive cases which are
correctly identified.
➔ Specificity: The proportion of actual negative cases which are correctly
identified.
Classification metrics/ a Confusion matrix
➔ Precision - percentage of positive cases that
were true positives as opposed to false
positives. Use the formula Precision = TP /
(TP+FP)
➔ Recall - percentage of actual positive cases
that were predicted as positives, as
opposed to those classified as false
negatives. Use the formula Recall =
TP/(TP+FN)
● Logarithmic loss - measure of how many total
➔ Accuracy - percentage of the total variables errors a model has. The closer to zero, the more
that were correctly classified. Use the correct predictions a model makes in
classifications.
formula Accuracy = (TP+TN) / ● Area under curve - method of visualizing true
(TP+TN+FP+FN) and false positive rates against each other.
Regression Metrics
We decompose
variability into the
sum of squares
total (SST), the sum
of squares
regression (SSR),
and the sum of
squares error
(SSE).

SST = SSR + SSE.

One of the most


common metrics of
model prediction
accuracy, mean
absolute
percentage error
(MAPE)
Regression Metrics

Some of the most useful regression metrics include:


➔ Coefficient of determination (or R-squared) - measures variance of a model
compared to the actual data.
➔ Mean squared error(MSE) - measures the amount of average divergence of the
model from the observed data.
➔ Mean absolute error(MAE) - measures the vertical and horizontal distance between
data points and a linear regression line to illustrate how much a model deviates from
observed data.
➔ Mean absolute percentage error(MAPE) - shows mean absolute error as a
percentage.
➔ Weighted mean absolute percentage error - uses actual values (rather than absolute
values) to measure percentage errors.
Evaluation Metrics for Clustering
Evaluation Metrics for Clustering
Evaluation Metrics for Clustering
Evaluation Metrics for Clustering
Evaluation Metrics for Clustering
Performance Improvement
1. Add More Data:
➔ Having more data is always a good idea.
➔ It allows the “data to tell for itself” instead of relying on assumptions and weak
correlations.
➔ Presence of more data results in better and more accurate machine-learning
models.
Performance Improvement
2. Treat Missing and Outlier Values

➔ The unwanted presence of missing and outlier values in machine learning training data often
reduces the accuracy of a trained model or leads to a biased model.
➔ It leads to inaccurate predictions.

➔ Missing: In the case of continuous variables, you can impute the missing values with mean, median,
or mode.
➔ Outlier: You can delete the observations and perform transformations, binning, or imputation (same
as missing values). Alternatively, you can also treat outlier values separately.
Performance Improvement
3. Feature Engineering
➔ This step helps to extract more information from existing data. New information is
extracted in terms of new features.
➔ Feature Transformation: Changing the scale of a variable from the original
scale to a scale between zero and one is a common practice in machine
learning, known as data normalization.
➔ Feature Creation: Deriving new variable(s) from existing variables is known as
feature creation. It helps to unleash the hidden relationship of a data set. Let’s
say we want to predict the number of transactions in a store based on
transaction dates.
Performance Improvement
4. Feature Selection
➔ a) Feature Selection is a process of finding out the best subset of attributes that
better explains the relationship of independent variables with the target variable.
➔ b) Domain Knowledge: Based on domain experience, we select feature(s) which
may have a higher impact on the target variable.
➔ c) Visualization: As the name suggests, it helps to visualize the relationship between
variables, which makes your variable selection process easier.
➔ d) Statistical Parameters: We also consider the p-values, information values, and
other statistical metrics to select the right features.
➔ e) PCA: It helps to represent training data into lower dimensional spaces but still
characterizes the inherent relationships in the data. It is a type of dimensionality
reduction technique.
Performance Improvement
5. Multiple Algorithms

➔ There are many different algorithms in machine learning, but hitting the right
machine learning algorithm is the ideal approach to achieve higher accuracy.
But, it is easier said than done.
Performance Improvement
6. Algorithm Tuning

➔ We know that machine learning algorithms are driven by hyperparameters.


These hyperparameters majorly influence the outcome of the learning
process.
➔ The objective of hyperparameter tuning is to find the optimum value for each
hyperparameter to improve the accuracy of the model.
Performance Improvement
7. Ensemble Methods
➔ This is the most common approach that
you will find majorly in winning solutions
of Data science competitions. This
technique simply combines the result of
multiple weak models and produces
better results. You can achieve by the
following ways:
➔ Bagging (Bootstrap Aggregating):
➔ Boosting: Boosting is an ensemble
modeling technique that attempts to
build a strong classifier from the
number of weak classifiers.

Ref. https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
Performance Improvement
8. Cross Validation

➔ To find the right answer to this question, we must use the cross-validation
technique.
➔ Cross Validation is one of the most important concepts in data modeling.
➔ It says to try to leave a sample on which you do not train the model and test
the model on this sample before finalizing the model.

Ref. https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
Feature Construction and Extraction
➔ Feature extraction/construction is a process through which a set of new features is
created.
What is Feature Extraction?
➔ Feature extraction is the process of identifying and selecting the most important
information or characteristics from a data set.

Common Feature Extraction Techniques: Dimensionality Reduction- Dimensionality


reduction can be done in 2 ways:

a. Feature Selection

b. Feature Extraction:
Feature Selection
Feature Selection: Feature selection is a way of selecting the subset of the most
relevant features from the original features set by removing the redundant,
irrelevant, or noisy features.

Benefits of feature selection in machine learning:

➔ It helps in avoiding the curse of dimensionality.


➔ It helps in the simplification of the model so that it can be easily interpreted
by the researchers.
➔ It reduces the training time.
➔ It reduces overfitting hence enhance the generalization.
Feature Selection
Supervised Feature Selection Technique
Wrapper Methods:
➔ In wrapper methodology, selection of features is done
by considering it as a search problem, in which different
combinations are made, evaluated, and compared with
other combinations.

➔ It trains the algorithm by using the subset of features


iteratively.
Supervised Feature Selection Technique
Some techniques of wrapper methods are:
Forward selection -
➔ Forward selection is an iterative process, which
begins with an empty set of features. After each
iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the
performance or not.
➔ The process continues until the addition of a new
variable/feature does not improve the performance
of the model.
Supervised Feature Selection Technique
Some techniques of wrapper methods are:

Exhaustive Feature Selection-

➔ Exhaustive feature selection is one of the best


feature selection methods, which evaluates each
feature set as brute-force.

➔ It means this method tries & make each possible


combination of features and return the best
performing feature set.
Supervised Feature Selection Technique
Some techniques of wrapper methods are:
Backward elimination -
➔ Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This
technique begins the process by considering all the
features and removes the least significant feature.
➔ This elimination process continues until removing
the features does not improve the performance of
the model.
Supervised Feature Selection Technique
Some techniques of wrapper methods are:
Recursive Feature Elimination-
➔ Recursive feature elimination is a recursive greedy
optimization approach, where features are selected
by recursively taking a smaller and smaller subset of
features.
➔ Now, an estimator is trained with each set of
features, and the importance of each feature is
determined
Supervised Feature Selection Technique
Filter Methods
➔ In Filter Method, features are selected on the basis
of statistics measures. This method does not depend
on the learning algorithm and chooses the features
as a pre-processing step.
➔ The filter method filters out the irrelevant feature and
redundant columns from the model by using different
metrics through ranking.
➔ The advantage of using filter methods is that it needs
low computational time and does not overfit the data.
Supervised Feature Selection Technique
Filter Methods
Some common techniques of Filter methods are as follows:
Information Gain:
➔ Information gain determines the reduction in entropy while transforming the dataset.
➔ It can be used as a feature selection technique by calculating the information gain of
each variable with respect to the target variable.
Chi-square Test:
➔ Chi-square test is a technique to determine the relationship between the categorical
variables.
➔ The chi-square value is calculated between each feature and the target variable,
and the desired number of features with the best chi-square value is selected.
Supervised Feature Selection Technique
Filter Methods
Some common techniques of Filter methods are as follows:
Fisher's Score:
➔ Fisher's score is one of the popular supervised technique of features selection.
➔ It returns the rank of the variable on the fisher's criteria in descending order.
➔ Then we can select the variables with a large fisher's score.
Missing Value Ratio:
➔ The value of the missing value ratio can be used for evaluating the feature set
against the threshold value.
➔ The formula for obtaining the missing value ratio is the number of missing values in
each column divided by the total number of observations. The variable is having
more than the threshold value can be dropped.
Supervised Feature Selection Technique
Embedded Methods
➔ These methods are also iterative, which
evaluates each iteration, and optimally
finds the most important features that
contribute the most to training in a
particular iteration.
➔ Embedded methods combined the
advantages of both filter and wrapper
methods by considering the interaction of
features along with low computational
cost.
➔ These are fast processing methods
similar to the filter method but more
accurate than the filter method.
➔ Some techniques of embedded methods
are:
Supervised Feature Selection Technique
Embedded Methods
Some of the most popular examples of these methods are LASSO and RIDGE regression which have
inbuilt penalization functions to reduce overfitting.

● Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of
the magnitude of coefficients.
● Ridge regression performs L2 regularization which adds penalty equivalent to square of the
magnitude of coefficients.

For more details and implementation of LASSO and RIDGE regression, you can refer to this article.
Lasso regression performs L1 regularization
Ridge regression performs L2 regularization
Random Forest Importance
Feature importance in Random Forest is determined through the following process:

Gini Importance:
Random Forest calculates the importance of a feature by looking at how much the tree nodes that use
that feature reduce impurity across all trees in the forest. This is known as Gini importance.

○ The Gini importance of a feature is the average of how much each tree node that uses that
feature reduces impurity across all trees in the forest.

Mean Decrease in Impurity (MDI):

The mean decrease in impurity (MDI) is another method used to determine feature importance in
Random Forest.

○ For each tree in the forest, the algorithm records how much each feature decreases the
impurity (typically measured by Gini impurity or entropy) as the data is split across the nodes.
The average decrease in impurity across all trees is then used as the feature importance.
Random Forest Importance
Feature Importance Calculation:

○ Once the Gini importance or MDI is calculated for each feature, the importances are
normalized so that they sum to 1. This helps in comparing the relative importance of
different features.

Application:

○ Feature importance scores can then be used to identify the most influential features
in the model, allowing for insights into which features are most informative for making
predictions.

In conclusion, Random Forest determines feature importance by evaluating how much each
feature contributes to reducing impurity across the ensemble of trees, and then normalizing
these importances to provide a relative ranking of feature importance.
Features Selection Techniques
Dimensionality Reduction:
➔ Principal Component Analysis (PCA): Reduces the dimensionality of the feature
space while retaining most of the variation in the data.
➔ Linear Discriminant Analysis (LDA): Maximizes the separation between classes by
finding the linear combinations of features that best represent the classes.
Hybrid Methods:
➔ Recursive Feature Elimination (RFE): Combines both wrapper and embedded
methods by recursively fitting a model and removing the least important feature until
the desired number of features is reached.
Each technique has its strengths and weaknesses, and the choice of method depends on
the specific characteristics of the dataset and the machine learning task at hand.
Features Selection Techniques
Feature Selection Techniques

Feature selection is a critical step in the machine learning pipeline that involves choosing a subset of
relevant features for model training. Here are several popular feature selection techniques:

Filter Methods:

➔ Variance Threshold: Remove features with low variance as they contain little information.
➔ Correlation-based Feature Selection: Identify and remove highly correlated features to reduce
redundancy.

Wrapper Methods:

➔ Forward Selection: Start with an empty set of features and iteratively add the feature that improves
the model performance the most.
➔ Backward Elimination: Begin with all features and remove the least significant feature at each
iteration.
Features Selection Techniques
Dimensionality Reduction:
➔ Principal Component Analysis (PCA): Reduces the dimensionality of the feature
space while retaining most of the variation in the data.
➔ Linear Discriminant Analysis (LDA): Maximizes the separation between classes by
finding the linear combinations of features that best represent the classes.
Hybrid Methods:
➔ Recursive Feature Elimination (RFE): Combines both wrapper and embedded
methods by recursively fitting a model and removing the least important feature until
the desired number of features is reached.
Each technique has its strengths and weaknesses, and the choice of method depends on
the specific characteristics of the dataset and the machine learning task at hand.
Binomial Distribution
Binomial Distribution:
● A discrete probability distribution Condition of Binomial Distribution:
that gives the probability of only two ● Experiment consist of n identical
possible outcomes in n independent trials
trails is known as Binomial ● Each trial are independent
Distribution. ● Each trial results in one of the two
Example: possible outcomes i.e. Success or
Failure
● Number of Tails in flipping coin n
● The probability of success remains
times.
● The number of times getting 1 on constant throughout the experiment
throwing a dice.

https://www.shiksha.com/online-courses/articles/binomial-distribution-definition-and-examples/
Mathematical Definition: Binomial Distribution
Mathematical Definition: Binomial Distribution Examples
1. The probability that man aged 60 will
live up to 70 is 0.65 out of 10 men, now
aged 60. Find the probability:

a) At least 7 will live up to 70.

b) Exactly 9 will live up to 70.

c) At most 9 will live up to 70.

Sol. n=10, p=0.65,


q=1-p=0.35

N.B: np is the frequency


Mathematical Definition: Binomial Distribution Examples
2. Out of 800 families with 5 children each.
How many families could be expected to have

I. 3 boys
II. 5 girls
III. Either 2 or 3 boys
IV. At least 2 girls

Sol. N=800, n=5, p=0.5, q=1-p=0.5

Ref. https://www.youtube.com/watch?v=Nhw5qGKZm5o
Mathematical Definition: Binomial Distribution Examples
3. 4 coins are tossed 100 times and
following are obtained.

No. of Head Frequency

0 5

1 29

2 36

4 25

5 5

Fit a binomial distribution for data and


calculate theoretical frequency.
Ref. https://www.youtube.com/watch?v=Nhw5qGKZm5o
Poisson Distribution
A Poisson distribution:

● can be used to model the number of times an event occurs in an interval of time
or space
● is a discrete probability distribution
● expresses the probability of a given number of events occurring in a fixed interval of
time or space
● events occur at a constant mean rate
● events occur independently of the time since the last event
● can be applied to measures such as distance, area, and volume
Mathematical Definition: Poisson Distribution
A Poisson distribution:

Lamda is
known as
mean=np
Mathematical Definition: Poisson Distribution example
A Poisson distribution:

1. Given that 2% of the fuses


manufactured by a firm are defective.
Find probability that a box containing
200 fuses has I) at least one defective
fuses, II) 3 or more defective fuses, III)
no defective fuses.

Sol, n=200, p=0.02, np=λ =4

Ref. https://www.youtube.com/watch?v=raySzcgcBFk
Bernoulli Distribution:
Bernoulli Distribution:
Bernoulli distribution is a special case of Binomial Distribution when the random experiment is done just only one
time.

Similar to binomial distribution it has only two possible outcomes:

● Success (1)
● Failure (0)

Note: The sum of the probability of success and failure is equal to 1.

Example:

● India will win the cricket world cup or not


● You will pass the exam or not
Bernoulli Distribution:
Bernoulli Distribution:
Continuous Distribution: Uniform
● Mathematically, a continuous uniform
distribution is defined by two parameters:
the minimum value ‘a’ and the maximum
value ‘b’.

Example:

● Consider a fair six-sided die. When you roll


the die, the outcome can be any of the
numbers 1, 2, 3, 4, 5, or 6, and each
outcome is equally likely.

● The probability of getting any specific


number is 1/6 making it a uniform
distribution.
Continuous Distribution: Normal
● Normal distribution, also known as the
Gaussian distribution, is a probability
distribution that is symmetric about the mean,
showing that data near the mean are more
frequent in occurrence than data far from the
mean.

● The normal distribution appears as a "bell


curve" when graphed.

● The standard deviation measures how data


values deviate from the mean.
Continuous Distribution: Uniform vs. Normal
Continuous Distribution: Laplace
Continuous Distribution: Laplace
Example:

Suppose we have a class of 30 students, and we want to understand


the distribution of their heights. Let's say that the heights of these
students follow a Laplace distribution with a mean (average) height
of 160 cm and a scale parameter of 5 cm.

In this case:

● μ=160 cm (mean height)


● b=5 cm (scale parameter)

With this information, we can understand the probabilities associated


with different height ranges within the class.
Continuous Distribution: Laplace
For example:

● There's a high probability that most students will have heights


close to the mean height of 160 cm.
● There's a lower but still significant probability that some
students will have heights slightly taller or shorter than 160
cm.
● There's a smaller probability that some students will have
heights significantly taller or shorter than 160 cm.
● Heights that are very far from the mean have increasingly
lower probabilities of occurring.
Central Theorem in Machine Learning
● Central Limit Theorem in statistics states that whenever we
take a large sample size of a population then the distribution
of sample mean approximates to the normal distribution.

● CLT can be used in Machine Learning to make conclusion


about the performance of the model.

● In mathematical terms, this can be understood in terms of


the: number of random samples, sample mean, standard
normal distribution and standard deviation.
Central Theorem in Machine Learning
● The Central Limit Theorem states that the sum (or average)
of a large number of independent and identically distributed
random variables, regardless of their underlying
distribution, will approximately follow a normal (Gaussian)
distribution.
● This theorem holds true even if the individual random
variables are not normally distributed themselves.
● In the context of machine learning, this theorem is crucial
because it allows practitioners to make statistical
inferences about population parameters based on sample
data, assuming certain conditions are met.
Central Theorem in Machine Learning
Central Theorem in Machine Learning
Use of Central Limit Theorem(CLT)
Population Parameter Estimation – We can use CLT to estimate the parameters of the
population like population mean or population proportion based on a sampled data.
Hypothesis testing – CLT can be used for various hypothesis assumptions tests as It
helps in constructing test statistics, such as the z-test or t-test, by assuming that the
sampling distribution of the test statistic is approximately normal.
Confidence interval – Confidence interval plays a very important role in defining the
range in which the population parameter lies. CLT plays a very crucial role in determining
the confidence interval of these population parameter.
Sampling Techniques – sampling technique help in collecting representative samples and
generalize the findings to the larger population. The CLT supports various sampling
techniques used in survey sampling and experimental design.
Simulation and Monte Carlo Methods – This methods involve generating random
samples from known distributions to approximate the behavior of complex systems or
estimate statistical quantities. CLT plays a very key role in the simulation and monte
carlo methods.
Monte Carlo Approximation in Machine Learning
● A Monte Carlo simulation is
defined as a computational
technique that uses random
sampling to model and
analyze complex systems or
processes.
● The method derives its
name from Monaco’s
renowned Monte Carlo
Casino, which is
synonymous with games of
chance and randomness.
Monte Carlo Approximation in Machine Learning
● Monte Carlo approximation, on the other hand, is a technique used to
estimate complex mathematical expressions or solve intricate
problems through random sampling.
● In machine learning, Monte Carlo methods are often employed in
situations where exact analytical solutions are impractical or infeasible.
● By generating a large number of random samples from a probability
distribution, Monte Carlo methods approximate the desired solution by
averaging or integrating over these samples.
Monte Carlo Approximation in Machine Learning
● A Monte Carlo simulation is
defined as a computational
technique that uses random
sampling to model and
analyze complex systems or
processes.
● The method derives its
name from Monaco’s
renowned Monte Carlo
Casino, which is
synonymous with games of
chance and randomness.
Central Theorem and Monte Carlo Approximation
The relationship between the Central Limit Theorem and Monte Carlo approximation
in machine learning lies in the latter's utilization of the former's principles. Monte
Carlo methods often rely on the Central Limit Theorem to justify their effectiveness.
Specifically, when a large number of random samples are drawn from a distribution,
the resulting estimates tend to converge to the true value as predicted by the Central
Limit Theorem. This convergence behavior allows Monte Carlo approximation to
provide reliable estimates of complex quantities or solutions.

In summary, the Central Limit Theorem provides theoretical support for the
effectiveness of Monte Carlo approximation techniques in machine learning,
enabling practitioners to leverage random sampling to estimate solutions to
challenging problems accurately.
Bayes Theorem, Prior and Posterior Probability, Likelihood
Bayes Theorem:
● One of the most recent developments in artificial intelligence is machine learning.

● The Bayes Theorem is the key idea in machine learning.

● The Bayes theorem is frequently referred to as the Bayes rule or Bayes Law.

● One of the most well-known theories in machine learning, the Bayes theorem helps
determine the likelihood that one event will occur with unclear information while
another has already happened.
Bayes Theorem, Prior and Posterior Probability, Likelihood

Bayes Theorem: The mathematical formulation of the Bayes


theorem is
Bayes Theorem includes two conditional
probabilities for the events, say A and B.

● Posterior is described as a revised probability that takes the available data into account.
● Likelihood is the likelihood that the theory will be supported by evidence.
● P(A) is referred to as the prior probability, or the probability of a hypothesis before taking the data
into account.
● P(B) referred to as marginal probability. It is described as the likelihood of the evidence taken into
account.
● The Bayes theorem has a wide range of uses in machine learning, making it one of the most
popular approaches among all algorithms for classification-related issues.
● Evidence: P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
Bayes Theorem, Prior and Posterior Probability, Likelihood

Bayes Theorem Example:

Assume that the word ‘offer’


occurs in 80% of the spam
messages in my account. Also,
let’s assume ‘offer’ occurs in 10%
of my desired e-mails. If 30% of
the received emails are
considered as a spam, and I will
receive a new message which
contains ‘offer’, what is the
probability that it is spam?
Bayes Theorem, Prior and Posterior Probability, Likelihood
Bayes Theorem Example:
A = Spam,

B = Contains offer

P( contains offer|spam) = 0.8 (given)

P(spam) = 0.3 (given)

P(contains offer) =

P(Contains offer|Spam) * P(Spam) +

P(Contains offer|not Spam) * P(not Spam)


= (0.8*0.3)/0.31
= 0.8*0.3 + 0.1*0.7 = 0.774
= 0.31
Concept Learning, Bayesian Belief Network
Concept Learning is a fundamental task in machine learning that involves learning
to categorize objects or instances into different classes or categories based on
their features or attributes. It is the process of inferring a general function from
specific instances.
Concept Learning, Bayesian Belief Network
Bayesian Belief Networks (BBNs):
Bayesian Belief Networks, also known as Bayesian Networks or Probabilistic Graphical Models, are
graphical models that represent probabilistic relationships among a set of variables using directed
acyclic graphs (DAGs).

BBNs provide a compact and interpretable way to encode complex probabilistic dependencies and
perform probabilistic inference.

Key Points:

Nodes: Represent random variables or features in the domain.


Edges: Directed edges between nodes indicate probabilistic dependencies.
Conditional Probability Distributions (CPDs): Quantify the conditional probabilities of each variable
given its parents in the graph.
Inference: BBNs facilitate efficient probabilistic inference, allowing queries about the probabilities
of variables given evidence.
Learning: BBNs can be learned from data using techniques like parameter estimation and
structure learning.
Concept Learning, Bayesian Belief Network
Applications of Bayesian Belief Networks:

Diagnosis and Decision Support: BBNs are used for medical diagnosis, fault diagnosis, and decision
support systems.
Risk Analysis: Assessing risks and uncertainties in various domains such as finance, engineering,
and environmental science.
Natural Language Processing: BBNs are applied in tasks like language modeling, information
retrieval, and sentiment analysis.

Advantages:

Modularity: BBNs provide a modular representation that allows easy incorporation of domain
knowledge.
Uncertainty Handling: BBNs naturally handle uncertainty and missing data through probabilistic
inference.
Interpretability: BBNs offer intuitive graphical representations that facilitate understanding and
interpretation of complex probabilistic relationships.
Bayesian Belief Network
Bayesian Belief Network
✔ Bayesian belief network is key method for dealing with probabilistic events and to solve a
problem which has uncertainty.
✔ A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph.
✔ It is also called a Bayes network, belief network, decision network, or Bayesian model.

✔ Bayesian networks are probabilistic, because these networks are built from a probability
distribution.
Application
✔ Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network.
✔ It can also be applied in various tasks including prediction, anomaly detection, diagnostics,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Belief Network cont..
Bayesian Belief Network(Directed Acyclic Graph or DAG)
✔ Each node corresponds to the random variables, and a variable can be
continuous or discrete.
✔ Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables.
✔ These directed links or arrows connect
the pair of nodes in the graph.
✔ In the diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
✔ If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
✔ Each node in the Bayesian network has condition probability distribution
P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.
✔ Bayesian network is based on Joint probability distribution and conditional
probability.
Bayesian Belief Network cont..
Joint probability distribution
✔ If we have variables x1, x2, x3,....., xn, then the probabilities of different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.
✔ The joint probability distribution of the variables are written as,
P[x1, x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
✔ In general, P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

✔ The joint probability distribution of the DAG,


P[A, B, C, D]
= P[A].P[B|A]. P[D|A]. P[C|B,D]
Illustration of Bayesian Belief Network
Example
✔ The joint probability distribution of the DAG,
▪ P[D, S, A, B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B]. P[E]

✔ Calculate the probability that alarm has sounded, but


there is neither a burglary, nor an earthquake occurred.
✔ Now, the joint probability distribution,
P(S, D, A, ¬B, ¬E)
= P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
✔ Hence, a Bayesian network can answer any query
about the domain by using Joint distribution.

JIS College of Engineering


Basics of Supervised Learning and Classification
Supervised learning

● Supervised learning, also known as supervised machine learning, is a subcategory of machine


learning and artificial intelligence.
● It is defined by its use of labeled data sets to train algorithms that to classify data or predict
outcomes accurately.

How supervised learning works

● Supervised learning uses a training set to teach models to yield the desired output. This training
dataset includes inputs and correct outputs, which allow the model to learn over time.
● The algorithm measures its accuracy through the loss function, adjusting until the error has been
sufficiently minimized.

Supervised learning can be separated into two types of problems

● Classification and
● Regression: https://www.ibm.com/topics/supervised-learning
Classification
Classification

● Classification uses an algorithm to accurately


assign test data into specific categories.
● It recognizes specific entities within the dataset
and attempts to draw some conclusions on how
those entities should be labeled or defined.

Common classification algorithms are

● linear classifiers,
● support vector machines (SVM),
● decision trees,
● k-nearest neighbor, and
● random forest

https://www.ibm.com/topics/supervised-learning
Regression Supervised learning algorithms
Regression

● Regression is used to understand


the relationship between dependent
and independent variables.
● It is commonly used to make
projections, such as for sales
revenue for a given business.
● Linear regression, logistic
regression, and polynomial
regression are popular regression
algorithms.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:
● Primarily leveraged for deep learning algorithms, neural networks process
training data by mimicking the interconnectivity of the human brain through
layers of nodes.
● Each node is made up of inputs, weights, a bias (or threshold), and an
output.
● If that output value exceeds a given threshold, it “fires” or activates the
node, passing data to the next layer in the network.
● Neural networks learn this mapping function through supervised learning,
adjusting based on the loss function through the process of gradient
descent.
● When the cost function is at or near zero, we can be confident in the
model’s accuracy to yield the correct answer.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Neural networks:

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Naive bayes:
● Naive Bayes is classification approach that adopts the principle of class
conditional independence from the Bayes Theorem.
● This means that the presence of one feature does not impact the presence of
another in the probability of a given outcome, and each predictor has an equal
effect on that result.
● There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes,
Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
● This technique is primarily used in text classification, spam identification, and
recommendation systems.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Naive bayes:

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Support vector machines (SVM):
● A support vector machine is a popular supervised learning model developed
by Vladimir Vapnik, used for both data classification and regression.
● That said, it is typically leveraged for classification problems, constructing a
hyperplane where the distance between two classes of data points is at its
maximum.
● This hyperplane is known as the decision boundary, separating the classes of
data points (e.g., oranges vs. apples) on either side of the plane.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
Support vector machines (SVM):
● A support vector machine is a popular supervised
learning model developed by Vladimir Vapnik,
used for both data classification and regression.
● That said, it is typically leveraged for classification
problems, constructing a hyperplane where the
distance between two classes of data points is at
its maximum.
● This hyperplane is known as the decision
boundary, separating the classes of data points
(e.g., oranges vs. apples) on either side of the
plane.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
K-nearest neighbor:
● K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm
that classifies data points based on their proximity and association to other available
data.
● This algorithm assumes that similar data points can be found near each other.
● As a result, it seeks to calculate the distance between data points, usually through
Euclidean distance, and then it assigns a category based on the most frequent
category or average.
● Its ease of use and low calculation time make it a preferred algorithm by data
scientists, but as the test dataset grows, the processing time lengthens, making it
less appealing for classification tasks.
● KNN is typically used for recommendation engines and image recognition.

https://www.ibm.com/topics/supervised-learning
Supervised learning algorithms
K-nearest neighbor:

https://www.ibm.com/topics/supervised-learning
Business applications of Supervised learning
● Image- and object-recognition: Supervised learning algorithms can be used to locate,
isolate, and categorize objects out of videos or images, making them useful when applied to
various computer vision techniques and imagery analysis.
● Predictive analytics: A widespread use case for supervised learning models is in creating
predictive analytics systems to provide deep insights into various business data points. This
allows enterprises to anticipate certain results based on a given output variable, helping
business leaders justify decisions or pivot for the benefit of the organization.
● Customer sentiment analysis: Using supervised machine learning algorithms,
organizations can extract and classify important pieces of information from large volumes of
data—including context, emotion, and intent—with very little human intervention. This can be
incredibly useful when gaining a better understanding of customer interactions and can be
used to improve brand engagement efforts.
● Spam detection: Spam detection is another example of a supervised learning model. Using
supervised classification algorithms, organizations can train databases to recognize patterns
or anomalies in new data to organize spam and non-spam-related correspondences
effectively.

https://www.ibm.com/topics/supervised-learning
Supervised learning challenges:
The following are some of these challenges:
● Supervised learning models can require certain levels of expertise to structure
accurately.
● Training supervised learning models can be very time intensive.
● Datasets can have a higher likelihood of human error, resulting in algorithms
learning incorrectly.
● Unlike unsupervised learning models, supervised learning cannot cluster or
classify data on its own.

https://www.ibm.com/topics/supervised-learning
Introduction to K-Nearest Neighbor (KNN) Algorithm
Lesson Title: Introduction to K-Nearest Neighbor (KNN) Algorithm

Objective:

● Introduction to the concept of K-Nearest Neighbor (KNN) algorithm.


● Understand the working principle of KNN.
● Learn how to implement KNN for classification tasks.
Concept of K-Nearest Neighbor (KNN) algorithm
● K-Nearest Neighbor (KNN) is a simple and intuitive supervised learning
algorithm used for classification and regression tasks.
● In KNN, predictions are made based on the majority class or average value of
the k-nearest data points in the feature space.
● The algorithm operates on the principle that similar instances are likely to
belong to the same class or have similar values.
● KNN requires no explicit training phase, making it easy to implement and
understand.
Concept of K-Nearest Neighbor (KNN) algorithm
KNN is non-parametric
● It predicts the categorization of a new
sample point using data from many
classes. KNN is non-parametric since it
makes no assumptions about the data it is
analyzing.

KNN is a lazy learning algorithm


● KNN is a typical example of a lazy learner.
It is called lazy not because of its
apparent simplicity, but because it doesn't
learn a discriminative function from the
training data but memorizes the training
dataset instead.

https://towardsdatascience.com/how-to-find-the-
optimal-value-of-k-in-knn-35d936e554eb
Distance Metrics of K-Nearest Neighbor

https://towardsdatascience.com/how-to-find-the-
optimal-value-of-k-in-knn-35d936e554eb
Selection of Hyperparameter(k) of KNN algorithm
● The value of k in KNN determines the number of neighbors
considered for classification.
● Choosing the appropriate value of k is crucial and depends on
factors such as the dataset size, complexity, and noise level.
● A smaller value of k may lead to a more flexible decision
boundary but may be sensitive to noise, while a larger value of k
may lead to smoother decision boundaries but could over-smooth
the classification.
● The intuition behind KNN revolves around the idea of classifying
instances based on the collective information provided by their
nearest neighbors.
How to select the optimal K value
● Derive a plot between error rate and K denoting values in a
defined range. Then choose the K value as having a minimum
error rate.
KNN algorithm
1. Load the data
2. Set the value for K which equals to number of
neighbors
3. Calculate the distance between input sample and
training samples
4. Add distances to data structure
5. Sort the data structure including all the distances
in ascending order (smallest value first)
6. Select the corresponding label according to
majority of K nearest neighbors
Decision Tree
Objective

● To introduce undergraduate students to Decision Trees as a fundamental


machine learning algorithm and
● To impart understanding of its basic concepts, construction, and
applications.
Introduction to Decision Tree
Overview
● Decision tree can be considered under
the supervised machine learning
● Decision tree consists of nodes each
of which has data stored in attributes
● Data is split in each node and thus
these nodes can also be called
"decision nodes"
● At the bottom of the tree are leaves
which are basically final outcomes
from the decision path
Introduction to Decision Tree
What is a Decision Tree? Example

● A decision tree is a tree-based supervised learning


method used to predict the output of a target variable.
Supervised learning uses labeled data (data with known
output variables) to make predictions with the help of
regression and classification algorithms.

● Supervised learning algorithms act as a supervisor for


training a model with a defined output variable. It learns
from simple decision rules using the various data
features.
Decision Tree: ID3 Algorithm
Decision Tree: ID3
Decision Tree: ID3
Decision Tree: ID3
Decision Tree: ID3
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
Decision Tree: ID3

https://www.youtube.com/watch?v=coOTEc-0OGw
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
CART Algorithm and Example
Similarly…

https://www.youtube.com/watch?v=xyDv3DLYjfM
Introduction to Ensemble Learning
Ensemble learning refers to algorithms that combine the predictions from two or more models.
Key principles of ensemble methods:
● Diversity: Ensemble models should have different biases and learn from different parts of
the data to reduce errors.
● Independence: The models in the ensemble should be trained independently to ensure
diversity.
● Aggregation: Combining the predictions of individual models to make a final prediction
using techniques like averaging or voting.
“Standard” ensemble learning strategies:
1) Bagging.
2) Stacking.
3) Boosting.
Ref. https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/
Introduction to Bagging
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly
used to reduce variance within a noisy data set.

● The name Bagging came


from the abbreviation of
Bootstrap AGGregatING.
● As the name implies, the two
key ingredients of Bagging
are bootstrap and
aggregation.
● Many popular ensemble
algorithms are based on this
approach, including:, Bagged
Decision Trees (canonical
bagging), Random Forest,
Extra Trees.
Introduction to Stacking
Stacking is a general procedure where a learner is trained to combine the
individual learners. Here, the individual learners are called the first-level learners,
while the combiner is called the second-level learner, or meta-learner.

● Stacking has its own nomenclature where ensemble members


are referred to as level-0 models and the model that is used to
combine the predictions is referred to as a level-1 model.
● Stacking is probably the most-popular meta-learning
technique. By using a meta-learner, this method tries to induce
which classifiers are reliable and which are not.
● Using trainable combiners, it is possible to determine which
classifiers are likely to be successful in which part of the
feature space and combine them accordingly.
Introduction to Boosting
In boosting, the training dataset for each subsequent classifier increasingly
focuses on instances misclassified by previously generated classifiers.

● The term boosting refers to a


family of algorithms that are able
to convert weak learners to strong
learners.
● Popular ensemble algorithms are
based on this approach, including:
● AdaBoost (canonical boosting)
● Gradient Boosting Machines
● Stochastic Gradient Boosting (XGBoost and
similar)
Introduction to Random Forest
● Random Forest is a classifier that
contains a number of decision trees on
various subsets of the given dataset
and takes the average to improve the
predictive accuracy of that dataset.

● The greater number of trees in the


forest leads to higher accuracy and
prevents the problem of overfitting.
Introduction to Random Forest
The Working process can be explained in the
below steps and diagram:

● Step-1: Select random K data points from


the training set.
● Step-2: Build the decision trees associated
with the selected data points (Subsets).
● Step-3: Choose the number N for decision
trees that you want to build.
● Step-4: Repeat Step 1 & 2.
● Step-5: For new data points, find the
predictions of each decision tree, and
Ref.
assign the new data points to the category https://www.javatpoint.com/machine-learning-random-forest-algorithm

that wins the majority votes.


Introduction to Random Forest
There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Ref.
https://www.javatpoint.com/machine-learning-random-forest-algorithm
Introduction to Support Vector Machine
● A support vector machine (SVM) is a supervised
machine learning algorithm that classifies data by
finding an optimal line or hyperplane that
maximizes the distance between each class in an
N-dimensional space.
● SVMs were developed in the 1990s by Vladimir N.
Vapnik and his colleagues, and they published this
work in a paper titled "Support Vector Method for
Function Approximation, Regression Estimation, ● The importance of SVM in
and Signal Processing"1 in 1995. machine learning is that it
● The SVM algorithm is widely used in machine has the ability to handle
learning as it can handle both linear and nonlinear high-dimensional data and
complex decision
classification tasks.
boundaries.
Mathematical Foundations of SVM
● Suppose there are n-dimensional sample
vectors in a region, then there is

hyperplane, which divides the sample into


two categories.

● The hyperplanes may exist in different


forms and the one that satisfies the
minimum distance between two types of
samples is called the optimal hyperplane.
Mathematical Foundations of SVM
● Figure 1 shows that the sum of two types
of sample distances from the hyperplane is
2/||w|| and the hyperplane margin is equal
to 2/||w||.
● Also, any training tuples that fall on
hyperplanes H1 or H2, i.e., the sides
defining the margin, are the support
vectors, as shown in Fig. 1.
● Thus, the problem is the maximization of
margin by minimizing the ||w||/2 value
SVM Types
Linear SVM:

● When the data is perfectly linearly separable


only then we can use Linear SVM. Perfectly
linearly separable means that the data points
can be classified into 2 classes by using a single
straight line(if 2D).

Ref.
https://www.ibm.com/topics/support-vector-machine

https://towardsdatascience.com/the-kernel-trick-c98cd
bcaeb3f
SVM Types
Non-Linear SVM:

● Much of the data in real-world scenarios are not linearly separable,


and that’s where nonlinear SVMs come into play. In order to make
the data linearly separable, preprocessing methods are applied to
the training data to transform it into a higher-dimensional feature
space.
● The “kernel trick” helps to reduce some of that complexity, making
the computation more efficient, and it does this by replacing dot
product calculations with an equivalent kernel function4.

Some popular kernel functions include:

● Polynomial kernel
● Radial basis function kernel (also known as a Gaussian or RBF
kernel)
● Sigmoid kernel
SVM: kernel trick
● The kernel trick provides a
solution to this problem. The “trick”
is that kernel methods represent the
data only through a set of pairwise
similarity comparisons between the
original data observations x (with
the original coordinates in the lower
dimensional space), instead of
explicitly applying the
transformations ϕ(x) and
representing the data by these
transformed coordinates in the
higher dimensional feature space.
Application of SVM:
Some common applications of SVM are-

● Face detection – SVMs classify parts of the image as a face and non-face and create a square
boundary around the face.
● Text and hypertext categorization – SVMs allow Text and hypertext categorization for both
inductive and transductive models. They use training data to classify documents into different
categories. It categorizes on the basis of the score generated and then compares with the threshold
value.
● Classification of images – Use of SVMs provides better search accuracy for image classification. It
provides better accuracy in comparison to the traditional query-based searching techniques.
● Bioinformatics – It includes protein classification and cancer classification. We use SVM for
identifying the classification of genes, patients on the basis of genes and other biological problems.
● Protein fold and remote homology detection – Apply SVM algorithms for protein remote
homology detection.
● Handwriting recognition – We use SVMs to recognize handwritten characters used widely.
● Generalized predictive control(GPC) – Use SVM based GPC to control chaotic dynamics with
useful parameters.
Simple Linear Regression
● Simple linear regression aims to
find a linear relationship to describe
the correlation between an
independent and possibly
dependent variable.
● The regression line can be used to
predict or estimate missing values,
this is known as interpolation.
Simple Linear Regression
● Linear regression is defined as a
statistical method used for modeling
the relationship between a dependent
variable (target) and one or more
independent variables (features).
● Intercept is represented by a
● Slope is represented by b
● Discuss the error term ε and its
assumption of being normally
distributed with mean zero and
constant variance.
The equation of linear regression is
represented by the following eq:
Simple Linear Regression
● Linear regression is defined as a
statistical method used for
modeling the relationship between
a dependent variable (target) and
one or more independent variables
(features).
Example of Linear Regression
Example of Linear Regression
Assumptions of simple linear Present real-world applications of
regression: simple linear regression in various
● Linearity: The relationship between domains, such as:
the dependent and independent ● Predicting sales based on
variables is linear. advertising expenditure
● Independence: Observations are ● Estimating house prices based on
independent of each other. square footage
● Homoscedasticity: The variance of ● Analyzing the relationship between
the error term is constant across all temperature and energy
levels of the independent variable.
consumption
● Normality: The error term follows a
normal distribution.
Multiple linear regression (MLR)
● Multiple linear regression (MLR) is used to determine a mathematical relationship among several random variables.

● In other terms, MLR examines how multiple independent variables are related to one dependent variable.

Y = b0 + b1X1 + b2X2 + b3X3 .. bnXn

● Where Y is a continuous measurement outcome (e.g., BMI),

● b0 is the "intercept" or starting value.

● X1, X2, X3, etc. are the values of independent predictor variables (i.e., risk factors), and

● b1, b2, b3, etc. are the coefficients for each risk factor.

● Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses

several explanatory variables to predict the outcome of a response variable.


Multiple linear regression (MLR)
Logistic Regression
● Logistic regression is
named for the function
used at the core of the
method, the logistic
function.

● The logistic function, also


called the sigmoid
function is an S-shaped
curve that can take any
real-valued number and
map it into a value x
between 0 and 1, but
never exactly at those
limits.
Logistic Regression
The Sigmoid function in a Logistic Regression Model is formulated as

where e is the base of the natural log and the x corresponds to the real numerical value you
want to transform.

Ref. https://tutorialforbeginner.com/linear-regression-vs-logistic-regression-in-machine-learning
Logistic Regression
Basics of Unsupervised Learning
● Unsupervised learning, also known
as unsupervised machine learning,
uses machine learning (ML)
algorithms to analyze and cluster
unlabeled data sets.
● These algorithms discover hidden
patterns or data groupings without
the need for human intervention.
● Unsupervised learning models are
utilized for three main
tasks—clustering, association, and
dimensionality reduction.
Basics of Unsupervised Learning
● Unsupervised learning models are
utilized for three main
tasks—clustering, association, and
dimensionality reduction.
● Clustering is a data mining
technique which groups unlabeled
data based on their similarities or
differences. Clustering algorithms
are used to process raw,
unclassified data objects into
groups represented by structures
or patterns in the information.
K-means clustering
● K-means clustering is a common example of an
exclusive clustering method where data points are
assigned into K groups, where K represents the
number of clusters based on the distance from
each group’s centroid.
● The data points closest to a given centroid will be
clustered under the same category.
● A larger K value will be indicative of smaller
groupings with more granularity whereas a smaller
K value will have larger groupings and less
granularity.
● K-means clustering is commonly used in market
segmentation, document clustering, image
segmentation, and image compression.

Ref. https://www.ibm.com/topics/unsupervised-learning
Hierarchical clustering(HC)
There are two type of HC:
agglomerative and divisive.

In agglomerative (bottom-up
approach), start from considering
each data point as a cluster, in
each iteration we are merging the
closest clusters until we have one
cluster. In other words,
It starts clustering by treating the individual data points as a single cluster then it is
merged continuously based on similarity until it forms one big cluster containing all
objects.
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Hierarchical clustering(HC)
● Agglomerative clustering requires a
definition of linkage, i.e. how to calculate
the distance between two clusters in the
case where a cluster contains more than
one sequence.
● The most commonly used definitions are
minimum distance (single linkage),
maximum distance (complete linkage),
and average linkage (also called UPGMA
(Unweighted Pair Group Method with
Arithmetic Mean)).
,
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Hierarchical clustering(HC)
In divisive clustering(top-down
approach) all the data points
belong to a single cluster, at each
iteration we split the farthest point
until each cluster contains a
unique observation. In other
words,

In divisive clustering(top-down approach) all the data points belong to a single


cluster, at each iteration we split the farthest point until each cluster contains a
unique observation.
Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Agglomerative clustering Examples

Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Agglomerative clustering Examples ..

Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Agglomerative clustering Examples ..

Ref. https://medium.com/leukemiaairesearch/clustering-techniques-with-gene-expression-data-4b35a04f87d5
Understanding Association in Machine Learning
The importance of association analysis in real-world applications are market
basket analysis, recommendation systems, and healthcare analytics.

Apriori algorithm is one of the most popular algorithms for association rule mining.
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm

https://www.youtube.com/watch?v=43CMKRHdH30
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Association Rule Mining – Apriori Algorithm
Activation Functions
● The biological equivalent of a neuron becomes a node, connected at its input
end and output end to other nodes, just like synapses in a brain.
● The chemical signals become mathematical values given to the input, and the
output chemical signals become mathematical in nature as well.
● As for the threshold that holds back the neuron firing until it receives the
appropriate input, that is converted into a bias.
● The bias is a numerical value that all of the inputs must exceed in order for the
node to output, and if so, how large the value of that output.
● As the network “learns” each input to a node is weighted with different
modifying values that increase or decrease how each individual input is
calculated when combined then compared to the bias.
● These weights change each time the network learns, as well as the bias they
are compared to.
Activation Functions
Why we use Activation functions with Neural Networks?
● It is used to determine the output of neural network like yes or no. It maps the
resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-


● Linear Activation Function
● Non-linear Activation Functions
Types of Activation Functions

Ref. https://medium.com/@BenDosch/ml-activation-functions-f851fd6334d2
Types of Activation Functions
ReLU (Rectified Linear Unit)
Activation Function

● The ReLU is the most used


activation function in the world
right now.

● Since, it is used in almost all the


convolutional neural networks or
deep learning.
● That means any negative input given to the ReLU
● But the issue is that all the activation function turns the value into zero
negative values become zero immediately in the graph, which in turns affects the
immediately which decreases the resulting graph by not mapping the negative values
ability of the model to fit or train appropriately.
from the data properly.

Ref. https://medium.com/@BenDosch/ml-activation-functions-f851fd6334d2
Types of Activation Functions
Leaky ReLU
● It is an attempt to solve the dying ReLU problem

● The leak helps to increase the range of the


ReLU function. Usually, the value of a is 0.01 or
so.

● When a is not 0.01 then it is called Randomized


ReLU.

● Therefore the range of the Leaky ReLU is


(-infinity to infinity).

● Both Leaky and Randomized ReLU functions are


monotonic in nature. Also, their derivatives also
monotonic in nature.

Ref. https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
PERCEPTRON and Its Application in Machine Learning

Lesson_4 Objectives are to understand


★ Biological Neuron Vs Artificial Neuron
★ PERCEPTRON
★ PERCEPTRON Learning Rule
★ Application of PERCEPTRON Learning Rule
★ Single-layer PERCEPTRON
★ Multi-layer PERCEPTRON
★ Test Your Skills: MCQ
Biological Neuron Vs Artificial Neuron

Biological Neuron Vs Artificial Neuron


● Biological neurons are cells that receive
input signals from other neurons through
dendrites.

● Artificial neurons are mathematical


models that are inspired by biological
neurons.
Biological Neuron Vs Artificial Neuron cont..
Biological Neuron Vs Artificial Neuron

● Biological neurons have a complex,


organic structure, while artificial neurons
have a simple, mathematical structure.

● Biological neurons receive input signals


from other neurons through dendrites,
while artificial neurons receive inputs that
can be excitatory or inhibitory.
PERCEPTRON

● In Machine Learning,
PERCEPTRON is considered as a
single-layer neural network

● It Consists of four main


parameters named input values
(Input nodes), weights & Bias, net
sum, and an activation function.
PERCEPTRON Learning Rule
● The PERCEPTRON receives multiple
input signals, and if the sum of the input
signals exceeds a certain threshold, it
either outputs a signal or does not return
an output.
● In the context of supervised learning and
classification, this can then be used to
predict the class of a sample.
● Perceptron convergence theorem can
only be applied, if and only if two
classes are linearly separable.
PERCEPTRON Learning Rule cont..
Application of PERCEPTRON Learning Rule
● Single-layer PERCEPTRON can solve only
linearly separable patterns(e.g. Solving 2
input AND, OR function)
Just take b=-1.5 for AND and -0.5 for OR,
w1=1 & w2=1.
● A single-layered perceptron model consists
feed-forward network and also includes a
threshold transfer function inside the model.
The main objective of the single-layer
perceptron model is to analyze the linearly
separable objects with binary outcomes.
Application of PERCEPTRON Learning Rule cont..
● Multi-layer PERCEPTRON can solve non-linearly separable
patterns(e.g. Solving 2 input XOR function)
Test Your Skills: MCQ on PERCEPTRON

1. What is perceptron?
a) a single layer feed-forward neural network with
pre-processing
b) an auto-associative neural network
c) a double layer auto-associative neural network
d) a neural network that contains feedback
Test Your Skills: MCQ on PERCEPTRON

2. A perceptron is a _________
A) Backtracking algorithm
B) Backpropagation algorithm
C) Feed-forward neural network
D) Feed Forward-backward algorithm
Test Your Skills: MCQ on PERCEPTRON

3. What is the objective of perceptron learning?


a) class identification
b) weight adjustment
c) adjust weight along with class identification
d) none of the mentioned
Test Your Skills: MCQ on PERCEPTRON

4. In perceptron learning, what happens when


input vector is correctly classified?
a) small adjustments in weight is done
b) large adjustments in weight is done
c) no adjustments in weight is done
d) weight adjustments doesn’t depend on
classification of input vector
Test Your Skills: MCQ on PERCEPTRON

5. When two classes can be separated by a separate


line, they are known as?
a) linearly separable
b) linearly inseparable classes
c) may be separable or inseparable, it depends on
system
d) none of the mentioned
McCulloch-Pitts Neuron
● The McCulloch–Pitt neural
network is considered to be
the first neural network.

● The neurons are connected


by directed weighted paths.

● McCulloch–Pitt neuron
allows binary activation (1
ON or 0 OFF), i.e., it either
fires with an activation 1 or
does not fire with an
activation of 0.
McCulloch-Pitts Neuron: AND function

https://medium.com/analytics-vidhya/mp-neuron-and-perceptron-model-with-sample-code-c2189edebd3f
McCulloch-Pitts Neuron: OR function
McCulloch-Pitts Neuron: OR function
ANN Architecture
● Interconnection can be defined as the way processing elements (Neuron) in
ANN are connected to each other. Hence, the arrangements of these
processing elements and geometry of interconnections are very essential
in ANN.
● These arrangements always have two layers that are common to all
network architectures, the Input layer and output layer where the input
layer buffers the input signal, and the output layer generates the output of
the network.
● The third layer is the Hidden layer, in which neurons are neither kept in the
input layer nor in the output layer. These neurons are hidden from the
people who are interfacing with the system and act as a black box to them.
● By increasing the hidden layers with neurons, the system’s computational
and processing power can be increased but the training phenomena of the
system get more complex at the same time.
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture
There exist five basic types of neuron connection architecture :

1. Single-layer feed-forward network


2. Multilayer feed-forward network
3. Single node with its own feedback
4. Single-layer recurrent network
5. Multilayer recurrent network

https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture

In this type of network, we have only


two layers input layer and the output
layer but the input layer does not
count because no computation is
performed in this layer. The output
layer is formed when different
weights are applied to input nodes
and the cumulative effect per node is
taken. After this, the neurons
collectively give the output layer to
compute the output signals.

https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture

This layer also has a hidden layer that is


internal to the network and has no direct
contact with the external layer. The
existence of one or more hidden layers
enables the network to be
computationally stronger, a feed-forward
network because of information flow
through the input function, and the
intermediate computations used to
determine the output Z. There are no
feedback connections in which outputs of
the model are fed back into itself.

https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture

When outputs can be directed back as


inputs to the same layer or preceding
layer nodes, then it results in feedback
networks. Recurrent networks are
feedback networks with closed loops.
The above figure shows a single
recurrent network having a single neuron
with feedback to itself.

https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture

The above network is a single-layer


network with a feedback connection in
which the processing element’s output can
be directed back to itself or to another
processing element or both. A recurrent
neural network is a class of artificial neural
networks where connections between
nodes form a directed graph along a
sequence. This allows it to exhibit dynamic
temporal behavior for a time sequence.
Unlike feedforward neural networks, RNNs
can use their internal state (memory) to
process sequences of inputs.

https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
ANN Architecture

In this type of network, processing


element output can be directed to the
processing element in the same layer and
in the preceding layer forming a
multilayer recurrent network. They
perform the same task for every element
of a sequence, with the output being
dependent on the previous computations.
Inputs are not needed at each time step.
The main feature of a Recurrent Neural
Network is its hidden state, which
captures some information about a
sequence.
https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/
Learning Process in ANN: Forward Propagation

Ref. https://www.linkedin.com/pulse/demystifying-forward-propagation-neural-networks-real-world-v/
Learning Process in ANN: Forward Propagation Steps
Forward propagation involves several key steps:

1. Input Layer: The process begins with the input layer, where data is fed into
the neural network.
2. Weighted Sum: Each connection between neurons in adjacent layers has
an associated weight. Forward propagation computes the weighted sum of
inputs.
3. Bias Addition: A bias term is added to the weighted sum. This helps in
shifting the activation function's input and introducing non-linearity.
4. Activation Function: The weighted sum plus bias is passed through an
activation function, which introduces non-linearity into the network and
determines the neuron's output.
5. Output Layer: This process is repeated for each layer in the network until
the final output layer is reached.
Learning Process in ANN: Backward Propagation
Backpropagation is an algorithm used in artificial intelligence (AI) to
fine-tune mathematical weight functions and improve the accuracy of an
artificial neural network’s outputs.
Optimization Algorithms
Optimization Algorithms

Gradient Descent (GD):


● Gradient Descent is a first-order optimization algorithm used to
minimize the cost function by iteratively moving in the direction
of the steepest descent of the cost function's gradient.
Stochastic Gradient Descent (SGD):
● Stochastic Gradient Descent is an extension of Gradient
Descent where the gradient is computed using a single training
example randomly selected from the dataset. It is faster but
noisier compared to batch Gradient Descent.
Optimization Algorithms

Mini-batch Gradient Descent:


● Mini-batch Gradient Descent is a compromise between batch
Gradient Descent and Stochastic Gradient Descent, where the
gradient is computed using a small subset of the training data
(mini-batch).
Adam (Adaptive Moment Estimation):
● Adam is an adaptive learning rate optimization algorithm that
combines the advantages of both AdaGrad and RMSProp. It
computes individual adaptive learning rates for different
parameters by storing exponentially decaying averages of past
gradients and squared gradients.
Optimization Algorithms

AdaGrad (Adaptive Gradient Algorithm):


AdaGrad is an optimization algorithm that adapts the learning rate for each
parameter by scaling it inversely proportional to the square root of the sum
of the squares of the gradients accumulated over all the previous time
steps.
RMSProp (Root Mean Square Propagation):
RMSProp is an adaptive learning rate optimization algorithm that divides
the learning rate by an exponentially decaying average of squared gradients.
It helps to normalize the gradients and prevent the learning rate from
decreasing too rapidly for frequently occurring features.
Optimization Algorithms

Adadelta:
● Adadelta is an extension of AdaGrad that addresses its aggressive,
monotonically decreasing learning rates. It uses a sliding window of
gradients to compute an exponentially decaying average, and it
updates the parameters based on the root mean square of recent
updates.
Nadam (Nesterov-accelerated Adaptive Moment Estimation):
● Nadam is an extension of Adam that incorporates Nesterov
momentum into its optimization process. It combines the advantages
of Nesterov momentum and Adam's adaptive learning rate
Optimization Algorithms

Adaptive Moment Estimation (AMSGrad):


● AMSGrad is a modification of Adam that addresses its tendency
to converge to suboptimal points by updating the moving
average of squared gradients more conservatively.
LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno):
● LBFGS is a quasi-Newton optimization algorithm used for
optimizing smooth non-linear functions. It approximates the
inverse Hessian matrix without explicitly computing it, making it
memory efficient for large-scale problems.
HMM

● The fundamental assumption behind using HMMs for modeling sequential data is
that there exists an underlying sequence of hidden states that generates the
observed data.
● In HMMs, observations are the data points that we can directly observe or
measure. These observations could be discrete symbols (e.g., words in a
sentence, nucleotides in a DNA sequence) or continuous values (e.g., sensor
readings, stock prices).
● Hidden Markov Models assume the existence of a sequence of hidden states that
are not directly observable. These hidden states represent the underlying structure
or dynamics of the system generating the observed data.
HMM: Markov chain

A Markov chain consists of three important


components:
● Initial probability distribution: An initial
probability distribution over states, πi is the
probability that the Markov chain will start
in a certain state i. Some states j may have
πj = 0, meaning that they cannot be initial
states
● One or more states
● Transition probability distribution: A
transition probability matrix A where
each aij represents the probability of
moving from state I to state j
HMM

HMM is a statistical model in which the system being


modeled are Markov processes with unobserved or hidden
states. It is a hidden variable model which can give an
observation of another hidden state with the help of the
Markov assumption. The hidden state is the term given to
the next possible variable which cannot be directly
observed but can be inferred by observing one or more
states according to Markov’s assumption. Markov
assumption is the assumption that a hidden variable is
dependent only on the previous hidden
state. Mathematically, the probability of being in a state at
a time t depends only on the state at the time (t-1). It is
termed a limited horizon assumption. Another Markov
assumption states that the conditional distribution over
the next state, given the current state, doesn’t change
over time. This is also termed a stationary process
assumption.
HMM Example

The diagram below represents a


Markov chain where there are three
states representing the weather of the
day (cloudy, rainy, and sunny). And,
there are transition probabilities ● If sunny today, then tomorrow:
● 50% probability for sunny
representing the weather of the next
● 10% probability for rainy
day given the weather of the current
day. ● 40% probability for cloudy
● If rainy today, then tomorrow:
There are three different states such as ● 10% probability for sunny
cloudy, rain, and sunny. The following ● 60% probability for rainy
represent the transition probabilities ● 30% probability for cloudy
based on the above diagram: ● If cloudy today, then tomorrow:
● 40% probability for sunny
● 50% probability for rainy
● 10% probability for cloudy
HMM Example
Using this Markov chain, what is the probability that the
Wednesday will be cloudy if today is sunny. The following
are different transitions that can result in a cloudy
Wednesday given today (Monday) is sunny.

● Sunny – Sunny (Tuesday) – Cloudy (Wednesday): The total probability of a cloudy


The probability to a cloudy Wednesday can be
calculated as 0.5 x 0.4 = 0.2 Wednesday = 0.2 + 0.03 + 0.04 = 0.27.

● Sunny – Rainy (Tuesday) – Cloudy (Wednesday): The As shown above, the Markov chain is a
probability of a cloudy Wednesday can be calculated process with a known finite number of
as 0.1 x 0.3 = 0.03 states in which the probability of being
in a particular state is determined only
● Sunny – Cloudy (Tuesday) – Cloudy (Wednesday):
The probability of a cloudy Wednesday can be by the previous state.
calculated as 0.4 x 0.1 = 0.0

You might also like