ml3 2
ml3 2
AIML
What is Machine Learning?
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample
historical data, or training data. For the purpose of developing predictive models,
machine learning brings together statistics and computer science. Algorithms that learn
from historical data are either constructed or utilized in machine learning. The
performance will rise in proportion to the quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic
based on the data and predict the output. Our perspective on the issue has changed as
a result of machine learning. The Machine Learning algorithm's operation is depicted in
the following block diagram:
Features of Machine Learning:
o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train machine
learning algorithms. The cost function can be used to determine the amount of data and
the machine learning algorithm's performance. We can save both time and money by
using machine learning.
Following are some key points which show the importance of Machine Learning:
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning
system for training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and
learns about each one. After the training and processing are done, we test the model
with sample data to see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning.
The managed learning depends on oversight, and it is equivalent to when an
understudy learns things in the management of the educator. Spam filtering is an
example of supervised learning.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent
gets a reward for each right action and gets a penalty for each wrong action. The agent
learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Note: We will learn about the above types of machine learning in detail in later chapters.
o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
o 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
o 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and
this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced
their interest from AI, which led to reduced funding by the government to the
researches.
o 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.
Machine Learning at 21st century
2006:
o Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
o The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine
learning models.
2007:
2008:
2009:
2010:
2012:
2013:
2014:
2015:
2016:
o The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
o Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.
2017:
Present day AI models can be utilized for making different expectations, including
climate expectation, sickness forecast, financial exchange examination, and so on.
1. Supervised Learning
Supervised learning is applicable when a machine has sample data, i.e., input as well
as output data with correct labels. Correct labels are used to check the correctness of
the model using some labels and tags. Supervised learning technique helps us to
predict future events with the help of past experience and labeled examples. Initially, it
analyses the known training dataset, and later it introduces an inferred function that
makes predictions about output values. Further, it also predicts errors during this entire
learning process and also corrects those errors through algorithms.
Example: Let's assume we have a set of images tagged as ''dog''. A machine learning
algorithm is trained with these dog images so it can easily distinguish whether an image
is a dog or not.
2. Unsupervised Learning
In unsupervised learning, a machine is trained with some input samples or labels only,
while output is not known. The training information is neither classified nor labeled;
hence, a machine may not always provide correct output compared to supervised
learning.
Example: Let's assume a machine is trained with some set of documents having
different categories (Type A, B, and C), and we have to organize them into appropriate
groups. Because the machine is provided only with input samples or without output, so,
it can organize these datasets into type A, type B, and type C categories, but it is not
necessary whether it is organized correctly or not.
3. Reinforcement Learning
Reinforcement Learning is a feedback-based machine learning technique. In such type
of learning, agents (computer programs) need to explore the environment, perform
actions, and on the basis of their actions, they get rewards as feedback. For each good
action, they get a positive reward, and for each bad action, they get a negative reward.
The goal of a Reinforcement learning agent is to maximize the positive rewards. Since
there is no labeled data, the agent is bound to learn by its experience only.
4. Semi-supervised Learning
Semi-supervised Learning is an intermediate technique of both supervised and
unsupervised learning. It performs actions on datasets having few labels as well as
unlabeled data. However, it generally contains unlabeled data. Hence, it also reduces
the cost of the machine learning model as labels are costly, but for corporate purposes,
it may have few labels. Further, it also increases the accuracy and performance of the
machine learning model.
Marketing:
Machine learning helps marketers to create various hypotheses, testing, evaluation, and
analyze datasets. It helps us to quickly make predictions based on the concept of big
data. It is also helpful for stock marketing as most of the trading is done through bots
and based on calculations from machine learning algorithms. Various Deep Learning
Neural network helps to build trading models such as Convolutional Neural Network,
Recurrent Neural Network, Long-short term memory, etc.
Self-driving cars:
This is one of the most exciting applications of machine learning in today's world. It
plays a vital role in developing self-driving cars. Various automobile companies like
Tesla, Tata, etc., are continuously working for the development of self-driving cars. It
also becomes possible by the machine learning method (supervised learning), in which
a machine is trained to detect people and objects while driving.
Speech Recognition:
Speech Recognition is one of the most popular applications of machine learning.
Nowadays, almost every mobile application comes with a voice search facility. This
''Search By Voice'' facility is also a part of speech recognition. In this method, voice
instructions are converted into text, which is known as Speech to text" or "Computer
speech recognition.
Google assistant, SIRI, Alexa, Cortana, etc., are some famous applications of speech
recognition.
Traffic Prediction:
Machine Learning also helps us to find the shortest route to reach our destination by
using Google Maps. It also helps us in predicting traffic conditions, whether it is cleared
or congested, through the real-time location of the Google Maps app and sensor.
Image Recognition:
Image recognition is also an important application of machine learning for identifying
objects, persons, places, etc. Face detection and auto friend tagging suggestion is the
most famous application of image recognition used by Facebook, Instagram, etc.
Whenever we upload photos with our Facebook friends, it automatically suggests their
names through image recognition technology.
Product Recommendations:
Machine Learning is widely used in business industries for the marketing of various
products. Almost all big and small companies like Amazon, Alibaba, Walmart, Netflix,
etc., are using machine learning techniques for products recommendation to their users.
Whenever we search for any products on their websites, we automatically get started
with lots of advertisements for similar products. This is also possible by Machine
Learning algorithms that learn users' interests and, based on past data, suggest
products to the user.
Automatic Translation:
Automatic language translation is also one of the most significant applications of
machine learning that is based on sequence algorithms by translating text of one
language into other desirable languages. Google GNMT (Google Neural Machine
Translation) provides this feature, which is Neural Machine Learning. Further, you can
also translate the selected text on images as well as complete documents through
Google Lens.
Virtual Assistant:
A virtual personal assistant is also one of the most popular applications of machine
learning. First, it records out voice and sends to cloud-based server then decode it with
the help of machine learning algorithms. All big companies like Amazon, Google, etc.,
are using these features for playing music, calling someone, opening an app and
searching data on the internet, etc.
Linear Regression
Linear Regression is one of the simplest and popular machine learning algorithms
recommended by a data scientist. It is used for predictive analysis by making
predictions for real variables such as experience, salary, cost, etc.
It is a statistical approach that represents the linear relationship between two or more
variables, either dependent or independent, hence called Linear Regression. It shows
the value of the dependent variable changes with respect to the independent variable,
and the slope of this graph is called as Line of Regression.
Linear Regression can be expressed mathematically as follows:
y= a0+a1x+ ε
Y= Dependent Variable
X= Independent Variable
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear Regression is helpful for evaluating the business trends and forecasts such as
prediction of salary of a person based on their experience, prediction of crop production
based on the amount of rainfall, etc.
Logistic Regression
Logistic Regression is a subset of the Supervised learning technique. It helps us to
predict the output of categorical dependent variables using a given set of independent
variables. However, it can be Binary (0 or 1) as well as Boolean (true/false), but instead
of giving an exact value, it gives a probabilistic value between o or 1. It is much similar
to Linear Regression, depending on its use in the machine learning model. As Linear
regression is used for solving regression problems, similarly, Logistic regression is
helpful for solving classification problems.
Let's understand the KNN algorithm with the below screenshot, where we have to
assign a new data point based on the similarity with available data points.
Including Machine Learning, KNN algorithms are used in so many fields as follows:
K-Means Clustering
K-Means Clustering is a subset of unsupervised learning techniques. It helps us to solve
clustering problems by means of grouping the unlabeled datasets into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
Decision Tree
Decision Tree is also another type of Machine Learning technique that comes under
Supervised Learning. Similar to KNN, the decision tree also helps us to solve
classification as well as regression problems, but it is mostly preferred to solve
classification problems. The name decision tree is because it consists of a tree-
structured classifier in which attributes are represented by internal nodes, decision rules
are represented by branches, and the outcome of the model is represented by each leaf
of a tree. The tree starts from the decision node, also known as the root node, and ends
with the leaf node.
Decision nodes help us to make any decision, whereas leaves are used to determine
the output of those decisions.
A Decision Tree is a graphical representation for getting all the possible outcomes to a
problem or decision depending on certain given conditions.
Random Forest
Random Forest is also one of the most preferred machine learning algorithms that come
under the Supervised Learning technique. Similar to KNN and Decision Tree, It also
allows us to solve classification as well as regression problems, but it is preferred
whenever we have a requirement to solve a complex problem and to improve the
performance of the model.
Naïve Bayes
The naïve Bayes algorithm is one of the simplest and most effective machine learning
algorithms that come under the supervised learning technique. It is based on the
concept of the Bayes Theorem, used to solve classification-related problems. It helps to
build fast machine learning models that can make quick predictions with greater
accuracy and performance. It is mostly preferred for text classification having high-
dimensional training datasets.
It is also based on the concept of Bayes Theorem, which is also known as Bayes' Rule
or Bayes' law. Mathematically, Bayes Theorem can be expressed as follows:
Where,
According to Arthur Samuel “Machine Learning enables a Machine to Automatically learn from Data,
Improve performance from an Experience and predict things without explicitly programmed.”
In Simple Words, When we fed the Training Data to Machine Learning Algorithm, this algorithm will
produce a mathematical model and with the help of the mathematical model, the machine will make
a prediction and take a decision without being explicitly programmed. Also, during training data, the
more machine will work with it the more it will get experience and the more efficient result is
produced.
The following are the different aspects of developing a learning system. Let us consider designing a
program to learn to play checkers, with the goal of entering it in the world checkers tournament.
Example : In Driverless Car, the training data is fed to Algorithm like how to Drive Car in Highway,
Busy and Narrow Street with factors like speed limit, parking, stop at signal etc. After that, a Logical
and Mathematical model is created on the basis of that and after that, the car will work according to
the logical model. Also, the more data the data is fed the more efficient output is produced.
According to Tom Mitchell, “A computer program is said to be learning from experience (E), with
respect to some task (T). Thus, the performance measure (P) is the performance at task T, which is
measured by P, and it improves with experience E.”
In order to complete the design of the learning system, we must now choose
The first design choice is to choose the type of training experience from which the system
will learn.
The type of training experience available can have a significant impact on success or failure
of the learner.
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system
2. The degree to which the learner controls the sequence of training examples
3. How well it represents the distribution of examples over which the final system
performance P must be measured
The next important step is choosing the target function. It means according to the knowledge fed to
the algorithm the machine learning will choose NextMove function which will describe what type of
legal moves should be taken. For example : While playing chess with the opponent, when opponent
will play then the machine learning algorithm will decide what be the number of possible legal moves
taken in order to get success.
NextMove: B-->M
This function accepts as input any board from the set of legal board states B and produces as output
some move from the set of legal moves M.
An alternative target function and one that will turn out to be easier to learn in this setting is an
evaluation function that assigns a numerical score to any given board state.
V: B-->R
Denote that V maps any legal board state from the set B to some real value
Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows
We need to choose a representation that the learning algorithm will use to describe the function
NextMove. The function NextMove will be calculated as a linear combination of the following board
features:
Here u0, u1 up to u6 are the coefficients that will be chosen(learned) by the learning algorithm
To learn the target function NextMove, we require a set of training examples, each describing a
specific board state b and the training value (Correct Move ) y for b. The training algorithm
learns/approximate the coefficients u0, u1 up to u6 with the help of these training examples by
estimating and adjusting these weights.
For Example: When a training data of Playing chess is fed to algorithm so at that time it is not
machine algorithm will fail or get success and again from that failure or success it will measure while
next move what step should be chosen and what is its success rate.
5. The final design
The final design is created at last when system goes from number of examples, failures and
success, correct and incorrect decision and what will be the next step etc.
Overview Of classification
Classification is a supervised machine learning method where the model tries to predict
the correct label of a given input data. In classification, the model is fully trained using
the training data, and then it is evaluated on test data before being used to perform
prediction on new unseen data.
For instance, an algorithm can learn to predict whether a given email is spam or ham
(no spam), as illustrated below.
Before diving into the classification concept, we will first understand the difference between the
two types of learners in classification: lazy and eager learners. Then we will clarify the
misconception between classification and regression.
There are two types of learners in machine learning classification: lazy and eager
learners.
Eager learners are machine learning algorithms that first build a model from the
training dataset before making any prediction on future datasets. They spend more time
during the training process because of their eagerness to have a better generalization
during the training from learning the weights, but they require less time to make
predictions.
Most machine learning algorithms are eager learners, and below are some examples:
Logistic Regression.
Support Vector Machine.
Decision Trees.
Artificial Neural Networks.
Lazy learners or instance-based learners, on the other hand, do not create any
model immediately from the training data, and this is where the lazy aspect comes from.
They just memorize the training data, and each time there is a need to make a
prediction, they search for the nearest neighbor from the whole training data, which
makes them very slow during prediction. Some examples of this kind are:
K-Nearest Neighbor.
Case-based reasoning.
However, some algorithms, such as BallTrees and KDTrees, can be used to improve
the prediction latency.
Even though classification and regression are both from the category of supervised
learning, they are not the same.
Healthcare
Training a machine learning model on historical patient data can help healthcare
specialists accurately analyze their diagnoses:
Education is one of the domains dealing with the most textual, video, and audio data.
This unstructured information can be analyzed with the help of Natural Language
technologies to perform different tasks such as:
Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually
exclusive categories. The training data in such a situation is labeled in a binary format:
true and false; positive and negative; O and 1; spam and not spam, etc. depending on
the problem being tackled. For instance, we might want to detect whether a given image
is a truck or a boat.
Logistic Regression and Support Vector Machines algorithms are natively designed for
binary classifications. However, other algorithms such as K-Nearest Neighbors and
Decision Trees can also be used for binary classification.
Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive
class labels, where the goal is to predict to which class a given input example belongs
to. In the following case, the model correctly classified the image to be a plane.
Most of the binary classification algorithms can be also used for multi-class
classification. These algorithms include but are not limited to:
Random Forest
Naive Bayes
K-Nearest Neighbors
Gradient Boosting
SVM
Logistic Regression.
But wait! Didn’t you say that SVM and Logistic Regression do not support multi-class
classification by default?
In general, for N labels, we will have Nx(N-1)/2 classifiers. Each classifier is trained on a
single binary dataset, and the final class is predicted by a majority vote between all the
classifiers. One-vs-one approach works best for SVM and other kernel-based
algorithms.
Imbalanced Classification
For the imbalanced classification, the number of examples is unevenly distributed in
each class, meaning that we can have more of one class than the others in the training
data. Let’s consider the following 3-class classification scenario where the training data
contains: 60% of trucks, 25% of planes, and 15% of boats.
The imbalanced classification problem could occur in the following scenario:
So, does that mean that such problems are left behind?
Of course not! We can use multiple approaches to tackle the imbalance problem in a
dataset. The most commonly used approaches include sampling techniques or
harnessing the power of cost-sensitive algorithms.
Sampling Techniques
Cluster-based Oversampling:
Random undersampling: random elimination of examples from the majority
class.
SMOTE Oversampling: random replication of examples from the minority
class.
Cost-Sensitive Algorithms
These algorithms take into consideration the cost of misclassification. They aim to
minimize the total cost generated by the models.
To learn why, let's pretend that we have a dataset of two types of pets:
Cats: Dogs:
Each pet in our dataset has two features: weight and fluffiness.
Our goal is to identify and evaluate suitable models for classifying a given pet as
either a cat or a dog. We'll use train/test/validations splits to do this!
Training Set: The dataset that we feed our model to learn potential underlying patterns and
relationships.
Validation Set: The dataset that we use to understand our model's performance across different
model types and hyper parameter choices.
Test Set: The dataset that we use to approximate our model's unbiased accuracy in the wild.
The training set should be as representative as possible of the population that we are trying to
model. Additionally, we need to be careful and ensure that it is as unbiased as possible, as any
bias at this stage may be propagated downstream during inference.
Select the feature to visualize the corresponding logistic regression model's decision
boundary. Drag each animal in the training set to a new position to see how the boundary
updates!
We could compare the accuracy of each model on the training set, but if we use the same exact
dataset for both training and tuning, the model will overfit and won't generalize well.
This is where the validation set comes in — it acts as an independent, unbiased dataset for
comparing the performance of different algorithms trained on our training set.
Select a feature to view the model's performance on the validation set in the table below. Drag
the pets across the line to see how the model performance updates!
We should never, under any circumstance, look at the test set's performance before
selecting a model.
Peeking at our test set performance ahead of time is a form of overfitting, and will likely lead to
unreliable performance expectations in production. It should only be checked as the final form
of evaluation, after the validation set has been used to identify the best model.
The illustration below depicts how an optimal model fits into the
data compared to over fitting.
Imagine predicting that a customer will pay back a loan, and the
customer defaults. Not just one customer but thousands of
customers. This can cause a crisis for any financial institution.
Causes of Overfitting
Noisy data
Noise in data often appears as errors, fluctuations, or outliers in
the data. This can be caused by data entry errors, data aging,
data transmission errors, and so on.
Too much noise in data can cause the model to think these are
valid data points. Fitting the noise pattern in the training dataset
will cause poor performance on the new dataset.
In the picture above, you can see that we have some blurry
images that cannot be labelled if they are cat or dog. In these
instances, the model could also learn these patterns alongside
relevant features. Removing these images can reduce
overfitting.
But this can pose a problem, since the model can start
capturing noise, fluctuations, or outliers. Let's look at a decision
tree model, how it works, and how overfitting can happen when
it becomes too complex.
To make a prediction, it starts from the root node and follow the
branches down, breaking and fitting every feature until it gets to
the leaf node. The prediction is then made based on the value
associated with the leaf node.
This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs. In this topic,
"Linear Discriminant Analysis (LDA) in machine learning”, we will discuss the LDA
algorithm for classification predictive modeling problems, limitation of logistic regression,
representation of linear Discriminant analysis model, how to make a prediction using
LDA, how to prepare data for LDA, extensions to LDA and much more. So, let's start
with a quick introduction to Linear Discriminant Analysis (LDA) in machine learning.
Note: Before starting this topic, it is recommended to learn the basics of Logistic Regression
algorithms and a basic understanding of classification problems in machine learning as a
prerequisite
To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in
a 2-dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these
data points efficiently but using linear Discriminant analysis; we can dimensionally
reduce the 2-D plane into the 1-D plane. Using this technique, we can also maximize
the separability between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y
axis, and we need to classify them efficiently. As we have already seen in the above
example that LDA enables us to draw a straight line that can completely separate the
two classes of the data points. Here, LDA uses an X-Y axis to create a new axis by
separating them using a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D
plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.
In other words, we can say that the new axis will increase the separation between the
data points of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
Drawbacks of Linear Discriminant Analysis (LDA)
Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But
LDA also fails in some cases where the Mean of the distributions is shared. In this case,
LDA fails to create a new axis that makes both the classes linearly separable.
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different variables
on LDA.
o Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used to
minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of
a linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the
basis of various parameters of patient health and the medical treatment which is going
on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of LDA;
we can easily identify and select the features that can specify the group of customers
who are likely to purchase a specific product in a shopping mall. This can be helpful
when we want to identify a group of customers who mostly purchase a product in a
shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
o In learning
nowadays, robots are being trained for learning and talking to simulate human work, and
it can also be considered a classification problem. In this case, LDA builds similar groups
on the basis of different parameters, including pitches, frequencies, sound, tunes, etc.
o PCA is an unsupervised algorithm that does not care about classes and labels and only
aims to find the principal components to maximize the variance in the given dataset. At
the same time, LDA is a supervised algorithm that aims to find the linear discriminants to
represent the axes that maximize separation between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA.
However, PCA is assumed to be an as good performer for a comparatively small sample
size.
o Both LDA and PCA are used as dimensionality reduction techniques, where PCA is first
followed by LDA.
How to Prepare Data for LDA
Below are some suggestions that one should always consider while preparing the data
to build the LDA model:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:
t "user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such
as KNN SVM, LogisticRegression, etc.
Steps will also remain the same, which are given below:
Generative models are a class of statistical models that generate new data instances. These
models are used in unsupervised machine learning to perform tasks.
Naive Bayes
• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong independence assumptions between the features. It is highly
scalable, requiring a number of parameters linear in the number of variables
(features/predictors) in a learning problem.
• A Naive Bayes Classifier is a program which predicts a class value given a set of attributes.
• For each known class value,
1. Calculate probabilities for each attribute, conditional on the class value.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional probabilities for the class variable.
• Once this has been done for all class values, output the class with the highest probability.
• Naive bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is a
strong assumption but results in a fast and effective method.
• The probability of a class value given a value of an attribute is called the conditional
probability. By multiplying the conditional probabilities together for each attribute for a
given class value, we have a probability of a data instance belonging to that class.
Conditional Probability
• Let A and B be two events such that P(A) > 0. We denote P(BIA) the probability of B given
that A has occurred. Since A is known to have occurred, it becomes the new sample space
replacing the original S. From this, the definition is,
P(B/A) = P(A∩B)/P(A)
OR
P(A ∩ B) = P(A) P(B/A)
• The notation P(B | A) is read "the probability of event B given event A". It is the probability
of an event B given the occurrence of the event A.
• We say that, the probability that both A and B occur is equal to the probability that A
occurs times the probability that B occurs given that A has occurred. We call P(B | A) the
conditional probability of B given A, i.e., the probability that B will occur given that A has
occurred.
• Similarly, the conditional probability of an event A, given B by,
P(A/B) = P(A∩B)/P(B)
• The probability P(A | B) simply reflects the fact that the probability of an event A may
depend on a second event B. If A and B are mutually exclusive A ∩ B = and P(A | B) = 0.
• Another way to look at the conditional probability formula is :
P(Second/First) = P(First choice and second choice)/P(First choice)
• Conditional probability is a defined quantity and cannot be proven.
• The key to solving conditional probability problems is to:
1. Define the events.
2. Express the given information and question in probability notation.
3. Apply the formula.
Joint Probability
• A joint probability is a probability that measures the likelihood that two or more events
will happen concurrently.
• If there are two independent events A and B, the probability that A and B will occur is
found by multiplying the two probabilities. Thus for two events A and B, the special rule of
multiplication shown symbolically is :
P(A and B) = P(A) P(B).
• The general rule of multiplication is used to find the joint probability that two events will
occur. Symbolically, the general rule of multiplication is,
P(A and B) = P(A) P(B | A).
• The probability P(A ∩ B) is called the joint probability for two events A and B which
intersect in the sample space. Venn diagram will readily shows that
P(A ∩ B) = P(A) + P(B) - P (AUB)
Equivalently:
P(A ∩ B) = P(A) + P(B) - P(A ∩ B) ≤ P(A) + P(B)
• The probability of the union of two events never exceeds the sum of the event
probabilities.
• A tree diagram is very useful for portraying conditional and joint probabilities. A tree
diagram portrays outcomes that are mutually exclusive.
Bayes Theorem
• Bayes' theorem is a method to revise the probability of an event given additional
information. Bayes's theorem calculates a conditional probability called a posterior or
revised probability.
• Bayes' theorem is a result in probability theory that relates conditional probabilities. If A
and B denote two events, P(A | B) denotes the conditional probability of A occurring, given
that B occurs. The two conditional probabilities P(A | B) and P(B | A) are in general different.
• Bayes theorem gives a relation between P(A | B) and P(B | A). An important application of
Bayes' theorem is that it gives a rule how to update or revise the strengths of evidence-
based beliefs in light of new evidence a posterior.
• A prior probability is an initial probability value originally obtained before any additional
information is obtained.
• A posterior probability is a probability value that has been revised by using additional
information that is later obtained.
• Suppose that B , B , B ... B partition the outcomes of an experiment and that A is another
1 2 3 n
classification method. In 1967, Thomas Cover and Peter Hart expanded on the non-parametric classification
method and published their "Nearest Neighbor Pattern Classification" paper . Almost 20 years later, the
2
algorithm was refined by James Keller, who developed a "fuzzy KNN" that produces lower error rates . 3
Today, the kNN algorithm is the most widely used algorithm due to its adaptability to most fields — from
genetics to finance and customer service.
How does kNN work?
The kNN algorithm works as a supervised learning algorithm, meaning it is fed training datasets it
memorizes. It relies on this labeled input data to learn a function that produces an appropriate output when
given new unlabeled data.
This enables the algorithm to solve classification or regression problems. While kNN's computation occurs
during a query and not during a training phase, it has important data storage requirements and is therefore
heavily reliant on memory.
For classification problems, the KNN algorithm will assign a class label based on a majority, meaning that it
will use the label that is most frequently present around a given data point. In other words, the output of a
classification problem is the mode of the nearest neighbors.
A distinction: majority voting vs. plurality voting
Majority voting denotes anything over 50% as the majority. This applies if there are
two class labels in consideration. However, plurality voting applies if multiple class
labels are being considered. In these cases, anything over 33.3% would be sufficient to
denote a majority, and hence provide a prediction. Plurality voting is therefore the
more accurate term to define kNN's mode.
If we were to illustrate this distinction:
A binary prediction
Y: ❤️❤️❤️❤️❤
Majority vote: ❤
Plurality vote: ❤
A multi-class setting
Y:
Majority vote: None
Plurality vote:
Regression problems use the mean of the nearest neighbors to predict a classification. A regression problem
will produce real numbers as the query output.
For example, if you were making a chart to predict someone's weight based on their height, the values
denoting height would be independent, while the values for weight would be dependent. By performing a
calculation of the average height-to-weight ratio, you could estimate someone's weight (the dependent
variable) based on their height (the independent variable).
4 types of computing kNN distance metrics
The key to the kNN algorithm is determining the distance between the query point and the other data points.
Determining distance metrics enables decision boundaries. These boundaries create different data point
regions. There are different methods used to calculate distance:
Euclidean distance is the most common distance measure, which measures a straight line between the query point
and the other point being measured.
Manhattan distance is also a popular distance measure, which measures the absolute value between two points. It
is represented on a grid, and often referred to as taxicab geometry — how do you travel from point A (your query
point) to point B (the point being measured)?
Minkowski distance is a generalization of Euclidean and Manhattan distance metrics, which enables the creation
of other distance metrics. It is calculated in a normed vector space. In the Minkowski distance, p is the parameter
that defines the type of distance used in the calculation. If p=1, then the Manhattan distance is used. If p=2, then
the Euclidean distance is used.
Hamming distance, also referred to as the overlap metric, is a technique used with Boolean or string vectors to
identify where vectors do not match. In other words, it measures the distance between two strings of equal length.
It is especially useful for error detection and error correction codes.
The kNN algorithm, popular for its simplicity and accuracy, has a variety of applications, especially when
used for classification analysis.
Relevance ranking: kNN uses natural language processing (NLP) algorithms to determine which results are most
relevant to a query.
Similarity search for images or videos: Image similarity search uses natural language descriptions to find images
matching from text queries.
Pattern recognition: kNN can be used to identify patterns in text or digit classification.
Finance: In the financial sector, kNN can be used for stock market forecasting, currency exchange rates, etc.
Product recommendations and recommendation engines: Think Netflix! "If you liked this, we think you'll also
like…" Any site that uses a version of that sentence, overtly or not, is likely using a kNN algorithm to power its
recommendation engine.
Healthcare: In the field of medicine and medical research, the kNN algorithm can be used in genetics to calculate
the probability of certain gene expressions. This allows doctors to predict the likelihood of cancer, heart attacks, or
any other hereditary conditions.
Data preprocessing: The kNN algorithm can be used to estimate missing values in datasets.