0% found this document useful (0 votes)
13 views

Unit 9 - Classification & Clustering

The document discusses classification and clustering in artificial intelligence, focusing on the concept of classification as a supervised learning method. It provides examples of classification problems, differentiates between binary and multi-class classification, and explains the use of algorithms like logistic regression. Additionally, it covers the confusion matrix, its importance in evaluating classification models, and the implications of false positives and negatives in medical testing.

Uploaded by

51554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 9 - Classification & Clustering

The document discusses classification and clustering in artificial intelligence, focusing on the concept of classification as a supervised learning method. It provides examples of classification problems, differentiates between binary and multi-class classification, and explains the use of algorithms like logistic regression. Additionally, it covers the confusion matrix, its importance in evaluating classification models, and the implications of false positives and negatives in medical testing.

Uploaded by

51554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

INDIAN SCHOOL MUSCAT

CLASS XI

ARTIFICIAL INTELLIGENCE

UNIT 9: Classification & Clustering


PART I - Classification
A few interesting examples to illustrate the widespread applications of classification
problems.
Case 1:
A credit card company typically receives hundreds of applications for a new credit card. It
contains information regarding several different attributes such as, annual salary, outstanding
debt, age etc. The problem is to categorize applications into those who have good credit, bad
credit or somewhere in the middle. Categorization of the application is nothing but a
classification problem.
Case 2:
You may want to own a dog but which kind of dog? This is the beginning of a classification
problem. Dogs can be classified in a number of different ways. For example, they can be
classified by breed (examples include beagles, hounds, Pug and countless others). they can
also be classified by their role in the lives of their masters and the work they do (examples
include a dog might be a family pet, a working dog, a show dog, or a hunting dog). In many
cases, dogs are defined both by their breed and their role. Based on different classification
criteria, you decide eventually which one you want to own.
A technical example to explain classification problem
Case 3:
A common example of classification comes with detecting spam emails. To write a
program to filter out spam emails, a computer programmer can train a machine
learning algorithm with a set of spam-like emails labeled as “spam” and regular
emails labeled as “not-spam”. The idea is to make an algorithm that can learn
characteristics of spam emails from this training set so that it can filter out spam
emails when it encounters new emails.
Activity-1
Look at the pictures below and tell
whether the fruit seller knows the
art of classification or not.
Justify your answer.
Classification is type of supervised learning

Supervised learning as the name indicates is the presence of a supervisor as a


teacher. Supervised learning is learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with
the correct answer. After that, the machine is provided with a new set of
examples (data) so that supervised learning algorithm analyses the training data
(set of training examples) and produces a correct outcome from labeled data.
suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all different fruits one by one like below:
• If shape of object is rounded with a depression at top
and Red in colour, then it will be labeled as – Apple.
• If shape of object is long curving cylinder and
green in colour, then it will be labeled as – Banana.

Now suppose after training the data, you present a new fruit (say Banana) from
basket and ask the machine to identify it. Since the machine has already learnt
from previous data, it will use the learning wisely this time to classify the fruit
based on its shape and color and would confirm the fruit as BANANA and place it
in Banana category. Thus the machine learns the things from training data (basket
containing fruits) and then applies the knowledge to test data (new fruit).
Supervised learning is further classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable
is a real value, such as “INR” or “Kilograms”, “Fahrenheit” etc.
What is classification in Artificial Intelligence/Machine Learning (AI/ML)
Classification is the process of categorizing a set of data (structured data or
unstructured data) into different categories or classes where we can assign label
to each class. Classification problems normally have a categorical output like a
‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’.
In the picture, we are assigning the
labels ‘paper’, ‘metal’, ‘plastic’, and
so on to different types of waste.
Question 1: Difference between classification and regression?
The main difference between Regression and Classification algorithms that
Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

Question 2: Look at the two graphs given


and suggest which graph represents the
classification problem.

Question 3: “Predicting stock price of a company on a particular day” - is it a


classification problem? Justify your answer.
Question 4: “Predicting whether India will lose or win a cricket match “- is it a
regression problem? Justify your answer.
Examples of Classification Problems
Example 1: In the banking industry, where you would like to know whether a
transaction is fraudulent or otherwise violating some regulation.
Example 2: Speech Understanding
Example 3: Face Detection
Activity:
Form a group of 5 students. Each group should think and come up with one case
from the classroom environment or their home/society, where they would like to
apply classification algorithm to solve the problem
Types of Classification Algorithm
Examples of classification problems include:
• Given an email, classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.
• Given recent user behavior, classify as churn or not
Types of Classification Algorithm
There are two main types of classification tasks that you may encounter, they are:
i) Binary Classification: Classification with only 2 distinct classes or with 2
possible outcomes

Example: Male and Female


Example: Classification of spam email and non-spam email
Example: Results of an exam: pass/fail
Example: Positive and Negative sentiment

ii) Multi Class Classification: Classification with more than two distinct classes.

Example: classification of types of soil


Example: classification of types of crops
Example: classification of mood/feelings in songs/music
Binary Classification
Binary Classification refers to those classification tasks that have two class labels
i.e. two possible outcomes. (normal state and abnormal state)
For example, “not spam” is the normal state and “spam” is the abnormal state.
Another example is “cancer not detected” is the normal state of a task that
involves a medical test and “cancer detected” is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the
abnormal state is assigned the class label 1.
Popular algorithms that can be used for binary classification include:
• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
Logistic Regression
Logistic regression
• Binomial classification algorithm
• Logistic Regression works with binary data, where either the event happens (1)
or the event does not happen (0).
So given some feature x it tries to find out whether some event y happens or not.
So y can either be 0 or 1.
In the case where the event happens, y is given the value 1.
If the event does not happen, then y is given the value of 0.
(For example, if y represents whether a sports teams wins a match, then y will be
1 if they win the match or y will be 0 if they do not. )

Example of a Logistic Curve is where the values of y


cannot be less than 0 or greater than 1.
Logistic Regression
A few more examples to understand logistic regression.
Example 1: Spam detection is a binary classification problem where we are given
an email and we need to classify whether or not it is spam. If the email is spam,
we label it 1; if it is not spam, we label it 0. In order to apply Logistic Regression to
the spam detection problem, the following features of the email are extracted:
• Sender of the email
• Number of typos in the email
• Occurrence of words/phrases like “offer”, “prize”, “free gift”, “lottery”, “you
won cash” and more

The resulting feature vector is then used to train a Logistic classifier which emits a
score in the range 0 to 1. If the score is more than 0.5, we label the email as
spam. Otherwise, we don’t label it as spam.
Logistic Regression
Example 2: A Logistic Regression classifier may be used to identify whether a
tumor is malignant or if it is benign. Several medical imaging techniques are used
to extract various features of tumors.
For instance, the size of the tumor, the affected body area, etc. These features are
then fed to a Logistic Regression classifier to identify if the tumor is malignant or
if it is benign.

Above two problems are solved using logistic regression algorithm because the
possible labels in both the cases are two only – Spam / Not spam,
malignant/benign i.e. binomial classification.
Confusion Matrix
Confusion matrix is a matrix (NxN table) which is used to validate how successful
a classification model in the field of ML/AI.
where N is the number of target classes.
The confusion matrix compares the actual target values with those predicted by
the classification model. This gives how well the classification model is performing
and what kind of error it is making.
For a binary classification problem, we would have a 2x2 matrix.

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a
true negative is an outcome where the model correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class. And a
false negative is an outcome where the model incorrectly predicts the negative class
True Positive (TP)
• The predicted value matches the actual value
• The actual value was positive and classification model also predicts positive
• There is no error
True Negative (TN)
• The predicted value matches the actual value
• The actual value was negative and classification model also forecasts negative
• There is no error
False Positive (FP)
• The predicted value doesn’t match the actual value
• The actual value was negative but the model predicted a positive value
• This is Type 1 Error
False Negative (FN)
• The predicted value doesn’t match the actual value
• The actual value was positive but the model predicted a negative value
• This is Type 2 Error
An example from cricket to better explain this.
True Positive (TP) - Umpire gives a batsman NOT OUT when he is NOT OUT.
True Negative (TN) - Umpire gives a Batsman OUT when he is OUT.
False Positive(FP) - Umpire gives a Batsman NOT OUT when he is OUT.
False Negative(FN) - Umpire gives a Batsman OUT when he is NOT OUT.
Question 1:
Assume there are 100 images, 30 of them depict a cat, the rest do not. A machine learning
model predicts the occurrence of a cat in 25 of 30 cat images. It also predicts absence of a cat in
50 of the 70 no cat images.
In this case, what are the true positive, false positive, true negative and false negative?
Solution: Assuming cat as a positive class.
Confusion Matrix:

• True Positive (TP): Images which are cat and actually predicted cat i.e. 25
• True Negative (TN): Images which are not-cat and actually predicted not-cat i.e. 50
• False Positive (FP): Images which are not-cat and actually predicted as cat i.e. 20
• False Negative (FN): Images which are cat and actually predicted as not-cat i.e. 5
Precision: TP/(TP+FP) Precision: Precision is defined as the percentage of true
Recall: TP/(TP+FN) positive cases versus all the cases where the prediction is true
Precision: 25/(25+20) = 0.55,
Recall(also known as Sensitivity): It is defined as the
Recall: 25/(25+5) = 0.833
fraction of positive cases that are correctly identified.
Confusion Matrix Example 1: Do you still remember the shepherd boy story?
“A shepherd boy used to take his herd of sheep across the fields to the lawns near the forest.
One day he felt very bored. He wanted to have fun. So he cried aloud "Wolf, Wolf. The wolf is
carrying away a lamb". Farmers working in the fields came running and asked, "Where is the
wolf?". The boy laughed and replied "It was just for fun. Now get going all of you". The boy
played the trick for quite a number of times in the next few days. After some days, as the boy
was perched on a tree, singing a song, there came a wolf. The boy cried loudly "Wolf, Wolf, the
wolf is carrying a lamb away." There was no one to the rescue. The boy shouted "Help! Wolf!
Help!" Still no one came to his help. The villagers thought that the boy was playing mischief
again. The wolf carried a lamb away“
Let us work on arriving at a confusion matrix for the above situation:
• "Wolf" is a positive class.
• "No wolf" is a negative class.
Question 2:
Assume there are 100 images, 30 of them depict a cat, the rest do not. A machine learning
model predicts the occurrence of a cat in 25 of 30 cat images. It also predicts absence of a cat in
50 of the 70 no cat images. Create the confusion matrix.
Question 3:
Above is a confusion matrix prepared for a binary classifier to detect email as Spam and Not
Spam. What is your interpretation of the above matrix?

Why do you need a Confusion matrix? The benefits of using a confusion matrix:
• It shows how any classification model is confused when it makes predictions
• Confusion matrix not only gives insight into the errors being made by the classifier but also
types of errors that are being made
• This helps overcome the limitations of using classification accuracy alone
• Every column of the confusion matrix represents the instances of the predicted class
• Each row of the confusion matrix represents the instances of the actual class
• It provides insight not only into the errors which are made by a classifier but also errors that
are being made in general
Confusion Matrix Quiz!
https://www.inabia.com/learning/quiz/confusion-matrix-quiz/

https://quizizz.com/admin/presentation/5f3f4683ae1779001b12c14a/confusion-matrix-final

False Positive or False Negative in Medical Science


In medical testing, (in binary classification), a false positive is an error in data reporting in which a test result
improperly indicates presence of a condition, such as a disease (the result is positive), when in reality it is not
present, while a false negative is an error in which a test result improperly indicates absence of a disease,
when in reality it is present. These are the two kinds of errors in a binary test.
These errors, (false positives or false negatives) & their implications are severe on the patients, family or
society.
False positive prompts patients to take medication or treatment they don’t really need.
Even more dangerous is the ‘false negative’ - the test that says you don’t have a disease for a condition you
actually have.
Ex: in the context of home pregnancy tests, which are more prone to giving false negatives than false
positives. However, when it comes to screening for more serious conditions like HIV or cancer, a false
negative can have dire repercussions.
Case 1:
Consider a health prediction case, where one wants to diagnose cancer. Imagine that detecting
cancer will trigger further analysis (the patient will not be immediately treated) whereas if you
don't detect cancer, the patient is sent home without further prognosis.
This case is thus asymmetric, since you definitely would like to avoid sending home a sick patient
(False Negative). You can however make the patient wait a little more by asking him/her to take
more tests even if the initial results show them negative for cancer (False Positive). As in this
situation, you would prefer a False Positive over a False Negative.
Case 2:
Imagine a patient taking an HIV test. The impacts of a false positive on the patient would at first
be heartbreaking; to have to deal with the trauma of facing this news and telling your family and
friends. But on further examination, the doctors will find out that person in question does not
have the virus. Again, this would not be a particularly pleasant experience. But not having HIV is
ultimately a good thing. On the other hand, a false negative would mean that the patient has
HIV but the test shows a negative result. The implications of this are terrifying, the patient would
be missing out on crucial treatment and runs the risk of spreading.
Without much doubt, the false negative here is the bigger problem. Both for the person and for
society.
Practice exercise on simple binary classification models
Q 1: A binary classifier was evaluated using a set of 1,000 test examples in which 50% of the
examples are negative. It was found that the classifier has 60 % sensitivity and 70 % accuracy.
Write the confusion matrix for this case.
Ans:
Based on this information, we can construct the following confusion matrix:

| Positive | Negative
|---------- |---------
True | 350 | 150
False | 30 | 470

This confusion matrix indicates that the classifier correctly classified 350 out of the 500 positive
samples (70% accuracy), and correctly identified 470 out of the 500 negative samples (60%
sensitivity).
Classification problem using TensorFlow playground
TensorFlow Playground: is a tool to help grasp the idea of neural networks and different training
algorithms like classification and clustering. It is a web app written in JavaScript that lets you play
with a real neural network running in your browser and click buttons and tweak parameters to
see how it works.
http://playground.tensorflow.org/

https://www.youtube.com/watch?v=rti0Ozfeqn8

https://www.youtube.com/watch?v=g60uieh32iM
https://www.youtube.com/watch?v=rti0Ozfeqn8

https://www.youtube.com/watch?v=g60uieh32iM

https://www.youtube.com/watch?v=ru9dXF04iSE
CLUSTERING
Consider you have large collection of books that you have to arrange according to categories in a
bookshelf. For example, you would arrange books like the “Harry Potter” series in one corner and
the “Famous Five” series in another.
There could be many other criteria of clustering
like – clustering based on authors, genre, year
publication, hardcover vs. paperback etc

Harry Potter Series (Cluster -1) Famous Five series collection (Cluster – 2)
When I visit a city, I would like to walk as much as possible, but I want to optimize my time to see as many
attractions as possible. While I am planning my next trip to Mumbai for four days. I have researched online and
made a list of 20 places that I would like to visit, at during this trip. In order to optimize time and cover all the
shortlisted places, I will need to bucket (“cluster”) the places based on proximity to each other. Creating the buckets
is in fact a method of clustering. Having said that, we perform the process of clustering almost every day in some
way or the other.
CLUSTERING
What is Clustering
Clustering is unsupervised learning which deals with finding a pattern in the collection of
unlabeled data. It is a technique of grouping similar data in such a way that data/objects in a
group are more similar to each other than the data/ objects in the other groups.
a simple graphical example:

Another example to understand clustering. Imagine X owns a chain of flavored milk parlors. The
parlor sells milk in 2 flavors – Strawberry (S) and Chocolate (C) across 8 outlets. In the below
table, you see the sales of both strawberry and chocolate flavored milk across the eight outlets.
CLUSTERING
To get a better understanding of the sales data, you can
plot it on a graph. Below we have plotted the sales of
both strawberry and chocolate. There are eight dots in
this graph that represents the 8 stores and the Y-axis
indicates the strawberry sales and the X- axis indicates the
chocolate sales.
After the analysis of this graph,
• Have a better insight into the sales data
• see a pattern emerging with respect to two groups of
stores that behave slightly different in terms of their
strawberry and chocolate sales and this is essentially
how clustering works.
CLUSTERING
Clustering algorithms can be applied in many fields, for instance:
Marketing: If you are a business, it is crucial that you target the right people. Clustering algorithms are
able to group together people with similar traits and likelihood to purchase your product/service. Once
you have the groups identified, target your messaging to them to increase sales probability.
Biology: Classification of plants and animals given their features
Libraries: Book ordering
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; Identifying
frauds
City-planning: Identifying groups of houses according to their house type, value and geographical location
Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones
WWW: Document classification; clustering weblog data to discover groups of similar access patterns
Identifying Fake News: Fake news is being created and spread at a rapid rate due to technology
innovations such as social media. But clustering algorithm is being used to identify fake news based on
the news content. The way that the algorithm works is by taking in the content of the fake news article
and examining the words used and then clustering them. These clusters are what help the algorithm
determine which pieces are genuine and which ones are fake. Certain words are found more commonly in
fake articles and once you see more such words in an article, it gives a higher probability of the material
being fake news.
CLUSTERING
Clustering Workflow
The following steps are required to cluster the data
1. Prepare the data: Data preparation refers to the set of features that will be available to the
clustering algorithm.
2. Create similarity metrics: To calculate the similarity between two data sets, you need to
combine all the feature data for the two examples into a single numeric value.
For instance, consider a shoe data set with only one feature – “shoe size”. You can quantify how similar two shoes
are by calculating the difference between their sizes. The smaller the numerical difference between sizes, the
greater the similarity between shoes. Such a handcrafted similarity measure is called a manual similarity measure.
The similarity measure is critical to any clustering technique and it must be chosen carefully.
3. Run the clustering algorithm: There are many different approaches to clustering data.
Two types of clustering algorithms -> hierarchical & partitioning
4. Interpret the results: Because clustering is unsupervised, no “truth” is available to verify
results. The absence of truth complicates assessing quality. In this situation, interpretation of
results becomes crucial.
Types of Clustering
1. Centroid-based clustering organizes the data into non-
hierarchical clusters, k-means is the most widely-used
centroid-based clustering algorithm. Centroid-based
algorithms are efficient but sensitive to initial
conditions and outliers. We focuses on k-means
because it is an efficient, effective, and simple
clustering algorithm.

2. Density-based clustering connects areas of high


example density into clusters. This allows for arbitrary-
shaped distributions as long as dense areas can be
connected. These algorithms have difficulty with data
of varying densities and high dimensions. Further, by
design, these algorithms do not assign outliers to
clusters.
Types of Clustering
3.Distribution-based Clustering approach assumes data is
composed of distributions, such as Gaussian distributions.
In the below figure, the distribution-based algorithm
clusters data into three Gaussian distributions. As distance
from the distribution's center increases, the probability
that a point belongs to the distribution decreases. The
bands show that decrease in probability.
Distribution-based Clustering

4.Hierarchical clustering creates a tree of clusters.


Hierarchical clustering, is well suited to hierarchical data,
such as taxonomies

Out of several approaches to clustering mentioned above,


the most widely used clustering algorithm is - “centroid-
based clustering using k-means”.
K-MEANS CLUSTERING
K-means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in two steps:
Step 1: Cluster Assignment
In this step, the algorithm goes to each of the data points and assigns the data point to one of
the cluster centroids. The assignment of data point to a particular cluster is determined by how
close the data point is from the particular centroid.
Step 2: Move centroid
In move centroid step, K-means moves the centroids to the average of the points in a cluster. In
other words, the algorithm calculates the average of all the points in a cluster and moves the
centroid to that average location. This process is repeated until all data points get a cluster and
hence there is no further opportunity of change in the clusters. The number of starting cluster is
chosen randomly.
https://www.youtube.com/watch?v=4b5d3muPQmA

https://www.koshegio.com/k-means-clustering-calculator

http://alekseynp.com/viz/k-means.html

You might also like