Unit 9 - Classification & Clustering
Unit 9 - Classification & Clustering
CLASS XI
ARTIFICIAL INTELLIGENCE
Now suppose after training the data, you present a new fruit (say Banana) from
basket and ask the machine to identify it. Since the machine has already learnt
from previous data, it will use the learning wisely this time to classify the fruit
based on its shape and color and would confirm the fruit as BANANA and place it
in Banana category. Thus the machine learns the things from training data (basket
containing fruits) and then applies the knowledge to test data (new fruit).
Supervised learning is further classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable
is a real value, such as “INR” or “Kilograms”, “Fahrenheit” etc.
What is classification in Artificial Intelligence/Machine Learning (AI/ML)
Classification is the process of categorizing a set of data (structured data or
unstructured data) into different categories or classes where we can assign label
to each class. Classification problems normally have a categorical output like a
‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’.
In the picture, we are assigning the
labels ‘paper’, ‘metal’, ‘plastic’, and
so on to different types of waste.
Question 1: Difference between classification and regression?
The main difference between Regression and Classification algorithms that
Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
ii) Multi Class Classification: Classification with more than two distinct classes.
The resulting feature vector is then used to train a Logistic classifier which emits a
score in the range 0 to 1. If the score is more than 0.5, we label the email as
spam. Otherwise, we don’t label it as spam.
Logistic Regression
Example 2: A Logistic Regression classifier may be used to identify whether a
tumor is malignant or if it is benign. Several medical imaging techniques are used
to extract various features of tumors.
For instance, the size of the tumor, the affected body area, etc. These features are
then fed to a Logistic Regression classifier to identify if the tumor is malignant or
if it is benign.
Above two problems are solved using logistic regression algorithm because the
possible labels in both the cases are two only – Spam / Not spam,
malignant/benign i.e. binomial classification.
Confusion Matrix
Confusion matrix is a matrix (NxN table) which is used to validate how successful
a classification model in the field of ML/AI.
where N is the number of target classes.
The confusion matrix compares the actual target values with those predicted by
the classification model. This gives how well the classification model is performing
and what kind of error it is making.
For a binary classification problem, we would have a 2x2 matrix.
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a
true negative is an outcome where the model correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class. And a
false negative is an outcome where the model incorrectly predicts the negative class
True Positive (TP)
• The predicted value matches the actual value
• The actual value was positive and classification model also predicts positive
• There is no error
True Negative (TN)
• The predicted value matches the actual value
• The actual value was negative and classification model also forecasts negative
• There is no error
False Positive (FP)
• The predicted value doesn’t match the actual value
• The actual value was negative but the model predicted a positive value
• This is Type 1 Error
False Negative (FN)
• The predicted value doesn’t match the actual value
• The actual value was positive but the model predicted a negative value
• This is Type 2 Error
An example from cricket to better explain this.
True Positive (TP) - Umpire gives a batsman NOT OUT when he is NOT OUT.
True Negative (TN) - Umpire gives a Batsman OUT when he is OUT.
False Positive(FP) - Umpire gives a Batsman NOT OUT when he is OUT.
False Negative(FN) - Umpire gives a Batsman OUT when he is NOT OUT.
Question 1:
Assume there are 100 images, 30 of them depict a cat, the rest do not. A machine learning
model predicts the occurrence of a cat in 25 of 30 cat images. It also predicts absence of a cat in
50 of the 70 no cat images.
In this case, what are the true positive, false positive, true negative and false negative?
Solution: Assuming cat as a positive class.
Confusion Matrix:
• True Positive (TP): Images which are cat and actually predicted cat i.e. 25
• True Negative (TN): Images which are not-cat and actually predicted not-cat i.e. 50
• False Positive (FP): Images which are not-cat and actually predicted as cat i.e. 20
• False Negative (FN): Images which are cat and actually predicted as not-cat i.e. 5
Precision: TP/(TP+FP) Precision: Precision is defined as the percentage of true
Recall: TP/(TP+FN) positive cases versus all the cases where the prediction is true
Precision: 25/(25+20) = 0.55,
Recall(also known as Sensitivity): It is defined as the
Recall: 25/(25+5) = 0.833
fraction of positive cases that are correctly identified.
Confusion Matrix Example 1: Do you still remember the shepherd boy story?
“A shepherd boy used to take his herd of sheep across the fields to the lawns near the forest.
One day he felt very bored. He wanted to have fun. So he cried aloud "Wolf, Wolf. The wolf is
carrying away a lamb". Farmers working in the fields came running and asked, "Where is the
wolf?". The boy laughed and replied "It was just for fun. Now get going all of you". The boy
played the trick for quite a number of times in the next few days. After some days, as the boy
was perched on a tree, singing a song, there came a wolf. The boy cried loudly "Wolf, Wolf, the
wolf is carrying a lamb away." There was no one to the rescue. The boy shouted "Help! Wolf!
Help!" Still no one came to his help. The villagers thought that the boy was playing mischief
again. The wolf carried a lamb away“
Let us work on arriving at a confusion matrix for the above situation:
• "Wolf" is a positive class.
• "No wolf" is a negative class.
Question 2:
Assume there are 100 images, 30 of them depict a cat, the rest do not. A machine learning
model predicts the occurrence of a cat in 25 of 30 cat images. It also predicts absence of a cat in
50 of the 70 no cat images. Create the confusion matrix.
Question 3:
Above is a confusion matrix prepared for a binary classifier to detect email as Spam and Not
Spam. What is your interpretation of the above matrix?
Why do you need a Confusion matrix? The benefits of using a confusion matrix:
• It shows how any classification model is confused when it makes predictions
• Confusion matrix not only gives insight into the errors being made by the classifier but also
types of errors that are being made
• This helps overcome the limitations of using classification accuracy alone
• Every column of the confusion matrix represents the instances of the predicted class
• Each row of the confusion matrix represents the instances of the actual class
• It provides insight not only into the errors which are made by a classifier but also errors that
are being made in general
Confusion Matrix Quiz!
https://www.inabia.com/learning/quiz/confusion-matrix-quiz/
https://quizizz.com/admin/presentation/5f3f4683ae1779001b12c14a/confusion-matrix-final
| Positive | Negative
|---------- |---------
True | 350 | 150
False | 30 | 470
This confusion matrix indicates that the classifier correctly classified 350 out of the 500 positive
samples (70% accuracy), and correctly identified 470 out of the 500 negative samples (60%
sensitivity).
Classification problem using TensorFlow playground
TensorFlow Playground: is a tool to help grasp the idea of neural networks and different training
algorithms like classification and clustering. It is a web app written in JavaScript that lets you play
with a real neural network running in your browser and click buttons and tweak parameters to
see how it works.
http://playground.tensorflow.org/
https://www.youtube.com/watch?v=rti0Ozfeqn8
https://www.youtube.com/watch?v=g60uieh32iM
https://www.youtube.com/watch?v=rti0Ozfeqn8
https://www.youtube.com/watch?v=g60uieh32iM
https://www.youtube.com/watch?v=ru9dXF04iSE
CLUSTERING
Consider you have large collection of books that you have to arrange according to categories in a
bookshelf. For example, you would arrange books like the “Harry Potter” series in one corner and
the “Famous Five” series in another.
There could be many other criteria of clustering
like – clustering based on authors, genre, year
publication, hardcover vs. paperback etc
Harry Potter Series (Cluster -1) Famous Five series collection (Cluster – 2)
When I visit a city, I would like to walk as much as possible, but I want to optimize my time to see as many
attractions as possible. While I am planning my next trip to Mumbai for four days. I have researched online and
made a list of 20 places that I would like to visit, at during this trip. In order to optimize time and cover all the
shortlisted places, I will need to bucket (“cluster”) the places based on proximity to each other. Creating the buckets
is in fact a method of clustering. Having said that, we perform the process of clustering almost every day in some
way or the other.
CLUSTERING
What is Clustering
Clustering is unsupervised learning which deals with finding a pattern in the collection of
unlabeled data. It is a technique of grouping similar data in such a way that data/objects in a
group are more similar to each other than the data/ objects in the other groups.
a simple graphical example:
Another example to understand clustering. Imagine X owns a chain of flavored milk parlors. The
parlor sells milk in 2 flavors – Strawberry (S) and Chocolate (C) across 8 outlets. In the below
table, you see the sales of both strawberry and chocolate flavored milk across the eight outlets.
CLUSTERING
To get a better understanding of the sales data, you can
plot it on a graph. Below we have plotted the sales of
both strawberry and chocolate. There are eight dots in
this graph that represents the 8 stores and the Y-axis
indicates the strawberry sales and the X- axis indicates the
chocolate sales.
After the analysis of this graph,
• Have a better insight into the sales data
• see a pattern emerging with respect to two groups of
stores that behave slightly different in terms of their
strawberry and chocolate sales and this is essentially
how clustering works.
CLUSTERING
Clustering algorithms can be applied in many fields, for instance:
Marketing: If you are a business, it is crucial that you target the right people. Clustering algorithms are
able to group together people with similar traits and likelihood to purchase your product/service. Once
you have the groups identified, target your messaging to them to increase sales probability.
Biology: Classification of plants and animals given their features
Libraries: Book ordering
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; Identifying
frauds
City-planning: Identifying groups of houses according to their house type, value and geographical location
Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones
WWW: Document classification; clustering weblog data to discover groups of similar access patterns
Identifying Fake News: Fake news is being created and spread at a rapid rate due to technology
innovations such as social media. But clustering algorithm is being used to identify fake news based on
the news content. The way that the algorithm works is by taking in the content of the fake news article
and examining the words used and then clustering them. These clusters are what help the algorithm
determine which pieces are genuine and which ones are fake. Certain words are found more commonly in
fake articles and once you see more such words in an article, it gives a higher probability of the material
being fake news.
CLUSTERING
Clustering Workflow
The following steps are required to cluster the data
1. Prepare the data: Data preparation refers to the set of features that will be available to the
clustering algorithm.
2. Create similarity metrics: To calculate the similarity between two data sets, you need to
combine all the feature data for the two examples into a single numeric value.
For instance, consider a shoe data set with only one feature – “shoe size”. You can quantify how similar two shoes
are by calculating the difference between their sizes. The smaller the numerical difference between sizes, the
greater the similarity between shoes. Such a handcrafted similarity measure is called a manual similarity measure.
The similarity measure is critical to any clustering technique and it must be chosen carefully.
3. Run the clustering algorithm: There are many different approaches to clustering data.
Two types of clustering algorithms -> hierarchical & partitioning
4. Interpret the results: Because clustering is unsupervised, no “truth” is available to verify
results. The absence of truth complicates assessing quality. In this situation, interpretation of
results becomes crucial.
Types of Clustering
1. Centroid-based clustering organizes the data into non-
hierarchical clusters, k-means is the most widely-used
centroid-based clustering algorithm. Centroid-based
algorithms are efficient but sensitive to initial
conditions and outliers. We focuses on k-means
because it is an efficient, effective, and simple
clustering algorithm.
https://www.koshegio.com/k-means-clustering-calculator
http://alekseynp.com/viz/k-means.html