Unsupervised - Learning Final
Unsupervised - Learning Final
Wei Wu
May 11, 2022
1
Contents
1 Overview of machine learning 3
4 Applications 17
5 Summary 18
6 Sources 19
2
1 Overview of machine learning
Supervised learning: Input is provided as a labelled dataset, a model can
learn from it to provide the result of the problem easily.
Unsupervised learning: There is no complete and clean labelled dataset.
Reinforcement learning: The algorithms learn to react to an environment on
their own.
Semi- supervised learning: It is an approach to machine learning that com-
bines a small amount of labeled data with a large amount of unlabeled data
during training.
3
In this case, the output will not be a label since we can’t model our needs
in this way. Instead, our program should be able to group the customers
accordingly to what makes them similar or unique. This grouping will be
conducted from the features learned during the training phase, and in this
case, we have unsupervised learning approach since there is no supervisor to
provide labels for the inputs to map them to the output.
2.1 Definition
Unsupervised machine learning is the process of inferring underlying hidden
patterns from historical data. Within such an approach, a machine learning
model tries to find any similarities, di↵erences, patterns, and structure in
data by itself. No prior human intervention is needed.
2.2 Examples
1.Picture a toddler. The child knows what the family cat looks like (provided
they have one) but has no idea that there are a lot of other cats in the world
that are all di↵erent. The thing is, if the kid sees another cat, he or she will
still be able to recognize it as a cat through a set of features such as two ears,
four legs, a tail, fur, whiskers, etc.
4
2.We have the following items and want to cluster them with unsupervised
learning. They are clustered by some di↵erent characteristics.
3.We have the following animals and try to cluster them with the help of
unsupervised learning.
5
2.4 Advantages
Why are we using Unsupervised Learning?
• Unsupervised learning is helpful for data science teams that don’t know
what they’re looking for in data. It can be used to search for unknown
similarities and di↵erences in data and create corresponding groups.
For example, user categorization by their social media activity.
• It reduces the chance of human error and bias, which could occur during
manual labeling processes.
6
3 Types Of Algorithms(commonly used)
3.1 Clustering
Clustering automatically categorizes data into groups according to similarity
criteria.
For example, We need to arrange the following blocks of shapes and colors.
According to di↵erent characteristics of this blocks, we could have two dif-
ferent results.
7
It repeats the process until no centroid moves more than a given threshold.
8
Example:
The example shows how six di↵erent clusters(data points) are merged step
by step on distance until they all create one large cluster.
How to compute the distance between clusters that contain more than one
data object? We have four methods below:
Centroid: Distance between the centroids of the two clusters.
Average Linkage: Average Distance between all pairs of points of the two
clusters.
Single Linkage: Distance between the two most similar data objects of the
two clusters.
Complete Linkage: Distance between the two most dissimilar data objects of
the two clusters.
(2) Divisive Hierarchical Clustering: starts with the whole data set as a
single cluster and then divides clusters step by step into small clusters.
9
The advantage of Hierarchical Clustering is we don’t have to pre-specify the
clusters. However, it doesn’t work very well on vast amounts of data or huge
datasets. And there are some disadvantages of the Hierarchical Clustering
algorithm that it is not suitable for large datasets.
The test sample (green dot) should be classified either to blue squares or
to red triangles.
If k = 3 (solid line circle) it is assigned to the red triangles because there are
2 triangles and only 1 square inside the inner circle.
If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs.
2 triangles inside the outer circle).
Advantages of KNN:
1. No Training Period: KNN is called Lazy Learner. There is no training
period for it. It stores the training dataset and learns from it only at the time
10
of making real time predictions. So it is much faster than other algorithms.
2.Since the KNN algorithm requires no training before making predictions,
new data can be added seamlessly which will not impact the accuracy of the
algorithm.
Disadvantages of KNN:
1. Does not work well with large dataset: In large datasets, the cost of cal-
culating the distance between the new point and each existing points is huge
which degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn’t
work well with high dimensional data because with large number of dimen-
sions, it becomes difficult for the algorithm to calculate the distance in each
dimension.
11
asserts that if a transaction contains A, it is also likely to contain B.
Definition 3: The support of an association rule A ! B is |AB|.
Definition 4: The Confidence is the percentage of all transactions satisfying
X that also satisfy Y.
Definition 5: The confidence of an association rule A ! B is |AB| |A|
:
supp(A\B) number of transactions containing A and B
conf(A ) B) = supp(A)
= number of transactions containing A
Example:
We have Itemsets: A={apple}, B={peach and apple}, C={peach and ba-
nana}, D={strawberry}, E={banana}
Dataset T={A,B,C,D,E}
Then we have the following information:
support(A)= 25
conf(A ) B)= 12
Frequent itemset
Frequent itemset is an itemset whose support value is greater than threshold
value . Which means: support(X) minsup.
12
Consider the given dataset with given transactions.
13
We can try to avoid considering all the items based on Apriori Principle:
(1) If an itemset is infrequent then all its supersets are infrequent.
(2) If an itemset is frequent then all its subsets are frequent.
(What is superset? - The opposite of subset
A is a superset of another set B if all elements of the set B are elements of
the set A. The superset relationship is denoted as A B. Opposing we say,
B is the subset of A.)
With the help of these two principles, we can try to make the algorithm
easier, for example:
Itemset D is infrequent, so all its supersets are infrequent. (Which means
all the itemsets with element D are infrequent, so we can ignore it and keep
looking for frequent itemsets.)
Besides frequent itemsets, two more important itemsets will be introduced
below:
Closed itemset
An itemset X is frequent and no immediate superset of X has same support
as X.
Maximal itemset
An itemset X is frequent and no immediate supersets of X are frequent.
14
In this example we can find out that for the frequent itemset C with supp(C)= 45 ,
none of its supersets have same support as C, so itemset C is closed.
And the frequent itemset {A, B, C, E} is the only maximal itemset because
it doesn’t have (frequent) supersets.
In conclusion, F=15, C=5, M=1 .
So the relationship between these three itemsets is: M ⇢ C ⇢ F .
Step 1: Scan T, calculate the support for each candidates, then we can get
C1 (the candidate set of level k) and L1 (the frequent dataset).
C1 L1
Step 2: Combine the itemsets in L1 to generate C2 .
15
Step 3: Scan C2 in order to get L2 .
C2 L2
Step 4: Combine itemsets in L2 to get C3 :
Step 5: Scan C3 .
C3 L3
) STOP, the algorithm can’t run anymore because no new frequent itemsets
are identified.
) {B, C, E} is the final frequent itemset.
) Calculate the confidence and compare with the mincof to find out which
association rules are possible.
16
(1) conf(B ^ C ! E)= 22 = 1 , conf(E ! B ^ C) = 23
(2) conf(B ^ E ! C) = 23 , conf(C ! B ^ E) = 23
(3) conf(C ^ E ! B) = 22 = 1 , conf(is B ! C ^ E) = 23
All confidences are higher than minconf, so all of the possibilities are accept-
able.
4 Applications
Applications of Apriori Algorithm :
1.In Education Field: Extracting association rules in data mining of admit-
ted students through characteristics and specialties.
2.In the Medical field: Analyzing the patient’s database.
3.In Forestry: Analyzing the probability and intensity of forest fire with the
forest fire data.
4.Apriori is used by many companies like Amazon in the Recommender Sys-
tem and by Google for the auto-complete feature.
17
5 Summary
• Unsupervised learning is a machine learning technique, where you do not
need to supervise the model.
• Unsupervised machine learning helps you to finds all kind of unknown pat-
terns in data.
• Clustering and Association are two types of Unsupervised learning.
• Three types of clustering methods are 1) k-means 2) Agglomerative 3) k
nearest neighbors.
• Association rules allow you to establish associations amongst data objects
inside large databases.
• In Supervised learning, Algorithms are trained using labelled data while
in Unsupervised learning Algorithms are used against data which is not la-
belled.
• The biggest drawback of Unsupervised learning is that you cannot get pre-
cise information regarding data sorting.
18
6 Sources
1.https://www.altexsoft.com/blog/unsupervised-machine-learning/
2.https://en.wikipedia.org/wiki/K-meansclustering
3.https://en.wikipedia.org/wiki/Hierarchicalclustering
4.https://www.youtube.com/watch?v=FvFsjdMwANU
5.https://www.guru99.com/unsupervised-machine-learning.html
6.https://medium.com/@manilwagle/association-rules-unsupervised-
learning-in-retail-69791aef99a
7.https://datascienceguide.github.io/association-rule-mining
8.https://en.wikipedia.org/wiki/Associationrulelearning
9.https://www.softwaretestinghelp.com/apriori-algorithm/
10.https://www.youtube.com/watch?v=nSpajfE5Ujc
11.https://bainingchao.github.io/2018/09/27/Apriori/
12.https://www.youtube.com/watch?v=rZRxCpLdNrg
13.https://www.aitude.com/supervised-vs-unsupervised-vs-reinforcement/
14.http://theprofessionalspoint.blogspot.com/2019/02/advantages-
and-disadvantages-of-knn.html
19
15.https://www.google.com.hk/search?q=kclient=safarisxsrf=APq-
WBsErrH
HV9eY41bPBOnp43bCVA51Mw:1649873895399source=lnmstbm=ischsa=Xved
=2ahUKEwiA89Wb05H3AhUMgP0HHbYnAuoQAUoAXoECAIQAwbiw=883bih
=708dpr=2imgrc=wz72rE1JY6tXM
16.https://www.iguazio.com/glossary/unsupervised-ml/?utmsource=adword
sutmterm=utmcampaign=GGl-DSA-ROWutmmedium=ppcutmcontent=
MachineLearninghsacam=11602556551hsaver=3hsamt=hsasrc=ghsagrp=
108334939330hsakw=hsanet=adwordshsaacc=3049108309hsaad=479289328973
hsatgt=dsa-392284169515gclid=CjwKCAjwo8-SBhAlEiwAopc9W6okrua9onhT8F
zBIuHRY6mxp7ASfxXKujZA9Z26JmQysAPzfSDOGBoCncAQAvDBwE
17.https://www.baeldung.com/cs/examples-supervised-unsupervised-
learning
20