ML Unit 3
ML Unit 3
Decision tree induction is the learning of decision trees from class labeled training tuples. A decision tree flowchart
is like a tree structure, where each internal node(non leaf node) denotes a test on an attribute, each branch represents
an outcome of the test and each leaf node holds a class label. The top most node in a tree is called root node.
Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method
for approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive
expressions. The most popular family of decision tree learning algorithms includes algorithms such as ID3,
ASSISTANT, and C4.5. These decision tree learning methods search a completely expressive hypothesis space and
thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over
large trees.
INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned
function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to
improve human readability. These learning methods are among the most popular of inductive inference
algorithm and have been successfully applied to a broad range of tasks from learning to diagnose medical cases
to learning to assess credit risk of loan applicants.
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies a test of some attribute of the
instance, and each branch descending from that node corresponds to one of the possible values for this
attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this
node, then moving down the tree branch corresponding to the value of the attribute in the given example. This
process is then repeated for the sub-tree rooted at the new node.
Figure 3.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings according
to whether they are suitable for playing tennis. For example, the instance
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative
instance (i.e., the tree predicts that PlayTennis = no).
In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree
itself to a disjunction of these conjunctions. For example, the decision tree shown in Figure 3.1 corresponds to
K.Ramya Laxmi, Assistant Professor, CSE
the expression
● Given a Tuple X, for which the associated class label is unknown, the attribute values of the tuples are
tested against the decision tree.
● Decision tree classifiers have good accuracy eg: used in the areas of medicine, manufacturing and
production, financial analysis etc
● During tree construction, attribute selection measures are used to select the attribute that best partitions
the tuples into distinct classes.
● When decision trees are built, many branches may reflect noise or outliers in the training data.
● Tree pruning attempts to identify and remove such branches with the goal of improving classification
accuracy on unseen data.
Most algorithms that have been developed for learning decision trees are variations on a core algorithm that
employs a top-down, greedy search through the space of possible decision trees. This approach is exemplified
by the ID3 algorithm (Quinlan 1986) and its successor C4.5 (Quinlan 1993).
The ID3 algorithm learns decision trees by constructing them top-down, beginning with the attribute that best
classifies the given data. To find the best attribute, each instance attribute is evaluated using a statistical test to
determine how well it alone classifies the training examples. The best attribute is selected and used as the test
at the root node of the tree. A descendant of the root node is then created for each possible value of this
attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch
corresponding to the example's value for this attribute).
The entire process is then repeated using the training examples associated with each descendant node to select
the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in
which the algorithm never backtracks to reconsider earlier choices.
The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. The
attribute that is most useful for classifying examples is selected. A good quantitative measure of the worth of
an attribute is defined by a statistical property, called information gain. It measures how well a given attribute
separates the training examples according to their target classification. ID3 uses this information gain measure
to select among the candidate attributes at each step while growing the tree.
In order to define information gain precisely, we begin by defining a measure commonly used in information
theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples. Given a
collection S, containing positive and negative examples of some target concept, the entropy of S relative to this
boolean classification is
To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9 positive and 5
negative examples. Then the entropy of S relative to this boolean classification is
Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are
positive (pe = I), then p, is 0, and Entropy(S) = -1 . log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0. Note
theentropy is 1 when the collection contains an equal number of positive and negative examples. If the
collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1.
Figure 3.2 shows the form of the entropy function relative to a boolean classification, as p, varies between 0
and 1. One interpretation of entropy from information theory is that it specifies the minimum number of bits of
information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at
random with uniform probability).
More generally, if the target attribute can take on c different values, then the entropy of S relative to this c-wise
classification is defined as
Where, pi is the proportion of S belonging to class i. The logarithm is base 2 because entropy is a measure of
the expected encoding length measured in bits. Note also that if the target attribute can take on c possible
values, the entropy can be as large as log, c.
Information gain, is the expected reduction in entropy caused by partitioning the examples according to
this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of
examples S, is defined as
For example, suppose S is a collection of training-example days described by attributes including Wind, which
can have the values Weak or Strong. As before, assume S is a collection containing 14 examples, [9+, 5-]. Of
these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and
theremainder have Wind = Strong. The information gain due to sorting the original 14 examples by the
attribute
Wind may then be calculated as
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the
tree. The use of information gain to evaluate the relevance of attributes is summarized in Figure 3.3. In this
figure the information gain of two different attributes, Humidity and Wind, is computed in order to determine
which is the better attribute for classifying the training examples shown in Table 3.2.
Here the target attribute PlayTennis, which can have values yes or no for different Saturday mornings, is to be
predicted based on other attributes of the morning in question. Consider the first step through the algorithm, in
which the topmost node of the decision tree is created.
ID3 determines the information gain for each candidate attribute (i.e., Outlook, Temperature, Humidity, and
Wind), then selects the one with highest information gain. The computation of information gain for two of
these attributes is shown in Figure 3.3. The information gain values for all four attributes are
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
where S denotes the collection of training examples from Table 3.2.
According to the information gain measure, the Outlook attribute provides the best prediction of the target
attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for
the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast,
and Rain). The resulting partial decision tree is shown in Figure 3.4, along with the training examples sorted to
each new descendant node. Note that every example for which Outlook = Overcast is also a positive example
of PlayTennis. Therefore, this node of the tree becomes a leaf node with the classification PlayTennis = Yes.
In contrast, the descendants corresponding to Outlook = Sunny and Outlook = Rain still have nonzero entropy,
and the decision tree will be further elaborated below these nodes.
The process of selecting a new attribute and partitioning the training examples is now repeated for each
nonterminal descendant node, this time using only the training examples associated with that node. Attributes
that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once
along any path through the tree. This process continues for each new leaf node until either of two conditions is
met: (1) every attribute has already been included along this path through the tree, or (2) the training examples
associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). Figure 3.4
illustrates the computations of information gain for the next step in growing the decision tree. The final
decision tree learned by ID3 from the 14 training examples of Table 3.2 is shown in the figure below:
print(clf.predict([[1,1,2]]))
output:
ID3 searches a complete hypothesis space (i.e., one capable of expressing any finite discrete-valued function).
It searches incompletely through this space, from simple to complex hypotheses, until its termination condition
is met (e.g., until it finds a hypothesis consistent with the data). Its inductive bias is solely a consequence of the
ordering of hypotheses by its search strategy. Its hypothesis space introduces no additional bias.
The second approach of post-pruning overfit trees has been found to be more successful in practice.
Pruning a decision node consists of removing the sub-tree rooted at that node, making it a leaf node, and
assigning it the most common classification of the training examples affiliated with that node. Nodes are
removed only if the resulting pruned tree performs no worse than-the original over the validation set. This has
the effect that any leaf node added due to coincidental regularities in the training set is likely to be pruned
because these same coincidences are unlikely to occur in the validation set. Nodes are pruned iteratively,
always choosing the node whose removal most increases the decision tree accuracy over the validation set.
Pruning of nodes continues until further pruning is harmful.
RULE POST-PRUNING:
In rule post-pruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from
the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the
rule consequent (post-condition). For example, the leftmost path of the tree in figure 3.1 is translated into the
rule
Next, each such rule is pruned by removing any antecedent, or precondition, whose removal does not worsen
its estimated accuracy. Given the above rule, for example, rule post-pruning would consider removing the
preconditions (Outlook = Sunny) and (Humidity = High).
It would select whichever of these pruning steps produced the greatest improvement in estimated rule
accuracy, then consider pruning the second precondition as a further pruning step. No pruning step is
performed if it reduces the estimated rule accuracy.
To deal with continuous variables, the continuous variables are discretized. For a continuous variable there is
not just one place to split it: the variable can be broken between any pair of data points, as shown in Figure. It
is expensive for continuous variables than it is for discrete ones,
In general, only one split is made to a continuous
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification and
regression tasks. It is a supervised learning algorithm that learns from labelled data to predict unseen data.
Classification Trees: The tree is used to determine which “class” the target variable is most likely to fall into,
when it is continuous.
Regression trees: These are used to predict a continuous variable’s value.
Tree structure:
CART builds a tree-like structure consisting of nodes and branches. The nodes represent different decision
points, and the branches represent the possible outcomes of those decisions. The leaf nodes in the tree contain a
predicted class label or value for the target variable.
Pruning:
To prevent overfitting of the data, pruning is a technique used to remove the nodes that contribute little to the
model accuracy. Cost complexity pruning and information gain pruning are two popular pruning techniques.
Cost complexity pruning involves calculating the cost of each node and removing nodes that have a negative
cost. Information gain pruning involves calculating the information gain of each node and removing nodes that
have a low information gain. CART algorithm uses Gini Impurity to split the dataset into a decision tree .It
does that by searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
•
Where, pi is the probability of an object being classified to a particular class.
# Dataset
data = pd.DataFrame({
'House_Size': [750, 800, 850, 900, 950, 1000],
'Price': [150, 180, 200, 220, 240, 260] # Prices in $1000s
})
# Display results
for size, price in zip(test_data.flatten(), predictions):
print(f"Predicted price for house size {size} sq ft: ${price * 1000:.2f}")
Most of our realworld problems are similar in nature. Combination of methods or models are used to solve these
problems.
Ensemble model based learning also works in a similar idea. If we want to enjoy the performance of more than one
machine learning algorithm, we have to build a modelwith combination of algorithms. A machine learning ensemble
consists of a concrete finite set of alternative models with a flexible structure that performs better solutions.
Thus ensemble methods is a combination of multiple machine learning models to obtain better predictive performance
than what can be obtained from any contstituent models.
· Combining predictions of an ensemble is often more accurate than the individual classifiers that make them up.
· An accurate classifier is one that has an error rate better than random guessing
Random forest:
Let us think what happens when there are several Decision Trees which are contributing to various final results. In this
case we consider about what maximum number of trees are voting for. For example when 7 trees out of 10 are saying
‘yes’ and other 3 are saying ‘No’, then we have to decide the final result to be yes. In this manner, the majority voting
will lead us to better accuracy in the final result.
The main aim of constructing a Random forest is to arrive at better accuracy in the predictions. When a machine
learning uses a group of other models or repeats a process several times then it is called an iterative model. Random
forest is called iterative model since it involves a group of Decision trees in arriving at final results. In case of Random
Forest, how the data is distributed between various models is the main concern.
Bootstrapping:
Let us imagine a dataset that contains a group of rows. Each row contains some columns from where we take only
those columns that contribute for our analysis. When we represent the relation between these columns on a graph, it is
shown as a data point. Thus a dataset contains many datapoints.
Suppose we want to create a subset of the data points(samples) from the main set of datapoints . How can we do this?
We have to collect 5 data points from the main set of data and put them into the subset. This can be done in 2 ways..
· In the first way, we can actually ‘remove’ the data points from the main and put them into subset. That means,
when a data point enters the subset, it was removed from the main set and hence not available in the main set.
· The data points removed from the main set are not replaced by any other data. This is called creating the subset
K.Ramya Laxmi, Assistant Professor, CSE
with out replacement.
· The data points which are removed from the subset1 were removed from the main dataset and hence they can
not appear again in subset2.
There is another way of creating subset of data points. Here, we will not actually remove the data points from the main
dataset.. We copy the data points from the main dataset and put them into a subset. That means, the original data points
are still available in the main dataset and their copies only are used in creation of the subsets.
In this case, the same datapoints can be used to create various subsets of data.
The first subset is created by copying 5 datapoints from the main set. After creating the subset also, the same data
points are still available in the main set. Hence, they can be used either fully or partially in creating the second subset.
This is known as creating the subset with replacement of data.
It is possible to create subsets of data from the main dataset. This can be done either with replacement or with out
K.Ramya Laxmi, Assistant Professor, CSE
replacement. This process is called ‘bootstrapping’. That means creating subsets from main set of data is called
‘bootstrapping’. The following are steps to do in the bootstrapping:
These subsets of data can be fed to the various machine learning models to observe their outputs.
Eg: we have several decision trees in the random forest machine learning model. Each decision tree needs a
subset of data on which it will act and provide the result. The results obtained from all the decision trees are
weighed to arrive at final conclusion.
Through bootstrapping, it is possible to create various subsets of data. Each subset of data is fed to one
decision tree in random forest. Hence every tree receives different data. So,bootstrapping is the technique used
to create subsets of data that are used by decision trees in the random forest model.
#train the random forest on the scikit digits dataset and check
digits=load_digits()
dir(digits)
for i in range(10):
plt.matshow(digits.images[i])
import pandas as pd
df=pd.DataFrame(digits.data)
df.head()
#in the above output, the 1st row --0 2nd row is 1
df['target']=digits.target
df.head()
x=df.drop(['target'],axis='columns')
y=df['target']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
#the word ensemble indicates using multiple algorithms(Decision trees) to predict output
model=RandomForestClassifier()
model.fit(x,y)
model.score(x_test,y_test)
model.fit(x_train,y_train)
model.score(x_test,y_test)
#make Prediction
#find out the hand written digit contained in 12th row in data
model.predict([digits.data[12]])
plt.matshow(digits.images[12])
print(digits.target[12])
Boosting:
It is one of the most popular ensemble method. Here a collection of very poor learners, each performing only
just better than chance, are put together to make an ensemble learner that can perform arbitrarily well. The
principal algorithm of boosting is named AdaBoost (Adaptive Boosting). The algorithm was proposed as an
improvement on the original 1990 boosting algorithm, which was rather data hungry. In that algorithm, the
training set was split into three. A classifier was trained on the first third, and then tested on the second third.
All of the data that was misclassified during that testing was used to form a new dataset, along with an equally
sized random selection of the data that was correctly classified. A second classifier was trained on this new
dataset, and then both of the classifiers were tested on the final third of the dataset. If they both produced the
same output, then that data point was ignored, otherwise the data point was added to yet another new dataset,
which formed the training set for a third classifier. There are various sorts of boosting algorithms that can be
employed in machine learning. Few of the most well-known are AdaBoost, Gradient Boost, Stochastic
Gradient Boost, Linear Programming Boost, Total Boost.
AdaBoost:
The innovation that AdaBoost uses is to give weights to each data point according to how difficult previous
classifiers have found to get it correct. These weights are given to the classifier as part of the input when it is
trained. At each iteration a new classifier is trained on the training set, with the weights that are applied to the
training set for each data point being modified at each iteration according to how successfully that data point
has been classified in the past. The weights are initially all set to the same value, 1/N, where N is the number of
data points in the training set. Then, at each iteration, the error (ϵ ) is computed as the sum of the weights of the
misclassified points, and the weights for incorrect examples are updated by being multiplied by α = (1 )
K.Ramya Laxmi, Assistant Professor, CSE
/ϵ. Weights for correct examples are left alone, and then the whole set is normalized so that it sums to 1
(which is effectively a reduction in the importance of the correctly classified data points). Training terminates
after a set number of iterations, or when either all of the data points are classified correctly, or one point
contains more than half of the available weight.
Bagging:
The simplest method of combining classifiers is known as bagging, which stands for bootstrap aggregating.
A bootstrap sample is a sample taken from the original dataset with replacement, so that we may get some data
several times and others not at all. A bootstrap dataset is a random sample of the original dataset, created by
sampling with replacement. This means that some samples from the original dataset can appear multiple times
in the bootstrap sample, while others might not appear at all. Having taken a set of bootstrap samples, the
bagging method simply requires that we fit a model to each dataset, and then combine them by taking the
output to be the majority vote of all the classifiers.
Sample 3 appears twice in the bootstrap dataset, while Sample 4 does not appear at all.
Boosting is exhaustive, it searches over the whole set of features at each stage, and each stage depends on
the previous one. Boosting has to run sequentially, and the individual steps can be expensive to run. By way
of contrast, the parallelism of the random forest and the fact that it only searches over a fairly small set of
features at each stage speed the algorithm up a lot. Since the algorithm only searches a small subset of the data
at each stage, it cannot be expected to be as good as boosting for the same number of trees. However, since
the trees are cheaper to train, we can make more of them in the same computational time, and often the results
are amazingly good even on very large and complicated datasets. The most amazing thing about random
forests is that they seem to deal very well with really big datasets. It is fairly clear that they should do well
computationally, since both the reduced number of features to search over and the ability to parallelize should
It computes the probability of having more than half of the trials result in success. If p > 0.5, then this sum
approaches 1 as T . This is a lot of the power behind ensemble methods: even if each classifier only gets
about half the answers right, if we use a decent number of classifiers (maybe 100), then the probability of the
ensemble being correct gets close to 1. In fact, even with less than 50% chance of success for each individual
classifier, the ensemble can often do very well indeed. For regression problems, rather than taking the majority
vote, it is common to take the mean of the outputs. However, the mean is heavily affected by outliers, with the
result that the median is a more common average to use. It is the use of the median that produces the bragging
algorithm, which is meant to imply ‘robust bagging’. There is an algorithm that does precisely this, known as
the mixture of experts. Inputs are presented to the network, and each individual classifier makes an assessment.
These outputs from the classifiers are then weighted by the relevant gate, which produces a weight w using the
current inputs, and this is propagated further up the hierarchy. The most common version of the mixture of
experts works as follows:
1. Averages:
The mean is the most commonly used average of a set of data, and is the value that is found by adding up
all the points in the dataset and dividing by the number of points. There are two other averages that are
used: the median and the mode. The median is the middle value, so the most common way to find it is to
sort the dataset according to size and then find the point that is in the middle. If there is an even number of
data points then there is no exact middle, so take the value halfway between the two points that are closest
to the middle. The mode is the most common value, so it just requires counting how many times each
element appears and picking the most frequent one.
The square root of the variance is known as the standard deviation. The variance looks at the variation in
one variable compared to its mean. This can be generalized to look at how two variables vary together,
which is known as the covariance. It is a measure of how dependent the two variables are, It is computed
by:
•
where ‘v’ is the mean of set {yi}.
If two variables are independent, then the covariance is 0 (the variables are then known as uncorrelated), If
they both increase and decrease at the same time, then the covariance is positive, and if one goes up while
the other goes down, then the covariance is negative. The covariance can be used to look at the correlation
between all pairs of variables within a set of data. We need to compute the covariance of each pair, and
these are then put together into what is imaginatively known as the covariance matrix.
This can be used to construct a distance measure called the Mahalanobis distance after the person who
described it in 1936, and is written as:
If the covariance matrix is set to the identity matrix, then the Mahalanobis distance reduces to the
Euclidean distance. Computing the Mahalanobis distance requires heavy computational machinery in
The Gaussian distribution turns up in many problems because of the Central Limit Theorem, which says
that lots of small random numbers will add up to something Gaussian. In higher dimensions it looks like:
•
Where, is the n × n covariance matrix (with || being its determinant and 1 being its inverse). Figure
2.15 shows the appearance in two dimensions of three different cases:
When the covariance matrix is the identity; when there are only numbers on the leading diagonal of the
matrix; and the general case. The first case is known as a spherical covariance matrix, and has only 1
parameter. The second and third cases define ellipses in two dimensions, either aligned with the axes (with
n parameters) or more generally, with n2 parameters.
where ф(x;μm , m) is a Gaussian function with mean μm and covariance matrix m, and the αm are
weights with the constraint that
The problem is how to choose the weights m. The common approach is to aim for the maximum likelihood
solution. The likelihood is the conditional probability of the data given the model, and the maximum
likelihood solution varies the model to maximize this conditional probability. In fact, it is common to
compute the log likelihood and then to maximize that; it is guaranteed to be negative, since probabilities
are all less than 1, and the logarithm spreads out the values, making the optimization more effective. The
algorithm that is used is an example of a very general one known as the expectation-maximization (or
more compactly, EM) algorithm.
Where, D denotes the data. Note that since we have set f = 1 this means that we are choosing Gaussian
two.
Computing the value of this expectation is known as the E-step. Then this estimate of the expectation
is maximized over the model parameters (the parameters of the two Gaussians and the mixing
parameter), the M-step. This requires differentiating the expectation with respect to each of the model
parameters. These two steps are simply iterated until the algorithm converges.
There are two such information criteria that are commonly used to identify how well we can expect the
trained model to perform.
• Aikake Information Criterium
• Bayesian Information Criterium
In these equations, k is the number of parameters in the model, N is the number of training examples, and
L is the best (largest) likelihood of the model. In both cases, based on the way that they are written here,
the model with the largest value is taken. Both of the measures will favor simple models, which is a form
of Occam’s razor.
The way to interpret this is that when k is small, so that there are few neighbors considered, the model has
flexibility and can represent the underlying model well, but that it makes mistakes (has high variance)
because there is relatively little data. As ‘k’ increases, the variance decreases, but at the cost of less
flexibility and so more bias.
The first thing to be done is label the leaf found as a potential nearest neighbor, and compute the
distance between the test point and this point, since any other point has to be closer. Now, check any
other boxes that could contain something closer. From figure 7.8 it is observed that point (3, 7) is
closer, and that is the label of the leaf for the sibling box to the one that was returned, so the algorithm
also needs to check the sibling box. However, suppose that (4.5, 2) is used as the test point. In that
case the sibling is too far away, but another point (6, 1) is closer. So just checking the sibling is not
enough the siblings of the parent node, must also be checked together with its descendants.
Distance Measures:
The most common measure of distance between points is the Euclidean distance.
If k = 1 then we get the city-block distance and k = 2 gives the Euclidean distance. The Euclidean
metric is written as the L2 norm and the city-block distance as the L1 norm. These can define different
averages of a set of numbers. If we define the average as the point that minimizes the sum of the
distance to every data point, then it turns out that the mean minimizes the Euclidean distance (the sum-
of-squares distance), and the median minimizes the L1 metric. A common invariant metric in use for
images is the tangent distance, which is an approximation to the Taylor expansion in first derivatives,
and works very well for small rotations and scalings.
Unsupervised Learning:
Many of the learning algorithms that we have seen till now have made use of a training set that
consists of a collection of labelled target data. Targets are obviously useful, since they enable us to
show the algorithm the correct answer to possible inputs, but in many circumstances they are difficult
to obtain—they could, for instance, involve somebody labelling each instance by hand.
In addition, it doesn’t seem to be very biologically plausible: most of the time when we are learning,
we don’t get told exactly what the right answer should be. In this chapter we will consider exactly the
opposite case, where there is no information about the correct outputs available at all, and the
algorithm is left to spot some similarity between different inputs for itself. Unsupervised learning is a
conceptually different problem to supervised learning. If the algorithm can exploit similarities
between inputs in order to cluster inputs that are similar together, this might perform classification
automatically. So the aim of unsupervised learning is to find clusters of similar inputs in the data
without being explicitly told that these data points belong to one class and those to a different class.
Instead, the algorithm has to discover the similarities for itself. The supervised learning algorithms
that we have discussed so far have aimed to minimize some external error criterion—mostly the sum-
of-squares error—based on the difference between the targets and the outputs.
Calculating and minimizing this error was possible because we had target data to calculate it from,
which is not true for unsupervised learning. If two inputs are close together then it means that their
vectors are similar, and so the distance between them is small (distance measures were discussed in