Excercise Solutions
Excercise Solutions
of training (c) For a3, which is a continuous attribute, compute the information gain
examples. for every possible split.
1. Draw the full decision tree for the parity function of four Boolean Answer: Gini = 1 − 2 °ø 0.52 = 0.5.
attributes, A, B, C, and D. Is it possible to simplify the tree? (b) Compute the Gini index for the Customer ID attribute.
Answer:The gini for each Customer ID value is 0. Therefore, the overall gini
for Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:The gini for Male is 1 − 0.42 − 0.62 = 0.48. The gini for Female is also
1 − 0.42 − 0.62 = 0.48. Therefore, the overall gini for Gender is 0.5 X 0.48 +
0.5 °X 0.48 = 0.48.
(d) Compute the Gini index for the Car Type attribute using multiway
split.
Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury car is
0.2188. The overall gini is 0.1625.
(e) Compute the Gini index for the Shirt Size attribute using multiway (d) What is the best split (among a1, a2, and a3) according to the
split. information gain?
Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Answer: According to information gain, a1 produces the best split.
Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for (e) What is the best split (between a1 and a2) according to the
Shirt Size attribute is 0.4914. classification error rate?
(f) Which attribute is better, Gender, Car Type, or Shirt Size? Answer: For attribute a1: error rate = 2/9.
Answer: Car Type because it has the lowest gini among the three attributes. For attribute a2: error rate = 4/9.
(g) Explain why Customer ID should not be used as the attribute test Therefore, according to error rate, a1 produces the best split.
condition even though it has the lowest Gini. (f) What is the best split (between a1 and a2) according to the Gini
Answer: The attribute has no predictive power since new customers are index?
assigned to new Customer IDs.
4. Show that the entropy of a node never increases after splitting it into
smaller successor nodes.
6. Consider splitting a parent node P into two child nodes, C1 and C2,
using
some attribute test condition. The composition of labeled training
(a) Calculate the information gain when splitting on A and B. Which instances (a) Compute a two-level decision tree using the greedy approach
attribute would the decision tree induction algorithm choose? at every node is summarized in the Table below. described
in this chapter. Use the classification error rate as the criterion for
splitting. What is the overall error rate of the induced tree?
Answer: Splitting Attribute at Level 1.
To determine the test condition at the root node, we need to compute
the error rates for attributes X, Y , and Z. For attribute X, the corresponding
counts are:
(a) Calculate the Gini index and misclassification error rate of the parent
node P.
Therefore, the error rate using attribute Y is (40 + 40)/200 = 0.4. The error rate using attributes Y and Z are 10/80 and 30/80, respectively.
For attribute Z, the corresponding counts are: Since attribute Y leads to a smaller error rate, it provides a better split.
The corresponding two-level decision tree is shown below.
Splitting Attribute at Level 2. The overall error rate of the induced tree is (10 + 10)/200 = 0.1.
After splitting on attribute Z, the subsequent test condition may involve (c) Compare the results of parts (a) and (b). Comment on the suitability
either attribute X or Y . This depends on the training examples of the greedy heuristic used for splitting attribute selection.
distributed to the Z = 0 and Z = 1 child nodes. Answer: From the preceding results, the error rate for part (a) is significantly
For Z = 0, the corresponding counts for attributes X and Y are the larger than that for part (b). This examples shows that a greedy heuristic
same, as shown in the table below. does not always produce an optimal solution.
Although the counts are somewhat different, their error rates remain
the same, (15 + 15)/100 = 0.3.
The corresponding two-level decision tree is shown below.
(b) Repeat part (a) using X as the first splitting attribute and then choose
the best remaining attribute for splitting at each of the two successor
nodes. What is the error rate of the induced tree?
Answer: After choosing attribute X to be the first splitting attribute, the
subsequent test condition may involve either attribute Y or attribute Z.
For X = 0, the corresponding counts for attributes Y and Z are shown
in the table below.
(c) How many instances are misclassified by the resulting decision
tree?
Answer: 20 instances are misclassified. (The error rate is 20/100)
(d) Repeat parts (a), (b), and (c) using C as the splitting attribute.
Answer: For the C = T child node, the error rate before splitting is:
The error rate using attributes Y and Z are 10/120 and 30/120, respectively.
Since attribute Y leads to a smaller error rate, it provides a better split.
(a) Compute the generalization error rate of the tree using the optimistic 11. This exercise, inspired by the discussions in [6], highlights one of
approach. the known limitations of the leave-one-out model evaluation procedure.
Answer: According to the optimistic approach, the generalization error rate is Let us consider a data set containing 50 positive and 50 negative
3/10 = 0.3. instances, where the attributes are purely random and contain no
(b) Compute the generalization error rate of the tree using the information about the class labels. Hence, the generalization error rate
pessimistic of any classification model learned over this data is expected to be 0.5.
approach. (For simplicity, use the strategy of adding a factor of 0.5 to Let us consider a classifier that assigns the majority class label of
each leaf node.) training instances (ties resolved by using the positive label as the
Answer: According to the pessimistic approach, the generalization error rate default class) to any test instance, irrespective of its attribute values.
is We can call this approach as the majority inducer classifier. Determine
(3 + 4 * 0.5)/10 = 0.5. the error rate of this classifier using the following methods.
(c) Compute the generalization error rate of the tree using the validation
set shown above. This approach is known as reduced error pruning.
Answer: According to the reduced error pruning approach, the generalization
error rate is 4/5 = 0.8.
10. Consider the decision trees shown in Figure 3.3. Assume they are
generated from a data set that contains 16 binary attributes and 3
classes, C1, C2, and C3. Compute the total description length of each
decision tree according to the minimum description length principle.
(a) Estimate the conditional probabilities for P(A = 1|+), P(B = 1|+),
(c) Estimate the conditional probabilities using the m-estimate P(C = 1|+), P(A = 1|−), P(B = 1|−), and P(C = 1|−) using the same
approach, approach as in the previous problem.
with p = 1/2 and m = 4. Answer: P(A = 1|+) = 0.6, P(B = 1|+) = 0.4, P(C = 1|+) = 0.8, P(A =
Answer: 1|−) = 0.4, P(B = 1|−) = 0.4, and P(C = 1|−) = 0.2
P(A = 0|+) = (2 + 2)/(5 + 4) = 4/9, (b) Use the conditional probabilities in part (a) to predict the class label
P(A = 0|−) = (3 + 2)/(5 + 4) = 5/9, for a test sample (A = 1,B = 1,C = 1) using the naıve Bayes approach.
P(B = 1|+) = (1 + 2)/(5 + 4) = 3/9, Answer:
P(B = 1|−) = (2 + 2)/(5 + 4) = 4/9, Let R : (A = 1,B = 1,C = 1) be the test record. To determine its
P(C = 0|+) = (3 + 2)/(5 + 4) = 5/9, class, we need to compute P(+|R) and P(−|R). Using Bayes theorem,
P(C = 0|−) = (0 + 2)/(5 + 4) = 2/9. P(+|R) = P(R|+)P(+)/P(R) and P(−|R) = P(R|−)P(−)/P(R).
(d) Repeat part (b) using the conditional probabilities given in part (c). Since P(+) = P(−) = 0.5 and P(R) is constant, R can be classified by
Answer: Let P(A = 0,B = 1,C = 0) = K comparing P(+|R) and P(−|R).
For this question,
P(R|+) = P(A = 1|+) * P(B = 1|+) * P(C = 1|+) = 0.192
P(R|−) = P(A = 1|−) * P(B = 1|−) * P(C = 1|−) = 0.032
Since P(R|+) is larger, the record is assigned to (+) class.
(c) Compare P(A = 1), P(B = 1), and P(A = 1,B = 1). State the
relationships between A and B.
(a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), Answer: P(A = 1) = 0.5, P(B = 1) = 0.4 and P(A = 1,B = 1) = P(A) * P(B) =
P(A|−), P(B|−), and P(C|−). 0.2. Therefore, A and B are independent.
Answer: (d) Repeat the analysis in part (c) using P(A = 1), P(B = 0), and P(A =
P(A = 1|−) = 2/5 = 0.4, P(B = 1|−) = 2/5 = 0.4, 1,B = 0).
P(C = 1|−) = 1, P(A = 0|−) = 3/5 = 0.6, Answer:
P(B = 0|−) = 3/5 = 0.6, P(C = 0|−) = 0; P(A = 1|+) = 3/5 = 0.6, P(A = 1) = 0.5, P(B = 0) = 0.6, and P(A = 1,B = 0) = P(A = 1) * P(B = 0) = 0.3.
P(B = 1|+) = 1/5 = 0.2, P(C = 1|+) = 2/5 = 0.4, A and B are still independent.
P(A = 0|+) = 2/5 = 0.4, P(B = 0|+) = 4/5 = 0.8, (e) Compare P(A = 1,B = 1|Class = +) against P(A = 1|Class = +)
P(C = 0|+) = 3/5 = 0.6. and P(B = 1|Class = +). Are the variables conditionally independent
(b) Use the estimate of conditional probabilities given in the previous given the class?
question to predict the class label for a test sample (A = 0,B = 1,C = 0) Answer: Compare P(A = 1,B = 1|+) = 0.2 against P(A = 1|+) = 0.6 and
using the naıve Bayes approach. P(B = 1|Class = +) = 0.4. Since the product between P(A = 1|+) and P(A =
Answer: 1|−) are not the same as P(A = 1,B = 1|+), A and B are not conditionally
independent given the class.
10. Figure 4.3 illustrates the Bayesian belief network for the data set
shown in Table 4.3. (Assume that all the attributes are binary).
(a) Draw the probability table for each node in the network.
P(Mileage=Hi) = 0.5
P(Air Cond=Working) = 0.625
P(Engine=Good|Mileage=Hi) = 0.5
P(Engine=Good|Mileage=Lo) = 0.75
P(Value=High|Engine=Good, Air Cond=Working) = 0.750
14. For each of the Boolean functions given below, state whether the Precision = 5/9 = 55.6%.
problem is linearly separable. Recall = 5/5 = 100%.
(a) A AND B AND C F-measure = (2 * .556 * 1)/(.556 + 1) = 0.715.
Answer: Yes According to F-measure, t = 0.1 is better than t = 0.5.
(b) NOT A AND B When t = 0.1, FPR = 0.8 and TPR = 1. On the other hand, when t = 0.5, FPR
Answer: Yes = 0.2 and TRP = 0.6. Since (0.2, 0.6) is closer to the point (0, 1), we favor t =
(c) (A OR B) AND (A OR C) 0.5. This result is inconsistent with the results using F-measure. We can also
Answer: Yes show this by computing the area under the ROC curve
(d) (A XOR B) AND (A OR B) For t = 0.5, area = 0.6 * (1 − 0.2) = 0.6 * 0.8 = 0.48.
Answer: No For t = 0.1, area = 1 * (1 − 0.8) = 1 * 0.2 = 0.2.
15. (a) Demonstrate how the perceptron model can be used to represent Since the area for t = 0.5 is larger than the area for t = 0.1, we prefer t = 0.5.
the AND and OR functions between a pair of Boolean variables. 17. Following is a data set that contains two attributes, X and Y , and two
Answer: Let x1 and x2 be a pair of Boolean variables and y be the output. class labels, “+” and “−”. Each attribute can take three different values:
For AND function, a possible perceptron model is: 0, 1, or 2.
M1 is better, since its area under the ROC curve is larger than the area
under ROC curve for M2.
(b) For model M1, suppose you choose the cutoff threshold to be t = 0.5.
In other words, any test instances whose posterior probability is greater
than t will be classified as a positive example. Compute the precision,
recall, and F-measure for the model at this threshold value. When t =
(b) Comment on the disadvantage of using linear functions as activation 0.5, the confusion matrix for M1 is shown below.
functions for multilayer neural networks.
Answer: Multilayer neural networks is useful for modeling nonlinear
relationships between the input and output attributes. However, if linear
functions are used as activation functions (instead of sigmoid or hyperbolic
tangent function), the output is still a linear combination of its input
attributes. Such a network is just as expressive as a perceptron.
(b) A discriminant rule is a rule of the form {p1, p2, . . . , pn} −→ {q},
where the rule consequent contains only a single item. An itemset of
size k can produce up to k discriminant rules. Let η be the minimum
confidence of all discriminant rules generated from a given itemset:
(b) Use the results in part (a) to compute the confidence for the
(c) Repeat the analysis in part (b) assuming that the rules have identical
association rules {b, d} −→ {e} and {e} −→ {b, d}. Is confidence a
support. Which rule has the highest confidence?
symmetric measure?
Answer:
Answer:
Considering s(p ∪ q) = s(p ∪ q ∪ r)
but s(p) ≥ s(p ∪ r)
Thus: c3 ≥ (c1 = c2)
Either all rules have the same confidence or c3 has the highest confidence.
(d) Transitivity: Suppose the confidence of the rules A −→ B and B −→ C
are larger than some threshold, minconf. Is it possible that A −→ C
has a confidence less than minconf?
Answer:
Yes, It depends on the support of items A, B, and C.
For example:
s(A,B) = 60% s(A) = 90%
(c) Repeat part (a) by treating each customer ID as a market basket. s(A,C) = 20% s(B) = 70%
Each item should be treated as a binary variable (1 if an item appears in s(B,C) = 50% s(C) = 60%
at least one transaction bought by the customer, and 0 otherwise.) Let minconf = 50% Therefore:
Answer: c(A → B) = 66% > minconf
c(B → C) = 71% > minconf
But c(A → C) = 22% < minconf
4. For each of the following measures, determine whether it is
monotone, antimonotone, or non-monotone (i.e., neither monotone nor
anti-monotone).
(a) A characteristic rule is a rule of the form {p} −→ {q1, q2, . . . , qn},
where the rule antecedent contains only a single item. An itemset of size
k can produce up to k characteristic rules. Let ζ be the minimum
confidenceof all characteristic rules generated from a given itemset:
(c) Repeat the analysis in parts (a) and (b) by replacing the min function Answer: (Beer, Cookies) or (Bread, Butter).
with a max function. 7. Show that if a candidate k-itemset X has a subset of size less than
k−1 that is infrequent, then at least one of the (k − 1)-size subsets of X is
necessarily infrequent.
ii. Confidence.
14. Given the rankings you had obtained in Exercise 13, compute the
correlation between the rankings of confidence and the other five
measures. Which measure is most highly correlated with confidence?
Which measure is least correlated with confidence?
Answer:
Correlation(Confidence, Support) = 0.97.
Correlation(Confidence, Interest) = 1.
Correlation(Confidence, IS) = 1.
Correlation(Confidence, Klosgen) = 0.7.
Correlation(Confidence, Odds Ratio) = -0.606. 16. (a) Prove that the ϕ coefficient is equal to 1 if and only if f11 = f1+ =
Interest and IS are the most highly correlated with confidence, while odds f+1.
ratio is the least correlated.
15. Answer the following questions using the data sets shown in Figure
5.6. Note that each data set contains 1000 items and 10,000
transactions. Dark cells indicate the presence of items and white cells
indicate the absence of items. We will apply the Apriori algorithm to
extract frequent itemsets with minsup = 10% (i.e., itemsets must be
contained in at least 1000 transactions)?
(a) Which data set(s) will produce the most number of frequent
itemsets?
Answer: Data set (e) because it has to generate the longest frequent
itemset along with its subsets.
(b) Which data set(s) will produce the fewest number of frequent
itemsets?
Answer: Data set (d) which does not produce any frequent itemsets
at 10% support threshold.
(c) Which data set(s) will produce the longest frequent itemset?
Answer: Data set (e).
(d) Which data set(s) will produce frequent itemsets with highest
maximum support?
Answer: Data set (b).
(e) Which data set(s) will produce frequent itemsets containing items
with wide-varying support levels (i.e., items with mixed support, ranging
from less than 20% to more than 70%).
Answer: Data set (e).
(d) How does M behave when P(B) is increased while P(A,B) and P(A)
remain unchanged?
Answer:
M decreases when P(B) increases. Proof: Let a = P(A,B)/P(A) =
P(B|A) and y = P(B). Since P(A,B) and P(A) remain unchanged, a
is a constant, i.e., P(B|A), is a constant. We treat the measure M as a
function, f(y) = (a−y)/(1−y). The derivative of f with respect to y is
df(y)/dy = (a−1)/(1−y)2. Since a = P(B|A) ≤ 1 and y = P(B) ≤ 1,
the derivative is negative, except in the special case where P(B|A) = 1
or P(B) = 1, when the derivative is 0. Thus, M decreases as y = P(B)
increases.
(f) What is the value of the measure when A and B are statistically
independent?
Answer: No. Proof: The numerator of M, P(B|A) − P(B) is 0 since
the independence of A and B implies P(B|A) = P(B).
(h) Does the measure remain invariant under row or column scaling
operations?
Answer: No. It is straightforward to find a counterexample.
(i) How does the measure behave under the inversion operation?
Answer: Inversion is equivalent to switching the roles of A and B.
Proof: We will compute the value of M under inversion, call it .M ,
(d) Write a simplified expression for the value of each measure shown in
Tables 6.11 and 6.12 when the variables are statistically independent.
Answer:
(c) How does M behave when P(A) is increased while P(A,B) and P(B)
remain unchanged?
Answer: M decreases. Proof: The only term in M that changes is
P(B|A) = P(A,B)/P(A), which decreases as P(A) is increased. This
decreases the numerator and thus decreases M.
Thus, for the measure, M, inversion is equivalent to switching the order
(role) of A and B.
18. Suppose we have market basket data consisting of 100 transactions
and 20 items. If the support for item a is 25%, the support for item b is
90% and the support for itemset {a, b} is 20%. Let the support and
confidence thresholds be 10% and 60%, respectively.
(a) Compute the confidence of the association rule {a} → {b}. Is the rule
interesting according to the confidence measure?
Answer: Confidence is 0.2/0.25 = 80%. The rule is interesting because it
exceeds the confidence threshold. iii. When C = 0 or C = 1, ϕ = 0.
(b) Compute the interest measure for the association pattern {a, b}. (b) What conclusions can you draw from the above result?
Describe the nature of the relationship between item a and item b in Answer: The result shows that some interesting relationships may disappear
terms of the interest measure. if the confounding factors are not taken into account.
Answer: The interest measure is 0.2/(0.25 * 0.9) = 0.889. The items are 20. Consider the contingency tables shown in Table 5.6.
negatively correlated according to interest measure.
(c) What conclusions can you draw from the results of parts (a) and (b)?
Answer: High confidence rules may not be interesting.
(a) Compute the odds ratios for both tables.
Answer: For Table 1, odds ratio = 1.4938.
For Table 2, the odds ratios are 0.8333 and 0.98.
(b) Compute the ϕ-coefficient for both tables.
Answer: For table 1, ϕ = 0.098.
For Table 2, the ϕ-coefficients are -0.0233 and -0.0047.
(c) Compute the interest factor for both tables.
Answer: For Table 1, I = 1.0784.
For Table 2, the interest factors are 0.88 and 0.9971.
For each of the measures given above, describe how the direction of
association changes when data is pooled together instead of being
stratified.
Answer: The direction of association changes sign (from negative to positive
correlated) when the data is pooled together.
(a) For table I, compute support, the interest measure, and the ϕ CHAPTER – 7 (CLUSTER ANALYSIS)
correlation coefficient for the association pattern {A, B}. Also, compute 1. Consider a data set consisting of 220 data vectors, where each vector
the confidence of rules A → B and B → A. has 32 components and each component is a 4-byte value. Suppose
Answer: s(A) = 0.1, s(B) = 0.1, s(A,B) = 0.09. that vector quantization is used for compression and that 216 prototype
I(A,B) = 9, ϕ(A,B) = 0.89. vectors are used. How many bytes of storage does that data set take
c(A −→ B) = 0.9, c(B −→ A) = 0.9. before and after compression and what is the compression ratio?
(b) For table II, compute support, the interest measure, and the ϕ Answer:
correlation coefficient for the association pattern {A, B}. Also, compute
the confidence of rules A → B and B → A.
Answer: s(A) = 0.9, s(B) = 0.9, s(A,B) = 0.89.
I(A,B) = 1.09, ϕ(A,B) = 0.89.
c(A −→ B) = 0.98, c(B −→ A) = 0.98.
(c) What conclusions can you draw from the results of (a) and (b)?
Answer: Interest, support, and confidence are non-invariant while the ϕ- 2. Find all well-separated clusters in the set of points shown in Figure
coefficient is invariant under the inversion operation. This is because ϕ- 7.35 .
coefficient takes into account the absence as well as the presence of an item
in a transaction.
21. Consider the relationship between customers who buy high-
definition televisions and exercise machines as shown in Tables 5.17
and 5.18 .
19. Table 5.5 shows a 2*2*2 contingency table for the binary variables A
and B at different values of the control variable C. Answer:
Observations from Figure 7.1:
1. Cluster 1: The leftmost circle contains four closely grouped points.
2. Cluster 2: The middle group consists of a set of smaller, individual
clusters, each containing three or four points.
3. Cluster 3: The rightmost circle contains another group of four
points.
Solution:
The clusters are defined by the points enclosed within the circles. In this case,
the well-separated clusters are:
• Cluster 1: Points in the leftmost circle.
• Cluster 2: Points in each middle circle, potentially forming sub-
clusters within the larger middle group.
• Cluster 3: Points in the rightmost circle.
5. Identify the clusters in Figure 7.36 using the center-, contiguity-, and In theory, there are an infinite number of ways to split the circle into
density-based definitions. Also indicate the number of clusters for each two clusters - just take any line that bisects the circle. This line can make any
case and give a brief indication of your reasoning. Note that darkness or angle 0◦ ≤ θ ≤ 180◦ with the x axis. The centroids will lie on the perpendicular
the number of dots indicates density. If it helps, assume center-based bisector of the line that splits the circle into two clusters and will be
means Kmeans, contiguity-based means single link, and density-based symmetrically positioned. All these solutions will have the same, globally
means DBSCAN. minimal, error.
(b) K = 3. The distance between the edges of the circles is slightly
greater than the radii of the circles.
If you start with initial centroids that are real points, you will necessarily
get this solution because of the restriction that the circles are more
than one radius apart. Of course, the bisector could have any angle, as
above, and it could be the other circle that is split. All these solutions
3. Many partitional clustering algorithms that automatically determine
have the same globally minimal error.
the number of clusters claim that this is an advantage. List two
(c) K = 3. The distance between the edges of the circles is much less
situations in which this is not the case.
than the radii of the circles.
(a) When there is hierarchical structure in the data. Most algorithms that
The three boxes show the three clusters that will result in the realistic
automatically determine the number of clusters are partitional, and
case that the initial centroids are actual data points.
thus, ignore the possibility of subclusters.
(d) K = 2.
(b) When clustering for utility. If a certain reduction in data size is needed,
Answer: In both case, the rectangles show the clusters. In the first case, the two
then it is necessary to specify how many clusters (cluster centroids) are clusters are only a local minimum while in the second case the clusters
(a) center-based 2 clusters. The rectangular region will be split in half.
produced.
Note that the noise is included in the two clusters. represent a globally minimal solution.
contiguity-based 1 cluster because the two circular regions will be (e) K = 3. Hint: Use the symmetry of the situation and remember that we
joined by noise. are looking for a rough sketch of what the result would be.
density-based 2 clusters, one for each circular region. Noise will be
eliminated.
(b) center-based 1 cluster that includes both rings.
contiguity-based 2 clusters, one for each rings.
density-based 2 clusters, one for each ring.
(c) center-based 3 clusters, one for each triangular region. One cluster is
also an acceptable answer.
contiguity-based 1 cluster. The three triangular regions will be joined
together because they touch.
density-based 3 clusters, one for each triangular region. Even though
the three triangles touch, the density in the region where they touch is
(a) Plot the probability of obtaining one point from each cluster in a lower than throughout the interior of the triangles.
sample of size K for values of K between 2 and 100. (d) center-based 2 clusters. The two groups of lines will be split in two.
contiguity-based 5 clusters. Each set of lines that intertwines becomes
a cluster.
density-based 2 clusters. The two groups of lines define two regions For the solution shown in the top figure, the two top clusters are enclosed
of high density separated by a region of low density. in two boxes, while the third cluster is enclosed by the regions
6. For the following sets of two-dimensional points, (1) provide a sketch defined by a triangle and a rectangle. (The two smaller clusters in the
of how they would be split into clusters by K-means for the given drawing are supposed to be symmetrical.) I believe that the second
number of 4!/44=0.0938. p=number of ways to select one centroid from solution—suggested by a student—is also possible, although it is a local
each clusternumber of way(7s. 2to0 )select minimum and might rarely be seen in practice for this configuration
K=10,100, clusters and (2) indicate approximately where the resulting of points. Note that while the two pie shaped cuts out of the larger circle
centroids would be. Assume that we are using the squared error are shown as meeting at a point, this is not necessarily the case—it
objective function. If you think that there is more than one possible depends on the exact positions and sizes of the circles. There could be a
solution, then please indicate whether each solution is a global or local gap between the two pie shaped cuts which is filled by the third (larger)
minimum. Note that the label of each diagram in Figure 7.37 matches the cluster. (Imagine the small circles on opposite sides.) Or the boundary
corresponding part of this question, e.g., Figure 7.37(a) goes with part between the two pie shaped cuts could actually be a line segment.
(a). 7. Suppose that for a data set
there are m points and K clusters,
half the points and clusters are in “more dense” regions,
half the points and clusters are in “less dense” regions, and
The solution is shown in Figure 4. Note that the probability is essentially the two regions are well-separated from each other.
0 by the time K = 10. For the given data set, which of the following should occur in order to
minimize the squared error when finding K clusters:
(a) Centroids should be equally distributed between more dense and
less
dense regions.
(b) More centroids should be allocated to the less dense region.
(c) More centroids should be allocated to the denser region.
Note: Do not get distracted by special cases or bring in factors other
than density. However, if you feel the true answer is different from any
Answer: given above, justify your response.
(a) K = 2. Assuming that the points are uniformly distributed in the The correct answer is (c). Less dense regions require more centroids if the
circle, how many possible ways are there (in theory) to partition the squared error is to be minimized.
points into two clusters? What can you say about the positions of the
two centroids? (Again, you don’t need to provide exact centroid
locations, just a qualitative description.)
8. Consider the mean of a cluster of objects from a binary transaction order dependent, for a fixed ordering of the objects, it always produces 15. Traditional agglomerative hierarchical clustering routines merge two
dataset. What are the minimum and maximum values of the components the same set of clusters. However, unlike K-means, it is not possible clusters at each step. Does it seem likely that such an approach
of the mean? What is the interpretation of components of the cluster to set the number of resulting clusters for the leader algorithm, except accurately captures the (nested) cluster structure of a set of data
mean? Which components most accurately characterize the objects in indirectly. Also, the K-means algorithm almost always produces better
points? If not, explain how you might postprocess the data to obtain a
the cluster? quality clusters as measured by SSE.
(a) The components of the mean range between 0 and 1. (b) Suggest ways in which the leader algorithm might be improved. more accurate view of the cluster structure.
(b) For any specific component, its value is the fraction of the objects in Use a sample to determine the distribution of distances between the (a) Such an approach does not accurately capture the nested cluster
the cluster that have a 1 for that component. If we have asymmetric points. The knowledge gained from this process can be used to more structure of the data. For example, consider a set of three clusters, each of
binary data, such as market basket data, then this can be viewed as the intelligently set the value of the threshold. The leader algorithm could be which has two, three, and four subclusters, respectively. An ideal hierarchical
probability that, for example, a customer in group represented by the modified to cluster for several thresholds clustering would have three branches from the root—one to each of the three
the cluster buys that particular item. during a single pass. main clusters—and then two, three, and four branches from each of these
(c) This depends on the type of data. For binary asymmetric data, the 13. The Voronoi diagram for a set of K points in the plane is a partition
components with higher values characterize the data, since, for most clusters, respectively. A traditional agglomerative approach cannot produce
of all the points of the plane into K regions, such that every point (of the
clusters, the vast majority of components will have values of zero. For such a structure.
plane) is assigned to the closest point among the K specified points—
regular binary data, such as the results of a true-false test, the significant (b) The simplest type of postprocessing would attempt to flatten the
see Figure 7.38 . What is the relationship between Voronoi diagrams and
components are those that are unusually high or low with respect hierarchical clustering by moving clusters up the tree.
to the entire set of data. K-means clusters? What do Voronoi diagrams tell us about the possible
9. Give an example of a data set consisting of three natural clusters, for shapes of Kmeansclusters?
16. Use the similarity matrix in Table 7.13 to perform single and
which (almost always) K-means would likely find the correct clusters, complete link hierarchical clustering. Show your results by drawing a
but bisecting K-means would not.
dendrogram. The dendrogram should clearly show the order in which
Consider a data set that consists of three circular clusters, that are identical
in terms of the number and distribution of points, and whose centers lie on the points are merged.
a line and are located such that the center of the middle cluster is equally
distant from the other two. Bisecting K-means would always split the middle
cluster during its first iteration, and thus, could never produce the correct
set of clusters. (Postprocessing could be applied to address this.)
10. Would the cosine measure be the appropriate similarity measure to
use with K-means clustering for time series data? Why or why not? If
not, what similarity measure would be more appropriate?
Time series data is dense high-dimensional data, and thus, the cosine
measure would not be appropriate since the cosine measure is appropriate
for sparse data. If the magnitude of a time series is important, then Euclidean
distance would be appropriate. If only the shapes of the time series are
important, then correlation would be appropriate. Note that if the comparison
of the time series needs to take in account that one time series might lead or
lag another or only be related to another during specific time periods, then
more sophisticated approaches to modeling time series similarity must be
used.
11. Total SSE is the sum of the SSE for each separate attribute. What
does it mean if the SSE for one variable is low for all clusters? Low for
just one cluster? High for all clusters? High for just one cluster? How
could you use the per variable SSE information to improve your
clustering?
(a) If the SSE of one attribute is low for all clusters, then the variable is
essentially a constant and of little use in dividing the data into groups.
(b) if the SSE of one attribute is relatively low for just one cluster, then (a) If we have K K-means clusters, then the plane is divided into K Voronoi
this attribute helps define the cluster. regions that represent the points closest to each centroid.
(c) If the SSE of an attribute is relatively high for all clusters, then it could (b) The boundaries between clusters are piecewise linear. It is possible to see
well mean that the attribute is noise.
this by drawing a line connecting two centroids and then drawing a
(d) If the SSE of an attribute is relatively high for one cluster, then it is
at odds with the information provided by the attributes with low SSE perpendicular to the line halfway between the centroids. This perpendicular
that define the cluster. It could merely be the case that the clusters line splits the plane into two regions, each containing points that are closest to
defined by this attribute are different from those defined by the other the centroid the region contains.
attributes, but in any case, it means that this attribute does not help
define the cluster. 14. You are given a data set with 100 records and are asked to cluster 17. Hierarchical clustering is sometimes used to generate K clusters, by
(e) The idea is to eliminate attributes that have poor distinguishing power the data. You use K-means to cluster the data, but for all values of K, the taking the clusters at the level of the dendrogram. (Root is at level 1.) By
between clusters, i.e., low or high SSE for all clusters, since they are
K-means algorithm returns only one non-empty cluster. You then apply looking at the clusters produced in this way, we can evaluate the
useless for clustering. Note that attributes with high SSE for all clusters
are particularly troublesome if they have a relatively high SSE with an incremental version of K-means, but obtain exactly the same result. behavior of hierarchical clustering on different types of data and
respect to other attributes (perhaps because of their scale) since they How is this possible? How would single link or DBSCAN handle such clusters, and also compare hierarchical approaches to K-means. The
introduce a lot of noise into the computation of the overall SSE. data? following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
12. The leader algorithm (Hartigan [533]) represents each cluster using a (a) The data consists completely of duplicates of one object. a. For each of the following sets of initial centroids, create two clusters
point, known as a leader, and assigns each point to the cluster (b) Single link (and many of the other agglomerative hierarchical schemes) by assigning each point to the nearest centroid, and then calculate the
corresponding to the closest leader, unless this distance is above a
would produce a hierarchical clustering, but which points appear in which total squared error for each set of two clusters. Show both the clusters
user-specified threshold. In that case, the point becomes the leader of a
new cluster. cluster would depend on the ordering of the points and the exact algorithm. and the total squared error for each set of centroids.
a. What are the advantages and disadvantages of the leader algorithm However, if the dendrogram were plotted showing the proximity at which each i. {18, 45}
as compared to K-means? object is merged, then it would be obvious that the data consisted of ii. {15, 40}
The leader algorithm requires only a single scan of the data and is duplicates. DBSCAN would find that all points were core points connected to
thus more computationally efficient since each object is compared to one another and produce a single cluster.
the final set of centroids at most once. Although the leader algorithm is
20. Consider the following four faces shown in Figure 7.39 . Again,
darkness or number of dots represents density. Lines are used only to
distinguish regions and do not represent points.
(a) For each figure, could you use single link to find the patterns
represented by the nose, eyes, and mouth? Explain.
Only for (b) and (d). For (b), the points in the nose, eyes, and mouth are much
(b) Do both sets of centroids represent stable solutions; i.e., if the K- closer together than the points between these areas. For (d) there is only
means algorithm was run on this set of points using the given centroids space between these regions.
as the starting centroids, would there be any change in the clusters (b) For each figure, could you use K-means to find the patterns
generated? (a) Data with very different sized clusters.
represented by the nose, eyes, and mouth? Explain.
Yes, both centroids are stable solutions. This can be a problem, particularly if the number of points in a cluster is small.
Only for (b) and (d). For (b), K-means would find the nose, eyes, and mouth,
(c) What are the two clusters produced by single link? For example, if we have a thousand points, with two clusters, one of size 900
but the lower density points would also be included. For (d), K - means would
The two clusters are {6, 12, 18, 24, 30} and {42, 48}. and one of size 100, and take a 5% sample, then we will, on average, end up
find the nose, eyes, and mouth straightforwardly as long as the number of
(d) Which technique, K-means or single link, seems to produce the with 45 points from the first cluster and 5 points from the second cluster. Five
clusters was set to 4.
“most natural” clustering in this situation? (For K-means, take the points is much easier to miss or cluster improperly than 50. Also, the second
(c) What limitation does clustering have in detecting all the patterns
clustering with the lowest squared error.) cluster will sometimes be represented by fewer than 5 points, just by the
formed by the points in Figure 7.7(c)?
MIN produces the most natural clustering. nature of random samples.
Clustering techniques can only find patterns of points, not of empty spaces.
(e) What definition(s) of clustering does this natural clustering (b) High-dimensional data.
correspond to? (Well-separated, center-based, contiguous, or density.) This can be a problem because data in high-dimensional space is typically 21. Compute the entropy and purity for the confusion matrix in Table
MIN produces contiguous clusters. However, density is also an acceptable sparse and more points may be needed to define the structure of a cluster in 7.14 .
answer. Even center-based is acceptable, since one set of centers gives the high-dimensional space.
desired clusters. (c) Data with outliers, i.e., atypical points.
(f) What well-known characteristic of the K-means algorithm explains By definition, outliers are not very frequent and most of them will be omitted
the previous behavior? when sampling. Thus, if finding the correct clustering depends on having the
K-means is not good at finding clusters of different sizes, at least when they outliers present, the clustering produced by sampling will likely be misleading.
are not well separated. The reason for this is that the objective of minimizing Otherwise, it is beneficial.
squared error causes it to “break” the larger cluster. Thus, in this problem, the (d) Data with highly irregular regions.
low error clustering solution is the “unnatural” one. This can be a problem because the structure of the border can be lost when
sampling unless a large number of points are sampled.
18. Suppose we find K clusters using Ward’s method, bisecting K- (e) Data with globular clusters.
means, and ordinary K-means. Which of these solutions represents a This is typically not a problem since not as many points need to be sampled
local or global minimum? Explain. to retain the structure of a globular cluster as an irregular one.
Although Ward’s method picks a pair of clusters to merge based on (f) Data with widely different densities.
minimizing SSE, there is no refinement step as in regular K-means. Likewise, In this case the data will tend to come from the denser region. Note that the
bisecting K-means has no overall refinement step. Thus, unless such a effect of sampling is to reduce the density of all clusters by the sampling
refinement step is added, neither Ward’s method nor bisecting K-means factor, e.g., if we take a 10% sample, then the density of the clusters is
produces a local minimum. Ordinary K-means produces a local minimum, but decreased by a factor of 10. For clusters that aren’t very dense to begin with,
like the other two algorithms, it is not guaranteed to produce a global this may means that they are now treated as noise or outliers.
minimum. (g) Data with a small percentage of noise points.
Sampling will not cause a problem. Actually, since we would like to exclude
noise, and since the amount of noise is small, this may be beneficial.
(h) Non-Euclidean data.
This has no particular impact.
(i) Euclidean data.
This has no particular impact.
(j) Data with many and mixed attribute types.
Many attributes was discussed under high-dimensionality. Mixed attributes
have no particular impact.
22. You are given two sets of 100 points that fall within the unit square. 24. Given the set of cluster labels and similarity matrix shown in Tables
One set of points is arranged so that the points are uniformly spaced. 7.15 and 7.16 , respectively, compute the correlation between the
The other set of points is generated from a uniform distribution over the similarity matrix and the ideal similarity matrix, i.e., the matrix whose
unit square.
entry is 1 if two objects belong to the same cluster, and 0 otherwise.
a. Is there a difference between the two sets of points?
Yes. The random points will have regions of lesser or greater density,
while the uniformly distributed points will, of course, have uniform
density throughout the unit square.
(b) If so, which set of points will typically have a smaller SSE for K=10
clusters?
The random set of points will have a lower SSE.
(c) What will be the behavior of DBSCAN on the uniform data set? The
random data set?
DBSCAN will merge all points in the uniform data set into one cluster
or classify them all as noise, depending on the threshold. There might Answer:
be some boundary issues for points at the edge of the region. However,
DBSCAN can often find clusters in the random data, since it does have
some variation in density.
23. Using the data in Exercise 24 , compute the silhouette coefficient for
each point, each of the two clusters, and the overall clustering.
25. Compute the hierarchical F-measure for the eight objects {p1, p2, p3,
p4, p5, p6, p7, and p8} and hierarchical clustering shown in Figure 7.40 .
Class A contains points p1, p2, and p3, while p4, p5, p6, p7, and p8
belong to class B.
This can be easily computed using a package, e.g., MATLAB. The answers
are single link, 0.8116, and complete link, 0.7480.
27. Prove Equation 7.14 . 31. We can represent a data set as a collection of object nodes and a 32. In Figure 7.41 , match the similarity matrices, which are sorted
collection of attribute nodes, where there is a link between each object according to cluster labels, with the sets of points. Differences in
and each attribute, and where the weight of that link is the value of the shading and marker shape distinguish between clusters, and each set
object for that attribute. For sparse data, if the value is 0, the link is of points contains 100 points and three clusters. In the set of points
omitted. Bipartite clustering attempts to partition this graph into disjoint labeled 2, there are three very tight, equalsized clusters.
clusters, where each cluster consists of a set of object nodes and a set
of attribute nodes. The objective is to maximize the weight of links
between the object and attribute nodes of a cluster, while minimizing the
weight of links between object and attribute links in different clusters.
This type of clustering is also known as coclustering because the
objects and attributes are clustered at the same time.
a. How is bipartite clustering (co-clustering) different from clustering the
sets of objects and attributes separately?
In regular clustering, only one set of constraints, related either to objects
or attributes, is applied. In co-clustering both sets of constraints are applied
simultaneously. Thus, partitioning the objects and attributes independently of
one another typically does not produce the same results.
b. Are there any cases in which these approaches yield the same
clusters?
Yes. For example, if a set of attributes is associated only with the objects
28. Prove Equation 7.16. in one particular cluster, i.e., has 0 weight for objects in all other
clusters, and conversely, the set of objects in a cluster has 0 weight for
all other attributes, then the clusters found by co-clustering will match
those found by clustering the objects and attributes separately. To use
documents as an example, this would correspond to a document data
set that consists of groups of documents that only contain certain words
and groups of words that only appear in certain documents.
c. What are the strengths and weaknesses of co-clustering as compared
to ordinary clustering?
Co-clustering automatically provides a description of a cluster of objects
in terms of attributes, which can be more useful than a description
of clusters as a partitioning of objects. However, the attributes that
distinguish different clusters of objects, may overlap significantly, and
in such cases, co-clustering will not work well.
Answers: 1 - D, 2 - C, 3 - A, 4 - B