0% found this document useful (0 votes)
2 views21 pages

Excercise Solutions

The document discusses various methods for computing Gini indices, information gain, and entropy in the context of decision tree classification. It covers the evaluation of attributes for splitting, the impact of different impurity measures, and the generalization error rates of decision trees using optimistic, pessimistic, and reduced error pruning approaches. Additionally, it highlights the limitations of certain evaluation methods and the importance of avoiding overfitting in model selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

Excercise Solutions

The document discusses various methods for computing Gini indices, information gain, and entropy in the context of decision tree classification. It covers the evaluation of attributes for splitting, the impact of different impurity measures, and the generalization error rates of decision trees using optimistic, pessimistic, and reduced error pruning approaches. Additionally, it highlights the limitations of certain evaluation methods and the importance of avoiding overfitting in model selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CHAPTER – 3 (CLASSIFICATION) (a)Compute the Gini index for the overall collection of training (c) For a3, which

of training (c) For a3, which is a continuous attribute, compute the information gain
examples. for every possible split.
1. Draw the full decision tree for the parity function of four Boolean Answer: Gini = 1 − 2 °ø 0.52 = 0.5.
attributes, A, B, C, and D. Is it possible to simplify the tree? (b) Compute the Gini index for the Customer ID attribute.
Answer:The gini for each Customer ID value is 0. Therefore, the overall gini
for Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:The gini for Male is 1 − 0.42 − 0.62 = 0.48. The gini for Female is also
1 − 0.42 − 0.62 = 0.48. Therefore, the overall gini for Gender is 0.5 X 0.48 +
0.5 °X 0.48 = 0.48.
(d) Compute the Gini index for the Car Type attribute using multiway
split.
Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury car is
0.2188. The overall gini is 0.1625.
(e) Compute the Gini index for the Shirt Size attribute using multiway (d) What is the best split (among a1, a2, and a3) according to the
split. information gain?
Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Answer: According to information gain, a1 produces the best split.
Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for (e) What is the best split (between a1 and a2) according to the
Shirt Size attribute is 0.4914. classification error rate?
(f) Which attribute is better, Gender, Car Type, or Shirt Size? Answer: For attribute a1: error rate = 2/9.
Answer: Car Type because it has the lowest gini among the three attributes. For attribute a2: error rate = 4/9.
(g) Explain why Customer ID should not be used as the attribute test Therefore, according to error rate, a1 produces the best split.
condition even though it has the lowest Gini. (f) What is the best split (between a1 and a2) according to the Gini
Answer: The attribute has no predictive power since new customers are index?
assigned to new Customer IDs.

3. Consider the training examples shown in Table 3.2 for a binary


classification problem.

4. Show that the entropy of a node never increases after splitting it into
smaller successor nodes.

(a) What is the entropy of this collection of training examples with


respect to the positive class?
Answer: There are four positive examples and five negative examples. Thus,
P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is
−4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.
(b) What are the information gains of a1 and a2 relative to these training
examples?
The preceding tree cannot be simplified.

2. Consider the training examples shown in Table 3.1 for a binary


classification
(b) Calculate the weighted Gini index of the child nodes. Would you
consider this attribute test condition if Gini is used as the impurity
(b) Calculate the gain in the Gini index when splitting on A and B. Which measure?
attribute would the decision tree induction algorithm choose?

(c) Calculate the weighted misclassification rate of the child nodes.


Would
you consider this attribute test condition if misclassification rate is used
as the impurity measure?

5. Consider the following data set for a binary class problem.

Since there is no drop in misclassification error rate, we would not consider


(c) Figure 4.13 shows that entropy and the Gini index are both this attribute test condition for splitting.
monotonously increasing on the range [0, 0.5] and they are both
monotonously decreasing on the range [0.5, 1]. Is it possible that 7. Consider the following set of training examples.
information gain and the gain in the Gini index favor different
attributes? Explain.
Answer: Yes, even though these measures have similar range and
monotonous
behavior, their respective gains, Δ, which are scaled differences of the
measures, do not necessarily behave in the same way, as illustrated by
the results in parts (a) and (b).

6. Consider splitting a parent node P into two child nodes, C1 and C2,
using
some attribute test condition. The composition of labeled training
(a) Calculate the information gain when splitting on A and B. Which instances (a) Compute a two-level decision tree using the greedy approach
attribute would the decision tree induction algorithm choose? at every node is summarized in the Table below. described
in this chapter. Use the classification error rate as the criterion for
splitting. What is the overall error rate of the induced tree?
Answer: Splitting Attribute at Level 1.
To determine the test condition at the root node, we need to compute
the error rates for attributes X, Y , and Z. For attribute X, the corresponding
counts are:

(a) Calculate the Gini index and misclassification error rate of the parent
node P.

Therefore, the error rate using attribute X is (60 + 40)/200 = 0.5.


For attribute Y , the corresponding counts are:

Therefore, the error rate using attribute Y is (40 + 40)/200 = 0.4. The error rate using attributes Y and Z are 10/80 and 30/80, respectively.
For attribute Z, the corresponding counts are: Since attribute Y leads to a smaller error rate, it provides a better split.
The corresponding two-level decision tree is shown below.

Therefore, the error rate using attribute Y is (30 + 30)/200 = 0.3.


Since Z gives the lowest error rate, it is chosen as the splitting attribute
at level 1.

Splitting Attribute at Level 2. The overall error rate of the induced tree is (10 + 10)/200 = 0.1.
After splitting on attribute Z, the subsequent test condition may involve (c) Compare the results of parts (a) and (b). Comment on the suitability
either attribute X or Y . This depends on the training examples of the greedy heuristic used for splitting attribute selection.
distributed to the Z = 0 and Z = 1 child nodes. Answer: From the preceding results, the error rate for part (a) is significantly
For Z = 0, the corresponding counts for attributes X and Y are the larger than that for part (b). This examples shows that a greedy heuristic
same, as shown in the table below. does not always produce an optimal solution.

The algorithm chooses attribute A because it has the highest gain.


8. The following table summarizes a data set with three attributes A, B, C (b) Repeat for the two children of the root node.
and two class labels +, −. Build a two-level decision tree. Answer: Because the A = T child node is pure, no further splitting is needed.
The error rate in both cases (X and Y ) are (15 + 15)/100 = 0.3. For the A = F child node, the distribution of training instances is:
For Z = 1, the corresponding counts for attributes X and Y are shown
in the tables below.

Although the counts are somewhat different, their error rates remain
the same, (15 + 15)/100 = 0.3.
The corresponding two-level decision tree is shown below.

(a) According to the classification error rate, which attribute would be


chosen as the first splitting attribute? For each attribute, show the
contingency table and the gains in classification error rate.
Answer: The error rate for the data without partitioning on any attribute is

The overall error rate of the induced tree is (15+15+15+15)/200 = 0.3.

(b) Repeat part (a) using X as the first splitting attribute and then choose
the best remaining attribute for splitting at each of the two successor
nodes. What is the error rate of the induced tree?
Answer: After choosing attribute X to be the first splitting attribute, the
subsequent test condition may involve either attribute Y or attribute Z.
For X = 0, the corresponding counts for attributes Y and Z are shown
in the table below.
(c) How many instances are misclassified by the resulting decision
tree?
Answer: 20 instances are misclassified. (The error rate is 20/100)
(d) Repeat parts (a), (b), and (c) using C as the splitting attribute.
Answer: For the C = T child node, the error rate before splitting is:
The error rate using attributes Y and Z are 10/120 and 30/120, respectively.
Since attribute Y leads to a smaller error rate, it provides a better split.
(a) Compute the generalization error rate of the tree using the optimistic 11. This exercise, inspired by the discussions in [6], highlights one of
approach. the known limitations of the leave-one-out model evaluation procedure.
Answer: According to the optimistic approach, the generalization error rate is Let us consider a data set containing 50 positive and 50 negative
3/10 = 0.3. instances, where the attributes are purely random and contain no
(b) Compute the generalization error rate of the tree using the information about the class labels. Hence, the generalization error rate
pessimistic of any classification model learned over this data is expected to be 0.5.
approach. (For simplicity, use the strategy of adding a factor of 0.5 to Let us consider a classifier that assigns the majority class label of
each leaf node.) training instances (ties resolved by using the positive label as the
Answer: According to the pessimistic approach, the generalization error rate default class) to any test instance, irrespective of its attribute values.
is We can call this approach as the majority inducer classifier. Determine
(3 + 4 * 0.5)/10 = 0.5. the error rate of this classifier using the following methods.
(c) Compute the generalization error rate of the tree using the validation
set shown above. This approach is known as reduced error pruning.
Answer: According to the reduced error pruning approach, the generalization
error rate is 4/5 = 0.8.

10. Consider the decision trees shown in Figure 3.3. Assume they are
generated from a data set that contains 16 binary attributes and 3
classes, C1, C2, and C3. Compute the total description length of each
decision tree according to the minimum description length principle.

Therefore, A is chosen as the splitting attribute.

(b) 2-fold stratified cross-validation, where the proportion of class labels


at
every fold is kept same as that of the overall data.
Answer: If we divide the data set into two folds such that both folds have
equal number of positives and negatives, the majority inducer trained over
any of the two folds will face a tie and thus assign test instances to the default
class, which is positive. Since the default class will be correct 50% of times on
any fold, the error rate of majority inducer using 2-fold stratified cross-
validation will be 0.5.
(c) From the results above, which method provides a more reliable
evaluation of the classifier’s generalization error rate?
Answer: Cross-validation provides a more reliable estimate of the
generalization
error rate of the majority inducer classifier on this dataset, which is expected
to be 0.5. Leave-one-out is quite susceptible to changes in the number of
. The total description length of a tree is given by: positive and negative instances in the training set, even by a single count,
Cost(tree, data) = Cost(tree) + Cost(data|tree). leading to a high error rate of 1 for the majority inducer. As another example,
. Each internal node of the tree is encoded by the ID of the splitting if we consider the minority inducer classifier, which labels every test instance
attribute. If there are m attributes, the cost of encoding each attribute with the minority class in the training set, we would find that the leave-one-out
is log2m bits. method would result
(e) Use the results in parts (c) and (d) to conclude about the greedy
. Each leaf is encoded using the ID of the class it is associated with. If in an error rate of 0 for the minority inducer, which is quite misleading since
nature
there are k classes, the cost of encoding a class is log2 k bits. the attributes contain no information about the classes and any classifier is
of the decision tree induction algorithm.
. Cost(tree) is the cost of encoding all the nodes in the tree. To simplify expected to have an error rate of 0.5.
Answer: The greedy heuristic does not necessarily lead to the best tree.
9. Consider the decision tree shown in Figure 3.2. the computation, you can assume that the total cost of the tree is
obtained by adding up the costs of encoding each internal node and 12. Consider a labeled data set containing 100 data instances, which is
each leaf node. randomly partitioned into two sets A and B, each containing 50
. Cost(data|tree) is encoded using the classification errors the tree instances. We use A as the training set to learn two decision trees, T10
commits with 10 leaf nodes and T100 with 100 leaf nodes. The accuracies of the
on the training set. Each error is encoded by log2 n bits, where n two decision trees on data sets A and B are shown in Table 3.3.
is the total number of training instances.
Which decision tree is better, according to the MDL principle?
Answer:
Because there are 16 attributes, the cost for each internal node in the
decision
tree is:
log2(m) = log2(16) = 4
Furthermore, because there are 3 classes, the cost for each leaf node is:
[log2(k)] = [log2(3)] = 2 Answer: Tree T10 is expected to show better generalization performance on
The cost for each misclassification error is log2(n). unseen instances than T100. We can see that the training accuracy of T100
The overall cost for the decision tree (a) is 2*4+3*2+7*log2 n = 14+7 log2 n on dataset A is very high, but its test accuracy on dataset B that was not used
and the overall cost for the decision tree (b) is 4 * 4 + 5 * 2 + 4 * log2 n = for training is low. This gap between training and test accuracies implies that
26 + 4 log2 n. According to the MDL principle, tree (a) is better than (b) if T100 suffers from overfitting. Hence, even if the training accuracy of T100 is
n < 16 and is worse than (b) if n > 16. very high on dataset A, it is not representative of generalization performance
on unseen instances in dataset B. On the other hand, tree T10 has
moderately low training accuracy on dataset A, but the test accuracy of T10
on dataset B is not very different. This implies that T10 does not suffer from
overfitting and the training performance is indeed indicative of generalization
performance.
(b) Now, you tested T10 and T100 on the entire data set (A+B) and found
that the classification accuracy of T10 on data set (A + B) is 0.85,
whereas the classification accuracy of T100 on the data set (A + B) is
0.87. Based on this new information and your observations from Table
3.3, which classification model would you finally choose for
classification? where P is the total number of positive examples in the validation set, N
Answer: is
We would still choose T10 over T100 for classification. The high accuracy of the total number of negative examples in the validation set, p is the
T100 on dataset (A + B) can be attributed to the high training accuracy of 14. Let X be a binomial random variable with mean Np and variance number
T100 on dataset A, which is an artifact of overfitting. Note that the Np(1−p). Show that the ratio X/N also has a binomial distribution with of positive examples in the validation set covered by the rule, and n is
performance of a classifier on the combined dataset (A + B) cannot be viewed mean p and variance p(1 − p)/N. the
as an estimate of generalization performance, since it contains instances Answer: Let r = X/N. Since X has a binomial distribution, r also has the number of negative examples in the validation set covered by the rule.
used for training (from dataset A). Hence, the final decision of choosing T10 same distribution. The mean and variance for r can be computed as follows: vIREP is actually similar to classification accuracy for the validation set.
over T100 is still motivated solely by its superior accuracy on unseen test Mean,E[r] = E[X/N] = E[X]/N = (Np)/N = p; IREP favors rules that have higher values of vIREP . On the other hand,
instances in B. RIPPER applies the following measure to determine whether a rule
should be pruned:
13. Consider the following approach for testing whether a classifier A
beats another classifier B. Let N be the size of a given data set, pA be
the accuracy of classifier A, pB be the accuracy of classifier B, and p =
(pA+pB)/2 be the average accuracy for both classifiers. To test whether
classifier A is significantly better than B, the following Z-statistic is
used:

CHAPTER – 4(CLASSIFICATION – ALTERNATIVE


TECHNIQUES)
1. Consider a binary classification problem with the following set of (a) Suppose R1 is covered by 350 positive examples and 150 negative
attributes and attribute values: examples, while R2 is covered by 300 positive examples and 50 negative
. Air Conditioner = {Working, Broken} examples. Compute the FOIL’s information gain for the rule R2 with
. Engine = {Good, Bad} respect to R1.
. Mileage = {High, Medium, Low} Answer: For this problem, p0 = 350, n0 = 150, p1 = 300, and n1 = 50.
. Rust = {Yes, No} Therefore,
Classifier A is assumed to be better than classifier B if Z > 1.96. the FOIL’s information gain for R2 with respect to R1 is:
Table 3.4 compares the accuracies of three different classifiers, decision Suppose a rule-based classifier produces the following rule set:
tree classifiers, naive Bayes classifiers, and support vector machines, Mileage = High −→ Value = Low
on various data sets. (The latter two classifiers are described in Chapter Mileage = Low −→ Value = High
5.) Air Conditioner = Working, Engine = Good −→ Value = High
Air Conditioner = Working, Engine = Bad −→ Value = Low
Air Conditioner = Broken −→ Value = Low
(b) Consider a validation set that contains 500 positive examples and
(a) Are the rules mutually exclustive? 500
Answer: No negative examples. For R1, suppose the number of positive examples
(b) Is the rule set exhaustive? covered by the rule is 200, and the number of negative examples
Answer: Yes covered
(c) Is ordering needed for this set of rules? by the rule is 50. For R2, suppose the number of positive examples
Answer: Yes because a test instance may trigger more than one rule. covered by the rule is 100 and the number of negative examples is 5.
(d) Do you need a default class for the rule set? Compute vIREP for both rules. Which rule does IREP prefer?
Answer: No because every instance is guaranteed to trigger at least Answer:
one rule.

2. The RIPPER algorithm (by Cohen [1]) is an extension of an earlier


algorithm called IREP (by FÅNurnkranz and Widmer [3]). Both
algorithms apply the reduced-error pruning method to determine
whether a rule needs to be pruned. The reduced error pruning method
uses a validation set to estimate the generalization error of a classifier.
Consider the following pair of rules:
R1: A −→ C
R2: A ∧ B −→ C

R2 is obtained by adding a new conjunct, B, to the left-hand side of R1.


For
this question, you will be asked to determine whether R2 is preferred
over (c) Compute vRIPPER for the previous problem. Which rule does
R1 from the perspectives of rule-growing and rule-pruning. To RIPPER
determine prefer?
whether a rule should be pruned, IREP computes the following Answer:
measure:
3. C4.5rules is an implementation of an indirect method for generating
rules (b) The Laplace measure.
from a decision tree. RIPPER is an implementation of a direct method Answer: The Laplace measure for the rules are 76.47% (for R1), 66.67% (for
for R2), and 64.29% (for R3), respectively. Therefore R1 is the best rule and R3
generating rules directly from data. is the worst rule according to the Laplace measure.
(a) Discuss the strengths and weaknesses of both methods. (c) The m-estimate measure (with k = 2 and p+ = 0.58).
Answer: The C4.5 rules algorithm generates classification rules from a global Answer: The m-estimate measure for the rules are 77.41% (for R1), 68.0%
perspective. This is because the rules are derived from decision trees, (for
which are induced with the objective of partitioning the feature space R2), and 65.43% (for R3), respectively. Therefore R1 is the best rule and R3
into homogeneous regions, without focusing on any classes. In contrast, is the worst rule according to the m-estimate measure.
RIPPER generates rules one-class-at-a-time. Thus, it is more biased (d) The rule accuracy after R1 has been discovered, where none of the
towards the classes that are generated first. (d) The Laplace measure. examples covered by R1 are discarded).
(b) Consider a data set that has a large difference in the class size (i.e., Answer: The Laplace measure of the rules are 71.43% (for R1), 73.81% (for Answer: If the examples for R1 are not discarded, then R2 will be chosen
some classes are much bigger than others). Which method (between R2), because it has a higher accuracy (70%) than R3 (66.7%).
C4.5rules and RIPPER) is better in terms of finding high accuracy rules and 52.6% (for R3), respectively. Therefore R2 is the best candidate (e) The rule accuracy after R1 has been discovered, where only the
for the small classes? and R3 is the worst candidate according to the Laplace measure. positive
Answer: The class-ordering scheme used by C4.5rules has an easier (e) The m-estimate measure (with k = 2 and p+ = 0.2). examples covered by R1 are discarded).
interpretation than the scheme used by RIPPER. Answer: The m-estimate measure of the rules are 62.86% (for R1), 73.38% Answer: If the positive examples covered by R1 are discarded, the new
4. Consider a training set that contains 100 positive examples and 400 (for accuracies for R2 and R3 are 70% and 60%, respectively. Therefore R2 is
negative examples. For each of the following candidate rules, R2), and 52.3% (for R3), respectively. Therefore R2 is the best candidate preferred over R3.
R1: A −→ + (covers 4 positive and 1 negative examples), and R3 is the worst candidate according to the m-estimate measure. (f) The rule accuracy after R1 has been discovered, where both positive
R2: B −→ + (covers 30 positive and 10 negative examples), 5. Figure 4.1 illustrates the coverage of the classification rules R1, R2, and negative examples covered by R1 are discarded.
R3: C −→ + (covers 100 positive and 90 negative examples), and R3. Determine which is the best and worst rule according to: Answer: If the positive and negative examples covered by R1 are discarded,
determine which is the best and worst candidate rule according to: the
new accuracies for R2 and R3 are 70% and 75%, respectively. In this case,
(a) Rule accuracy. R3 is preferred over R2.
Answer: The accuracies of the rules are 80% (for R1), 75% (for R2), and
52.6%
(for R3), respectively. Therefore R1 is the best candidate and R3 is the
worst candidate according to rule accuracy. 6. (a) Suppose the fraction of undergraduate students who smoke is
(b) FOIL’s information gain. 15% and the fraction of graduate students who smoke is 23%. If one-
Answer: Assume the initial rule is ∅ → +. This rule covers p0 = 100 positive fifth of the
examples and n0 = 400 negative examples. college students are graduate students and the rest are undergraduates,
The rule R1 covers p1 = 4 positive examples and n1 = 1 negative what is the probability that a student who smokes is a graduate
example. Therefore, the FOIL’s information gain for this rule is student?

(b) Given the information in part (a), is a randomly chosen college


student
more likely to be a graduate or undergraduate student?
Answer: An undergraduate student, because P(UG) > P(G).
(c) Repeat part (b) assuming that the student is a smoker.
Answer: An undergraduate student because P(UG|S) > P(G|S).
(d) Suppose 30% of the graduate students live in a dorm but only 10% of
the undergraduate students live in a dorm. If a student smokes and lives
in the dorm, is he or she more likely to be a graduate or undergraduate
(c) The likelihood ratio statistic. student? You can assume independence between students who live in a
Answer: For R1, the expected frequency for the positive class is 5 * 100/500 dorm and those who smoke.
=1 Answer:
and the expected frequency for the negative class is 5 * 400/500 = 4. First, we need to estimate all the probabilities.
Therefore, the likelihood ratio for R1 is P(D|UG) = 0.1, P(D|G) = 0.3.
P(D) = P(UG).P (D|UG)+P(G).P (D|G) = 0.8∗0.1+0.2∗0.3 = 0.14.
P(S) = P(S|UG)P(UG)+P(S|G)P(G) = 0.15∗0.8+0.23∗0.2 = 0.166. 8. Consider the data set shown in Table 4.2.
P(DS|G) = P(D|G) * P(S|G) = 0.3 * 0.23 = 0.069 (using conditional
independent assumption)
P(DS|UG) = P(D|UG) * P(S|UG) = 0.1 * 0.15 = 0.015.
We need to compute P(G|DS) and P(UG|DS).

7. Consider the data set shown in Table 4.1

(a) Estimate the conditional probabilities for P(A = 1|+), P(B = 1|+),
(c) Estimate the conditional probabilities using the m-estimate P(C = 1|+), P(A = 1|−), P(B = 1|−), and P(C = 1|−) using the same
approach, approach as in the previous problem.
with p = 1/2 and m = 4. Answer: P(A = 1|+) = 0.6, P(B = 1|+) = 0.4, P(C = 1|+) = 0.8, P(A =
Answer: 1|−) = 0.4, P(B = 1|−) = 0.4, and P(C = 1|−) = 0.2
P(A = 0|+) = (2 + 2)/(5 + 4) = 4/9, (b) Use the conditional probabilities in part (a) to predict the class label
P(A = 0|−) = (3 + 2)/(5 + 4) = 5/9, for a test sample (A = 1,B = 1,C = 1) using the naıve Bayes approach.
P(B = 1|+) = (1 + 2)/(5 + 4) = 3/9, Answer:
P(B = 1|−) = (2 + 2)/(5 + 4) = 4/9, Let R : (A = 1,B = 1,C = 1) be the test record. To determine its
P(C = 0|+) = (3 + 2)/(5 + 4) = 5/9, class, we need to compute P(+|R) and P(−|R). Using Bayes theorem,
P(C = 0|−) = (0 + 2)/(5 + 4) = 2/9. P(+|R) = P(R|+)P(+)/P(R) and P(−|R) = P(R|−)P(−)/P(R).
(d) Repeat part (b) using the conditional probabilities given in part (c). Since P(+) = P(−) = 0.5 and P(R) is constant, R can be classified by
Answer: Let P(A = 0,B = 1,C = 0) = K comparing P(+|R) and P(−|R).
For this question,
P(R|+) = P(A = 1|+) * P(B = 1|+) * P(C = 1|+) = 0.192
P(R|−) = P(A = 1|−) * P(B = 1|−) * P(C = 1|−) = 0.032
Since P(R|+) is larger, the record is assigned to (+) class.
(c) Compare P(A = 1), P(B = 1), and P(A = 1,B = 1). State the
relationships between A and B.
(a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), Answer: P(A = 1) = 0.5, P(B = 1) = 0.4 and P(A = 1,B = 1) = P(A) * P(B) =
P(A|−), P(B|−), and P(C|−). 0.2. Therefore, A and B are independent.
Answer: (d) Repeat the analysis in part (c) using P(A = 1), P(B = 0), and P(A =
P(A = 1|−) = 2/5 = 0.4, P(B = 1|−) = 2/5 = 0.4, 1,B = 0).
P(C = 1|−) = 1, P(A = 0|−) = 3/5 = 0.6, Answer:
P(B = 0|−) = 3/5 = 0.6, P(C = 0|−) = 0; P(A = 1|+) = 3/5 = 0.6, P(A = 1) = 0.5, P(B = 0) = 0.6, and P(A = 1,B = 0) = P(A = 1) * P(B = 0) = 0.3.
P(B = 1|+) = 1/5 = 0.2, P(C = 1|+) = 2/5 = 0.4, A and B are still independent.
P(A = 0|+) = 2/5 = 0.4, P(B = 0|+) = 4/5 = 0.8, (e) Compare P(A = 1,B = 1|Class = +) against P(A = 1|Class = +)
P(C = 0|+) = 3/5 = 0.6. and P(B = 1|Class = +). Are the variables conditionally independent
(b) Use the estimate of conditional probabilities given in the previous given the class?
question to predict the class label for a test sample (A = 0,B = 1,C = 0) Answer: Compare P(A = 1,B = 1|+) = 0.2 against P(A = 1|+) = 0.6 and
using the naıve Bayes approach. P(B = 1|Class = +) = 0.4. Since the product between P(A = 1|+) and P(A =
Answer: 1|−) are not the same as P(A = 1,B = 1|+), A and B are not conditionally
independent given the class.

(e) Compare the two methods for estimating probabilities. Which


method
is better and why?
Answer: When one of the conditional probability is zero, the estimate for
conditional probabilities using the m-estimate probability approach is better,
since we don’t want the entire expression becomes zero.
9. (a) Explain how naÅNıve Bayes performs on the data set shown in P(Value=High|Engine=Good, Air Cond=Broken) = 0.667 (a) Classify the data point x = 5.0 according to its 1-, 3-, 5-, and 9-nearest
Figure 4.2. P(Value=High|Engine=Bad, Air Cond=Working) = 0.222 neighbors (using majority vote).
P(Value=High|Engine=Bad, Air Cond=Broken) = 0 Answer: 1-nearest neighbor: +,
(b) Use the Bayesian network to compute P(Engine = Bad, Air 3-nearest neighbor: −,
Conditioner = Broken). 5-nearest neighbor: +,
9-nearest neighbor: −.
(b) Repeat the previous analysis using the distance-weighted voting
approach described in Section 4.3.1.
Answer: 1-nearest neighbor: +,
3-nearest neighbor: +,
5-nearest neighbor: +,
9-nearest neighbor: +.

13. The nearest-neighbor algorithm described in Section 4.3 can be


11. Given the Bayesian network shown in Figure 4.4, compute the extended to handle nominal attributes. A variant of the algorithm called
following probabilities: PEBLS (Parallel Examplar-Based Learning System) by Cost and
Salzberg [2] measures the distance between two values of a nominal
attribute using the modified value difference metric (MVDM). Given a
pair of nominal attribute values, V1 and V2, the distance between them
is defined as follows:
Answer: NB will not do well on this data set because the conditional
probabilities for each distinguishing attribute given the class are the same for
both class A and class B.
(b) If each class is further divided such that there are four classes (A1,
A2, B1, and B2), will naÅNıve Bayes perform better?
Answer: The performance of NB will improve on the subclasses because the where nij is the number of examples from class i with attribute value Vj
product of conditional probabilities among the distinguishing attributes and nj is the number of examples with attribute value Vj . Consider the
will be different for each subclass. training set for the loan classification problem shown in Figure 4.9. Use
(c) How will a decision tree perform on this data set (for the two-class the MVDM measure to compute the distance between every pair of
problem)? What if there are four classes? attribute values for the Home Owner and Marital Status attributes.
Answer: For the two-class problem, decision tree will not perform well
because the entropy will not improve after splitting the data using the
distinguishing attributes. If there are four classes, then decision tree will
improve considerably.

10. Figure 4.3 illustrates the Bayesian belief network for the data set
shown in Table 4.3. (Assume that all the attributes are binary).

(a) P(B = good, F = empty, G = empty, S = yes).


Answer:
P(B = good, F = empty,G = empty, S = yes)
= P(B = good) * P(F = empty) * P(G = empty|B = good, F = empty) * P(S =
yes|B = good, F = empty)
= 0.9 * 0.2 * 0.8 * 0.2 = 0.0288.

12. Consider the one-dimensional data set shown in Table 4.4.

(a) Draw the probability table for each node in the network.
P(Mileage=Hi) = 0.5
P(Air Cond=Working) = 0.625
P(Engine=Good|Mileage=Hi) = 0.5
P(Engine=Good|Mileage=Lo) = 0.75
P(Value=High|Engine=Good, Air Cond=Working) = 0.750
14. For each of the Boolean functions given below, state whether the Precision = 5/9 = 55.6%.
problem is linearly separable. Recall = 5/5 = 100%.
(a) A AND B AND C F-measure = (2 * .556 * 1)/(.556 + 1) = 0.715.
Answer: Yes According to F-measure, t = 0.1 is better than t = 0.5.
(b) NOT A AND B When t = 0.1, FPR = 0.8 and TPR = 1. On the other hand, when t = 0.5, FPR
Answer: Yes = 0.2 and TRP = 0.6. Since (0.2, 0.6) is closer to the point (0, 1), we favor t =
(c) (A OR B) AND (A OR C) 0.5. This result is inconsistent with the results using F-measure. We can also
Answer: Yes show this by computing the area under the ROC curve
(d) (A XOR B) AND (A OR B) For t = 0.5, area = 0.6 * (1 − 0.2) = 0.6 * 0.8 = 0.48.
Answer: No For t = 0.1, area = 1 * (1 − 0.8) = 1 * 0.2 = 0.2.
15. (a) Demonstrate how the perceptron model can be used to represent Since the area for t = 0.5 is larger than the area for t = 0.1, we prefer t = 0.5.
the AND and OR functions between a pair of Boolean variables. 17. Following is a data set that contains two attributes, X and Y , and two
Answer: Let x1 and x2 be a pair of Boolean variables and y be the output. class labels, “+” and “−”. Each attribute can take three different values:
For AND function, a possible perceptron model is: 0, 1, or 2.

M1 is better, since its area under the ROC curve is larger than the area
under ROC curve for M2.
(b) For model M1, suppose you choose the cutoff threshold to be t = 0.5.
In other words, any test instances whose posterior probability is greater
than t will be classified as a positive example. Compute the precision,
recall, and F-measure for the model at this threshold value. When t =
(b) Comment on the disadvantage of using linear functions as activation 0.5, the confusion matrix for M1 is shown below.
functions for multilayer neural networks.
Answer: Multilayer neural networks is useful for modeling nonlinear
relationships between the input and output attributes. However, if linear
functions are used as activation functions (instead of sigmoid or hyperbolic
tangent function), the output is still a linear combination of its input
attributes. Such a network is just as expressive as a perceptron.

16. You are asked to evaluate the performance of two classification


models, M1 and M2. The test set you have chosen contains 26 binary
attributes, labeled as A through Z. Table 4.5 shows the posterior Precision = 3/4 = 75%.
probabilities obtained by applying the models to the test set. (Only the Recall = 3/5 = 60%.
posterior probabilities for the positive class are shown). As this is a two- F-measure = (2 * .75 * .6)/(.75 + .6) = 0.667.
class problem, P(−) = 1 − P(+) and P(−|A, . . . ,Z) = 1 − P(+|A, . . . ,Z). (c) Repeat the analysis for part (c) using the same cutoff threshold on
Assume that we are mostly interested in detecting instances from the model M2. Compare the F-measure results for both models. Which
positive class. model is better? Are the results consistent with what you expect from The concept for the “+” class is Y = 1 and the concept for the “−” class
the ROC curve? is X = 0 ∨ X = 2.
Answer: When t = 0.5, the confusion matrix for M2 is shown below. (a) Build a decision tree on the data set. Does the tree capture the “+”
and “−” concepts?
Answer: There are 30 positive and 600 negative examples in the data.
Therefore, at the root node, the error rate is
Eorig = 1 − max(30/630, 600/630) = 30/630.
If we split on X, the gain in error rate is:

Precision = 1/2 = 50%.


Recall = 1/5 = 20%.
F-measure = (2 * .5 * .2)/(.5 + .2) = 0.2857.
Based on F-measure, M1 is still better than M2. This result is consistent
with the ROC plot.
(a) Plot the ROC curve for both M1 and M2. (You should plot them on (d) Repeat part (c) for modelM1 using the threshold t = 0.1. Which
the same graph.) Which model do you think is better? Explain your threshold do you prefer, t = 0.5 or t = 0.1? Are the results consistent with
reasons. what you expect from the ROC curve?
Answer: Answer:
When t = 0.1, the confusion matrix for M1 is shown below.
18. Consider the task of building a classifier from random data, where
the attribute values are generated randomly irrespective of the class
labels. Assume the data set contains records from two classes, “+” and
(b) What are the accuracy, precision, recall, and F1-measure of the “−.” Half of the dataset is used for training while the remaining half is
decision tree? (Note that precision, recall, and F1-measure are defined used for testing.
with respect to the “+” class.) (a) Suppose there are an equal number of positive and negative records
Answer: The confusion matrix on the training data: in the data and the decision tree classifier predicts every test record to
be positive. What is the expected error rate of the classifier on the test
data?
Answer: 50%.
(b) Repeat the previous analysis assuming that the classifier predicts
each test record to be positive class with probability 0.8 and negative
class with probability 0.2.
Answer: 50%.
(c) Suppose two-thirds of the data belong to the positive class and the
remaining one-third belong to the negative class. What is the expected
error of a classifier that predicts every test record to be positive?
Answer: 33%.
(d) Repeat the previous analysis assuming that the classifier predicts
each test record to be positive class with probability 2/3 and negative
class with probability 1/3.
Answer: 44.4%.
19. Derive the dual Lagrangian for the linear SVM with nonseparable
data where the objective function is Answer:
(a) Both decision tree and NB will do well on this data set because the
distinguishing attributes have better discriminating power than noise
attributes in terms of entropy gain and conditional probability. k-NN
will not do as well due to relatively large number of noise attributes.
(b) NB will not work at all with this data set due to attribute dependency.
Other schemes will do better than NB.
(c) NB will do very well in this data set, because each discriminating attribute
has higher conditional probability in one class over the other
and the overall classification is done by multiplying these individual
conditional probabilities. Decision tree will not do as well, due to the
relatively large number of distinguishing attributes. It will have an overfitting
problem. k-NN will do reasonably well.
(d) k-NN will do well on this data set. Decision trees will also work, but
20. Consider the XOR problem where there are four training points: will result in a fairly large decision tree. The first few splits will be quite
(1, 1,−), (1, 0, +), (0, 1, +), (0, 0,−). random, because it may not find a good initial split at the beginning.
Transform the data into the following feature space: NB will not perform quite as well due to the attribute dependency.
(e) k-NN will do well on this data set. Decision trees will also work, but
will result in a large decision tree. If decision tree uses an oblique split
instead of just vertical and horizontal splits, then the resulting decision
Find the maximum margin linear decision boundary in the transformed tree will be more compact and highly accurate. NB will not perform
space. quite as well due to attribute dependency.
Answer: The decision boundary is f(x1, x2) = x1x2. (f) kNN works the best. NB does not work well for this data set due to
attribute dependency. Decision tree will have a large tree in order to
capture the circular decision boundaries.
21. Given the data sets shown in Figures 4.6, explain how the decision
tree, naÅNıve Bayes, and k-nearest neighbor classifiers would perform CHAPTER – 5 (ASSOCIATION ANALYSIS)
on these data sets. 1. For each of the following questions, provide an example of an
(d) What are the accuracy, precision, recall, and F1-measure of the new
decision tree? association rulefrom the market basket domain that satisfies the
following conditions. Also,describe whether such rules are subjectively
Answer: The confusion matrix of the new tree
interesting.
(a) A rule that has high support and high confidence.
Answer: Milk −→ Bread. Such obvious rule tends to be uninteresting.
(b) A rule that has reasonably high support but low confidence.
Answer: Milk −→Tuna. While the sale of tuna and milk may be higher
than the support threshold, not all transactions that contain milk also
contain tuna. Such low-confidence rule tends to be uninteresting.
(c) A rule that has low support and low confidence. (d) Use the results in part (c) to compute the confidence for the
Answer: Cooking oil −→ Laundry detergent. Such low confidence rule association rules {b, d} −→ {e} and {e} −→ {b, d}.
tends to be uninteresting. Answer:
(d) A rule that has low support and high confidence.
Answer: Vodka −→ Caviar. Such rule tends to be interesting.
2. Consider the data set shown in Table 5.1.

(e) Suppose s1 and c1 are the support and confidence values of an


association rule r when treating each transaction ID as a market basket.
Also, let s2 and c2 be the support and confidence values of r when
treating each customer ID as a market basket. Discuss whether there
are any relationships between s1 and s2 or c1 and c2.
Answer: There are no apparent relationships between s1, s2, c1, and c2.
3. (a) What is the confidence for the rules ∅ −→ A and A −→ ∅?
Answer:
c(∅ −→ A) = s(∅ −→ A).
c(A −→ ∅) = 100%.
(b) Let c1, c2, and c3 be the confidence values of the rules {p} −→ {q},
(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating {p} −→ {q, r}, and {p, r} −→ {q}, respectively. If we assume that c1, c2,
each transaction ID as a market basket. and c3 have different values, what are the possible relationships that
Answer: may exist among c1, c2, and c3? Which rule has the lowest confidence?

(b) A discriminant rule is a rule of the form {p1, p2, . . . , pn} −→ {q},
where the rule consequent contains only a single item. An itemset of
size k can produce up to k discriminant rules. Let η be the minimum
confidence of all discriminant rules generated from a given itemset:

(b) Use the results in part (a) to compute the confidence for the
(c) Repeat the analysis in part (b) assuming that the rules have identical
association rules {b, d} −→ {e} and {e} −→ {b, d}. Is confidence a
support. Which rule has the highest confidence?
symmetric measure?
Answer:
Answer:
Considering s(p ∪ q) = s(p ∪ q ∪ r)
but s(p) ≥ s(p ∪ r)
Thus: c3 ≥ (c1 = c2)
Either all rules have the same confidence or c3 has the highest confidence.
(d) Transitivity: Suppose the confidence of the rules A −→ B and B −→ C
are larger than some threshold, minconf. Is it possible that A −→ C
has a confidence less than minconf?
Answer:
Yes, It depends on the support of items A, B, and C.
For example:
s(A,B) = 60% s(A) = 90%
(c) Repeat part (a) by treating each customer ID as a market basket. s(A,C) = 20% s(B) = 70%
Each item should be treated as a binary variable (1 if an item appears in s(B,C) = 50% s(C) = 60%
at least one transaction bought by the customer, and 0 otherwise.) Let minconf = 50% Therefore:
Answer: c(A → B) = 66% > minconf
c(B → C) = 71% > minconf
But c(A → C) = 22% < minconf
4. For each of the following measures, determine whether it is
monotone, antimonotone, or non-monotone (i.e., neither monotone nor
anti-monotone).

(a) A characteristic rule is a rule of the form {p} −→ {q1, q2, . . . , qn},
where the rule antecedent contains only a single item. An itemset of size
k can produce up to k characteristic rules. Let ζ be the minimum
confidenceof all characteristic rules generated from a given itemset:
(c) Repeat the analysis in parts (a) and (b) by replacing the min function Answer: (Beer, Cookies) or (Bread, Butter).
with a max function. 7. Show that if a candidate k-itemset X has a subset of size less than
k−1 that is infrequent, then at least one of the (k − 1)-size subsets of X is
necessarily infrequent.

8. Consider the following set of frequent 3-itemsets:


{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}, {2, 3, 5}, {3, 4, 5}.
Assume that there are only five items in the data set.
(a) List all candidate 4-itemsets obtained by a candidate generation
procedure using the Fk−1 * F1 merging strategy.
Answer: {1, 2, 3, 4},{1, 2, 3, 5},{1, 2, 4, 5},{1, 3, 4, 5},{2, 3, 4, 5}.
(b) List all candidate 4-itemsets obtained by the candidate generation
procedure in Apriori.
Answer: {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 4, 5},{1, 3, 4, 5},{2, 3, 4, 5}.
(c) List all candidate 4-itemsets that survive the candidate pruning step
of the Apriori algorithm.
Answer: {1, 2, 3, 4}
9. The Apriori algorithm uses a generate-and-count strategy for deriving
frequent itemsets. Candidate itemsets of size k+1 are created by joining
a pair of frequent itemsets of size k (this is known as the candidate
generation step). A candidate is discarded if any one of its subsets is
6. Consider the market basket transactions shown in Table 5.2.
found to be infrequent during the candidate pruning step. Suppose the
Apriori algorithm is applied to the data set shown in Table 5.3 with
minsup = 30%, i.e., any itemset occurring in less than 3 transactions is
considered to be infrequent.

Since s(A,B,C) ≤ s(A,B) and min(s(A,B), s(A,C), s(B,C)) ≤ min(s(A), s(B),


s(C)) ≤ min(s(A), s(B)), η′({A,B,C}) can be greater than or less than η′({A,B}).
Hence, the measure is non-monotone.
(a) What is the maximum number of association rules that can be
5. Prove Equation 5.3. (Hint: First, count the number of ways to create extracted from this data (including rules that have zero support)?
an itemset that forms the left hand side of the rule. Next, for each size k (a) Draw an itemset lattice representing the data set given in Table 5.3.
itemset selected for the left-hand side, count the number of ways to Answer: There are six items in the data set. Therefore the total number Label each node in the lattice with the following letter(s):
choose the remaining d − k items to form the right-hand side of the of rules is 602. . N: If the itemset is not considered to be a candidate itemset by the
rule.) (b) What is the maximum size of frequent itemsets that can be extracted Apriori algorithm. There are two reasons for an itemset not to be
(assuming minsup > 0)? considered as a candidate itemset: (1) it is not generated at all during
Answer: Because the longest transaction contains 4 items, the maximum the candidate generation step, or (2) it is generated during the candidate
size of frequent itemset is 4. generation step but is subsequently removed during the candidate
(c) Write an expression for the maximum number of size-3 itemsets that pruning step because one of its subsets is found to be infrequent.
can be derived from this data set. . F: If the candidate itemset is found to be frequent by the Apriori
algorithm.
. I: If the candidate itemset is found to be infrequent after support
(d) Find an itemset (of size 2 or larger) that has the largest support. counting.
Answer: {Bread, Butter}.
(e) Find a pair of items, a and b, such that the rules {a} −→ {b} and
{b} −→ {a} have the same confidence.
Answer: the appropriate branch of the tree according to the hash value. Once Answer:
a leaf node is reached, the candidate is inserted based on one of the
following conditions:
Condition 1: If the depth of the leaf node is equal to k (the root is
assumed to be at depth 0), then the candidate is inserted regardless
of the number of itemsets already stored at the node.
Condition 2: If the depth of the leaf node is less than k, then the
candidate can be inserted as long as the number of itemsets stored
at the node is less than maxsize. Assume maxsize = 2 for this
question.
Condition 3: If the depth of the leaf node is less than k and the
number of itemsets stored at the node is equal to maxsize, then
the leaf node is converted into an internal node. New leaf nodes
are created as children of the old leaf node. Candidate itemsets
previously stored in the old leaf node are distributed to the children
based on their hash values. The new candidate is also hashed to
its appropriate leaf node.
Answer:

(b) What is the percentage of frequent itemsets (with respect to all


itemsets in the lattice)?
Answer: Percentage of frequent itemsets = 16/32 = 50.0% (including the null 13. The original association rule mining formulation uses the support
set). and confidence measures to prune uninteresting rules.
(c) What is the pruning ratio of the Apriori algorithm on this data set? (a) Draw a contingency table for each of the following rules using the
(Pruning ratio is defined as the percentage of itemsets not considered transactions shown in Table 5.4.
to be a candidate because (1) they are not generated during candidate
generation or (2) they are pruned during the candidate pruning step.)
Answer: Pruning ratio is the ratio of N to the total number of itemsets. Since
the count of N = 11, therefore pruning ratio is 11/32 = 34.4%.
(d) What is the false alarm rate (i.e, percentage of candidate itemsets
that are found to be infrequent after performing support counting)?
Answer: False alarm rate is the ratio of I to the total number of itemsets.
Since the count of I = 5, therefore the false alarm rate is 5/32 = 15.6%.
10. The Apriori algorithm uses a hash tree data structure to efficiently
count the support of candidate itemsets. Consider the hash tree for (b) How many leaf nodes are there in the candidate hash tree? How
candidate 3- itemsets shown in Figure 5.2. many internal nodes are there?
Answer: There are 5 leaf nodes and 4 internal nodes.
(c) Consider a transaction that contains the following items:
{1, 2, 3, 5, 6}. Using the hash tree constructed in part (a), which leaf
nodes will be checked against the transaction? What are the candidate
3-itemsets contained in the transaction?
Answer: The leaf nodes L1, L2, L3, and L4 will be checked against
the transaction. The candidate itemsets contained in the transaction
include {1,2,3} and {1,2,6}.
12. Given the lattice structure shown in Figure 5.4 and the transactions
given in
Table 5.3, label each node with the following letter(s):
. M if the node is a maximal frequent itemset,
. C if it is a closed frequent itemset,
. N if it is frequent but neither maximal nor closed, and
. I if it is infrequent. Rules: {b} −→ {c}, {a} −→ {d}, {b} −→ {d}, {e} −→ {c},
Assume that the support threshold is equal to 30%. {c} −→ {a}.
(a) Given a transaction that contains items {1, 3, 4, 5, 8}, which of the
hash tree leaf nodes will be visited when finding the candidates of the
transaction?
Answer: The leaf nodes visited are L1, L3, L5, L9, and L11.
(b) Use the visited leaf nodes in part (b) to determine the candidate
itemsets that are contained in the transaction {1, 3, 4, 5, 8}.
Answer: The candidates contained in the transaction are {1, 4, 5}, {1, 5, 8},
and {4, 5, 8}.
11. Consider the following set of candidate 3-itemsets:
{1, 2, 3}, {1, 2, 6}, {1, 3, 4}, {2, 3, 4}, {2, 4, 5}, {3, 4, 6}, {4, 5, 6}
(a) Construct a hash tree for the above candidate 3-itemsets. Assume
the tree uses a hash function where all odd-numbered items are hashed
to the left child of a node, while the even-numbered items are hashed
to the right child. A candidate k-itemset is inserted into the tree by
hashing on each successive item in the candidate and then following
(b) Use the contingency tables in part (a) to compute and rank the rules
in decreasing order according to the following measures.
i. Support.
Answer:

ii. Confidence.

14. Given the rankings you had obtained in Exercise 13, compute the
correlation between the rankings of confidence and the other five
measures. Which measure is most highly correlated with confidence?
Which measure is least correlated with confidence?
Answer:
Correlation(Confidence, Support) = 0.97.
Correlation(Confidence, Interest) = 1.
Correlation(Confidence, IS) = 1.
Correlation(Confidence, Klosgen) = 0.7.
Correlation(Confidence, Odds Ratio) = -0.606. 16. (a) Prove that the ϕ coefficient is equal to 1 if and only if f11 = f1+ =
Interest and IS are the most highly correlated with confidence, while odds f+1.
ratio is the least correlated.
15. Answer the following questions using the data sets shown in Figure
5.6. Note that each data set contains 1000 items and 10,000
transactions. Dark cells indicate the presence of items and white cells
indicate the absence of items. We will apply the Apriori algorithm to
extract frequent itemsets with minsup = 10% (i.e., itemsets must be
contained in at least 1000 transactions)?
(a) Which data set(s) will produce the most number of frequent
itemsets?
Answer: Data set (e) because it has to generate the longest frequent
itemset along with its subsets.
(b) Which data set(s) will produce the fewest number of frequent
itemsets?
Answer: Data set (d) which does not produce any frequent itemsets
at 10% support threshold.
(c) Which data set(s) will produce the longest frequent itemset?
Answer: Data set (e).
(d) Which data set(s) will produce frequent itemsets with highest
maximum support?
Answer: Data set (b).
(e) Which data set(s) will produce frequent itemsets containing items
with wide-varying support levels (i.e., items with mixed support, ranging
from less than 20% to more than 70%).
Answer: Data set (e).
(d) How does M behave when P(B) is increased while P(A,B) and P(A)
remain unchanged?
Answer:
M decreases when P(B) increases. Proof: Let a = P(A,B)/P(A) =
P(B|A) and y = P(B). Since P(A,B) and P(A) remain unchanged, a
is a constant, i.e., P(B|A), is a constant. We treat the measure M as a
function, f(y) = (a−y)/(1−y). The derivative of f with respect to y is
df(y)/dy = (a−1)/(1−y)2. Since a = P(B|A) ≤ 1 and y = P(B) ≤ 1,
the derivative is negative, except in the special case where P(B|A) = 1
or P(B) = 1, when the derivative is 0. Thus, M decreases as y = P(B)
increases.

(b) Show that if A and B are independent, then P(A,B) * P(A,B) =


P(A,B) * P(A,B).
Answer:

(f) What is the value of the measure when A and B are statistically
independent?
Answer: No. Proof: The numerator of M, P(B|A) − P(B) is 0 since
the independence of A and B implies P(B|A) = P(B).

(h) Does the measure remain invariant under row or column scaling
operations?
Answer: No. It is straightforward to find a counterexample.
(i) How does the measure behave under the inversion operation?
Answer: Inversion is equivalent to switching the roles of A and B.
Proof: We will compute the value of M under inversion, call it .M ,

(d) Write a simplified expression for the value of each measure shown in
Tables 6.11 and 6.12 when the variables are statistically independent.
Answer:

(c) How does M behave when P(A) is increased while P(A,B) and P(B)
remain unchanged?
Answer: M decreases. Proof: The only term in M that changes is
P(B|A) = P(A,B)/P(A), which decreases as P(A) is increased. This
decreases the numerator and thus decreases M.
Thus, for the measure, M, inversion is equivalent to switching the order
(role) of A and B.
18. Suppose we have market basket data consisting of 100 transactions
and 20 items. If the support for item a is 25%, the support for item b is
90% and the support for itemset {a, b} is 20%. Let the support and
confidence thresholds be 10% and 60%, respectively.
(a) Compute the confidence of the association rule {a} → {b}. Is the rule
interesting according to the confidence measure?
Answer: Confidence is 0.2/0.25 = 80%. The rule is interesting because it
exceeds the confidence threshold. iii. When C = 0 or C = 1, ϕ = 0.
(b) Compute the interest measure for the association pattern {a, b}. (b) What conclusions can you draw from the above result?
Describe the nature of the relationship between item a and item b in Answer: The result shows that some interesting relationships may disappear
terms of the interest measure. if the confounding factors are not taken into account.
Answer: The interest measure is 0.2/(0.25 * 0.9) = 0.889. The items are 20. Consider the contingency tables shown in Table 5.6.
negatively correlated according to interest measure.
(c) What conclusions can you draw from the results of parts (a) and (b)?
Answer: High confidence rules may not be interesting.
(a) Compute the odds ratios for both tables.
Answer: For Table 1, odds ratio = 1.4938.
For Table 2, the odds ratios are 0.8333 and 0.98.
(b) Compute the ϕ-coefficient for both tables.
Answer: For table 1, ϕ = 0.098.
For Table 2, the ϕ-coefficients are -0.0233 and -0.0047.
(c) Compute the interest factor for both tables.
Answer: For Table 1, I = 1.0784.
For Table 2, the interest factors are 0.88 and 0.9971.
For each of the measures given above, describe how the direction of
association changes when data is pooled together instead of being
stratified.
Answer: The direction of association changes sign (from negative to positive
correlated) when the data is pooled together.

(a) For table I, compute support, the interest measure, and the ϕ CHAPTER – 7 (CLUSTER ANALYSIS)
correlation coefficient for the association pattern {A, B}. Also, compute 1. Consider a data set consisting of 220 data vectors, where each vector
the confidence of rules A → B and B → A. has 32 components and each component is a 4-byte value. Suppose
Answer: s(A) = 0.1, s(B) = 0.1, s(A,B) = 0.09. that vector quantization is used for compression and that 216 prototype
I(A,B) = 9, ϕ(A,B) = 0.89. vectors are used. How many bytes of storage does that data set take
c(A −→ B) = 0.9, c(B −→ A) = 0.9. before and after compression and what is the compression ratio?
(b) For table II, compute support, the interest measure, and the ϕ Answer:
correlation coefficient for the association pattern {A, B}. Also, compute
the confidence of rules A → B and B → A.
Answer: s(A) = 0.9, s(B) = 0.9, s(A,B) = 0.89.
I(A,B) = 1.09, ϕ(A,B) = 0.89.
c(A −→ B) = 0.98, c(B −→ A) = 0.98.
(c) What conclusions can you draw from the results of (a) and (b)?
Answer: Interest, support, and confidence are non-invariant while the ϕ- 2. Find all well-separated clusters in the set of points shown in Figure
coefficient is invariant under the inversion operation. This is because ϕ- 7.35 .
coefficient takes into account the absence as well as the presence of an item
in a transaction.
21. Consider the relationship between customers who buy high-
definition televisions and exercise machines as shown in Tables 5.17
and 5.18 .

19. Table 5.5 shows a 2*2*2 contingency table for the binary variables A
and B at different values of the control variable C. Answer:
Observations from Figure 7.1:
1. Cluster 1: The leftmost circle contains four closely grouped points.
2. Cluster 2: The middle group consists of a set of smaller, individual
clusters, each containing three or four points.
3. Cluster 3: The rightmost circle contains another group of four
points.
Solution:
The clusters are defined by the points enclosed within the circles. In this case,
the well-separated clusters are:
• Cluster 1: Points in the leftmost circle.
• Cluster 2: Points in each middle circle, potentially forming sub-
clusters within the larger middle group.
• Cluster 3: Points in the rightmost circle.
5. Identify the clusters in Figure 7.36 using the center-, contiguity-, and In theory, there are an infinite number of ways to split the circle into
density-based definitions. Also indicate the number of clusters for each two clusters - just take any line that bisects the circle. This line can make any
case and give a brief indication of your reasoning. Note that darkness or angle 0◦ ≤ θ ≤ 180◦ with the x axis. The centroids will lie on the perpendicular
the number of dots indicates density. If it helps, assume center-based bisector of the line that splits the circle into two clusters and will be
means Kmeans, contiguity-based means single link, and density-based symmetrically positioned. All these solutions will have the same, globally
means DBSCAN. minimal, error.
(b) K = 3. The distance between the edges of the circles is slightly
greater than the radii of the circles.
If you start with initial centroids that are real points, you will necessarily
get this solution because of the restriction that the circles are more
than one radius apart. Of course, the bisector could have any angle, as
above, and it could be the other circle that is split. All these solutions
3. Many partitional clustering algorithms that automatically determine
have the same globally minimal error.
the number of clusters claim that this is an advantage. List two
(c) K = 3. The distance between the edges of the circles is much less
situations in which this is not the case.
than the radii of the circles.
(a) When there is hierarchical structure in the data. Most algorithms that
The three boxes show the three clusters that will result in the realistic
automatically determine the number of clusters are partitional, and
case that the initial centroids are actual data points.
thus, ignore the possibility of subclusters.
(d) K = 2.
(b) When clustering for utility. If a certain reduction in data size is needed,
Answer: In both case, the rectangles show the clusters. In the first case, the two
then it is necessary to specify how many clusters (cluster centroids) are clusters are only a local minimum while in the second case the clusters
(a) center-based 2 clusters. The rectangular region will be split in half.
produced.
Note that the noise is included in the two clusters. represent a globally minimal solution.
contiguity-based 1 cluster because the two circular regions will be (e) K = 3. Hint: Use the symmetry of the situation and remember that we
joined by noise. are looking for a rough sketch of what the result would be.
density-based 2 clusters, one for each circular region. Noise will be
eliminated.
(b) center-based 1 cluster that includes both rings.
contiguity-based 2 clusters, one for each rings.
density-based 2 clusters, one for each ring.
(c) center-based 3 clusters, one for each triangular region. One cluster is
also an acceptable answer.
contiguity-based 1 cluster. The three triangular regions will be joined
together because they touch.
density-based 3 clusters, one for each triangular region. Even though
the three triangles touch, the density in the region where they touch is
(a) Plot the probability of obtaining one point from each cluster in a lower than throughout the interior of the triangles.
sample of size K for values of K between 2 and 100. (d) center-based 2 clusters. The two groups of lines will be split in two.
contiguity-based 5 clusters. Each set of lines that intertwines becomes
a cluster.
density-based 2 clusters. The two groups of lines define two regions For the solution shown in the top figure, the two top clusters are enclosed
of high density separated by a region of low density. in two boxes, while the third cluster is enclosed by the regions
6. For the following sets of two-dimensional points, (1) provide a sketch defined by a triangle and a rectangle. (The two smaller clusters in the
of how they would be split into clusters by K-means for the given drawing are supposed to be symmetrical.) I believe that the second
number of 4!/44=0.0938. p=number of ways to select one centroid from solution—suggested by a student—is also possible, although it is a local
each clusternumber of way(7s. 2to0 )select minimum and might rarely be seen in practice for this configuration
K=10,100, clusters and (2) indicate approximately where the resulting of points. Note that while the two pie shaped cuts out of the larger circle
centroids would be. Assume that we are using the squared error are shown as meeting at a point, this is not necessarily the case—it
objective function. If you think that there is more than one possible depends on the exact positions and sizes of the circles. There could be a
solution, then please indicate whether each solution is a global or local gap between the two pie shaped cuts which is filled by the third (larger)
minimum. Note that the label of each diagram in Figure 7.37 matches the cluster. (Imagine the small circles on opposite sides.) Or the boundary
corresponding part of this question, e.g., Figure 7.37(a) goes with part between the two pie shaped cuts could actually be a line segment.
(a). 7. Suppose that for a data set
there are m points and K clusters,
half the points and clusters are in “more dense” regions,
half the points and clusters are in “less dense” regions, and
The solution is shown in Figure 4. Note that the probability is essentially the two regions are well-separated from each other.
0 by the time K = 10. For the given data set, which of the following should occur in order to
minimize the squared error when finding K clusters:
(a) Centroids should be equally distributed between more dense and
less
dense regions.
(b) More centroids should be allocated to the less dense region.
(c) More centroids should be allocated to the denser region.
Note: Do not get distracted by special cases or bring in factors other
than density. However, if you feel the true answer is different from any
Answer: given above, justify your response.
(a) K = 2. Assuming that the points are uniformly distributed in the The correct answer is (c). Less dense regions require more centroids if the
circle, how many possible ways are there (in theory) to partition the squared error is to be minimized.
points into two clusters? What can you say about the positions of the
two centroids? (Again, you don’t need to provide exact centroid
locations, just a qualitative description.)
8. Consider the mean of a cluster of objects from a binary transaction order dependent, for a fixed ordering of the objects, it always produces 15. Traditional agglomerative hierarchical clustering routines merge two
dataset. What are the minimum and maximum values of the components the same set of clusters. However, unlike K-means, it is not possible clusters at each step. Does it seem likely that such an approach
of the mean? What is the interpretation of components of the cluster to set the number of resulting clusters for the leader algorithm, except accurately captures the (nested) cluster structure of a set of data
mean? Which components most accurately characterize the objects in indirectly. Also, the K-means algorithm almost always produces better
points? If not, explain how you might postprocess the data to obtain a
the cluster? quality clusters as measured by SSE.
(a) The components of the mean range between 0 and 1. (b) Suggest ways in which the leader algorithm might be improved. more accurate view of the cluster structure.
(b) For any specific component, its value is the fraction of the objects in Use a sample to determine the distribution of distances between the (a) Such an approach does not accurately capture the nested cluster
the cluster that have a 1 for that component. If we have asymmetric points. The knowledge gained from this process can be used to more structure of the data. For example, consider a set of three clusters, each of
binary data, such as market basket data, then this can be viewed as the intelligently set the value of the threshold. The leader algorithm could be which has two, three, and four subclusters, respectively. An ideal hierarchical
probability that, for example, a customer in group represented by the modified to cluster for several thresholds clustering would have three branches from the root—one to each of the three
the cluster buys that particular item. during a single pass. main clusters—and then two, three, and four branches from each of these
(c) This depends on the type of data. For binary asymmetric data, the 13. The Voronoi diagram for a set of K points in the plane is a partition
components with higher values characterize the data, since, for most clusters, respectively. A traditional agglomerative approach cannot produce
of all the points of the plane into K regions, such that every point (of the
clusters, the vast majority of components will have values of zero. For such a structure.
plane) is assigned to the closest point among the K specified points—
regular binary data, such as the results of a true-false test, the significant (b) The simplest type of postprocessing would attempt to flatten the
see Figure 7.38 . What is the relationship between Voronoi diagrams and
components are those that are unusually high or low with respect hierarchical clustering by moving clusters up the tree.
to the entire set of data. K-means clusters? What do Voronoi diagrams tell us about the possible
9. Give an example of a data set consisting of three natural clusters, for shapes of Kmeansclusters?
16. Use the similarity matrix in Table 7.13 to perform single and
which (almost always) K-means would likely find the correct clusters, complete link hierarchical clustering. Show your results by drawing a
but bisecting K-means would not.
dendrogram. The dendrogram should clearly show the order in which
Consider a data set that consists of three circular clusters, that are identical
in terms of the number and distribution of points, and whose centers lie on the points are merged.
a line and are located such that the center of the middle cluster is equally
distant from the other two. Bisecting K-means would always split the middle
cluster during its first iteration, and thus, could never produce the correct
set of clusters. (Postprocessing could be applied to address this.)
10. Would the cosine measure be the appropriate similarity measure to
use with K-means clustering for time series data? Why or why not? If
not, what similarity measure would be more appropriate?
Time series data is dense high-dimensional data, and thus, the cosine
measure would not be appropriate since the cosine measure is appropriate
for sparse data. If the magnitude of a time series is important, then Euclidean
distance would be appropriate. If only the shapes of the time series are
important, then correlation would be appropriate. Note that if the comparison
of the time series needs to take in account that one time series might lead or
lag another or only be related to another during specific time periods, then
more sophisticated approaches to modeling time series similarity must be
used.
11. Total SSE is the sum of the SSE for each separate attribute. What
does it mean if the SSE for one variable is low for all clusters? Low for
just one cluster? High for all clusters? High for just one cluster? How
could you use the per variable SSE information to improve your
clustering?
(a) If the SSE of one attribute is low for all clusters, then the variable is
essentially a constant and of little use in dividing the data into groups.
(b) if the SSE of one attribute is relatively low for just one cluster, then (a) If we have K K-means clusters, then the plane is divided into K Voronoi
this attribute helps define the cluster. regions that represent the points closest to each centroid.
(c) If the SSE of an attribute is relatively high for all clusters, then it could (b) The boundaries between clusters are piecewise linear. It is possible to see
well mean that the attribute is noise.
this by drawing a line connecting two centroids and then drawing a
(d) If the SSE of an attribute is relatively high for one cluster, then it is
at odds with the information provided by the attributes with low SSE perpendicular to the line halfway between the centroids. This perpendicular
that define the cluster. It could merely be the case that the clusters line splits the plane into two regions, each containing points that are closest to
defined by this attribute are different from those defined by the other the centroid the region contains.
attributes, but in any case, it means that this attribute does not help
define the cluster. 14. You are given a data set with 100 records and are asked to cluster 17. Hierarchical clustering is sometimes used to generate K clusters, by
(e) The idea is to eliminate attributes that have poor distinguishing power the data. You use K-means to cluster the data, but for all values of K, the taking the clusters at the level of the dendrogram. (Root is at level 1.) By
between clusters, i.e., low or high SSE for all clusters, since they are
K-means algorithm returns only one non-empty cluster. You then apply looking at the clusters produced in this way, we can evaluate the
useless for clustering. Note that attributes with high SSE for all clusters
are particularly troublesome if they have a relatively high SSE with an incremental version of K-means, but obtain exactly the same result. behavior of hierarchical clustering on different types of data and
respect to other attributes (perhaps because of their scale) since they How is this possible? How would single link or DBSCAN handle such clusters, and also compare hierarchical approaches to K-means. The
introduce a lot of noise into the computation of the overall SSE. data? following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
12. The leader algorithm (Hartigan [533]) represents each cluster using a (a) The data consists completely of duplicates of one object. a. For each of the following sets of initial centroids, create two clusters
point, known as a leader, and assigns each point to the cluster (b) Single link (and many of the other agglomerative hierarchical schemes) by assigning each point to the nearest centroid, and then calculate the
corresponding to the closest leader, unless this distance is above a
would produce a hierarchical clustering, but which points appear in which total squared error for each set of two clusters. Show both the clusters
user-specified threshold. In that case, the point becomes the leader of a
new cluster. cluster would depend on the ordering of the points and the exact algorithm. and the total squared error for each set of centroids.
a. What are the advantages and disadvantages of the leader algorithm However, if the dendrogram were plotted showing the proximity at which each i. {18, 45}
as compared to K-means? object is merged, then it would be obvious that the data consisted of ii. {15, 40}
The leader algorithm requires only a single scan of the data and is duplicates. DBSCAN would find that all points were core points connected to
thus more computationally efficient since each object is compared to one another and produce a single cluster.
the final set of centroids at most once. Although the leader algorithm is
20. Consider the following four faces shown in Figure 7.39 . Again,
darkness or number of dots represents density. Lines are used only to
distinguish regions and do not represent points.

(a) For each figure, could you use single link to find the patterns
represented by the nose, eyes, and mouth? Explain.
Only for (b) and (d). For (b), the points in the nose, eyes, and mouth are much
(b) Do both sets of centroids represent stable solutions; i.e., if the K- closer together than the points between these areas. For (d) there is only
means algorithm was run on this set of points using the given centroids space between these regions.
as the starting centroids, would there be any change in the clusters (b) For each figure, could you use K-means to find the patterns
generated? (a) Data with very different sized clusters.
represented by the nose, eyes, and mouth? Explain.
Yes, both centroids are stable solutions. This can be a problem, particularly if the number of points in a cluster is small.
Only for (b) and (d). For (b), K-means would find the nose, eyes, and mouth,
(c) What are the two clusters produced by single link? For example, if we have a thousand points, with two clusters, one of size 900
but the lower density points would also be included. For (d), K - means would
The two clusters are {6, 12, 18, 24, 30} and {42, 48}. and one of size 100, and take a 5% sample, then we will, on average, end up
find the nose, eyes, and mouth straightforwardly as long as the number of
(d) Which technique, K-means or single link, seems to produce the with 45 points from the first cluster and 5 points from the second cluster. Five
clusters was set to 4.
“most natural” clustering in this situation? (For K-means, take the points is much easier to miss or cluster improperly than 50. Also, the second
(c) What limitation does clustering have in detecting all the patterns
clustering with the lowest squared error.) cluster will sometimes be represented by fewer than 5 points, just by the
formed by the points in Figure 7.7(c)?
MIN produces the most natural clustering. nature of random samples.
Clustering techniques can only find patterns of points, not of empty spaces.
(e) What definition(s) of clustering does this natural clustering (b) High-dimensional data.
correspond to? (Well-separated, center-based, contiguous, or density.) This can be a problem because data in high-dimensional space is typically 21. Compute the entropy and purity for the confusion matrix in Table
MIN produces contiguous clusters. However, density is also an acceptable sparse and more points may be needed to define the structure of a cluster in 7.14 .
answer. Even center-based is acceptable, since one set of centers gives the high-dimensional space.
desired clusters. (c) Data with outliers, i.e., atypical points.
(f) What well-known characteristic of the K-means algorithm explains By definition, outliers are not very frequent and most of them will be omitted
the previous behavior? when sampling. Thus, if finding the correct clustering depends on having the
K-means is not good at finding clusters of different sizes, at least when they outliers present, the clustering produced by sampling will likely be misleading.
are not well separated. The reason for this is that the objective of minimizing Otherwise, it is beneficial.
squared error causes it to “break” the larger cluster. Thus, in this problem, the (d) Data with highly irregular regions.
low error clustering solution is the “unnatural” one. This can be a problem because the structure of the border can be lost when
sampling unless a large number of points are sampled.
18. Suppose we find K clusters using Ward’s method, bisecting K- (e) Data with globular clusters.
means, and ordinary K-means. Which of these solutions represents a This is typically not a problem since not as many points need to be sampled
local or global minimum? Explain. to retain the structure of a globular cluster as an irregular one.
Although Ward’s method picks a pair of clusters to merge based on (f) Data with widely different densities.
minimizing SSE, there is no refinement step as in regular K-means. Likewise, In this case the data will tend to come from the denser region. Note that the
bisecting K-means has no overall refinement step. Thus, unless such a effect of sampling is to reduce the density of all clusters by the sampling
refinement step is added, neither Ward’s method nor bisecting K-means factor, e.g., if we take a 10% sample, then the density of the clusters is
produces a local minimum. Ordinary K-means produces a local minimum, but decreased by a factor of 10. For clusters that aren’t very dense to begin with,
like the other two algorithms, it is not guaranteed to produce a global this may means that they are now treated as noise or outliers.
minimum. (g) Data with a small percentage of noise points.
Sampling will not cause a problem. Actually, since we would like to exclude
noise, and since the amount of noise is small, this may be beneficial.
(h) Non-Euclidean data.
This has no particular impact.
(i) Euclidean data.
This has no particular impact.
(j) Data with many and mixed attribute types.
Many attributes was discussed under high-dimensionality. Mixed attributes
have no particular impact.
22. You are given two sets of 100 points that fall within the unit square. 24. Given the set of cluster labels and similarity matrix shown in Tables
One set of points is arranged so that the points are uniformly spaced. 7.15 and 7.16 , respectively, compute the correlation between the
The other set of points is generated from a uniform distribution over the similarity matrix and the ideal similarity matrix, i.e., the matrix whose
unit square.
entry is 1 if two objects belong to the same cluster, and 0 otherwise.
a. Is there a difference between the two sets of points?
Yes. The random points will have regions of lesser or greater density,
while the uniformly distributed points will, of course, have uniform
density throughout the unit square.
(b) If so, which set of points will typically have a smaller SSE for K=10
clusters?
The random set of points will have a lower SSE.
(c) What will be the behavior of DBSCAN on the uniform data set? The
random data set?
DBSCAN will merge all points in the uniform data set into one cluster
or classify them all as noise, depending on the threshold. There might Answer:
be some boundary issues for points at the edge of the region. However,
DBSCAN can often find clusters in the random data, since it does have
some variation in density.
23. Using the data in Exercise 24 , compute the silhouette coefficient for
each point, each of the two clusters, and the overall clustering.

25. Compute the hierarchical F-measure for the eight objects {p1, p2, p3,
p4, p5, p6, p7, and p8} and hierarchical clustering shown in Figure 7.40 .
Class A contains points p1, p2, and p3, while p4, p5, p6, p7, and p8
belong to class B.

26. Compute the cophenetic correlation coefficient for the hierarchical


clusterings in Exercise 16 . (You will need to convert the similarities into
dissimilarities.)

This can be easily computed using a package, e.g., MATLAB. The answers
are single link, 0.8116, and complete link, 0.7480.
27. Prove Equation 7.14 . 31. We can represent a data set as a collection of object nodes and a 32. In Figure 7.41 , match the similarity matrices, which are sorted
collection of attribute nodes, where there is a link between each object according to cluster labels, with the sets of points. Differences in
and each attribute, and where the weight of that link is the value of the shading and marker shape distinguish between clusters, and each set
object for that attribute. For sparse data, if the value is 0, the link is of points contains 100 points and three clusters. In the set of points
omitted. Bipartite clustering attempts to partition this graph into disjoint labeled 2, there are three very tight, equalsized clusters.
clusters, where each cluster consists of a set of object nodes and a set
of attribute nodes. The objective is to maximize the weight of links
between the object and attribute nodes of a cluster, while minimizing the
weight of links between object and attribute links in different clusters.
This type of clustering is also known as coclustering because the
objects and attributes are clustered at the same time.
a. How is bipartite clustering (co-clustering) different from clustering the
sets of objects and attributes separately?
In regular clustering, only one set of constraints, related either to objects
or attributes, is applied. In co-clustering both sets of constraints are applied
simultaneously. Thus, partitioning the objects and attributes independently of
one another typically does not produce the same results.
b. Are there any cases in which these approaches yield the same
clusters?
Yes. For example, if a set of attributes is associated only with the objects
28. Prove Equation 7.16. in one particular cluster, i.e., has 0 weight for objects in all other
clusters, and conversely, the set of objects in a cluster has 0 weight for
all other attributes, then the clusters found by co-clustering will match
those found by clustering the objects and attributes separately. To use
documents as an example, this would correspond to a document data
set that consists of groups of documents that only contain certain words
and groups of words that only appear in certain documents.
c. What are the strengths and weaknesses of co-clustering as compared
to ordinary clustering?
Co-clustering automatically provides a description of a cluster of objects
in terms of attributes, which can be more useful than a description
of clusters as a partitioning of objects. However, the attributes that
distinguish different clusters of objects, may overlap significantly, and
in such cases, co-clustering will not work well.

Answers: 1 - D, 2 - C, 3 - A, 4 - B

30. Clusters of documents can be summarized by finding the top terms


(words) for the documents in the cluster, e.g., by taking the most
frequent k terms, where k is a constant, say 10, or by taking all terms
that occur more frequently than a specified threshold. Suppose that K-
means is used to find clusters of both documents and words for a
document data set.
a. How might a set of term clusters defined by the top terms in a
document cluster differ from the word clusters found by clustering the
terms with Kmeans?
First, the top words clusters could, and likely would, overlap somewhat.
Second, it is likely that many terms would not appear in any of the
clusters formed by the top terms. In contrast, a K-means clustering of
the terms would cover all the terms and would not be overlapping.
(b) How could term clustering be used to define clusters of documents?
An obvious approach would be to take the top documents for a term
cluster; i.e., those documents that most frequently contain the terms in
the cluster.

You might also like