DataMining Workbook Answers
DataMining Workbook Answers
Data Mining
Workbook
4th year
2019
Part1: Pre-Processing
Suppose that the data for analysis includes the attribute age.
The age values are
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70.
a) Find the mean, median, and mode of the data.
b) Give the five-number summary of the data. And Show a boxplot of the data.
c) Partition the data into three bins by each of equal-frequency and equal-
width partitioning
d) Use smoothing by bin boundaries to smooth these data
e) Use min-max normalization to transform the value 30 for age onto the range
[0:0; 1:0].
f) Use z-score normalization to transform the value 30 for age, where the SD of
age is 12.94 years.
g) Plot an equal-width histogram of width 10.
Consider the data set shown in Table 1 , Let min sup = 40%
and min conf. = 80%.
a. Find all frequent itemsets using Apriori Algorithm
b. List all the strong association rules.
c. Find the correlation the strong association rules using lift, what is the
meaning of the computed value?
b. Find all strong association rules of the form: X Y → Z and note their
confidence values.
Table 3:Transactions .
Consider the data set shown in Table3 TID Items bought
1001 {i1, i4, i5}
Let min sup = 30% and min conf. = 75%. 1024 {i1, i2, i3, i5}
1012 {i1, i2, i4, i5}
a. Construct the FP-tree for these transaction 1031 {i1, i3, i4, i5}
b. Compute the support for item-sets : {i1}, 1015 {i2, i3, i5}
{i4}, {i5}, {i1;i4},{i1; i5},{i4; i5} and {i1; i4; i5} 1022 {i2, i4, i5}
1029 {i1, i3, i4}
c. Compute the confidence for the
1040 {i1, i2, i3}
association rules:
1033 {i1, i4, i5}
{i1; i4} → {i5}; {i1, i5}→ {i4} and {i4; i5} → {i1}.
1038 {i1, i2, i5}
Which one is a strong rule?
d. Compute the interest measure for the strong association rules in (c).
What is the meaning of the computed value
Consider the data set shown in Table 4
a. Compute the support for item-sets
{i5}, {i2; i4}, and {i2; i4; i5} by treating each transaction ID as a market basket.
b. Use the results in (a) to compute the confidence for the association rules
{i2; i4} → {i5} and {i5} → {i2; i4}.
c. Is confidence a symmetric measure?
d. Repeat part (a) by treating each customer ID as a market basket. Each
item should be treated as a binary variable (1 if an item appears in at least one
transaction bought by the customer, and 0 otherwise.)
e. Use the results in part (d) to compute the confidence for the association
rules {i2; i4} → {i5} and {i5} → {i2; i4}.
f. Discuss whether there are any relationships between support and
confidence of parts {a, b} and {d, e}.
g. Compute the lift for the association rules {i2; i4} → {i5} and {i5} → {i2; i4}
in parts {b, e}. what is the meaning of the computed value.
b. Based on the given data, is the purchase of hot dogs independent of the
purchase of hamburgers? If not, what kind of correlation relationship exists
between the two? Note: the value needed to reject the hypothesis is 10.828
A database has four transactions. Let min sup = 60% and min
conf = 80%.
Part 3: Classification
Table1
Consider the sample data shown in Instance A1 A2 A3 class
Table4
Consider the sample data shown in Table4 A1 A2 A3 class
T T P +
a- What is the entropy of this collection of data T T P +
b- What are the information gains of A1, A2 and A3 T F N -
F F N +
relative to this data? F T N -
F T N -
c- What is the best split according to the information F F P -
T F P +
gain in part (b)? F T N -
d- Use Naïve Bayesian Classification method to classify T F N -
an object which has the following attributes: {A1:T, A2:F,
A3:N}
For a given data in table5, count represents the number of data
tuples having the values for department, status, age, and salary.
Let status be the class label attribute.
a- What is the entropy of this collection of data .
b- What are the information gains of department, age, and salary
c- What is the best split according to the information gain in (b)?
d- Use Naïve Bayesian Classification method to classify an individual who has
the following attributes:
{ department =”marketing”; age= “youth”, salary= low}
Table5
department age salary status count
sales Middle aged medium senior 30
sales youth low junior 30
sales Middle aged low junior 40
systems youth medium junior 20
systems Middle aged high senior 20
systems senior high senior 10
marketing senior medium senior 10
marketing Middle aged medium junior 20
secretary senior medium senior 10
secretary youth low junior 10
Part4 : Clustering
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1,
2), C2(4, 9), C3(2,8):
- Use the k-means algorithm to show the three cluster centers after the
first-round execution
For the following nine points (2,2), (2,7), (3,1), (3,5), (4,3), (4,8),
(5,2), (6,2), (6,5). Assume that k = 2 and initially assign (2,2)
and (2,7) as the center of each cluster.
- Apply the k-means algorithm using the Manhattan distance and show the
new cluster centers after the first-round execution
- Suppose that your result in (a) is the final cluster, how can you use it to
detect the outlier? what is the object which likely be an outlier?
Actual
No 412 2588 3000
a.0.99 b.0.95 c.0.86 d.0.05 FP TN
Total 7366 2634 10000
15. Specificity is ……….
TN/N = 2588/3000
a.0.99 b.0.95 c.0.86 d.0.05
Assume, you want to cluster 7 observations into 3 clusters using K-Means
clustering algorithm. After first iteration clusters, C1, C2, C3 has following
observations: C1: {(2, 2), (4, 4), (6, 6)} C2: {(0, 4), (4, 0)} C3: {(5, 5), (9, 9)}
16. What will be the cluster centroids if you want to proceed for second
iteration?
a. C1: (4, 4), C2: (2, 2), C3: (7, 7) b. C1: (6, 6), C2: (4, 4), C3: (9, 9)
c. C1: (2, 2), C2: (0, 0), C3: (5, 5) d. None of these
17. What will be the Manhattan distance for observation (9, 9) from cluster
centroid C1 in second iteration? (9-4) + (9-4)
a. 10 b. 5 c. 13 d. None of these
18. Consider the given data: {3, 4, 5, 10, 21, 32, 43, 44, 46, 52, 59, 67}, Using
equal-width partitioning and four bins, how many values are there in the first
bin?
a. 3 b. 4 c. 5 d. 6
19. If smooth by median is applied to the previous bins, what is the new value
of the data in the first bin?
a. 4 b. 4.5 c. 5 d. 7.5
20. Which of the following lists all parts of the five-number summary?
a. Mean, Median, Mode, Range, and Total
b. Minimum, Quartile1, Median, Quartile3, and Maximum
c. Smallest, Q1, Q2, Q3, and Q4
d. Minimum, Maximum, Range, Mean, and Median
Answer the following Questions
1. Define: A centroid in k-means.
2. Define: A core point in DBSCAN
3. Define: association and correlation analysis. Give an example
4. Define: cluster analysis. Give an example
5. Define: Data Cleaning, Data integration, Data reduction, Data
transformation, Discretization
6. Define: outlier analysis. Give an example
7. Define: regression. Give an example
8. Give an example for nonparametric data reduction strategies.
9. Give an example for parametric data reduction strategies.
10. How does K-means differ from DBSCAN
11. How to assess the goodness of a rule?
12. How you can solve missing values problems
13. If a person’s height is measured in inches then what kind of attribute you
will use? RATIONAL
14. If the correlation coefficient of the items bred and rice is equal to 1.5. This
means what?
15. If the covariance of the items bred and rice is equal to 1. This means
what? positive
16. If the information gain of age and income attributes are 0.24 and 0.024
respectively which one you will chose as the splitting attribute
17. If the lift measure of the items bred and rice is equal to 0.5. This means
what?
18. If the lift measure of the items bred and rice is equal to 1. This means
what?
19. If the lift measure of the items bred and rice is equal to 1.5. This means
what?
20. If the mean is equal to the median then this might be an indication that
the data is what?
21. If the mean is larger than the median then this might be an indication that
the data is what?
22. If the mean is smaller than the median then this might be an indication
that the data is what? negative
23. If you have 100 values in my data and I add 5.0 to all of the values, then
how will this change the median? value is increased ...but position still the same
24. If you have 100 values in my data and I add 5.0 to all of the values, then
how will this change the median?
25. List the Cluster Analysis Methods
26. List the Major Preprocessing Tasks That Improve Quality of Data
27. List the steps of knowledge discovery
28. List the transformation strategies binning, regression
29. List the types of outliers. Give an example for each one.
30. The confidence for the association rule {bread} → {milk, diapers} was
determined to be 0.95. What does the value 0.95 mean?
31. The support for the association rule {bread} → {milk, diapers} was
determined to be 0.95. What does the value 0.95 mean?
32. What are rules conflicts? How can you solve it?
33. What are the data smoothing techniques? binning, regression, outlier analysis
61. Cluster is the process of finding a model that describes and distinguishes
data classes or concepts. F cluster --> classification
62. Computing the total sales of a company. Is a data mining task? F
63. Correlation analysis divides data into groups that are meaningful, useful,
or both. F ..correlation analysis --> cluster analysis
64. Database mining refers to the process of deriving high-quality information
from text. F ..database mining --> text mining
65. Dissimilarity matrix stores n data objects that have p attributes as an n-by-
p matrix F .. data matrix
66. Dividing the customers of a company according to their profitability. is a
data mining task? F
67. For an association rule, if we move one item from the right-hand-side to F because
confidence is not
the left-hand-side of the rule, then the confidence will never change. symmetric
68. If all the proper subsets of an itemset are frequent, then the itemset itself
must also be frequent. F
69. In decision tree algorithms, attribute selection measures are used to rank
attributes T
70. In decision tree algorithms, attribute selection measures are used to
reduce the dimensionality T
71. In lazy learner we interest in the largest distance. F the smallest distance
72. Intrinsic methods measure how well the clusters are separated T
73. Multimedia Mining is the application of data mining techniques to
discover patterns from the Web. F multimedia mining --> web mining
74. Regression is a method of integration F except integration
75. Strategies for data transformation include chi-square test F chi-square is in integration
76. The Pruning make the decision tree more complex F complex --> simple
77. An object is an outlier if its density is equal to the density of its neighbors. F equal to -->
less than
78. A common weakness of association rule mining is that it is not produce
enough interesting rules F
79. Accuracy is interestingness measures for association rules F Accuracy --> support and
confidence
80. Binning is a method of reduction F (binning is in Transformation $ cleaning)
86. Incomplete data problem can be solved by binning F incomplete data --> noise data
89. Mode is a middle value in set of ordered values F mode --> median
92. Predicting the outcomes of tossing a (fair) pair of dice. Is a data mining
task? F
93. Recall is interestingness measures for association rules F Recall --> lift
94. Redundancy is an important issue in data cleaning F data cleaning --> data integration
95. Sampling methods smooth noisy data F sampling --> regression or outlier analysis or binning
99. the object is local outlier if it is deviate significantly from the rest of the
dataset F local --> global outlier
100. The silhouette coefficient is a method to determine the natural number of
clusters for hierarchical algorithms F determine the evaluation of the cluster not determine the
number of clusters
My best wishes;