Mining Massive Data University of Primorska Fall 2020
Mining Massive Data University of Primorska Fall 2020
University of Primorska
Fall 2020
Homework 1
(Frequent itemset mining and Finding similar items)
1. Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100.
Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets,
item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3,
4, 6, 12}, since these are all the integers that divide 12. Answer the following questions:
(a) If the support threshold is 5, which items are frequent?
(b) If the support threshold is 5, which pairs of items are frequent?
(c) What is the sum of the sizes of all the baskets?
(d) Suppose the support threshold is 5. Find the maximal frequent itemsets.
2. Consider the data of the previous problem. What is the confidence of the following association
rules?
(a) {5, 7} → 2.
(b) {2, 3, 4}→ 5.
3. Suppose that we use a triangular matrix to count pairs, and n, the number of items, is 20.
(a) What pair’s count is in a[100]?
(b
) Suppose the support threshold is 5. Find the maximal frequent itemsets.
4. Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}
Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash table
with 11 buckets, and the set {i, j} is hashed to bucket i×jmod 11.
(a) By any method, compute the support for each item and each pair of items.
(b) Which pairs hash to which buckets?
(c) Which buckets are frequent?
(d) Which pairs are counted on the second pass of the PCY Algorithm?
1
Mining Massive Data
University of Primorska
Fall 2020
5. Suppose we have a universal set U of n elements, and we choose two subsets S and T at random,
each with m of the n elements. What is the expected value of the Jaccard similarity of S and T?
6. What is the largest number of k-shingles a document of n bytes can have? You may assume that
the size of the alphabet is large enough that the number of possible strings of length k is at least n.
(a) Compute the minhash signature for each column if we use the following three hash functions:
h1(x) = 2x + 1 mod 6; h2(x) = 3x + 2 mod 6; h3(x) = 5x + 2 mod 6.
(b) Which of these hash functions are true permutations?
(c) How close are the estimated Jaccard similarities for the six pairs of columns to the true
Jaccard similarities?