0% found this document useful (0 votes)
125 views2 pages

Mining Massive Data University of Primorska Fall 2020

This document contains 8 questions about frequent itemset mining and finding similar items. Specifically, it asks about computing frequent items and itemsets given transaction data with different support thresholds, calculating confidence of association rules, using a triangular matrix to count item pairs, applying the PCY algorithm to transaction data hashed into buckets, estimating expected Jaccard similarity of random sets, computing the maximum number of shingles in a document, calculating minhash signatures to estimate Jaccard similarity between columns of a matrix, and evaluating an S-curve function.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views2 pages

Mining Massive Data University of Primorska Fall 2020

This document contains 8 questions about frequent itemset mining and finding similar items. Specifically, it asks about computing frequent items and itemsets given transaction data with different support thresholds, calculating confidence of association rules, using a triangular matrix to count item pairs, applying the PCY algorithm to transaction data hashed into buckets, estimating expected Jaccard similarity of random sets, computing the maximum number of shingles in a document, calculating minhash signatures to estimate Jaccard similarity between columns of a matrix, and evaluating an S-curve function.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Mining Massive Data

University of Primorska
Fall 2020

Homework 1
(Frequent itemset mining and Finding similar items)

1. Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100.
Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets,
item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3,
4, 6, 12}, since these are all the integers that divide 12. Answer the following questions:
(a) If the support threshold is 5, which items are frequent?
(b) If the support threshold is 5, which pairs of items are frequent?
(c) What is the sum of the sizes of all the baskets?
(d) Suppose the support threshold is 5. Find the maximal frequent itemsets.

2. Consider the data of the previous problem. What is the confidence of the following association
rules?
(a) {5, 7} → 2.
(b) {2, 3, 4}→ 5.

3. Suppose that we use a triangular matrix to count pairs, and n, the number of items, is 20.
(a) What pair’s count is in a[100]?
(b
) Suppose the support threshold is 5. Find the maximal frequent itemsets.

4. Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}
Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash table
with 11 buckets, and the set {i, j} is hashed to bucket i×jmod 11.
(a) By any method, compute the support for each item and each pair of items.
(b) Which pairs hash to which buckets?
(c) Which buckets are frequent?
(d) Which pairs are counted on the second pass of the PCY Algorithm?

1
Mining Massive Data
University of Primorska
Fall 2020

5. Suppose we have a universal set U of n elements, and we choose two subsets S and T at random,
each with m of the n elements. What is the expected value of the Jaccard similarity of S and T?

6. What is the largest number of k-shingles a document of n bytes can have? You may assume that
the size of the alphabet is large enough that the number of possible strings of length k is at least n.

7. Consider the matrix with six rows below.

(a) Compute the minhash signature for each column if we use the following three hash functions:
h1(x) = 2x + 1 mod 6; h2(x) = 3x + 2 mod 6; h3(x) = 5x + 2 mod 6.
(b) Which of these hash functions are true permutations?
(c) How close are the estimated Jaccard similarities for the six pairs of columns to the true
Jaccard similarities?

8. Evaluate the following S-curve:

for s = 0.1, 0.2, . . . , 0.9, for the following values of r and b:


(a) r = 3 and b = 10.
(b) r = 6 and b = 20.
(c) r = 5 and b = 50.

You might also like