Worksheet
Worksheet
SRN: PES1UG21CS487
Marking scheme
1. Problem 1: Preprocessing - 2 marks
2. Problem 2: Item set detection and analaysis - 3 marks
3. Problem 3: Association rule minning - 5 marks
Context
Welcome to "FreshEats Superstore", a budding supermarket. As a data analyst working with "FreashEats" , your mission is to uncover meaningful patterns within customer
transactions to enhance their shopping experience and help us compete with our competitor "Not-So-FreshEats".
Out[3]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
whole
vegetables green cottage energy tomato low fat green mineral antioxydant frozen olive
0 shrimp almonds avocado weat yams honey salad salmon spinach
mix grapes cheese drink juice yogurt tea water juice smoothie oil
flour
1 burgers meatballs eggs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 chutney NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 turkey avocado NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [4]: df.shape
(7501, 20)
Out[4]:
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 7501 non-null object
1 1 5747 non-null object
2 2 4389 non-null object
3 3 3345 non-null object
4 4 2529 non-null object
5 5 1864 non-null object
6 6 1369 non-null object
7 7 981 non-null object
8 8 654 non-null object
9 9 395 non-null object
10 10 256 non-null object
11 11 154 non-null object
12 12 87 non-null object
13 13 47 non-null object
14 14 25 non-null object
15 15 8 non-null object
16 16 4 non-null object
17 17 4 non-null object
18 18 3 non-null object
19 19 1 non-null object
dtypes: object(20)
memory usage: 1.1+ MB
In [6]: df.describe()
Out[6]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count 7501 5747 4389 3345 2529 1864 1369 981 654 395 256 154 87 47 25 8 4 4 3 1
mineral mineral mineral mineral green french green green green green low fat green green green frozen protein olive
top magazines salmon spinach
water water water water tea fries tea tea tea tea yogurt tea tea tea smoothie bar oil
In [7]: print(df.isnull().sum())
0 0
1 1754
2 3112
3 4156
4 4972
5 5637
6 6132
7 6520
8 6847
9 7106
10 7245
11 7347
12 7414
13 7454
14 7476
15 7493
16 7497
17 7497
18 7498
19 7500
dtype: int64
Out[8]: items
2 [chutney]
3 [turkey, avocado]
... ...
7498 [chicken]
Successfully dealt with NaN values and transformed the dataset into a list of lists.
te = TransactionEncoder()
te_ary = te.fit(item_lists).transform(item_lists)
items_df = pd.DataFrame(te_ary, columns=te.columns_)
In [10]: frequent_itemsets.head(10)
0 0.087188 (burgers)
1 0.081056 (cake)
2 0.046794 (champagne)
3 0.059992 (chicken)
4 0.163845 (chocolate)
5 0.080389 (cookies)
7 0.179709 (eggs)
8 0.079323 (escalope)
1. "FreshEats" wants to replenish its stocks, help find the top 5 most popular (higher buying frequency) items/item_sets to replenish. Explain and justify the process followed to come
to the conclusion.
1) The Apriori algorithm is used to identify frequent item sets. 2) The frequent item sets in descending order are selected based on their support values. 3) Top 5 item sets with the
highest support values are chosen.
top_5 = sorted_fi.head(5)
1) Mineral Water: The support value of 0.238368 indicates that mineral water appears in approximately 23.84% of all transactions. It is the most popular individual item in the dataset.
2) Eggs: The support value of 0.179709 indicates that eggs appear in approximately 17.97% of all transactions. It is the second most popular individual item in the dataset.
3) Spaghetti: The support value of 0.174110 indicates that spaghetti appears in approximately 17.41% of all transactions. It is the third most popular individual item in the dataset.
4) French Fries: A support value of 0.170911 indicates that french fries appear in approximately 17.09% of all transactions, making it the fourth most popular individual item.
5) Chocolate: A support value of 0.163845 indicates that chocolate appears in approximately 16.38% of all transactions, making it the fifth most popular individual item.
Higher support values suggest that the corresponding items are more frequently purchased by customers at FreshEats. Therefore, "FreshEats" may replenish its stocks with mineral
water, eggs, spaghetti, french fries and chocolate.
1. Print out the association rules along with their confidence and lift. (Analyse the output structure of apriori())
(min_support=0.01, min_confidence = 0.045, min_lift=1.5, min_length=2)
In [12]: items_df.head()
0 False True True False True False False False False False ... False True False False True False False True False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False True False False False False False ... True False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False True False False False
1. As the Holiday season is approaching, "FreshEats" is considering to provide discounts and offers on some of their products. Help them identify the top 5 popular pairs/sets of
items/item_sets bought, considering probability of consequent item being purchased when antecedent item is bought.
top_5_rules = rules.head(5)
print(top_5_rules[['antecedents', 'consequents', 'confidence', 'lift']])
1. Also help them identify the top 5 popular pairs/sets of items/item_sets bought together, considering the popularity of consequent and antecedent items.
(Consequent and antecedent items together form the pairs/sets specified in the question).
top_5_rules = rules.head(5)
print(top_5_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])