0% found this document useful (0 votes)
25 views

apriori algorithm or market basket analysis _ kaggle

Uploaded by

Yanet Cesaire
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

apriori algorithm or market basket analysis _ kaggle

Uploaded by

Yanet Cesaire
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

menu

Skip to
content

 Search

menu
Skip to
content

Create

 Search

Home

Competitions

Datasets

Code

Discussions

Courses

More

Your Work
Recently Viewed
Apriori Algorithm or Market Basket Analysis

Diabetes prediction

120 years of Olympic history: athletes and results

42nd Solution - Nothing Special

birds_hd5
Recently Edited
Basket Ball Game Predictions


0
View Active Events

Rocky suven datascience · 1y ago · 4466 views


Apriori Algorithm or Market Basket Analysis
Python · Ecommerce Dataset , Basket Optimisation
Apriori Algorithm or Market Basket Analysis
Notebook Data Logs Comments (3)

 39

Copy & Edit 114

more_vert
Run
32.9s

Version 9 of 9
pandas Business Python Data Analytics Statistical Analysis
+2

Market Basket Analysis


using assocition
rules

In this Notebook we would


learn

1. What is Association Rule Learning?

2. How Apriori works ?

3. Implementing Apriori With Python - in


2 ways

This notebook is made by Rocky Jagtiani


Head Content Development & Training at
Suven Consultants & Technology Pvt Ltd.
- Training & Recruitment Company.

Problem Statement :

When we go grocery shopping, we often


have a standard list of things to buy.
Each shopper has a distinctive list,
depending on one’s needs and
preferences. A housewife might buy
healthy ingredients for a family dinner,
while a bachelor might buy beer and
chips. Understanding these buying
patterns can help to increase sales in
several ways. If there is a pair of items, X
and Y, that are frequently bought
together:

Both X and Y can be placed on


the same shelf, so that buyers of
one item would be prompted to
buy the other.

Promotional discounts could be


applied to just one out of the two
items.

Advertisements on X could be
targeted at buyers who
purchase Y.

X and Y could be combined into


a new product, such as having Y
in flavors of X.

While we may know that certain


items are frequently bought
together, the question is, how
do we uncover these
associations?

Besides increasing sales profits,


association rules can also be used in
other fields. In medical diagnosis for
instance, understanding which symptoms
tend to co-morbid can help to improve
patient care and medicine prescription.

What is Association Rule


Learning?
Association Rule Learning is rule-based
learning for identifying the association
between different variables in a
database. One of the best and
most popular examples of
Association Rule Learning is
the Market Basket Analysis . The
problem analyses the association
between various items that has the
highest probability of being bought
together by a customer.

For example, the association rule,


{onions, chicken masala} => {chicken}
says that a person who has got both
onions and chicken masala in his or her
basket has a high probability of buying
chicken also.

Apriori Algorithm
The algorithm was first proposed in 1994
by Rakesh Agrawal and Ramakrishnan
Srikant. Apriori algorithm finds the most
frequent itemsets or elements in a
transaction database and identifies
association rules between the items just
like the above-mentioned example.

How Apriori works ?


To construct association rules between
elements or items, the algorithm
considers 3 important factors which are,
support, confidence and lift. Each of
these factors is explained as follows:

Support:

The support of item I is defined as the


ratio between the number of transactions
containing the item I by the total number
of transactions expressed as :

Support indicates how popular an


itemset is, as measured by the proportion
of transactions in which an itemset
appears. In Table 1 below, the support of
{apple} is 4 out of 8, or 50%. Itemsets
can also contain multiple items. For
instance, the support of {apple, beer,
rice} is 2 out of 8, or 25%.
If you discover that sales of items
beyond a certain proportion tend to have
a significant impact on your profits, you
might consider using that proportion as
your support threshold. You may then
identify itemsets with support values
above this threshold as significant
itemsets.

Confidence:

This is measured by the proportion of


transactions with item I1, in which item I2
also appears. The confidence between
two items I1 and I2, in a transaction is
defined as the total number of
transactions containing both items I1 and
I2 divided by the total number of
transactions containing I1. ( Assume I1 as
X , I2 as Y )

Confidence says how likely item Y is


purchased when item X is purchased,
expressed as {X -> Y}. This is measured
by the proportion of transactions with
item X, in which item Y also appears. In
Table 1, the confidence of {apple ->
beer} is 3 out of 4, or 75%.

One drawback of the confidence


measure is that it might misrepresent the
importance of an association. This is
because it only accounts for how popular
apples are, but not beers. If beers are
also very popular in general, there will be
a higher chance that a transaction
containing apples will also contain beers,
thus inflating the confidence measure. To
account for the base popularity of both
constituent items, we use a third measure
called lift.

Lift:

Lift is the ratio between the confidence


and support.

Lift says how likely item Y is purchased


when item X is purchased, while
controlling for how popular item Y is. In
Table 1, the lift of {apple -> beer} is
1,which implies no association between
items. A lift value greater than 1 means
that item Y is likely to be bought if item X
is bought, while a value less than 1
means that item Y is unlikely to be
bought if item X is bought. ( here X
represents apple and Y represents beer )

for Extra Reading : refer this

Implementing Apriori With


Python - in 2 ways

In the first way , we use apyori


package.
In [1]:

#External package need to instal


l
!pip install apyori

Collecting apyori
Downloading apyori-1.1.2.tar.g
z (8.6 kB)
Building wheels for collected pa
ckages: apyori
Building wheel for apyori (set
up.py) ... -​ ​\​ ​done
Created wheel for apyori: file
name=apyori-1.1.2-py3-none-any.w
hl size=5974 sha256=2fe313e79a86
7e64d0c823a9302d128463a1b37af9fb
0d71ba92e025f6c211a6
Stored in directory: ∕root∕.ca
che∕pip∕wheels∕cb∕f6∕e1∕57973c63
1d27efd1a2f375bd6a83b2a616c4021f
24aab84080
Successfully built apyori
Installing collected packages: a
pyori
Successfully installed apyori-1.
1.2
WARNING: You are using pip versi
on 20.3.1; however, version 21.0
.1 is available.
You should consider upgrading vi
a the '∕opt∕conda∕bin∕python3.7
-m pip install --upgrade pip' co
mmand.

In [2]:

#import all required packages..


import pandas as pd
import numpy as np
from apyori import apriori

In [3]:

#loading market basket dataset..

df = pd.read_csv('..∕input∕baske
t-optimisation∕Market_Basket_Opt
imisation.csv',header=None)

In [4]:

df.head()

Out[4]:

0 1 2 3

vegetables
0 shrimp almonds avocado
mix

1 burgers meatballs eggs NaN

2 chutney NaN NaN NaN

3 turkey avocado NaN NaN

mineral energy whole


4 milk
water bar wheat rice

In [5]:

## Data Cleaning step

# replacing empty value with 0.


df.fillna(0,inplace=True)

In [6]:

df.head()
Out[6]:

0 1 2 3

vegetables
0 shrimp almonds avocado
mix

1 burgers meatballs eggs 0

2 chutney 0 0 0

3 turkey avocado 0 0

mineral energy whole


4 milk
water bar wheat rice

In [7]:

# Data Pre-processing step

# for using aprori , need to con


vert data in list format..
# transaction = [['apple','almon
ds'],['apple'],['banana','apple'
]]....

transactions = []

for i in range(0,len(df)):
transactions.append([str(df.
values[i,j]) for j in range(0,20
) if str(df.values[i,j])!='0'])

In [8]:

## verifying - by printing the 0


th transaction
transactions[0]

Out[8]:

['shrimp',
'almonds',
'avocado',
'vegetables mix',
'green grapes',
'whole weat flour',
'yams',
'cottage cheese',
'energy drink',
'tomato juice',
'low fat yogurt',
'green tea',
'honey',
'salad',
'mineral water',
'salmon',
'antioxydant juice',
'frozen smoothie',
'spinach',
'olive oil']

In [9]:

## verifying - by printing the 1


st transaction
transactions[1]

Out[9]:

['burgers', 'meatballs', 'eggs']


In [10]:

# Call apriori function which re


quires minimum support, confidan
ce and lift, min length is combi
nation of item default is 2".
rules = apriori(transactions, mi
n_support=0.003, min_confidance=
0.2, min_lift=3, min_length=2)

## min_support = 0.003 -> means


selecting items with min support
of 0.3%
## min_confidance = 0.2 -> means
min confidance of 20%
## min_lift = 3
## min_length = 2 -> means no. o
f items in the transaction shoul
d be 2

In [11]:

#it generates a set of rules in


a generator file...
rules

Out[11]:

<generator object apriori at 0x7


f17980b5350>

In [12]:

# all rules need to be converted


in a list..
Results = list(rules)
Results

In [13]:

# convert result in a dataframe


for further operation...
df_results = pd.DataFrame(Result
s)

In [14]:

# as we see "order_statistics" ,
is itself a list so need to be
converted in proper format..
df_results.head()

Out[14]:

items support ordered_statistics

(cottage [((brownies), (cottage


0 cheese, 0.003466 cheese),
brownies) 0.102766798418...

(light [((chicken), (light


1 cream, 0.004533 cream),
chicken) 0.0755555555555555...

(mushroom
[((escalope), (mushroom
cream
2 0.005733 cream sauce),
sauce,
0.072268...
escalope)

[((escalope), (pasta),
(pasta,
3 0.005866 0.07394957983193277,
escalope)
4....

(fresh
[((fresh bread), (tomato
bread,
4 0.004266 juice),
tomato
0.09907120743...
juice)

In [15]:

# keep support in a separate dat


a frame so we can use later..
support = df_results.support
In [16]:

'''
convert orderstatistic in a prop
er format.
order statistic has lhs => rhs a
s well rhs => lhs
we can choose any one for convie
nce.
Let's choose first one which is
'df_results['ordered_statistics'
][i][0]'
'''

#all four empty list which will


contain lhs, rhs, confidance and
lift respectively.
first_values = []
second_values = []
third_values = []
fourth_value = []

# loop number of rows time and a


ppend 1 by 1 value in a separate
list..
# first and second element was f
rozenset which need to be conver
ted in list..
for i in range(df_results.shape[
0]):
single_list = df_results['or
dered_statistics'][i][0]
first_values.append(list(sin
gle_list[0]))
second_values.append(list(si
ngle_list[1]))
third_values.append(single_l
ist[2])
fourth_value.append(single_l
ist[3])

In [17]:

# convert all four list into dat


aframe for further operation..
lhs = pd.DataFrame(first_values)
rhs = pd.DataFrame(second_values
)

confidance=pd.DataFrame(third_va
lues,columns=['Confidance'])

lift=pd.DataFrame(fourth_value,c
olumns=['lift'])

In [18]:

# concat all list together in a


single dataframe
df_final = pd.concat([lhs,rhs,su
pport,confidance,lift], axis=1)
df_final

In [19]:

'''
we have some of place only 1 it
em in lhs and some place 3 or mo
re so we need to a proper repres
enation for User to understand.
replacing none with ' ' and com
bining three column's in 1
example : coffee,none,none is c
onverted to coffee, ,
'''
df_final.fillna(value=' ', inpla
ce=True)
df_final.head()
Out[19]:

0 1 0 1 2 support

cottage
0 brownies 0.003466
cheese

light
1 chicken 0.004533
cream

mushroom
2 escalope cream 0.005733
sauce

3 escalope pasta 0.005866

fresh tomato
4 0.004266
bread juice

In [20]:

#set column name


df_final.columns = ['lhs',1,'rhs
',2,3,'support','confidance','li
ft']
df_final.head()

Out[20]:

lhs 1 rhs 2 3 support

cottage
0 brownies 0.003466
cheese

light
1 chicken 0.004533
cream

mushroom
2 escalope cream 0.005733
sauce

3 escalope pasta 0.005866

fresh tomato
4 0.004266
bread juice

In [21]:

# add all three column to lhs it


emset only
df_final['lhs'] = df_final['lhs'
] + str(", ") + df_final[1]

df_final['rhs'] = df_final['rhs'
]+str(", ")+df_final[2] + str(",
") + df_final[3]

In [22]:

df_final.head()

Out[22]:

lhs 1 rhs 2 3 support

cottage
0 brownies, 0.003466
cheese, ,

light
1 chicken, 0.004533
cream, ,

mushroom
2 escalope, cream 0.005733
sauce, ,

3 escalope, pasta, , 0.005866

fresh tomato
4 0.004266
bread, juice, ,

In [23]:

#drop columns 1,2 and 3 because


now we already appended to lhs c
olumn.

df_final.drop(columns=[1,2,3],in
place=True)
In [24]:

#this is final output. You can s


ort based on the support lift an
d confidance..
df_final.head()

Out[24]:

lhs rhs support confidance

cottage
0 brownies, 0.003466 0.102767
cheese, ,

light
1 chicken, 0.004533 0.075556
cream, ,

mushroom
2 escalope, cream 0.005733 0.072269
sauce, ,

3 escalope, pasta, , 0.005866 0.073950

fresh tomato
4 0.004266 0.099071
bread, juice, ,

In [25]:

## Showing top 10 items, based o


n lift. Sorting in desc order
df_final.sort_values('lift', asc
ending=False).head(10)

Out[25]:

lhs rhs support confidance

mineral
water,
58 olive oil, whole 0.003866 0.058704
wheat
pasta,

fromage
6 honey, , 0.003333 0.245098
blanc,

spaghetti,
ground
49 tomato 0.003066 0.031208
beef,
sauce,

light
1 chicken, 0.004533 0.075556
cream, ,

3 escalope, pasta, , 0.005866 0.073950

french
ground
28 fries, herb 0.003200 0.032564
beef,
& pepper,

11 pasta, shrimp, , 0.005066 0.322034

herb &
ground
23 pepper, 0.003999 0.040706
beef,
chocolate,

mineral
frozen water,
69 0.003200 0.033566
vegetables, chocolate,
shrimp

whole
10 olive oil, wheat 0.007999 0.121457
pasta, ,

Other way of doing Apriori in


Python

In the Second way we would use


mlxtend.frequent_patterns
package.

Why we doing it in this way -

1. Limitation of first approach was


need to convert data in a list fomat.
In real life a store has many
thousands of transactations hence it
is computationally expensive.

2. Apyori package is outdated .

3. Results are coming in less friendly


format so their is a need for pre-
processing.

4. Instead lets use mlxtend


package. It generate frequent
itemset and association rules both.
In [26]:

'''
load apriori and association mod
ules from mlxtend.frequent_patte
rns
Used different dataset because m
lxtend need data in below format
.

transaction_name apple ba
nana grapes
transaction 1 0
1 1
2 1
0 1
3 1
0 0
4 0
1 0

we could have used above data a


s well but need to perform opera
tion to bring in this format ins
tead of that used seperate data
only.
'''

import pandas as pd
from mlxtend.frequent_patterns i
mport apriori
from mlxtend.frequent_patterns i
mport association_rules

df1 = pd.read_csv('..∕input∕ecom
merce-dataset∕data-2.csv', encod
ing="ISO-8859-1")
df1.head()

Out[26]:

InvoiceNo StockCode Description Quantity

WHITE
HANGING
0 536365 85123A HEART T- 6
LIGHT
HOLDER

WHITE
1 536365 71053 METAL 6
LANTERN

CREAM
CUPID
2 536365 84406B HEARTS 8
COAT
HANGER

KNITTED
UNION
3 536365 84029G FLAG HOT 6
WATER
BOTTLE

RED
WOOLLY
4 536365 84029E HOTTIE 6
WHITE
HEART.

In [27]:

# data has many country choose a


ny one for check.
df1.Country.value_counts().head(5
)

Out[27]:

United Kingdom 495478


Germany 9495
France 8557
EIRE 8196
Spain 2533
Name: Country, dtype: int64
In [28]:

# using only France country data


for now; can check for other a
s well..
df1 = df1[df1.Country == 'France
']

In [29]:

# some spaces are there in descr


iption; need to remove else late
r operation it will create probl
em..
df1['Description'] = df1['Descri
ption'].str.strip()

In [30]:

#some of transaction quantity is


negative which can not be possi
ble remove them.
df1 = df1[df1.Quantity >0]

In [31]:

#df1[df1.Country == 'France'].he
ad(10)
df1.head(10)

Out[31]:

InvoiceNo StockCode Description Quantity

ALARM
CLOCK
26 536370 22728 24
BAKELIKE
PINK

ALARM
CLOCK
27 536370 22727 24
BAKELIKE
RED

ALARM
CLOCK
28 536370 22726 12
BAKELIKE
GREEN

PANDA AND
BUNNIES
29 536370 21724 12
STICKER
SHEET

STARS GIFT
30 536370 21883 24
TAPE

INFLATABLE
31 536370 10002 POLITICAL 48
GLOBE

VINTAGE
HEADS AND
32 536370 21791 24
TAILS CARD
GAME

SET/2 RED
RETROSPOT
33 536370 21035 18
TEA
TOWELS

ROUND
SNACK
34 536370 22326 BOXES SET 24
OF4
WOODLAND

SPACEBOY
35 536370 22629 24
LUNCH BOX

In [32]:

# convert data in format which i


s required
# converting using pivot table a
nd Quantity sum as values. fill
0 if any nan values

basket = pd.pivot_table(data=df1
,index='InvoiceNo',columns='Desc
ription',values='Quantity', aggf
unc='sum',fill_value=0)

In [33]:

basket.head()
Out[33]:

10 12 12 EGG
COLOUR COLOURED HOUSE
Description
SPACEBOY PARTY PAINTED
PEN BALLOONS WOOD

InvoiceNo
536370 0 0 0

536852 0 0 0

536974 0 0 0

537065 0 0 0

537463 0 0 0

5 rows × 1563 columns

In [34]:

#this to check correctness after


binning it to 1 ..
basket['10 COLOUR SPACEBOY PEN']
.head(10)

Out[34]:

InvoiceNo
536370 0
536852 0
536974 0
537065 0
537463 0
537468 24
537693 0
537897 0
537967 0
538008 0
Name: 10 COLOUR SPACEBOY PEN, dt
ype: int64

In [35]:

# we dont need quantity sum


# we need either has taken or no
t
# so if user has taken that item
mark as 1 else mark as 0.

def convert_into_binary(x):
if x > 0:
return 1
else:
return 0

In [36]:

basket_sets = basket.applymap(co
nvert_into_binary)

In [37]:

# check : has quantity now conve


rted to 1 or 0.
basket_sets['10 COLOUR SPACEBOY
PEN'].head(10)

Out[37]:

InvoiceNo
536370 0
536852 0
536974 0
537065 0
537463 0
537468 1
537693 0
537897 0
537967 0
538008 0
Name: 10 COLOUR SPACEBOY PEN, dt
ype: int64
In [38]:

# remove postage item as it is j


ust a seal which almost all tran
saction contains.
print(basket_sets['POSTAGE'].head
())

basket_sets.drop(columns=['POSTA
GE'],inplace=True)

InvoiceNo
536370 1
536852 1
chevron_right
License
Table of Contents
This Notebook has been released under the Apache 2.0 open source license.
Continue exploring
Market Basket Analysis using
assocition rules

Data
2 input and 0 output

Logs
32.9 second run - successful

Comments
3 comments

chevron_right
License
Table of Contents
This Notebook has been released under the Apache 2.0 open source license.
Continue exploring
Market Basket Analysis using
assocition rules

Data
2 input and 0 output

Logs
32.9 second run - successful

Comments
3 comments

You might also like