Running RFM in Python
Running RFM in Python
#import modules
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import datetime as dt
Loading Dataset
data = pd.read_excel("C:\Users\siva\Desktop\Online_Retail.xlsx")
data.head()
data.tail( )
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo 541909 non-null object
StockCode 541909 non-null object
Description 540455 non-null object
Quantity 541909 non-null int64
InvoiceDate 541909 non-null datetime64[ns]
UnitPrice 541909 non-null float64
CustomerID 406829 non-null float64
Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
This material is not original work. This compilation draws heavily from various sources
data= data[pd.notnull(data['CustomerID'])]
Removing Duplicates
Sometimes you get a messy dataset. You may have to deal with duplicates, which will skew your
analysis. In python, pandas offer function drop_duplicates(), which drops the repeated or
duplicate records.
filtered_data=data[['Country','CustomerID']].drop_duplicates()
filtered_data.Country.value_counts()
This material is not original work. This compilation draws heavily from various sources
Brazil 1
European Community 1
filtered_data.Country.value_counts()[:10].plot(kind='bar')
filtered_data.Country.value_counts()[:5].plot(kind='bar')
The describe() function in pandas is convenient in getting various summary statistics. This
function returns the count, mean, standard deviation, minimum and maximum values and the
quantiles of the data.
uk_data.describe()
This material is not original work. This compilation draws heavily from various sources
min -80995.000000 0.000000 12346.000000
Here, you can filter the necessary columns for RFM analysis. You only need her five columns
CustomerID, InvoiceDate, InvoiceNo, Quantity, and UnitPrice. CustomerId will uniquely define
your customers, InvoiceDate help you calculate recency of purchase, InvoiceNo helps you to
count the number of time transaction performed(frequency). Quantity purchased in each
transaction and UnitPrice of each unit purchased by the customer will help you to calculate the
total purchased amount.
uk_data=uk_data[['CustomerID','InvoiceDate','InvoiceNo','Quantity','UnitPrice'
]]
uk_data['TotalPrice'] = uk_data['Quantity'] * uk_data['UnitPrice']
uk_data['InvoiceDate'].min(),uk_data['InvoiceDate'].max()
PRESENT = dt.datetime(2011,12,10)
uk_data['InvoiceDate'] = pd.to_datetime(uk_data['InvoiceDate'])
uk_data.head()
This material is not original work. This compilation draws heavily from various sources
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
RFM Analysis
For Recency, Calculate the number of days between present date and date of last
purchase each customer.
For Frequency, Calculate the number of orders for each customer.
For Monetary, Calculate sum of purchase price for each customer.
rfm.columns
Index(['InvoiceDate', 'TotalPrice', 'InvoiceNo'], dtype='object')
Customers with the lowest recency, highest frequency and monetary amounts considered as top
customers.
qcut() is Quantile-based discretization function. qcut bins the data based on sample quantiles. For
example, 1000 values for 4 quantiles would produce a categorical object indicating quantile
membership for each customer.
This material is not original work. This compilation draws heavily from various sources
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1'])
rfm.head()
This material is not original work. This compilation draws heavily from various sources
This material is not original work. This compilation draws heavily from various sources