0% found this document useful (0 votes)
67 views

Running RFM in Python

This document discusses performing RFM (Recency, Frequency, Monetary) analysis in Python. It loads customer transaction data, filters to a specific country, and calculates RFM scores by quantiling customers based on days since last purchase (Recency), number of purchases (Frequency), and total spending (Monetary). It combines these into an RFM score and identifies top customers with scores of 111, representing the lowest Recency, highest Frequency, and highest Monetary values.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Running RFM in Python

This document discusses performing RFM (Recency, Frequency, Monetary) analysis in Python. It loads customer transaction data, filters to a specific country, and calculates RFM scores by quantiling customers based on days since last purchase (Recency), number of purchases (Frequency), and total spending (Monetary). It combines these into an RFM score and identifies top customers with scores of 111, representing the lowest Recency, highest Frequency, and highest Monetary values.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Running RFM in Python

Importing Required Library

#import modules
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import datetime as dt

Loading Dataset

data = pd.read_excel("C:\Users\siva\Desktop\Online_Retail.xlsx")

data.head()

data.tail( )

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo 541909 non-null object
StockCode 541909 non-null object
Description 540455 non-null object
Quantity 541909 non-null int64
InvoiceDate 541909 non-null datetime64[ns]
UnitPrice 541909 non-null float64
CustomerID 406829 non-null float64
Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

This material is not original work. This compilation draws heavily from various sources
data= data[pd.notnull(data['CustomerID'])]

Removing Duplicates

Sometimes you get a messy dataset. You may have to deal with duplicates, which will skew your
analysis. In python, pandas offer function drop_duplicates(), which drops the repeated or
duplicate records.
filtered_data=data[['Country','CustomerID']].drop_duplicates()

filtered_data.Country.value_counts()

United Kingdom 3950


Germany 95
France 87
Spain 31
Belgium 25
Switzerland 21
Portugal 19
Italy 15
Finland 12
Austria 11
Norway 10
Denmark 9
Netherlands 9
Australia 9
Channel Islands 9
Sweden 8
Japan 8
Cyprus 8
Poland 6
Unspecified 4
Canada 4
Israel 4
Greece 4
USA 4
EIRE 3
Bahrain 2
United Arab Emirates 2
Malta 2
Lithuania 1
Singapore 1
Iceland 1
Lebanon 1
RSA 1
Saudi Arabia 1
Czech Republic 1

This material is not original work. This compilation draws heavily from various sources
Brazil 1
European Community 1

filtered_data.Country.value_counts()[:10].plot(kind='bar')

filtered_data.Country.value_counts()[:5].plot(kind='bar')

To Filter data for United Kingdom customer


uk_data=data[data.Country=='United Kingdom']

The describe() function in pandas is convenient in getting various summary statistics. This
function returns the count, mean, standard deviation, minimum and maximum values and the
quantiles of the data.

uk_data.describe()

Quantity UnitPrice CustomerID

count 361878.000000 361878.000000 361878.000000

mean 11.077029 3.256007 15547.871368

std 263.129266 70.654731 1594.402590

This material is not original work. This compilation draws heavily from various sources
min -80995.000000 0.000000 12346.000000

25% 2.000000 1.250000 14194.000000

50% 4.000000 1.950000 15514.000000

75% 12.000000 3.750000 16931.000000

max 80995.000000 38970.000000 18287.000000

To remove the negative quantity


uk_data = uk_data[(uk_data['Quantity']>0)]
uk_data.describe()

Filter required Columns

Here, you can filter the necessary columns for RFM analysis. You only need her five columns
CustomerID, InvoiceDate, InvoiceNo, Quantity, and UnitPrice. CustomerId will uniquely define
your customers, InvoiceDate help you calculate recency of purchase, InvoiceNo helps you to
count the number of time transaction performed(frequency). Quantity purchased in each
transaction and UnitPrice of each unit purchased by the customer will help you to calculate the
total purchased amount.
uk_data=uk_data[['CustomerID','InvoiceDate','InvoiceNo','Quantity','UnitPrice'
]]
uk_data['TotalPrice'] = uk_data['Quantity'] * uk_data['UnitPrice']

uk_data['InvoiceDate'].min(),uk_data['InvoiceDate'].max()

(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:49:00'))

PRESENT = dt.datetime(2011,12,10)
uk_data['InvoiceDate'] = pd.to_datetime(uk_data['InvoiceDate'])
uk_data.head()

CustomerID InvoiceDate InvoiceNo Quantity UnitPrice TotalPrice

0 17850.0 2010-12-01 08:26:00 536365 6 2.55 15.30

1 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34

2 17850.0 2 010-12-01 08:26:00 536365 8 2.75 22.00

3 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34

This material is not original work. This compilation draws heavily from various sources
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34

RFM Analysis

Here, you are going to perform following opertaions:

 For Recency, Calculate the number of days between present date and date of last
purchase each customer.
 For Frequency, Calculate the number of orders for each customer.
 For Monetary, Calculate sum of purchase price for each customer.

rfm= uk_data.groupby('CustomerID').agg({'InvoiceDate': lambda date: (PRESENT -


date.max()).days,'InvoiceNo': lambda num: len(num),'TotalPrice': lambda price:
price.sum()})

rfm.columns
Index(['InvoiceDate', 'TotalPrice', 'InvoiceNo'], dtype='object')

# Change the name of columns


rfm.columns=['recency','frequency','monetary']
rfm['recency'] = rfm['recency'].astype(int)
rfm.head()

recency frequency monetary


CustomerID
12346.0 325 1 77183.60
12747.0 2 103 4196.01
12748.0 0 4596 33719.73
12749.0 3 199 4090.88
12820.0 3 59 942.34

Computing Quantile of RFM values

Customers with the lowest recency, highest frequency and monetary amounts considered as top
customers.

qcut() is Quantile-based discretization function. qcut bins the data based on sample quantiles. For
example, 1000 values for 4 quantiles would produce a categorical object indicating quantile
membership for each customer.

rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'])

This material is not original work. This compilation draws heavily from various sources
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1'])
rfm.head()

Recency frequency monetary r_quartile f_quartile m_quartile


CustomerID
12346.0 325 1 77183.60 4 4 1
12747.0 2 103 4196.01 1 1 1
12748.0 0 4596 33719.73 1 1 1
12749.0 3 199 4090.88 1 1 1
12820.0 3 59 942.34 1 2 2

RFM Result Interpretation

Combine all three quartiles(r_quartile,f_quartile,m_quartile) in a single column, this rank will


help you to segment the customers well group.

rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) +


rfm.m_quartile.astype(str)
rfm.head()

# Filter out Top/Best cusotmers


rfm[rfm['RFM_Score']=='111'].sort_values('monetary', ascending=False).head()

This material is not original work. This compilation draws heavily from various sources
This material is not original work. This compilation draws heavily from various sources

You might also like