0% found this document useful (0 votes)
41 views

Chen2012 Article DataMiningForTheOnlineRetailIn

Uploaded by

NeilChavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Chen2012 Article DataMiningForTheOnlineRetailIn

Uploaded by

NeilChavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Technical Article

Data mining for the online


retail industry: A case study of
RFM model-based customer
segmentation using data mining
Received (in revised form): 18th July 2012

Daqing Chen
is a senior lecturer in the Department of Informatics, Faculty of Business, London South Bank University, London, UK. He mainly
lectures in data mining and business intelligence on BSc and MSc courses. His research interests include data mining, data-driven
marketing and customer-centric business intelligence. In recent years he has been engaged in several business-oriented data
mining projects across various business sectors.

Sai Laing Sain


is currently a BSc student in the Department of Informatics, Faculty of Business, London South Bank University, London, UK.

Kun Guo
is currently a PhD student in the Department of Civil Engineering, Faculty of Engineering, Science, and the Built Environment at
London South Bank University, London, UK. His academic interests include numerical modelling, artificial intelligence algorithms
and data mining.

ABSTRACT Many small online retailers and new entrants to the online retail sector are
keen to practice data mining and consumer-centric marketing in their businesses yet
technically lack the necessary knowledge and expertise to do so. In this article a case
study of using data mining techniques in customer-centric business intelligence for an
online retailer is presented. The main purpose of this analysis is to help the business
better understand its customers and therefore conduct customer-centric marketing more
effectively. On the basis of the Recency, Frequency, and Monetary model, customers of
the business have been segmented into various meaningful groups using the k-means
clustering algorithm and decision tree induction, and the main characteristics of the
consumers in each segment have been clearly identified. Accordingly a set of
recommendations is further provided to the business on consumer-centric marketing.
SAS Enterprise Guide and SAS Enterprise Miner are used in the present study.
Journal of Database Marketing & Customer Strategy Management (2012) 19, 197–208.
doi:10.1057/dbm.2012.17; published online 27 August 2012

Keywords: online retail; customer-centric marketing; data mining; customer segmenta-


tion; RFM model; k-means clustering

Correspondence:
Daqing Chen
Department of Informatics,
INTRODUCTION in Retail Group (IMRG), online shoppers
Faculty of Business, London For the past 10 years, we have witnessed in the United Kingdom spent an estimated
South Bank University,
London, UK
a steady and strong increase of online retail £50 billion in year 2011, a more than
E-mail: [email protected] sales. According to the Interactive Media 5000 per cent increase compared with year

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
www.palgrave-journals.com/dbm/
Chen et al

2000.1 This remarkable increase of online In order to address these business concerns,
sales indicates that the way consumers data mining techniques have been widely
shop for and use financial services has adopted across the online retail sector,
fundamentally changed. coupled with a set of well-known business
Compared with traditional shopping in metrics about customers’ profitability and
retail stores, online shopping has some values, for instance, the recency, frequency
unique characteristics: each customer’s and monetary (RFM) model,2 and the
shopping process and activities can be customer life value model.3 For many
tracked instantaneously and accurately, online retailers in the United Kingdom and
each customer’s order is usually associated internationally alike, especially the leading
with a delivery address and a billing companies including Amazon, Walmart,
address, and each customer has an online Tesco, Sainsbury’s, Argos, Marks and
store account with essential contact and Spencer, John Lewis, and EasyJet, data
payment information. These desirable, mining has now become a common
special online shopping characteristics practice and an integral part of the business
have enabled online retailers to treat each processes in creating customer-centric
customer as an individual with personalized business intelligence and supporting
understanding of each customer and to customer-centric marketing.4,5
build upon customer-centric business Although many famous online retail
intelligence. brands are embracing data mining techniques
In relation to customer-centric business as crucial tools to gain competitive
intelligence, online retailers are usually advantages on the market, there are still
concerned with the following common many smaller ones and new entrants are
business concerns: keen to practise consumer-centric marketing
yet technically lack the necessary knowledge
• Which items/products’ web pages has a and expertise to do so.
customer visited? How long has a customer In this article a case study of using data
stayed with each web page, and in which mining techniques in customer-centric
sequence has a customer visited a set of business intelligence for an online retailer is
products’ web pages? presented. The online retailer considered
• Who are the most/least valuable customers here is a typical one: a small business and a
to the business? What are the distinct relatively new entrant to the online retail
characteristics of them? sector, knowing the growing importance of
• Who are the most/least loyal customers, being analytical in today’s online businesses
and how are they characterized? and data mining techniques, however,
• What are customers’ purchase behaviour lacking technical awareness and recourses.
patterns? Which products/items have The main purpose of this analysis is to help
customers purchased together often? In the business better understand its customers
which sequence the products have been and therefore conduct customer-centric
purchased? marketing more effectively. On the basis
• Which types of customers are more likely of the RFM model, customers of the
to respond to a certain promotion mailing? business have been segmented into various
and meaningful groups using the k-means
• What are the sales patterns in terms of clustering algorithm and decision tree
various perspectives such as products/ induction, and the main characteristics of
items, regions and time (weekly, monthly, the consumers in each segment have been
quarterly, yearly and seasonally), and clearly identified. Accordingly, a set of
so on? recommendations is provided to the

198 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
Data mining for the online retail industry

business on customer-centric marketing and non-store business with some 80 members


further data analysis tasks. The analysis is of staff. The company was established in
developed in a step-by-step way. SAS 1981 mainly selling unique all-occasion
Enterprise Guide and SAS Enterprise gifts. For years in the past, the merchant
Miner6–9 have been employed in this study. relied heavily on direct mailing catalogues,
The rest of this article is organized as and orders were taken over phone calls. It
follows. The next section provides the was only 2 years ago that the company
background information about the online launched its own web site and shifted
retailer studied in the article along with the completely to the Web. Since then the
associated dataset to be explored. The company has maintained a steady and
section after that discusses in detail healthy number of customers from all parts
about the main steps and tasks for data of the United Kingdom and Europe, and
pre-processing in order to create an has accumulated a huge amount of data
appropriate target dataset for the required about many customers. The company also
further analyses. In the subsequent section uses Amazon.co.uk to market and sell its
the k-means clustering analysis is performed products.
and a set of meaningful clusters and The customer transaction dataset held by
segments of the target dataset has been the merchant has 11 variables as shown in
identified. A detailed discussion on each of Table 1, and it contains all the transactions
the clusters is given, and the segmentation occurring in years 2010 and 2011. It should
is further refined by using decision tree be noted that the variable PostCode is
induction. The penultimate section essential for the business as it provides vital
summarizes the essential consumer-centric information that makes each individual
business intelligence based on the analysis consumer recognizable and trackable, and
results, and provides some concrete therefore it makes some in-depth analyses
recommendations to the online retailer possible in the present study.
aiming at maximizing profits for the As the first ever pilot study for the
business. Finally the concluding remarks are business to generate sensible customer
given in the last section. intelligence, only the transactions created
from 1 January 2011 to 31 December 2011
BUSINESS BACKGROUND are explored in this article. Over that
AND THE ASSOCIATED DATA particular period, there were 22 190 valid
The online retailer under consideration in transactions in total, associated with 4381
this article is a UK-based and registered valid distinct postcodes. Corresponding to

Table 1: Variables in the customer transaction dataset (4381 instances)


Variable name Data type Description; typical values and meanings

Invoice Nominal Invoice number; a 6-digit integral number uniquely assigned to each
transaction
StockCode Nominal Product (item) code; a 5-digit integral number uniquely assigned to each
distinct product
Description Nominal Product (item) name;
CARD I LOVE LONDON
Quantity Numeric The quantities of each product (item) per transaction
Price Numeric Product price per unit in sterling; £45.23
InvoiceDate Numeric The day and time when each transaction was generated; 31/05/2011 15:59
Address Line 1 Nominal Delivery address line 1; 103 Borough Road
Address Line 2 Nominal Delivery address line 2; Elephant and Castle
Address Line 3 Nominal Delivery address line 3; London
PostCode Nominal Delivery address postcode, mainly for consumers from the UK; SE1 0AA
Country Nominal Delivery address country; England

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208 199
Chen et al

these transactions, there are 406 830 5. Sort out the dataset by Postcode and
instances (record rows) in the dataset, each create three essential aggregated variables
for a particular item contained in a Recency, Frequency and Monetary.
transaction. On average, each postcode is Calculate the values of these variables
associated with five transactions, that is, per postcode.
each customer has purchased a product
from the online retailer about once every Following these steps a target dataset for the
2 months. In addition, only consumers analysis has been generated. The original
from the United Kingdom are analysed. dataset was in MS Excel format, and was
It is interesting to notice that the average transformed into the final target dataset
number of distinct products (items) in SAS format in SAS Enterprise Guide 4.2.
contained in each transaction occurring in Part of the target dataset is shown in
2011 was 18.3 ( = 406 830/22 190). This Figure 1, and the variables in the target
seems to suggest that many of the dataset and their statistics are described
consumers of the business were in Tables 2 and 3. The SAS procedures
organizational customers rather than proc means and proc sql were used to
individual customers. transform the dataset and to calculate the
values for the variables Recency, Frequency
and Monetary, for each given postcode,
DATA PRE-PROCESSING respectively. As an example, Table 4 gives
In order to conduct the required RFM the relevant SAS code utilized to calculate
model-based clustering analysis, the original the values for Monetary. Finally the target
dataset needs to be pre-processed. The dataset was uploaded into SAS Enterprise
main steps and relevant tasks involved in Miner 6.2 for analysis.
the data preparation are as follows:

1. Select appropriate variables of interest RFM MODEL-BASED


from the given dataset. In our case CLUSTERING ANALYSIS
the following six variables have been
chosen: Invoice, StockCode, Quantity, Price, Clustering
InvoiceDate and PostCode. With the prepared target dataset we
2. Create an aggregated variable named intended to identify whether consumers
Amount, by multiplying Quantity with can be segmented meaningfully in the
Price, which gives the total amount of view of recency, frequency and monetary
money spent per product/item in each values. The k-means clustering algorithm
transaction. was employed for this purpose, and it can
3. Separate the variable InvoiceDate into be easily performed by using the Cluster
two variables Date and Time. This allows node in SAS Enterprise Miner.
different transactions created by the same As well-known, the k-means clustering
consumer on the same day but at different algorithm is very sensitive to a dataset
times to be treated separately. that contains outliers (anomalies) or
4. Filter out any transactions that do not variables that are of incomparable scales or
have a postcode associated with. This magnitudes. Examining the histograms of
resolves any missing value issues in the variables Recency, Frequency and Monetary
relation to the variable PostCode. In of the target dataset in SAS Enterprise
addition, filter out any transactions that are Miner, as illustrated in Figure 2, it is
not associated with a United Kingdom’s evident that there are a few instances
postcode. having quite different monetary and

200 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
Data mining for the online retail industry

Figure 1: Samples of the target dataset.

Table 2: Variables in the target dataset Table 3: Summary of the target dataset (3799
instances)
Variable name Data type Description
Variable name Minimum Median Maximum
Buyer Nominal Corresponding to each
distinct postcode Recency 0 3.2 12
Recency Numeric Recency in month Frequency 1 4.9 169
First_Purchase Numeric Time in month since the Monetary 3.75 1586.63 88 125.38
first purchase in 2011 First_Purchase 0 7.5 12
Frequency Numeric Frequency of purchase per
postcode
Monetary Numeric Total amount spent per
postcode
Minimum Numeric Minimum spending per Table 4: Sample SAS codes for calculating values of
postcode monetary
Maximum Numeric Maximum spending per
postcode proc means data=YourLibraryName.
Mean Numeric Median spending per SortedOriginalDataset n sum min
postcode max mean;
var Amount;
by Postcode;
output out=YourLibraryName.TagretDatasetMonetary
(drop=_type_ _freq_) n=n sum=sum min=min max=max
mean=mean;
frequency values compared to the majority run;
of the instances in the dataset. These
instances are valid from the business point
of view as they are genuine transaction
records; however, they are outliers from the different: Recency [0,12]; Frequency [1,169]
data analysis point of view. Therefore, these and Monetary [3,88 125], respectively. As
instances should be isolated from the such, these variables should be normalized
majority and treated separately. In addition, before the clustering analysis.
the three variables are not on comparable On the basis of the initial insight into the
scales, and the value ranges are quite dataset, a project diagram has been set up

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208 201
Chen et al

Figure 2: Distribution of the variables Recency, Frequency and Monetary.

202 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
Data mining for the online retail industry

Figure 3: Project diagram in SAS Enterprise Miner 6.2.

Table 5: Summary of the filtered target dataset (3726 Table 6: Instances in each cluster
instances)
Cluster Frequency Percentage
Variable Minimum Median Maximum of cluster
name
1 527 14.14
Recency 0 3.2 12 2 636 17.07
Frequency 1 4.1 28 3 1748 46.91
Monetary 3.75 1565.70 13 110.02 4 627 16.83
First_Purchase 0 7.5 12 5 188 5.05

in SAS Enterprise Miner for the clustering Table 7: Statistics of each cluster
analysis as depicted in Figure 3. There are Minimum Median Maximum
four nodes in the diagram. In the Data
Cluster 1
Sources (Target Dataset) node, the three Recency 8 9.8 12
variables Recency, Frequency and Monetary Frequency 1 1.3 4
Monetary 3.75 361.20 7741.47
were chosen as input for the clustering First_Purchase 8 11.1 12
analysis. The Filter node was set to exclude
Cluster 2
from the analysis any instances having a rare Recency 4 5.4 7
value for any variables involved, and the Frequency 1 2.3 13
minimum cutoff value for rare values was set Monetary 15 586.19 3906.27
First_Purchase 4 7.7 12
to 1 per cent of the total number of
instances under consideration. For example, Cluster 3
Recency 0 1.5 3
out of the total 3799 instances, there was Frequency 1 2.6 7
only one instance taking a monetary value of Monetary 20.8 685.71 4314.72
First_Purchase 0 5.3 12
more than £87 684, and therefore, that
instance was extended from the analysis. Cluster 4
Overall there were totally 73 instances were Recency 0 1.0 5
Frequency 3 8.3 16
excluded by the Filter node, and the Monetary 191.17 2425.09 7330.8
summary of the resultant filtered target First_Purchase 1 1.0 12
dataset is given in Table 5. In the Cluster Cluster 5
node, the standard range transformation for Recency 0 0.7 6
Frequency 3 17.7 28
normalization was used with the number of Monetary 1641.48 5962.85 13 110.02
clusters specified as 3, 4 and 5, respectively, First_Purchase 0 11.1 12
and finally, the Segment Profile node was
utilized to assists to interpret each cluster
found. Understanding the clusters
The clustering and segment results with Interpreting and understanding each
five clusters are shown in Tables 6 and 7, cluster identified is crucial in generating
and the distribution of the instances within customer-centric business intelligence.
each cluster is detailed in Figures 4 and 5. Examining Table 7 and Figures 4 and 5,
This segmentation by five clusters seems to it is interesting to see that each cluster
have a clearer interpretation of the target indeed contains a group of consumers that
dataset than the ones by three and four have certain distinct and intrinsic features as
clusters. detailed below.

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208 203
Chen et al

Figure 4: (a) Distribution of all instances coloured for different clusters. (b) Distribution of the instances in
cluster 1. (c) Distribution of the instances in cluster 2. (d) Distribution of the instances in cluster 3. (e) Distribution
of the instances in cluster 4. (f) Distribution of the instances in cluster 5.

Cluster 1 relates to some 527 consumers, the first half of the year, the consumers
composed of 14.4 per cent of the whole didn’t shop often, and the average value
population. This group seems to be the of frequency was only 1.3.
least profitable group as none of the Contrasted with the customers in
customers in this group purchased anything cluster 1, the 188 customers in cluster 5
in the second half of the year. Even for mainly started shopping with the online

204 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
Data mining for the online retail industry

Figure 5: (a) Distribution of recency by cluster. (b) Distribution of frequency by cluster. (c) Distribution of
monetary by cluster. (d) Distribution of first purchase by cluster.

retailer at the beginning of the year, and There are some 459 consumers in cluster
continued to the end of the year with 2. Compared with clusters 4 and 5, this
an average value of recency 0.7. They group of customers has a lower frequency
purchased quite often and as a result, spent throughout the year and a significantly
a quite high amount of money. This group smaller average value of monetary,
of consumers can be categorized as very indicating that a much smaller amount of
high recency, very high frequency and spending per consumer. This group can be
very high monetary with a high spending categorized as low recency, high frequency
per consumer. In fact, those 188 consumers and medium monetary with a medium
contributed 25.5 per cent of the total sales spending per consumer.
in the year. This group, although the Cluster 3 is the largest-sized group with
smallest (only composed of 5.05 per cent 1748 consumers. Consumers in this group
of the whole population), seems to be have a reasonable value of frequency.
the most profitable group. Compared with clusters 2 and 4, this
Cluster 4 contains some 627 consumers group has a lower but reasonable value of
with a very high value for frequency monetary as the group includes many newly
and monetary, although lower than those registered consumers starting shopping
of cluster 5. This group seems to be the with the retailer very recently. This group
second high profit group. seems to have represented ordinary

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208 205
Chen et al

Figure 6: Customer segmentation (left) and associated sales (right) by cluster.

consumers and therefore has a certain level 2.5 with an average monetary value of
of uncertainty in terms of profitability. In 990.66; and frequency more than 2.5 and
the long-term view, some of the consumers less than 3.5 with an average monetary
might be potentially very highly profitable value of 1056.70 and so on. Also, it is
or unprofitable at all. interesting to note that the relationship
We use Figure 6 to summarize our between frequency and monetary seems to
analysis made so far: in the whole be a monotonic linear relationship.
population of the consumers, 47 per cent
of them were ordinary shoppers with CUSTOMER-CENTRIC
reasonable spending and frequency, about BUSINESS INTELLIGENCE
34 per cent were medium to high profit, AND RECOMMENDATIONS
5 per cent were extremely highly profit, The most valuable consumers of the
and the remaining 14 per cent were business have contributed more than
extremely low profit. About 22 per cent 60 per cent of the total sales in year 2011,
of the consumers contributed roughly whereas the least valuable ones only made
60 per cent of the total sales. Overall the up 4 per cent of the total sales. For each
business seems to be quite healthy in terms of these consumer groups, it is essential
of profitability. to further find out which products the
customers in each group have purchased,
Enhancing clustering analysis which products have been purchased
using decision tree together most frequently and in which
As discussed above, cluster 3 is the most sequence the products have been purchased.
diverse cluster among the five identified The business can gain a better
clusters in the sense that it contains both understanding of the consumers by
newly registered and old customers as well. exploring the associations among consumer
To refine the segmentation of the instances groups and the products they have
in this cluster, a decision tree has been used purchased. The association can be
to create some nested segments internally examined on products/items level and
inside the cluster, as shown in Figure 5. on products categories level as well.
In other words, these nested segments Many of the consumers of the business
form some sub-clusters inside cluster 3, were organizational consumers with
and make it possible to categorize the a high quantity of a product per transaction.
consumers concerned into some sensible Examining at which specific times (seasons),
sub-categories. For example, as shown in what products and which types of products
Figure 7, the customers can be divided they have purchased frequently will be
into such categories as frequency more than beneficiary to the business. It will be also

206 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208
Data mining for the online retail industry

Figure 7: Refined segmentation of the instances in cluster 3 using decision tree induction.

interesting to see if there are any differences be created by means of data mining
between different types of customers, that techniques. The distinct customer groups
is, organizational and individual customers, characterized in the case study can help the
in terms of their shopping patterns. business better understand its customers in
Monitoring the diversity of the most terms of their profitability, and accordingly,
diverse customer group and predicting adopt appropriate marketing strategies for
which customer will potentially become different consumers.
affiliated to the most or the least profitable It has been shown in this analysis that
group is very useful for the business in the there are two steps in the whole data
long term. Identifying appropriate predictors mining process that are very crucial and the
or indictors for such predictions is most time-consuming: data preparation and
invaluable. model interpretation and evaluation.
Another aspect worth further Further research for the business includes:
investigation is to link consumer groups to conducting association analysis to establish
geographical locations. This correlation, if customer buying patterns with regard to
exists, may help the business look into which products have been purchased
other factors, such as culture, customs, and together frequently by which customers and
economics, that may affect a consumer’s which customer groups; enhancing the
buying intention and preferences. merchant’s web site to enable a consumer’s
shopping activities to be captured and
CONCLUDING REMARKS tracked instantaneously and accurately;
A case study has been presented in this and predicting each customer’s lifecycle
article to demonstrate how customer-centric value to quantify the level of diversity of
business intelligence for online retailers can each customer.

© 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208 207
Chen et al

ACKNOWLEDGEMENTS the Appetite. Working Knowledge Report, Babson


Executive Education.
The authors thank the anonymous reviewers 5 Fuloria, S. (2011) How Advanced Analytics Will
for their valuable comments and suggestions Inform and Transform U.S. Retail. Cognizant Reports,
to improve the quality of this article. July, http://www.cognizant.com/InsightsWhitepapers/
How-Advanced-Analytics-Will-Inform-and-
Transform-US-Retail.pdf, accessed January 2012.
REFERENCES 6 Collica, R.S. (2007) CRM Segmentation and
1 Interactive Media in Retail Group (IMRG). (2012) Clustering Using SAS Enterprise Miner, Cary, NC:
Press archive, http://www.imrg.com, accessed January SAS Insititute.
2012. 7 Cerrito, P.B. (2007) Introduction to Data Mining
2 Kumar, V. and Reinartz, W.J. (2006) Customer Using SAS Enterprise Miner. Cary, NC: SAS
Relationship Management: A Databased Approach, Institute.
Hoboken, NJ: John Wiley & Sons. 8 Sarma, K.S. (2007) Predictive Modeling with
3 Hughes, A.M. (2012) Strategic Database Marketing SAS Enterprise Miner. Cary, NC: SAS Institute.
4e: The Masterplan for Starting and Managing a 9 Thompson, W. (2008) Understanding Your Customer:
Profitable, Customer-based Marketing Program, Segmentation Techniques for Gaining Customer
McGraw-Hill Professional, USA. Insight and Predicting Risk in the Telecom Industry.
4 Davenport, T.H. (2009) Realizing the Potential Paper 154-2008, SAS Global Forum, 16–19 March,
of Retail Analytics: Plenty of Food for Those with San Antonio, TX.

208 © 2012 Macmillan Publishers Ltd. 1741-2439 Database Marketing & Customer Strategy Management Vol. 19, 3, 197–208

You might also like