0% found this document useful (0 votes)
98 views

Market Basket Analysis With Association Rules

This document summarizes a research article published in the journal Communications in Statistics - Theory and Methods. The article discusses using market basket analysis and association rule mining to analyze customer purchase data from a supermarket. Specifically, it analyzes purchase data from 225 products to identify rules about which products customers tend to buy together. It finds the top 10 rules using the FP Growth algorithm and provides an example of the best rule, which is that customers who buy milk, sweet relish, and frozen pizza also tend to buy eggs. The article concludes the analysis can help the supermarket with product placement and promotions to increase sales and revenue.

Uploaded by

desi silviaaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Market Basket Analysis With Association Rules

This document summarizes a research article published in the journal Communications in Statistics - Theory and Methods. The article discusses using market basket analysis and association rule mining to analyze customer purchase data from a supermarket. Specifically, it analyzes purchase data from 225 products to identify rules about which products customers tend to buy together. It finds the top 10 rules using the FP Growth algorithm and provides an example of the best rule, which is that customers who buy milk, sweet relish, and frozen pizza also tend to buy eggs. The article concludes the analysis can help the supermarket with product placement and promotions to increase sales and revenue.

Uploaded by

desi silviaaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: https://www.tandfonline.com/loi/lsta20

Market basket analysis with association rules

Yüksel Akay Ünvan

To cite this article: Yüksel Akay Ünvan (2020): Market basket analysis with association rules,
Communications in Statistics - Theory and Methods, DOI: 10.1080/03610926.2020.1716255

To link to this article: https://doi.org/10.1080/03610926.2020.1716255

Published online: 29 Jan 2020.

Submit your article to this journal

Article views: 9

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=lsta20
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS
https://doi.org/10.1080/03610926.2020.1716255

Market basket analysis with association rules



€ksel Akay Unvan
Yu
Faculty of Management, Banking and Finance Department, Ankara Yıldırım Beyazıt University,
Ankara, Turkey

ABSTRACT ARTICLE HISTORY


This study was conducted in order to make a Market Basket Analysis Received 25 October 2019
by using Association Rules. The data used in the study are the sales Accepted 7 January 2020
data of any supermarket received from the Vancouver Island
KEYWORDS
University website. Data were analyzed in the Weka program using a
Market basket analysis;
data set containing 225 different products. Apriori and FP Growth, association rules; Apriori;
which are Association Rules algorithms, were tried in order. Since FP Growth
the data set is categorical, the Apriori algorithm did not yield any
results. Therefore, the FP Growth algorithm was used and the top 10
rules were given according to the conviction value. The best rule
accordingly; a customer who buys Milk, Sweet Relish and Pepperoni
Pizza (Frozen) also gets eggs. Best rule with 21.06 Conviction and 1
(100%) confidence values are this rule. 24 customers who received
these 3 products in the dataset received eggs. Similarly, also other
rules were interpreted in this study. As a result, product placement
in the supermarket can be made according to these rules. Thus, sales
of these products will increase and supermarket revenue will
increase directly.

1. Introduction
Information technologies continue to develop rapidly these days and companies can
obtain, store, analyze and interpret the data they are interested in much easier and at a
lower cost. With the development of data mining today, data sets have gained a great
deal of value. The association rules, which can determine the buying behavior of cus-
tomers who shop at retail stores or e-commerce sites, are one of the most used data
mining techniques and identify products or product groups that are highly interdepend-
ent with each other. For this purpose, market basket analysis using association rules is
one of the most popular methods. The association rules, which use algorithms such as
Apriori and FP Growth, give the top 10 rules.
In this study, it is aimed to perform a market basket analysis by using association
rules. For this purpose, taking a data set of a supermarket on the Vancouver Island
University website was used (Vancouver Island University 2019). This data set contains
information on purchases made by customers for 255 different products. After the

CONTACT Y€uksel Akay Unvan€ [email protected] Faculty of Management, Banking and Finance
Department, Ankara Yıldırım Beyazıt University, Ankara, Turkey.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/lsta.
ß 2020 Taylor & Francis Group, LLC
2 Y. A. ÜNVAN

necessary examinations and studies on the data are made, they will be analyzed with
the Weka program and necessary interpretations will be made.

2. Literature
Ulaş et al. (2001) applied basket analysis in their study to some of the sales data
obtained from various stores of Gima T€ urk A.Ş. The database used contained 756,868
transactions, 140,610 items and 7,237 kinds of goods. Each item was on average 105
items. There were 9,985 registered customers in the database. Since the data used was
taken in the summer, the best-selling products were; tomatoes, bread, and cucumbers.
In addition, products such as eggs and watermelons were also identified. Based on these
data, it can be said that people tend to eat, especially light meals in the summer
months. “77% of cucumber areas receive tomatoes” or “Purslane and 55% of tomato
areas receive parsley” can be cited as examples of the rules.
Chen et al. (2005) studied market basket analysis and stated that the proposed
method is efficient in terms of calculations. They also identified that the stores vary in
size and that this method is advantageous over the traditional method when more stores
and periods are used.
By using the Association Rules, Timor and Şimşek (2008) analyzed the exchange data
belonging to customers of a large supermarket chain in Turkey operating in the retail
sector. With the Association Rules and basket analysis, it was determined which prod-
ucts the customers purchased with which products. Then, the variables that affect the
purchasing behavior of the customers were determined by decision trees. When the val-
ues obtained as a result of the analyzes were examined, it was seen that customers who
buy a certain product X also buy a certain amount of Y product, but vice versa, those
who buy product Y are not at the same rate as those who buy product X. It has been
said that this and similar information may be used in campaign arrangements or shelf
arrangements and promotion and sales of products not associated with related products.
Song et al. (2009) conducted a competitive structure analysis of the Chinese soybean
import market. The study revealed that Chinese soybean importers may have stronger
market power in the Chinese soybean import market. It also develops a model that will
test the strength of soybean trade in the US-China market and develops a two-country
partial balance trade in the research and estimates it at the same time. The results sup-
port the hypothesis that Chinese soybean importers have a stronger market power than
US soybean exporters.
Musalem, Aburto, and Bosch (2018) presented an approach to identify the relation-
ships between product categories used to divide a retailer’s business into category sub-
sets. Browser data were used to reveal product category dependencies. Since the number
of possible relationships between them may be very large, the authors provided an
approach that produces an intuitive graphical representation of these relationships using
data analysis techniques found in standard statistical packages such as multidimensional
scaling and clustering. As a result of the analysis, four groups of product categories pur-
chased by customers emerged. The analysis of each of these groups was conducted with
the retail store under consideration as a small sub-group. As a result, it showed that
retailers can potentially benefit if they switch to a customer management approach that
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 3

identifies relationships between product categories rather than the traditional category
management approach, where they manage product categories separately.
Setiabudi et al. (2011) implemented yet another MBA method in their paper. They
analyzed the buying habit of shopping users using MBA. The study conducted the
evaluation of implemented MBA on minimarket X. They used well know Apriori
method for discovering a frequent set of items which are frequently appeared in trans-
action history as well as a database. The itemsets exceeding the threshold of minimum
support value were selected as frequent itemsets. Such selected itemsets were further uti-
lized to generate association rules followed by decoding. Every selected frequent itemset
was able to generate association rules and hence compute the confidence using hybrid
dimension association rules. The experimental results claimed that their implemented
MBA can able to generate knowledge about the kind of items that were frequently pur-
chased in a similar time frame by the customers using the criteria of hybrid dimension
association rules. Their mining process outcomes showed the correlation among associ-
ation rules and confidence that can be analyzed.
Raeder and Chawla (2011), in their study, took a different approach to mining
process data. By modeling data as a product network, they discovered impressive
communities (clusters) in the data. In the network-based approach, they have shown
that they can isolate inferences between products precisely and reduce the need to
search through a list of aggregate rules. First, they examined the characteristics of
productive networks and showed that identifying communities within these networks
could reveal meaningful relationships between association rules and products that
may be difficult to find.
Dogan, Erol, and Buldu (2014) analyzed the data belonging to customers of a leading
insurance company in the industry operating in Turkey, using the Apriori algorithm.
As a result of this analysis, it can be seen which product groups the customers prefer to
buy together. When minimum support value is set as 4% and minimum trust value is
set as 15%; The results of the first five associations that had the highest support and
confidence values were determined as follows. 64% of the customers who purchased
Casco Insurance also received Traffic Insurance and these persons constitute 17% of the
total customers. 55% of those who have received fire insurance have purchased DASK
(Compulsory Earthquake Insurance) and they constitute 9% of the total persons. 47% of
those who have Compulsory Earthquake Insurance insurance have also purchased fire
insurance and these persons make up 9% of the total persons. 34% of those who have
received fire insurance also purchased Traffic Insurance and they accounted for 6% of
total customers. 33% of those who have received fire insurance have also purchased
Casco Insurance and these persons make up 5% of total customers.
Roodpishia and Nashtaei (2015) said that today many organizations are focused on
discovering the hidden patterns of their customers to maintain their competitive pos-
ition through customer analysis. They are now aware that organizations are now the
most valuable resource for customers. Their research was conducted using data related
to 300 customers of an insurance company in Anzali city of Iran and the K-Means clus-
tering method was used. Using the demographic variables such as gender, age, occupa-
tion, education level, marital status, place of residence and income of customers, the
optimum number of clusters was determined to obtain the data required to group the
4 Y. A. ÜNVAN

customers. Later, researchers used the partnership rules method to find hidden patterns
in the insurance industry.
In a study by Dogan (2015) some statistical inferences regarding the password struc-
tures of users of an e-commerce site were revealed. Accordingly, user password lengths
ranged from 4 characters to 12 characters. The average password length was 7.1 and it
was determined that 53% of the passwords were generated using only one character.
Based on this, it was found that the majority of passwords do not have sufficient secur-
ity. In the analysis, a data set of 9997 people with all variables was used. Nine meaning-
ful and useful rules were found after the elimination process in the rules obtained from
association analysis. In rule 1; Persons living in the Southeastern Anatolia Region with a
password complexity value of 2 and a shorter password length were in the 25–44 age
group with 98% accuracy. In rule 2; Of the male users who were male, who live in
Central Anatolia and who are between the ages of 45–64, the password length is short
(1). Of the 12 people who met the qualifications in the premise of the rule, 12 had the
same attribute (password complexity value 1) in the successive part of the rule. The
other seven rules can be interpreted in this way.
Kaur and Kang (2016) stated that the data mining technique, IE merger rule mining,
is presented as an adjunct technique in examining customer behavior and increasing
sales. Merger rule mining is said to be useful for discovering interesting relationships
hidden in large data sets. In their research, different mining types such as masonry rule
mining, classification, clustering, and other techniques were discussed. The study also
discussed the association rules, namely the two basic measures for support and trust.
Merger rule mining technique, rule induction technique, and Apriori algorithm were
examined after application. As a result of the analysis, it was found that a strong rela-
tionship exists between milk and butter. It was also concluded that many customers
bought milk and butter together. It is said that these rules can help retailers understand
customers’ purchasing capabilities.

Ozçalıcı (2017) in his research for the secondhand vehicle market, put together a data
set consisting of 73 different variables belonging to over two hundred thousand cars
from an e-commerce site by web scraping method. In order to use the Apriori algo-
rithm, 100 rules with a support value of 10% and the trust value of over 70% were iden-
tified with the variables whose values were rearranged. If you want to buy a vehicle
worth between 30,000 Turkish Liras and 50,000 Turkish Liras, the vehicles will most
likely include ABS, electric windscreen, central lock, CD player, electric mirror and
manual gear.
Gangurde, Kumar, and Gore (2017) designed an optimized technique for Market
Basket Analysis (MBA) to estimate and analyze customers’ buying behavior. The study
faced two difficulties in making the analysis. The first challenge was data cleanup since
none of the available techniques considered the possibility of the raw data or noisy data
in the transaction history. The second challenge was that customers’ demands constantly
change in terms of season and time. In addition, the output of the market basket ana-
lysis depends entirely on time and season and is therefore required to be performed
repeatedly. Therefore, a dynamic and automated MBA framework was needed. They
employed new algorithms based on data cleaning; Apriori and FFNN in solving these
challenges. The performance of the proposed approach was evaluated according to
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 5

existing methods. It was stated that the results were important and promising to
demand the effectiveness of the proposed approach.
Moodley et al. (2018) stated that in recent years (especially after the recession in the UK
in 2008) competition in the shopping sector has intensified, shopping habits and demo-
graphic characteristics have changed and price sensitivity has become increasingly import-
ant. Numerous studies have been undertaken to understand the items that are often
purchased together (association rule mining/frequent product sets) with a few measures
proposed to collect substantial support and to establish confidence at different levels of
accuracy as these criteria are highly content-dependent. Uninorms was used as an alterna-
tive measure to increase support and confidence in the analysis of market basket data
using the UK grocery retail sector as a case study. Experiments were conducted on con-
sumer panel data to compare Uninorm with three other popular measures (Jaccard,
Cosine, and Conviction). Uninorm has been found to outperform other models when it
complies with the basic monotonic characteristics of support in market basket analysis.
Liew (2018) performed a study to reveal the basic nutritional habits related to phys-
ical activity with the students’ opinions about the food quality in the school cafeteria
and vending machines. The empirical analysis was based on the 2011 Healthy School
Program (HSP) Assessment. HSP assesses the demographic characteristics, nutritional
habits, and exercise patterns of a representative sample of primary, secondary, and high
school students in the United States. The findings showed that students assigned to dif-
ferent clusters have different eating habits, exercise models, weight status, weight man-
agement, and opinions about the quality of food in the school cafeteria and vending
machines. Also, there were great differences in diet profiles and lifestyle behaviors
among students who were not sure of their overweight or weight status.
Bilgiç (2019) stated that market basket analysis is very important in terms of under-
standing consumers’ preferences and purchasing behaviors in the retail sector and
developing the most suitable production and marketing strategies. The study not only
provided useful results for the company which was the focus of the research but also
encourages the researchers to use the shared R programing language in detail through-
out the study so that both researchers and retailers can analyze their data with advanced
algorithms. Besides, unlike many other studies, the relationship between the products
was found rather than working with more general product groups. As a result of the
analysis of the strongest buying behavior; it was determined that the customers who
buy eggs also shop from the grocery store.
Yulianto and Heryanto (2019) conducted research on software from e-commerce appli-
cations using market basket analysis to market handmade products produced by the
Handicraft Industry. From the analysis, two main results were obtained. 1. With the estab-
lished e-commerce application, the handicraft industry is expected to assist the business
process in the marketing of handmade products. 2. The use of the basket, basket analysis
method can improve the quality of service to customers, particularly in providing product
selection information, and can directly help owners decide on innovation.
Raja et al. (2019) collected the the market-based data and determined the frequent
and non frequent item sets in order to uncover the reasons for the sales data. By this
way the least preferred product of the market was identified. In addition, a visualization
technique was used to make sales data more comprehensible than previously seen. The
6 Y. A. ÜNVAN

authors also claimed that the research they proposed in which the profit and loss of
each product was examined can be used for supermarket-based organization in future.
Rezende and Ladeira (2019) studied on Market Basket Analysis of a financial institu-
tion and perform some rules of personal consumer association of S~ao Paulo state. Three
association algorithms was demonstrated in the article, but only one of them is appli-
cated. The data handled was explained in detail with all the filters and treatments that
were done. The modeling reporting on algorithms of association rules and examples of
these algorithms were also described in the paper. Based on the results obtained, they
were able to determine the shopping basket of the financial institution and tested the
results in changing rules and conditions.

3. Market basket analysis


Market Basket Analysis (MBA) is a set of statistical affinity calculations that help man-
agers better understand—and ultimately serve—their customers by highlighting purchas-
ing patterns in the retail and restaurant businesses. Briefly, the MBA shows what
combinations of products most frequently occur together in orders. These relationships
can be used to increase profitability through recommendations, promotions, cross-sell-
ing or even the placement of items on a menu or in a store. Applied more deeply,
Market Basket Analysis allows companies to identify the keystone products, those that
differentiate them in the market and could potentially hurt business if they were
unavailable or more expensive. Gourmet or other specialty items in a grocery store
might have limited appeal. But the customers they attract (and their subsequent spend-
ing) could justify high-visibility placement. Customers ordering through the company’s
app could be interested in items, discounts, campaigns or combinations that offer extra
loyalty points (Smartbridge 2019).
In retailing, most purchases are bought on impulse. Market basket analysis gives hints
as to what a customer might have bought if the idea had occurred to them. Therefore,
Market Basket Analysis (MBA) can be used in deciding the location and promotion of
goods inside a store as a first step. But this is only the first of analysis. Differential
Market Basket Analysis can find interesting results and can also eliminate the problem
of a potentially high volume of unimportant results. In this analysis, results are com-
pared between different stores, between customers in different demographic groups, dif-
ferent seasons of the year, between different days of the week, etc. (Albion Research
Ltd. 2019). Other application areas include:

 Analysis of credit card purchases.


 Analysis of telephone calling patterns.
 Identification of fraudulent medical insurance claims.
 Analysis of telecom service purchases.

4. Association rules analysis


One of the methods used for Market Basket Analysis is Association Rules. It is neces-
sary to make some definitions before proceeding to a detailed examination of the
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 7

Figure 1. Data mining models and functions (Dunham 2002).

Association Rules Analysis. D is the database of sales transactions of a supermarket and


T is the sales transactions (sales receipt) of each customer in this database. If I ¼ {I1,
I2, … , Im} is a product set, T  I. An association rule is shown as A ! B. The object/
product to the left of the rule is called the left side and the right is called the right side.
When this statement is simply considered as Vegetable ! Fruit, it is clear that the
expression means that customers who buy vegetables also buy fruits. More than one
product may appear on the left side of the rule. In rule A ! B, A  I, B  I and A \ B
empty set. A rule must provide minimum support and minimum trust. To briefly sum-
marize the value of support and trust, first of all, it should be noted that each rule has
three values called support, trust, and leverage. Of these, the support value (s) is the ratio
of receipts A [ B between sales slips in database D, which corresponds to the ratio of
receipts containing both A and B to all receipts. Thus, the support value indicates how
often the correlation between the respective products is. Trust (c), another value, indi-
cates how much of the receipts containing A also include B. This value is a conditional
probability value. The confidence value indicates the strength of the rule. Apart from
these two values, a leverage value is calculated as mentioned before (Bilgiç 2019, 91).
Figure 1 shows the details of the data mining models and functions.
Support ðA ! BÞ ¼ P ðA [ BÞ
Confidence ðA ! BÞ ¼ P ðB j AÞ
Confidence ðA ! BÞ Support ðA ! BÞ (1)
Leverage ðA ! BÞ ¼ ¼
Support ðBÞ Support ðAÞSupport ðBÞ
Support, trust, and leverage values measure the quality of a rule in terms of its usabil-
ity and accuracy. The predictive accuracy value obtained by using trust and support val-
ues together was first introduced by Scheffer (2001). This value can be named
interestingness or prediction accuracy and is used as a value indicating the reliability
level of the rule in some algorithms, especially the PredictiveApriori algorithm. As men-
tioned earlier, the support value determines how many transactions are included in the
left and right side of a rule in a data set. While the confidence value indicates how
many chips containing the product on the left side of the rule also contain the right
side product (Cios et al. 2007, 290). The trust value also measures the degree to which a
rule is good at predicting which element will appear on its right-hand side. But if the
items on the right side are familiar items, the rule may not be interesting for us.
Therefore, the leverage value, which is a measure of whether the rule is interesting or
8 Y. A. ÜNVAN

not, compares the entire rule with the randomly selected right-side elements. Therefore,
the degree of leverage, as well as the degree of trust of a rule, should be considered.
If the leverage value is greater than 1, the association between the products in the
rule is positive, meaning that products A and B appear more together than expected
and the rule is interesting. And if the value is less than 1, there is a negative correlation
between them. Therefore, rules with a leverage value of less than 1 are ignored. If the
value is exactly equal to 1, there is no correlation, indicating independence. The leverage
value also determines whether the rule emerges as a chance, or, on the contrary, is a
really expected and good rule. The reason for taking the name of leverage can be
explained as follows: For example, if a leverage value is greater than 1 in a two-product
rule, the sale of one product may increase the sale of the other product. Finally, it is
necessary to say about these three values that the limit values of these values should be
determined by analysts or experts (Bilgiç 2019, 92).
Also, conviction value is used to form the rule of association. When calculating the
conviction value, the probability that elements A are seen without element B are calcu-
lated. If the conviction is 1, A and B are independent of each other. If the conviction
value is less than 1, the related rule can be established (Şeker 2011).
1  Support ðBÞ
Conviction ðA ! BÞ ¼ (2)
1  Confidence ðA, BÞ
Since the high of all values does not mean that consistently interesting and important
high rules will be achieved, the degree to which a rule is interesting is determined using
the lift value (Ateş and Karabatak 2017). The fact that the lift criterion is less than or
greater than 1 indicates that the interest increases and that Lift 1” indicates that there is
no interest (Jabbour, Mazouri, and Sais 2018).
P ðA \ BÞ
Lift ðA ! BÞ ¼ (3)
P ðAÞ  P ðBÞ
In general, Association Rule analysis consists of two stages (Aggarwal 2015, 98; Han,
Pei, and Kamber 2011, 231).
1. All frequent product clusters above a support level (minimum support value) pre-
viously determined by the user are detected.
2. Strong association rules are established from frequent product clusters; the rules
must be above a minimum value of support and trust set by the user.

4.1. Apriori algorithm


Agrawal’s studies started in 1993 and came to the conclusion with the Apriori algorithm
published in 1994 with Srikant. The algorithm is a stepwise technique that starts with
the simplest rule and adds individual products to the k þ 1 product set, where k sets of
products are used. This algorithm starts by selecting a subset of products that value
above a predetermined support value in a given product set (often single products are
selected in the first place) and ignoring other products below that support value.
Products that go through the first step form two-item product clusters. The calculated
support values of these product clusters are also compared with the support value ini-
tially determined, and the product clusters below are again ignored. But these ignored
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 9

product clusters are candidates for two-product rules in the future (for the right-hand
side of the rules). This process merges the products frequently until the specified sup-
port value is reached, and finally, no more curing clusters can be found. After the detec-
tion of frequent product clusters, now the rule-finding process is started. There is a
minimum predetermined value of support as well as association rules above a confi-
dence value (Aggarwal 2015, 100; Giudici 2005, 93).
This algorithm takes advantage of the previous step through the use of prior know-
ledge of frequently repeating objects (Agrawal and Srikant 1994, 487–99).
The Apriori algorithm is based on the rule that all subsets of the frequently repeating
object set must also consist of frequently repeating sets and use an iterative approach.
First, there are frequently repetitive sets with one element. This set is called L1 (fre-
quently repeating 1-element set). L1 is used to obtain L2 (a repetitive 2-element cluster).
The algorithm works repetitively to find the most repetitive sets that can be obtained.
The presence of each LK means scanning the entire database. The database is scanned
many times to find frequently occurring items, and these scans include elements that
are associated with the Apriori algorithm’s concatenation, pruning, and minimum sup-
port criteria (Han and Kamber 2000).

4.1.1. Merge process


In order to find Lk, Lk-1 is combined with Lk-1 to create a candidate cluster with k
element length. The candidate set is indicated by the Ck icon. In order to perform Lk-
1> <Lk-1 concatenation, the first (k-2) elements of the members in this set must be
common. In the joining process, there is a set of candidate elements Ck containing ele-
ments of length k, using elements that are equal to the first (k-2) element (Dogan, Erol,
and Buldu 2014, 108).

4.1.2. Pruning
The Ck symbol represents the candidate set with length k. It is necessary to examine
whether any element in Ck has been a frequently repeated element. A repetitive element
set Lk is found, discarding elements whose repeat value is less than the minimum sup-
port value. Subsequently, each element (k-1) has a length of subsets and these subsets
are checked for frequent repetitive elements. The element whose repeat value of any
subset is less than the minimum support value is discarded from Ck. The remaining ele-
ments in Ck have a repetition value by scanning the entire data set and the cluster ele-
ments that cannot exceed the minimum support value are also removed from the Ck.
As a result, the Lk set is formed, with subsets of all (k-1) lengths being frequently
repeated and repetition values of all elements providing the minimum support value. By
making subset control, after creating Ck candidate set containing elements with k
length, the supported values of all elements in Ck set are calculated by scanning all data-
sets. Within this candidate set, the minimum support value is formed by the frequently
repeated element set Lk. The assembly and pruning continue until the Lk-1 set is equal
to the empty set (Agrawal and Srikant 1994, 487–99; Han and Kamber 2000).
10 Y. A. ÜNVAN

Table 1. Apriori v’s FP Growth.


Algorithm Technique Runtime Memory usage Parallelizability
Apriori Generate singletons, Candidate generation Saves singletons, Candidate generation
pairs, triplets, etc. is extremely slow. pairs, triplets, etc. is very
Runtime increases parallelizable
exponentially
depending on the
number of
different items.
FP Growth Insert sorted items Runtime increases Stores a compact The data are very
by frequency into linearly, version of interdependent,
a pattern tree depending on the the database each node needs
number of the root
transactions
and items
Source: https://www.singularities.com/blog/our-blog-1/post/apriori-vs-fp-growth-for-frequent-item-set-mining-11.

4.1.3. Creating rule


Once all large elements are created, the rules are extracted using these elements. If a !
(l - a) rule cannot be created, a’ ! (l - a’) rules cannot be generated. The algorithm is
based on this idea. For example, let X ¼ {A, B, C} and Y ¼ {D}. If the rule {A, B, C}
! D cannot be extracted, the rule {A, B} ! {C, D} cannot be removed because the first
rule always has more confidence than the second. (Ulaş et al. 2001, 4).

4.2. FP Growth
FP Growth is an improvement of Apriori designed to eliminate some of the heavy bot-
tlenecks in Apriori. This algorithm was planned with the benefits of MapReduce taken
into account. Therefore, it works well with any distributed system focused on
MapReduce. FP Growth simplifies all the problems present in Apriori by using a struc-
ture called an FP Tree. In an FP Tree each node represents an item and its current
count, and each branch represents a different association (Singularities 2019).
The biggest advantage found in FP Growth is the truth that the algorithm only needs
to read the file twice, as opposed to Apriori who reads it once for every iteration. This
also reduces costs. Another huge advantage is that it removes the need to calculate the
pairs to be counted, which is very processing heaviest. Because it uses FP Tree. This
makes it O(n) which is much faster than Apriori algorithm. FP Growth algorithm stores
in memory a compact version of the database. But it also has the problem of the inter-
dependence of data. The interdependency problem is that for the parallelization of the
algorithm some that still need to be shared, which creates a bottleneck in the
shared memory.

4.3. Apriori v’s FP Growth


FP Growth has less memory usage and less runtime according to Apriori. Also, FP
Growth is more scalable because of its linear running time. If it is necessary to decide
between these algorithms; FP is definitely more convenient (Singularities 2019). The
details of the Apriori v’s FP Growth is given in Table 1.
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 11

5. Methodology
Weka is the name of the software developed at the University of Waikato for the pur-
pose of machine learning, consisting of the initials of the words “Waikato Environment
for Knowledge Analysis”. Weka is one of the 10 most used software in the field of busi-
ness intelligence and it ranks in the top 3 among the most used free software in the
field of business intelligence (Vohra 2012). Which is widely used today, includes
machine learning algorithms and methods such as Filtered Associator, FP Growth,
Generalized Sequential Patterns, Predictive Apriori and Tertius.
As it is developed in Java and the libraries come in jar files, it can be easily integrated
into projects written in Java, making it use more widespread (Şeker 2013).
Weka has a completely modular design and can perform visualization, data analysis,
business intelligence applications, data mining on data sets with the features it contains.
The Weka software comes with support for an .arff extension. However, Weka software
has tools for converting CSV files to ARFF format. Basically, the following 3 Data
Mining operations can be done with Weka:

 Classification
 Clustering
 Association

In addition to the above operations, pre and post operations can be performed on
the data sets.

 Data Pre-Processing
 Visualization

All attributes are understood by Weka as numeric. In fact, they are all binary, having
values either 0 (not purchased) or 1 (purchased) in the dataset. Therefore, the dataset
has been rearranged under NumericToNominal and no class options. Since Apriori
algorithm does not detect numeric data sets that do not give good results, this study
was done by using the FP Growth algorithm in the Weka program. The FP Growth
algorithm is regarded as an input format expressed as nominal attributes with only 2
values (i.e., that 0 and 1).
Dataset consists of 1361 transactions. Initially, the minMetric and
lowerBoundMindSupport parameters were correct because the algorithm did not give
any rules at first. The algorithm was run with Delta: 0.05, lowerBoundMinSupport 0.01,
minMetric 0.7, and upperBoundMinSupport 1.0. As a result of FP Growth algorithm,
286,304 rules were obtained and the first 10 rules with the highest conviction value
are given.

6. Results
The higher the conviction value, the more decisive the rule is. If the conviction value is
1, the products in the rule are independent of each other. So it cannot be taken as a
12 Y. A. ÜNVAN

rule. However, the dependence increases as you move away from conviction value 1. So
the highest value can be said to be the most decisive rule.
1. [2pct. Milk ¼ 1, Sweet Relish ¼ 1, Pepperoni Pizza - Frozen ¼ 1]: 24 ¼¼> [Eggs
¼ 1]: 24 conf:(1) lift:(8.15) lev:(0.02) <conv:(21.06)>
2pct. Milk, Sweet Relish and Pepperoni Pizza - Frozen products are brought together
by customers, the probability of receiving Eggs increases 8.15 times. The value of the
conviction is 21.06. Therefore, it ranks first as the best rule. Since the Confidence value
is 1, it is concluded that 100% of the customers who buy 2 pct. Milk, Sweet Relish and
Pepperoni Pizza-Frozen products receive Eggs. 24 of the 24 customers who bought these
products also received Eggs. If the confidence value was 0.9, 90% of the customers who
bought these products would receive Eggs with them.
2. [Onions ¼ 1, Wheat Bread ¼ 1, Apples ¼ 1]: 22 ¼¼> [2pct. Milk ¼ 1]: 22
conf:(1) lift:(9.13) lev:(0.01) <conv:(19.59)>
Onions, Wheat Bread, and Apples products are brought together by customers, the
probability of receiving 2pct. Milk increases 9.13 times. The value of the conviction is
19.59. Therefore, this rule is determined as the second-best rule. Since Confidence value
is 1, it is concluded that 100% of the customers who buy Onions, Wheat Bread and
Apples products receive 2pct. Milk. 24 of the 24 customers who bought these products
also received 2pct. Milk.
3. [White Bread ¼ 1, 2pct. Milk ¼ 1, Plain Bagels ¼ 1]: 22 ¼¼> [Eggs ¼ 1]: 22
conf:(1) lift:(8.15) lev:(0.01) <conv:(19.3)>
The combination of White Bread, Milk, and Plain Bagels increases the likelihood of
buying Eggs 8.15 times. Here the conviction value is set to 19.3. Therefore, this rule is
determined as the third-best rule. Since Confidence value is 1, it is concluded that 100%
of customers who buy White Bread, Milk and Plain Bagels buy Eggs. Of the 22 custom-
ers who bought these products, 22 also received Eggs.
4. [98pct. Fat Free Hamburger ¼ 1, Toothpaste ¼ 1, Garlic ¼ 1]: 20 ¼¼> [White
Bread ¼ 1, Potato Chips ¼ 1]: 20 conf:(1) lift:(19.44) lev:(0.01) <conv:(18.97)>
The fact that customers buy 98 pct. Fat-Free Hamburger, Toothpaste and Garlic
products together increase the probability of buying White Bread and Potato Chips
together by 19.44 times. Here the conviction value is set to 18.97. Therefore, this rule is
determined as the 4th best rule. Since Confidence value is 1, 100% of customers who
buy 98pct. Fat-Free Hamburger, Toothpaste and Garlic products also buy White Bread
and Potato Chips. 20 of the 20 customers who bought these products bought White
Bread and Potato Chips.
The other 6 rules obtained can be interpreted as the first 4 rules interpreted above.
Since the remaining 6 rules also have a confidence value of 1, 100% of the customers in
the rule will be interpreted as having received these products. It can also be said that
addiction decreases as the conviction value decreases more and more.
5. [Eggs ¼ 1, 2pct. Milk ¼ 1, 98pct. Fat Free Hamburger ¼ 1, Onions ¼ 1]: 21 ¼¼>
[Potato Chips ¼ 1]: 21 conf:(1) lift:(10.23) lev:(0.01) <conv:(18.95)>
6. [Eggs ¼ 1, White Bread ¼ 1, Wheat Bread ¼ 1, Bananas ¼ 1]: 21 ¼¼> [2pct.
Milk ¼ 1]: 21 conf:(1) lift:(9.13) lev:(0.01) <conv:(18.7)>
7. [Popcorn Salt ¼ 1, Apple Fruit Roll ¼ 1]: 21 ¼¼> [Eggs ¼ 1]: 21 conf:(1)
lift:(8.15) lev:(0.01) <conv:(18.42)>
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 13

8. [White Bread ¼ 1, 2pct. Milk ¼ 1, Toothpaste ¼ 1, Pepperoni Pizza - Frozen ¼


1]: 21 ¼¼> [Eggs ¼ 1]: 21 conf:(1) lift:(8.15) lev:(0.01) <conv:(18.42)>
9. [98pct. Fat Free Hamburger ¼ 1, Toothpaste ¼ 1, Garlic ¼ 1]: 20 ¼¼> [Potato
Chips ¼ 1]: 20 conf:(1) lift:(10.23) lev:(0.01) <conv:(18.05)>
10. [Eggs ¼ 1, 2pct. Milk ¼ 1, 98pct. Fat Free Hamburger ¼ 1, Apples ¼ 1]: 20
¼¼> [Potato Chips ¼ 1]: 20 conf:(1) lift:(10.23) lev:(0.01) <conv:(18.05)>
As a result of these 10 rules determined; we see which products they buy along with
the products they receive. The main purpose of the Market Basket Analysis made to the
Association Rule is to find the purchase relationships between the products and make
sales and position arrangements. According to these rules, the products that are related
to each other can be placed closer to the market. For example, in the first rule: Milk,
Sweet Relish and Pepperoni Pizza - Frozen products should be kept close to each other.
Eggs should also be close to these products. Thus, the increase in market earnings will
be more attractive to the customer to shop. Other customers who buy these products
will most likely buy Eggs. If market basket analysis is correctly interpreted for the mar-
ket operator, it will certainly be productive and profitable.

References
Aggarwal, C. C. 2015. Data mining: The textbook. New York: Springer; IBM T.J. Watson
Research Center.
Agrawal, R., and R. Srikant. 1994. Fast algorithms for mining association rules. In Proceedings of
20th International Conference on Very Large Data Bases, VLDB, Vol. 1215, 487–99. Santiago,
Chile: IBM Almaden Research Center.
Albion Research Ltd. 2019. Market basket analysis. https://www.albionresearch.com/data_mining/
market_basket.php (accessed August 10, 2019).
Ateş, Y., and M. Karabatak. 2017. Multiple minimum support value for quantitative association
rules. Fırat University Journal of Engineering Sciences 29 (2):57–65.
Bilgiç, E. 2019. Market basket analysis with R programming language: An application on con-
sumer purchasing behavior of a supermarket in Muş. Journal of Social Sciences of Mus
Alparslan University 7 (3):89–97.
Chen, Y.-L., K. Tang, R.-J. Shen, and Y.-H. Hu. 2005. Market basket analysis in a multiple store
environment. Decision Support Systems 40 (2):339–54. doi:10.1016/j.dss.2004.04.009.
Cios, K. J., W. Pedrycz, R. W. Swiniarski, and L. Kurgan. 2007. Data mining - A knowledge dis-
covery approach. New York: Springer.
Dogan, B., B. Erol, and A. Buldu. 2014. Using association rule mining for customer relationship
management in insurance sector. Marmara Journal of Natural and Applied Sciences 3:105–14.
Dogan, O. 2015. The analysis of passwords structures in an E-commerce site user accounts by
using association rules. Journal of Internet Applications and Management 6 (2):49–61. doi:10.
5505/iuyd.2015.29491.
Dunham, M. H. 2002. Data mining: Introductory and advanced topics. USA: Pearson Education.
Gangurde, R., B. Kumar, and S. D. Gore. 2017. Optimized predictive model using artificial neural
network for market basket analysis. Computer Science and Electronics Journals 9 (1):42–52.
Giudici, P. 2005. Applied data mining: Statistical methods for business and industry. USA: John
Wiley & Sons.
Han, J., and M. Kamber. 2000. Data mining concept and techniques. 1st ed. USA: Morgan
Kaufmann Publishers.
Han, J., J. Pei, and M. Kamber. 2011. Data mining: Concepts and techniques. USA: Elsevier.
Jabbour, S., F. Mazouri, and L. Sais. 2018. Mining negatives association rules using constraints.
Procedia Computer Science 127:481–8. doi:10.1016/j.procs.2018.01.146.
14 Y. A. ÜNVAN

Kaur, M., and S. Kang. 2016. Market basket analysis: Identify the changing trends of market data
using association rule mining. Procedia Computer Science 85:78–85. doi:10.1016/j.procs.2016.05.
180.
Liew, H. 2018. Dietary habits and physical activity: Results from cluster analysis and market bas-
ket analysis. Nutrition and Health 24 (2):83–92. doi:10.1177/0260106018770942.
Moodley, R., F. Chiclana, F. Caraffini, and J. Carter. 2018. Application of uninorms to market
basket analysis. International Journal of Intelligent Systems 34 (1): 1–14. doi:10.1002/int.22039.
Musalem, A., L. Aburto, and M. Bosch. 2018. Market basket analysis insights to support category
management. European Journal of Marketing 52 (7/8):1550–73. doi:10.1108/EJM-06-2017-0367.

Ozçalıcı, M. 2017. Predicting second-hand car sales price using decision trees and genetic algo-
rithms. Alphanumeric Journal 5 (1):103–14.
Raeder, T., and N. V. Chawla. 2011. Market basket analysis with networks. Social Network
Analysis and Mining 1 (2):97–113. doi:10.1007/s13278-010-0003-7.
Raja, B., J. Pamina, P. Madhavan, and A. S. Kumar. 2019. Market behavior analysis using descrip-
tive approach. https://ssrn.com/abstract=3330017; http://dx.doi.org/10.2139/ssrn.3330017
(accessed February 6, 2019).
Rezende, F., and M. Ladeira. 2019. Market basket analysis in a financial institution. Singular
Engenheria 1 (1):6–12. 10.33911/singular-etg.v1i1.18.
Roodpishia, M. V., and R. A. Nashtaei. 2015. Market basket analysis in insurance industry.
Management Science Letters 5:393–400.
Scheffer, T. 2001. Finding association rules that trade support optimally against confidence. In
Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery
in Databases, 424–35. Berlin Heidelberg: Springer.
Setiabudi, D. H., G. S. Budhi, I. W. J. Purnama, and A. Noertjahyana. 2011. Data mining market
basket analysis using hybrid-dimension association rules, case study in Minimarket X. 2011
International Conference on Uncertainty Reasoning and Knowledge Engineering, IEEE, Bali,
Indonesia.
Singularities. 2019. Apriori vs FP-growth for frequent item set mining. https://www.singularities.
com/blog/our-blog-1/post/apriori-vs-fp-growth-for-frequent-item-set-mining-11 (accessed
August 17, 2019).
Smartbridge. 2019. Market basket analysis 101: Anticipating customer behavior. https://smart-
bridge.com/market-basket-analysis-101/ (accessed August 10, 2019).
Song, B., M. A. Marchant, M. R. Reed, and S. Xu. 2009. Competitive analysis and market power
of China’s soybean import market. International Food and Agribusiness Management Review
12 (1):21–42.
Şeker, S. E. 2011. Interest measures for association rules. http://bilgisayarkavramlari.sadievren-
seker.com/2011/09/09/birliktelik-kurallarinin-pay-olcumleri-interest-measures-for-association-
rules/ (accessed August 13, 2019).
_ Zekası ve Veri Madencilig i (Weka ile). Turkey: Cinius Publications. ISBN:
Şeker, S. E. 2013. Iş
9786051276717.
Timor, M., and T. U. Şimşek. 2008. Customer behavior modeling by using market basket analysis
_
in data mining. Istanbul €
Universitesi _
Işletme Fak€ _
ultesi Işletme _
Iktisadı Enstit€
us€
u Dergisi 19 (59):
3–10.
Ulaş, M. A., E. Alpaydın, N. S€ onmez, and A. ve Kalkan. 2001. Veri Madencilig inde Sepet Analizi
Uygulamaları, IT Summit 2001, TBD 18. Informatics Congress, 4–7 September, Istanbul. _
Vancouver Island University. 2019. Marina Barsky, Computing Science, Vancouver Island
University. http://csci.viu.ca/barskym/teaching/DM2012/labs/LAB7/PartII.html (accessed
August 11, 2019).
Vohra, G. 2012. 10 Most popular analytics tools in business. https://analyticstraining.com/10-
most-popular-analytic-tools-in-business/ (accessed August 12, 2019).
Yulianto, E., and H. Heryanto. 2019. Rancang Bangun Perangkat Lunak E-commerce
Menggunakan Metode market basket analysis. Media Informatika 18 (1):21–41.

You might also like