Résumé Business Analytics
Résumé Business Analytics
1
Pareto Chart ................................................................................................................................... 13
Variation in business process ................................................................................................................. 14
Special and common cause variation ................................................................................................. 14
Structural variation ............................................................................................................................ 14
Robustness an underused concept .................................................................................................... 14
Process stability and capability tools ..................................................................................................... 15
Statistical inference, accuracy and precision ..................................................................................... 15
Statistical inference ........................................................................................................................ 15
Accuracy and precision .................................................................................................................. 15
Control chart to evaluate stability ..................................................................................................... 15
Rule of thumb to define out of control process ............................................................................ 16
Phases of control charting.............................................................................................................. 17
Generality to construct control charts ........................................................................................... 17
Type of control charts .................................................................................................................... 19
Control charts constants ................................................................................................................ 22
Control Charts Formula .................................................................................................................. 23
Deciding which charts to use ......................................................................................................... 24
Multivariate process control and monitoring ................................................................................ 24
Case study: on-time delivery.......................................................................................................... 24
Process capability analysis to evaluate capability.............................................................................. 25
Process capability ratios ................................................................................................................. 26
Deming’s funnel experiment .......................................................................................................... 28
Overall business process improvement strategy ................................................................................... 28
Tools to Analyze common cause variation step:................................................................................ 29
Tools of the Study cause-and-effect relationships ............................................................................ 30
Tools of the ‘Document the problem’ step ........................................................................................ 30
Tool of the ‘Identify potential root causes’ step ................................................................................ 30
Tools of the ‘Choose best solutions’ step .......................................................................................... 30
Process control charts and continuous improvement ....................................................................... 30
Tools to explore observational business data ........................................................................................ 31
What are observational studies ......................................................................................................... 31
Affinity grouping association rules ..................................................................................................... 31
Cooccurrences, Association rules, Sequential patterns ................................................................. 32
Association rule mining .................................................................................................................. 32
Support of a rule ............................................................................................................................ 32
Confidence of a rule ....................................................................................................................... 33
2
Frequent items or itemset ............................................................................................................. 33
A priori principle ............................................................................................................................. 33
Basic a priori algorithm .................................................................................................................. 33
Leverage ......................................................................................................................................... 34
Recommendation engines or systems ............................................................................................... 34
Building a recommendation system .............................................................................................. 34
3
Introduction
Definition
Business analytics refers to the methodology employed by an organization to enhance its business and
make optimized decisions based on data, and by the use of `statistical thinking' to improve, for
example, their products, services, supply chain and operations, human resources, financial
management and marketing.
Types of analytics
Descriptive analytics
What happened?' or What is happening now? (that is, describe, inform, or sense, ->hindsight)
descriptive analytics (also known as `Business Intelligence BI)
Predictive analytics
What will happen? (That is, predict or forecast -> Foresight)
Prescriptive analytics
What to do? Or How to make it happen? Or What is the best that could happen? Or how to optimize
what happens? (optimize, advise, recommend, act -> insight)
Statistical thinking
What is it?
Statistical thinking is the philosophy of learning and acting based on the following fundamentals
principles:
- All work occurs in a system of interconnected processes - a process being a chain of activities
that turns inputs into outputs;
- Variation, which gives rise to uncertainty, exists in all processes;
- Understanding and reducing (unintended or unwanted) variation are keys to success.
4
Using statistical thinking
Statistical thinking provides a theory and a methodology for improvement:
All three principles work together to create the power of statistical thinking.
- A process is one or more connected activities in which inputs are transformed into outputs for
a specific purpose.
o Basically, any activity in which a change of state takes place is a process.
- We cannot improve a process that we do not understand.
5
SIPOC elements
Suppliers The individual or the group that provides the inputs to the process
Inputs The materials, information or other items that goes from the supplier, through the
process, into outputs, and eventually to the customer(s).
Process steps The specific activities that transform the process inputs into outputs.
Outputs The product or the service that is produced by the process.
Customers The individual or the group that utilize(s) the process outputs.
- Using a BPM model allows us to understand how work evolves and to move through the
organization from inputs to outputs.
A BPM model flowchart, which is used as a working document for identifying the steps in a business
process and for highlighting the places where variation is most likely to occur.
- He discovered the `85/15 rule', which states that 85% of the problems are in the process and
the remaining 15% are due to the people who operate the process.
- W. Edwards Deming stated that the true guru is more like `96/4'.
Although there is a difference between the gurus provided by the two quality gurus, obviously the vast
majority of problems are in the process.
6
- An operational definition is a set of specific instructions about when, where and how to obtain
data.
- Operational definitions can be loosely described as descriptions that allow two people to look
at the same thing at the same time and record the same measurement.
- This does not make one definition right and the other wrong. It simply emphasizes the need
to communicate what definition is being used and to select a definition that is appropriate to
the situation.
- The best course of action is to anticipate and define as much as possible in advance.
7
Understanding and reducing variation (Third Principle)
The focus is on unintended variation and how it is analyzed to improve performance. First, we must
identify, characterize and quantify variation to understand both the variation and the process that
produced it.
- With this knowledge we work to change the process, e.g. operate it more consistently, to
reduce its variation.
The performance of a process is influenced by its average performance and the amount of variation
around the average performance.
The average performance of any process is a function of various factors involved in the operation of
the process, e.g. average time in days to get bills out, average waiting time in minutes to be served in
a restaurant, or average pounds of waste of a printing process.
When we understand the variation in the output of the process, we can determine which factors within
the process influence the average performance.
We can then attempt to modify these factors to move the average to a more desirable level.
Unfortunately, many managers of today are still concerned about the average process performance
and not yet paid much attention to the performance variation.
- This fundamental problem can be readily rectified by applying the principles of statistical
thinking.
Businesses, as well as customers, are interested in the variation of the process output around its
average value.
- Customers of today are more concerned with the consistency of performance and not about
how often we can hit the target value of product or service performance.
- Typically, consistency of a product or a service is a key customer requirement.
- Using a problem-solving approach, we identify what was different in the process when it
produced the usual result.
- Using a process improvement approach, we study the common cause variation and try to
discover the input and process factors that are the largest sources of variation.
8
Structural variation
Structural variation is due to causes that operate as an inherent part of the system as common causes
do. However, on a control chart, they appear to be due to special causes. But, their occurrence is
predictable (predictably unstable)
The experiment shows that even though `willing workers' want to do good jobs, their success is directly
tied to and limited by the nature of the system (i.e. the process) they are working within.
Real and sustainable improvement on the part of the `willing workers' is achieved only when the
`management' (i.e. the owner of the system) is able to improve the process.
9
Statistical thinking strategy
Broad use of statistical thinking can help organizations improve operations and their management
system. Statistical thinking can be used in all parts of an organization and job functions.
Overall approach
We begin by identifying, documenting and understanding the business process itself. We almost
always begin with some subject matter knowledge (`theory').
- Subject matter knowledge is everything we know about the process under study.
- This could be derived from experience or from an academic study.
- This guides us in planning which data from the process would be most helpful to validate or
refine our ideas (`hypotheses').
- Subject matter knowledge helps us to determine where biases are likely to occur and to avoid
or minimize them.
Once the data are obtained, statistical techniques and tools are used to analyze the data and to
account for the variation in the data.
- Although the analysis may confirm our ideas (`hypotheses'), additional questions almost
always arise, or new ideas may be generated.
- This leads to a desire to obtain additional data to validate or further refine the revised
hypotheses, and the whole cycle is repeated.
Most business applications are sequential studies involving a series of such steps.
10
- Fortunately, our knowledge about the process, and therefore our ability to improve it,
increases with each step.
A complicated issue is the fact that business processes are not static over time but dynamic.
- Statistical methods can be very helpful in determining whether the process has undergone
`significant' change.
Statistical thinking uses the scientific method to develop subject matter knowledge and to gather
data to evaluate and revise hypotheses (`ideas'). There are some important differences:
- Statistical thinking recognizes that results are produced by a process and that the process must
be understood and improved to improve the results; and
- the emphasis in statistical thinking is on variation | the scientific method can be applied
without any awareness of the concept of variation, which may lead to misinterpretation of the
results.
Key similarities are that both are sequential approaches that integrate data and subject matter
knowledge.
11
Statistical engineering
Definition
The statistical engineering provides the needed frameworks, it provides with the tactics, or the specific
plan of attack for a given problem.
Statistical engineering is the study of how to best utilize statistical concepts, methods and tools, and
integrate them with information technology and other relevant sciences to generate improved results.
- We need to think carefully about the problem and our overall approach to solving it prior to
getting carried away with detailed analytics.
The impact of statistical thinking and methods can be increased by integrating several statistical tools,
and even non-statistical tools, enabling practitioners to deal with issues that cannot be addressed with
any one method.
Viewing statistical thinking and methods from an engineering perspective provide a clear focus on
problem solving.
Statistical engineering emphasizes having a plan of attack for complex problems and linking various
tools together in an overall approach to improvement.
12
(inputs) are often shown as groupings or related sub factors that act in concert to form the overall
`effect' of the group.
Pareto Chart
The Pareto chart is a bar chart that ranks categories (items) by
how important they are with respect to their frequency of
occurrence. Each block represents an item; the bigger the block,
the more important it is. For example, when the items are
problems, the Pareto chart helps to identify which problems
need immediate attention and which can be looked at later.
While there are many things that cause a large problem, it is
frequently found that a few specific factors (about 20%) account
for the bulk of the observed occurrences of that problem (about
80%). This phenomenon was called the `Pareto principle'.
13
Variation in business process
Special and common cause variation
Differences between common and special cause variation
- A process is said to be in (statistical) control if the only source of variation is common cause
variation.
- A process is said to be out of control if special cause variation is present.
These distinctions between special cause and common cause variation are very important because the
proper approach for improving processes with common causes of variation (i.e. stable processes) is
different from the approach used with special causes of variation (i.e. unstable processes).
If a process is unstable (or `out of control'), eliminating the special causes will help to stabilize it.
- Because special causes are `assignable' to a specific event and are not a fundamental part of
the process, they can often be addressed without spending a great deal of time, effort or
money.
- Special causes tend to have the most impact, so addressing the them first typically will result
in dramatic improvement.
Once the process is stabilized, we can begin to study and understand it, allowing us to make
fundamental improvements.
- Eliminating special causes is really fixing problems, or bringing the process back to where it
should have been in the first place.
- True improvement comes by fundamentally changing the process.
- To do this, we must address the common causes.
If the process is stable (or `in control'), improvement generally requires careful study of the whole
process to identify and prioritize the common causes and their relationships.
Structural variation
There are some `predictably unstable' processes, i.e. some processes are technically unstable, but in
a predictable way.
In general, eliminating structural variation requires fundamental change to the process, much like
common cause variation.
14
- Besides controlling the process by eliminating special cause variation and improving the
system by reducing common cause variation, robustness is another way to reduce variation.
- Process design
- Product design
- Management practices
We want to build robust processes by anticipating variation and reducing its effects.
A robust process is insensitive to uncontrolled factors both internal and external to it. It is like installing
shock absorbers on a car. Shock absorbers enable passengers to enjoy a smooth ride despite the
uncontrollable conditions of the road. Similarly, robustness provides shock absorbers for the work
process and its output.'
- They could fail to identify a performance change as special cause and thereby lose valuable
information about the process;
15
- They could treat a change as a special cause when it is, in fact, a part of `typical' process
variation | this error is called `tampering'.
- A Center Line that represent the average value of the process characteristic corresponding to
the in-control state
- An Upper Control Limit (UCL)
- A Lower Control Limit (LCL)
As long as the points plot within the control limits, the process is assumed to be `in control', and no
action is necessary.
However, a point that plots outside of the control limits is interpreted as evidence that the process is
`out of control'.
- Investigation and corrective action is required to find and eliminate the assignable causes
responsible for this behavior.
Even if all the points plot inside the control limits, if they behave in a systematic or nonrandom manner,
then this is an indication that the process is `out of control'.
16
Phases of control charting
Phase I
The phase I (or `set-up' phase) is a retrospective analysis of process data to construct `trial control
limits'
- To determine if the process has been in control over the period of time where the data were
collected; and
- To see if reliable control limits can be established to monitor future production.
Control charts are used primarily in phase I to assist in bringing the process into a state of statistical
control by differentiating common cause variation from special cause variation over time and evaluate
process stability and eliminate special causes.
Phase II
Phase II begins after we have a `clean' set of process data gathered under stable conditions and
representative of the in-control process performance.
- In phase II, we use control charts (with the projected phase I control limits) to monitor the
process for special cause variation.
- Special causes can be detected rapidly using these charts, and problems can be diagnosed and
eliminated.
- The control limits will not be updated unless there has been a substantial change in the
process.
- The control chart itself does not eliminate special causes, it `only' detects them and provides
clues to help us understand the process.
A control chart is also of little value in improving a stable process because it will only continue to tell
us that the process is stable.
- Control charts are not designed for reducing common cause variation.
- To reduce common cause variation quite different tools and ways of thinking are required, e.g.
modify internal parts of the process using `planned' investigation.
There is a close connection between control charts and statistical hypothesis testing as the control
chart tests a hypothesis repeatedly at different points in time.
17
Usually, these data are not known. Therefore, we need to estimate them. We usually do so by using
samples that are taken while we believe the process is in control.
- These estimates should usually be based on at least 20 to 25 samples, i.e. subgroups of data.
- Suppose that m samples are available, each containing n observations on the process
characteristic. Typically, n will be small, often either 4, 5 or 6.
- X1 bar, X2 bar … Xn bar are the average of each sample,
The best estimator of µ, the process average, is the grand average, or the means of the sample means,
denoted X bar bar
𝑥̅1 + 𝑥̅ 2 + ⋯ + 𝑥̅𝑛
𝑋̿ =
𝑛
To construct the control limits, we need to estimate the standard deviation σ. We can estimate σ from
either the ranges or the standard deviations of the m samples.
𝑅 = 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
If R1, R2,…,Rm are the ranges of the m samples then
𝑅1 + 𝑅2 + ⋯ + 𝑅𝑚
𝑅̅ =
𝑚
Then an unbiased estimator of σ is
𝑅̅
𝜎̂ = 𝑊ℎ𝑒𝑟𝑒 𝑑2 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝑑2
The standard deviation method
This method is preferable to the range method when either the sample size n is moderately large -say,
n > 10 - or the sample size n is variable. Moreover, this method uses all the data, not just the largest
and smallest observations.
It might also be useful to define S bar, where S1, S2, …, Sm are the sample standard deviation
𝑆1 + 𝑆2 + ⋯ + 𝑆𝑚
𝑆̅ =
𝑚
Moving Range
We can use the moving range of two successive observations, i.e. the unsigned differences between
consecutive measurements, as the basis of estimating the process variation.
𝑀𝑅 = |𝑋𝑖 − 𝑋𝑖−1 |
18
𝑀𝑅1 + 𝑀𝑅2 + ⋯ + 𝑀𝑅𝑚
̅̅̅̅̅
𝑀𝑅 =
𝑚−1
Type of control charts
There are many types of control charts, but they all plot data (or statistics calculated from data) over
time (or sample number) and have statistical control limits (UCL and LCL) representing the range of
common cause variation to be expected.
C-Chart
C-Chart or `count chart': plot of the number of defects (or nonconforming items). Assumes that defects
(or nonconforming items) are rare and control limits are computed based on the Poisson distribution
(`distribution of rare events'); see `Practical 8'.
U-Chart
U-Chart or `unitized count chart': plot of the fraction (or proportion) of defects (or nonconforming
items), i.e. the number of defects (or nonconforming items) divided by the sample size.
- The u chart simply converts the data to a common scale when the opportunity for defects
varies from data point to data point.
- Assumes that defects are rare and control limits are computed based on the Poisson
distribution.
- Does not require constant number of units (unlike the c chart). Can be used, for example, when
samples are of different sizes.
- These data are better treated as if they were continuous data, even though they technically
are not continuous.
Consider discrete data, i.e. attributes, that fall in only one of two categories, such as `defective' or `not
defective', or `nonconforming' or `conforming'.
p-Chart
It is a plot of the fraction or proportion of defectives, i.e. the ratio of the number of defectives to the
sample size.
- Control limits are based on the binomial distribution; see `Practical 8'.
- For defectives (binomial) data, we may have equal or unequal subgroup sizes.
To estimate p bar
∑𝑚
𝑖=1 𝐷𝑖 ∑𝑚 𝑝̂ 𝑖
𝑝̅ = = 𝑖=1
𝑚𝑛 𝑚
19
And the variance of p hat
𝑝(1 − 𝑝)
𝑣𝑎𝑟(𝑝̂ ) =
𝑛
The p chart has the advantage that groups of data with different sample sizes can still be plotted on
the same chart.
np-Chart
The only difference with the P-Chart is that on this graph it is not the proportions that are plotted but
the number of occurrence or nonoccurrence.
- It is necessary to monitor both the mean value of the process characteristic and its
variation!
Individuals Chart
The individual chart (plot of individual data points): used when the nature of the process is such that
it is difficult or impossible to group measurements into subgroups so that an estimate of the process
variation can be determined.
- The sample size used for process monitoring is n = 1; that is, the sample consists in an individual
unit - a single process measurement.
- The individuals chart is a generic chart, and it can be used in virtually any situation, including
with discrete data.
- The individuals chart is therefore a good default chart when you are uncertain about the most
appropriate chart to use.
20
X bar Chart
In X bar Charts the sample means are plotted in order to control the mean value of a variable.
To first control the variation of a variable one can use, depending on the measure of `within-sample
variation', i.e. using the ranges or the standard deviations of the m samples:
𝑈𝐶𝐿 = 𝑋̿ + 𝐴2 𝑅̅
𝐶𝐿 = 𝑋̿
𝐿𝐶𝐿 = 𝑋̿ − 𝐴2 𝑅̅
Where A2 is a constant
𝑈𝐶𝐿 = 𝑋̿ + 𝐴3 𝑆̅
𝐶𝐿 = 𝑋̿
𝐿𝐶𝐿 = 𝑋̿ − 𝐴3 𝑆̅
Typically, you use this formula when the sample size n is moderately large (n > 10) or when the sample
size n is variable.
R Chart
In this type of chart, the sample ranges are plotted and the R chart with three-sigma control limits is
defined as:
𝑈𝐶𝐿 = 𝐷4 𝑅̅
𝐶𝐿 = 𝑅̅
𝐿𝐶𝐿 = 𝐷3 𝑅̅
S Chart
In this type of chart, the sample standard deviations are plotted and the S chart with three-sigma
control limits is defined as:
𝑈𝐶𝐿 = 𝐵4 𝑆̅
𝐶𝐿 = 𝑆̅
𝐿𝐶𝐿 = 𝐵3 𝑆̅
S charts are preferable to R charts when either the sample size n is moderately large - say, n > 10 - or
the sample size n is variable.
Note that, if you make changes to the process you have to recalculate the control limits.
21
Control charts constants
22
Control Charts Formula
23
Deciding which charts to use
- Answering the question `Can we count nonoccurrence?' determines whether the data are like
`defectives' (i.e. something is either defective or not -> `Yes') or like `defects' (wherein one
item can have numerous defects -> `No');
- Answering the question `Are occurrences rare?' determines whether the data are very similar
to defects data, but do not satisfy the statistical requirement that the defects are rare for use
of the Poisson distribution.
- The term rare is intended to be interpreted at a particular location and point in time.
The gap between the target and the overall average is called the off-target deviation.
24
Process capability analysis to evaluate capability
Capability refers to the ability of a process to meet its business requirements.
Capable means that the process variation is low enough so that if the process average is properly set,
virtually all process measurements will meet its business requirements.
- The emphasis of statistical capability analysis is therefore primarily on the process variation,
as opposed to the process average.
- A process could be capable, but not currently performing well relative to its requirements
because the process average is off-target.
Process capability analysis is the primary statistical tool used to evaluate capability - once the process
is stabilized!
For processes that are stable, we often wish to evaluate the degree to which the current process will
satisfy business requirements, assuming the process continues to produce stable output. That is
exactly what we do with Process capability analysis.
The business requirements - the so-called specifications or targets - are externally imposed to the
process, e.g. by technical specifications, customer expectations or requirements.
Such business requirements determine the specification limits, i.e. the `Upper Specification Limit'
(USL) and the `Lower Specification Limit' (LSL). The specification limits indicate the “acceptable” range
over which the process should perform.
A process in control is stable over time. However, there is no guarantee that a process in control
produces products (or services) of satisfactory quality, i.e. as externally imposed by specifications.
Capability has nothing to do with control - expect for the very important point that if a process is not
in control, it is hard to tell if it is capable or not.
If a process that is in control, i.e. a process that is stable, does not have adequate capability,
fundamental changes in the process are needed.
25
- Capability relates the actual performance of a process in control, i.e. after special causes have
been removed, to the desired performance.
There is no mathematical or statistical relationship between specifications limits and control limits.
- By plotting the data of key variables versus their specifications, we can determine graphically
whether we have a problem.
- Calculation of capability ratios supplies a number to document how well we are, or can be,
doing.
- A benefit of process capability ratios is that they make discussion of process capability more
rational and less dependent on opinion (HIPPO)
Most capability rations are more strongly affected by non-normality than are control charts, and can
be quite inaccurate when based on small numbers of observations.
Cp ratio
The process capability ratio is defined as
𝑈𝑆𝐿 − 𝐿𝑆𝐿
𝐶𝑝 =
6𝜎
The range USL – LSL is also known as the specification width or interval.
Cpk ratio
This is a logical extension of C p, because one could have a Cp greater than 1 but experiencing data
outside the specifications because the average is off-center.
with
𝑈𝑆𝐿 − 𝜇
𝐶𝑝𝑢 =
3𝜎
and
µ − 𝐿𝑆𝐿
𝐶𝑝𝑙 =
3𝜎
In order to obtain the estimate of this ratio, we replace the parameters with their estimates.
26
Relationship between Cp and Cpk
- It is easy to forget that, for example, Cpk is simply a point estimate and, as such, is subject to
statistical fluctuation.
- It should become standard practice report confidence intervals for process capability ratios.
27
1 𝐶̂𝑝𝑘
2
𝐶̂𝑝𝑘 ± 𝑧1−𝛼/2 √ +
9𝑛 2(𝑛 − 1)
In general, z = 1.96 when alpha = 0.05 or 2.5758 when alpha = 0.01. This works when the sample size
n is at least 25.
The experiment clearly shows to not react before scientifically studying a process.
The lessons learnt (or `rules') of the experiment can be applied to many different types of
management, all of which are impediments to effective management and continuous improvement,
including:
As the analysis of special cause variation differs from the one of common cause variation, the emphasis
should be on removing special cause variation first.
- Use tools of improvement to study all the data (i.e. not just the ‘good’
or the ‘bad’ points) and identify factors that cause variation.
- Determine what needs to be permanently changed to achieve a
different level of quality.
28
Tools to Analyze common cause variation step:
- Stratification: define a ‘stratification factor’ such as the day of the week, the machine or the
business unit.
o Partition the factor into logical categories.
o Compare the data for each category to highlight differences and so that patterns can
be seen;
- Disaggregation: define measures for sub processes or individual process steps.
o Study the variation in the individual sub processes.
o How does it contribute to the overall process variation?
- Regression analysis: existing process knowledge might suggest one or more variables (‘inputs’)
that influence the process measure (‘output’).
o A regression analysis might ‘verify’ this opinion or indicate that these variables have
negligible ‘effect’.
29
Tools of the Study cause-and-effect relationships
- (statistical) Experimental design: strategically and systematic planned variation of input
factors for an actual process — also known as ‘Design Of Experiments’ (DOE); see references.
o The experimenter observes the ‘effect’ of these variations on important process
characteristics;
- Interrelationship digraph: evaluate the cause-and-effect relationships between issues to
identify which are the ‘drivers’ and which are the ‘effects’.
o This provides a way to process raw ideas that have been generated in ‘brainstorming’
(which was used to rapidly generate a diverse list of ideas or potential root causes);
- Model building: construct a (statistical) model of a process that predicts (not explains!)
process performance based on input variables
- Scatter plot: plot of a process characteristic (output) versus a potential explanatory variable
(input);
- Box plot: boxplot to depict the relationship between a discrete variable, such as the region of
a country, and the distribution of continuous variable, such as the profitability
30
o The process of continuous improvement should be charted over time and
adjustments made to the control charts in use to reflect the improvements made (in
reducing variation):
- So-called data mining (rebranded nowadays as data science) usually involves analyses of
observational data to find ‘novel’ patterns or relationships, i.e. to create or generate new
ideas.
Well-conducted observational studies often suggest patterns, relationships or provide clues that -
when followed up by more rigorous investigations - can lead to important findings to help improve a
business process.
Observational studies have particular applicability at the starting point of an investigation, as well as
in situations for which more disciplined studies are impractical - or even unethical - and for which the
main goal is prediction rather than gaining understanding (explanation).
Observational data have an important role in pointing the way forward, but they should not be a
primary ingredient of making data-driven decisions!
- Basket analysis gives insights into the merchandise by telling a business which products tend
to be purchased together and which are most amenable to promotion.
31
- This information is actionable
o It can suggest new store layouts
o It can determine which products to put on special
o It can indicate when to issue coupons and so on.
Although the roots of association rules are in analyzing point-of-sale transactions, association rules can
be applied outside the retail industry.
- Whenever a customer purchases multiple product at the same time or does multiple things in
close proximity, or whenever one wants to discover sequences of events that commonly occur
together, there is a potential application.
- Some examples include:
o items purchased on a credit card, such as rental cars and hotel rooms, give insights
into the next product that customers are likely to purchase;
o optional services purchased by telecommunications customers (e.g. call waiting, call
forwarding, ISDN, speed call, UMTS) help determine how to bundle these services
together to maximize revenue;
o banking services used by retail customers (e.g. money market accounts. investment
services, car loans) identify customers likely to want other services;
o unusual combinations of insurance claims can be a sign of fraud and can spark further
investigation
o hidden relationships between financial transactions based on their cooccurrences can
be a sign of money laundering;
o medical patient histories can give indications of complications based on certain
combinations of treatments.
Often, basket analysis is used as a starting point when transaction data are available, and a business
does not know what specific patterns to look for.
- Note that association rule mining does not consider the order of transactions (sequential
pattern analysis or mining’).
Support of a rule
It is defined as:
Support denotes the proportion of transactions in the data set which contain the itemset (i.e. the
cooccurrence)
A high value means that the rule involves a great part of the database.
32
Confidence of a rule
It is defined as
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌
𝑋 ⇒ 𝑌 ℎ𝑎𝑠 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 𝑖𝑓 𝑃(𝑌|𝑋) = =𝑐
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋
Confidence denotes the proportion of transaction containing X which contain also Y.
Items or itemset with support ≥ smin are called frequent items or itemset.
A priori principle
It is: Any subset of a frequent itemset must be frequent. Hence it is sufficient to only mine all maximal
frequent itemset.
The ‘a priori’ principle is very powerful and can greatly reduce the search space.
Thus, by applying the ‘a priori’ principle many itemset can potentially be avoided when exploring the
search space.
In practice, however, we do not only fix a minimal support smin but also a minimal confidence cmin, say.
- The rules hold i.e. are valid, if their support is ≥ smin and their confidence is ≥ cmin.
- If smin is high, then we get few frequent itemset and few valid rules which occur very often.
- If smin is low, then we get many valid rules which occur rarely.
- If cmin is high we get few rules, but all are ‘almost logically true’.
- If cmin is low we get many rules, but many of them are very ‘uncertain’.
In practice, typical values are smin = 2 − 10% (rarely occurring rules) and cmin = 70 − 90% (‘certain’ rules).
Several optimizations for the basic a priori algorithm exist, as well as problem extensions, e.g. using
interestingness measures beyond support and confidence
Lift
Lift is defined by the combination of the support and confidence:
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓 𝑋 ⟹ 𝑌 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌
𝑋 ⟹ 𝑌 ℎ𝑎𝑠 𝑙𝑖𝑓𝑡 𝑙 𝑖𝑓 = =𝑙
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ∗ 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌
33
Leverage
It is an alternative to the lift and it is the difference of the quantities defining lift rather than their ratios
It is defined as ρ
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ⇒ 𝑌
𝜌=
√𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ∗ 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌
In practise, in addition to fix a minimal support s min and a minimal confidence cmin, we could also fix a
minimal ‘correlation’ value ρmin, say, and retain only the rules that have their values greater than these
minimal values.
‘Offline recommendation engines or systems’: think of the people around you as recommendation
engines.
Collect preferences
It is build users profile. Different approaches are generally used to measure user’s tastes and interests:
- Personal preferences or ‘tell the business what you like’; Example. The book you just read
changed your life, give it a five-star review on Amazon. Or, you liked that article and shared it
on Twitter.
- Collaborative Filtering (CF) or ‘people like you tell the business what you may like’.
Example: Businesses know what your preferences are, businesses find people with ‘similar’
tastes, businesses look at what they have purchased and recommend you the items they liked
and that you have not purchased yet.
34
Example: Patterns found with association rule mining could be used when recommending new
products or services to others based on what others have bought before (or based on which
products or services are bought together).
Note that CF does not need an ‘understanding’ of the item, e.g. of the book or the movie, itself.
- Content based filtering methods are based on a description, i.e. an ‘understanding’, of the
item and a profile of the user’s preferences.
o In a content based filtering recommendation system, keywords are used to describe
the items; besides, a user profile is built to indicate the type of items this user likes.
o These methods recommend items that are ‘similar’ to those that a user liked in the
past (or is examining in the present).
- Hybrid recommendation systems: a hybrid approach, combining CF and content based
filtering could be more effective in some cases;
Example. A possible hybrid approach implementation where recommendations are done
separately and then combined:
Example. Netflix is a good example of a hybrid system. They make movie recommendations by
comparing the watching and searching habits of similar users (i.e. CF) as well as by offering
movies that share characteristics with movies that a user has rated highly (i.e. content based
filtering).
- Collective intelligence or ‘it is the consensus that tells the business what you like’;
Example. If 99% of the people who have seen this new movie thought it was terrible, it is
unlikely that the recommendation system recommends it to you.
- Discovery: the recommendation system experiments with presenting you new things — based
on your history — and you tell whether you like it or not.
o In addition to getting to know you better, this stimulates novelty and creates surprise.
People are unique and there is no single approach to recommendation that will work for everyone.
The challenge is to find the metrics that are relevant and combine them in a clever way to make the
recommendation as much personal as possible.
Find similarities
It is to find people that are similar to you. Similarity indicates the strength of relationship between two
users, but often similarity is hard to define. However, ‘similarity is subjective’:
35
- Different measures of similarity calculated from the same set of users (or items) can, and often
will, lead to different solutions.
- Associated with similarity is dissimilarity = 1 − similarity, e.g. two users (or items) are ‘close’
when their dissimilarity is small or their similarity large.
- The term distance is often used informally to refer to a dissimilarity measure derived from the
characteristics describing the users (or items).
- Experiment with the simplest measures first, since this is likely to ease the possibly difficult
task of the interpretation of results.
𝑑(𝑥, 𝑦) = ∑|𝑥𝑘 − 𝑦𝑘 |
𝑘=1
𝑛
𝑑(𝑥, 𝑦) = √∑ (𝑥𝑘 − 𝑦𝑘 )2
𝑘=1
36
Make Recommendations
It is to recommend similar items (or users), just find the ones that are the most similar to the item (or
user) you like, i.e. the ones with the lowest distance or the highest ‘correlation score.
37
Useful formulas
Variance
𝑉(𝑥) = 𝐸(𝑋 2 ) − [𝐸(𝑋)]2
Sample Variance
∑(𝑋 − 𝑋̅ )2
𝑆2 =
𝑁−1
Comparing 2 groups – test of proportions
38