0% found this document useful (0 votes)
13 views38 pages

Résumé Business Analytics

Jjjj 6777 jjjjjjjjjj jjjggg jjjhhhgfdxer jdrgcrghvvv Ggxfg hgdxft

Uploaded by

hassen.walid01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views38 pages

Résumé Business Analytics

Jjjj 6777 jjjjjjjjjj jjjggg jjjhhhgfdxer jdrgcrghvvv Ggxfg hgdxft

Uploaded by

hassen.walid01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Résumé Business Analytics

Table des matières


Introduction.............................................................................................................................................. 4
Definition .............................................................................................................................................. 4
Types of analytics ................................................................................................................................. 4
Descriptive analytics......................................................................................................................... 4
Explanatory (or diagnostic) analytics ............................................................................................... 4
Predictive analytics........................................................................................................................... 4
Prescriptive analytics........................................................................................................................ 4
Need for business improvement .............................................................................................................. 4
Statistical thinking .................................................................................................................................... 4
What is it? ............................................................................................................................................ 4
The meaning of quality and quality improvement .............................................................................. 4
Data driven decisions making .............................................................................................................. 4
Using statistical thinking ...................................................................................................................... 5
Principles of statistical thinking................................................................................................................ 5
Process thinking (First Principle) .......................................................................................................... 5
SIPOC (Suppliers Inputs Process steps Outputs Customers) ............................................................ 5
Business Process Management (BPM) ............................................................................................. 6
Blame the process not the people ................................................................................................... 6
Process measurement and operational definitions ......................................................................... 6
Process measurement tracked over time ........................................................................................ 7
Variation exists in all process (Second Process)................................................................................... 7
Understanding and reducing variation (Third Principle)...................................................................... 8
Two types of variation in business process...................................................................................... 8
Three key business analytics questions ............................................................................................... 9
Deming’s red bead experiment ............................................................................................................ 9
Statistical thinking strategy .................................................................................................................... 10
Relationship to the scientific method ................................................................................................ 11
Relationship to the “Plan Do Check Cycle” ........................................................................................ 11
Statistical engineering ........................................................................................................................ 12
Definition ........................................................................................................................................ 12
Key principles of statistical engineering ......................................................................................... 12
The statistics overall system as discipline ...................................................................................... 12
Fishbone diagram or Cause and effect diagram ............................................................................ 12

1
Pareto Chart ................................................................................................................................... 13
Variation in business process ................................................................................................................. 14
Special and common cause variation ................................................................................................. 14
Structural variation ............................................................................................................................ 14
Robustness an underused concept .................................................................................................... 14
Process stability and capability tools ..................................................................................................... 15
Statistical inference, accuracy and precision ..................................................................................... 15
Statistical inference ........................................................................................................................ 15
Accuracy and precision .................................................................................................................. 15
Control chart to evaluate stability ..................................................................................................... 15
Rule of thumb to define out of control process ............................................................................ 16
Phases of control charting.............................................................................................................. 17
Generality to construct control charts ........................................................................................... 17
Type of control charts .................................................................................................................... 19
Control charts constants ................................................................................................................ 22
Control Charts Formula .................................................................................................................. 23
Deciding which charts to use ......................................................................................................... 24
Multivariate process control and monitoring ................................................................................ 24
Case study: on-time delivery.......................................................................................................... 24
Process capability analysis to evaluate capability.............................................................................. 25
Process capability ratios ................................................................................................................. 26
Deming’s funnel experiment .......................................................................................................... 28
Overall business process improvement strategy ................................................................................... 28
Tools to Analyze common cause variation step:................................................................................ 29
Tools of the Study cause-and-effect relationships ............................................................................ 30
Tools of the ‘Document the problem’ step ........................................................................................ 30
Tool of the ‘Identify potential root causes’ step ................................................................................ 30
Tools of the ‘Choose best solutions’ step .......................................................................................... 30
Process control charts and continuous improvement ....................................................................... 30
Tools to explore observational business data ........................................................................................ 31
What are observational studies ......................................................................................................... 31
Affinity grouping association rules ..................................................................................................... 31
Cooccurrences, Association rules, Sequential patterns ................................................................. 32
Association rule mining .................................................................................................................. 32
Support of a rule ............................................................................................................................ 32
Confidence of a rule ....................................................................................................................... 33

2
Frequent items or itemset ............................................................................................................. 33
A priori principle ............................................................................................................................. 33
Basic a priori algorithm .................................................................................................................. 33
Leverage ......................................................................................................................................... 34
Recommendation engines or systems ............................................................................................... 34
Building a recommendation system .............................................................................................. 34

3
Introduction
Definition
Business analytics refers to the methodology employed by an organization to enhance its business and
make optimized decisions based on data, and by the use of `statistical thinking' to improve, for
example, their products, services, supply chain and operations, human resources, financial
management and marketing.

Types of analytics
Descriptive analytics
What happened?' or What is happening now? (that is, describe, inform, or sense, ->hindsight)
descriptive analytics (also known as `Business Intelligence BI)

Explanatory (or diagnostic) analytics


Why did it happen? or “why is it happening?” or what is happening? or what are the trends? or what
patterns are there? (that is, explain, understand or respond -> oversight)

Predictive analytics
What will happen? (That is, predict or forecast -> Foresight)

Prescriptive analytics
What to do? Or How to make it happen? Or What is the best that could happen? Or how to optimize
what happens? (optimize, advise, recommend, act -> insight)

Need for business improvement

Statistical thinking
What is it?
Statistical thinking is the philosophy of learning and acting based on the following fundamentals
principles:

- All work occurs in a system of interconnected processes - a process being a chain of activities
that turns inputs into outputs;
- Variation, which gives rise to uncertainty, exists in all processes;
- Understanding and reducing (unintended or unwanted) variation are keys to success.

The meaning of quality and quality improvement


The importance of statistical thinking derives of the modern definition of quality

- Quality is inversely proportional to variation


- If variation in the important characteristics of a product or a service decreases, then the quality
of the product or the service increases.
- Quality improvement is the reduction of variation in a process.

Data driven decisions making


Data-driven decision making enables businesses to turn knowledge into appropriate decisions for the
good of their customers, and ultimately for an organization’s benefit.

4
Using statistical thinking
Statistical thinking provides a theory and a methodology for improvement:

- It helps to identify where improvement is needed


- It provides general approach to take and it suggests tools to use.

A complete improvement approach includes all elements of statistical thinking:

Process -> Variation -> Data

Principles of statistical thinking


- All work occurs in a system of interconnected processes - a process being a chain of activities
that turns inputs into outputs;
- Variation, which gives rise to uncertainty, exists in all processes; and
- Understanding and reducing (unintended or unwanted) variation are keys to success.

All three principles work together to create the power of statistical thinking.

Process thinking (First Principle)


This principle provides the context for understanding the organization, the improvement potentials
and the sources of variation mentioned in the second and third principles.

- A process is one or more connected activities in which inputs are transformed into outputs for
a specific purpose.
o Basically, any activity in which a change of state takes place is a process.
- We cannot improve a process that we do not understand.

SIPOC (Suppliers Inputs Process steps Outputs Customers)


It is a common model for process improvement efforts.

5
SIPOC elements

Suppliers The individual or the group that provides the inputs to the process
Inputs The materials, information or other items that goes from the supplier, through the
process, into outputs, and eventually to the customer(s).
Process steps The specific activities that transform the process inputs into outputs.
Outputs The product or the service that is produced by the process.
Customers The individual or the group that utilize(s) the process outputs.

Business Process Management (BPM)


In modern managerial terms the resulting business process model and its implementation is called
`Business Process Management' (BPM).

- Using a BPM model allows us to understand how work evolves and to move through the
organization from inputs to outputs.

A BPM model flowchart, which is used as a working document for identifying the steps in a business
process and for highlighting the places where variation is most likely to occur.

- Typically, 5 to 10 steps will be sufficient for a first process understanding

Blame the process not the people


Joseph M. Juran, another quality guru, pointed out that the source of most problems is in the process
we use to do our work.

- He discovered the `85/15 rule', which states that 85% of the problems are in the process and
the remaining 15% are due to the people who operate the process.
- W. Edwards Deming stated that the true guru is more like `96/4'.

Although there is a difference between the gurus provided by the two quality gurus, obviously the vast
majority of problems are in the process.

- Blame the process not the people when working on improvement.

Process measurement and operational definitions


Process measurements (i.e. measures or metrics) are critical to the successful management and
improvement of processes.

- Lack of good data is typically the greatest barrier to improvement

Operational definitions used in data collections process are key

6
- An operational definition is a set of specific instructions about when, where and how to obtain
data.
- Operational definitions can be loosely described as descriptions that allow two people to look
at the same thing at the same time and record the same measurement.

Different operational definitions may lead to different data

- This does not make one definition right and the other wrong. It simply emphasizes the need
to communicate what definition is being used and to select a definition that is appropriate to
the situation.
- The best course of action is to anticipate and define as much as possible in advance.

Process measurement tracked over time


Process measurements tracked over time enable us to analyze the process in the following five ways:

- Asses current performance level


- Determine if the process has shifted by comparing current to past performance
- Determine if the process should be adjusted (minor changes)
- Determine if the process must be improved (major changes)
- Predict future performance of the process

Variation exists in all process (Second Process)


This provides the opportunity for process improvement and hence quality. Variation is the enemy of
quality.

If there were no variation

- Process would run better


- Products would have the desired quality
- Services would be more consistent
- Managers would manage better

Focusing on the variation is the key strategy to improve a performance.

7
Understanding and reducing variation (Third Principle)
The focus is on unintended variation and how it is analyzed to improve performance. First, we must
identify, characterize and quantify variation to understand both the variation and the process that
produced it.

- With this knowledge we work to change the process, e.g. operate it more consistently, to
reduce its variation.

The performance of a process is influenced by its average performance and the amount of variation
around the average performance.

The average performance of any process is a function of various factors involved in the operation of
the process, e.g. average time in days to get bills out, average waiting time in minutes to be served in
a restaurant, or average pounds of waste of a printing process.

When we understand the variation in the output of the process, we can determine which factors within
the process influence the average performance.

We can then attempt to modify these factors to move the average to a more desirable level.
Unfortunately, many managers of today are still concerned about the average process performance
and not yet paid much attention to the performance variation.

- This fundamental problem can be readily rectified by applying the principles of statistical
thinking.

Businesses, as well as customers, are interested in the variation of the process output around its
average value.

- Customers of today are more concerned with the consistency of performance and not about
how often we can hit the target value of product or service performance.
- Typically, consistency of a product or a service is a key customer requirement.

Two types of variation in business process


There are two types of unintended variation that we may need to reduce: special cause and common
cause variation.

Special cause variation


It is outside the `normal' or typical variation a process exhibits. The result of special cause may be
unpredictable or unexpected values that are too high or too low for the customer. It results from
unexpected or unusual occurrences that are not inherent in the process. As special cause variation is
atypical, we can often eliminate the causes without fundamentally changing the process.

- Using a problem-solving approach, we identify what was different in the process when it
produced the usual result.

Common cause variation


It is normal or typical variation. It results from how the process is designed to operate and is a natural
part of the process. Reducing the inherent common cause variation typically requires studying the
process as a whole because there are no unusual results to investigate.

- Using a process improvement approach, we study the common cause variation and try to
discover the input and process factors that are the largest sources of variation.

Do not treat common cause variation as special cause variation!

8
Structural variation
Structural variation is due to causes that operate as an inherent part of the system as common causes
do. However, on a control chart, they appear to be due to special causes. But, their occurrence is
predictable (predictably unstable)

Three key business analytics questions

Deming’s red bead experiment


The red bead experiment illustrates the fallacy of rating people and ranking them in order of future
performance, based on past and present performance.

The experiment shows that even though `willing workers' want to do good jobs, their success is directly
tied to and limited by the nature of the system (i.e. the process) they are working within.

- Common cause variation is an inherent characteristic of the process as currently operated.


- A process exhibiting common cause variation will not respond to slogans, exhortations,
threats, consequences, counselling, training or any other individual-event response when
people are following instructions.

Real and sustainable improvement on the part of the `willing workers' is achieved only when the
`management' (i.e. the owner of the system) is able to improve the process.

9
Statistical thinking strategy
Broad use of statistical thinking can help organizations improve operations and their management
system. Statistical thinking can be used in all parts of an organization and job functions.

Overall approach

1. Study the current situation


2. Define quality performance
3. Analyze performance
4. Focus on the main problems
5. Monitor and evaluate progress

We begin by identifying, documenting and understanding the business process itself. We almost
always begin with some subject matter knowledge (`theory').

- Subject matter knowledge is everything we know about the process under study.
- This could be derived from experience or from an academic study.
- This guides us in planning which data from the process would be most helpful to validate or
refine our ideas (`hypotheses').
- Subject matter knowledge helps us to determine where biases are likely to occur and to avoid
or minimize them.

Once the data are obtained, statistical techniques and tools are used to analyze the data and to
account for the variation in the data.

- Although the analysis may confirm our ideas (`hypotheses'), additional questions almost
always arise, or new ideas may be generated.
- This leads to a desire to obtain additional data to validate or further refine the revised
hypotheses, and the whole cycle is repeated.

Most business applications are sequential studies involving a series of such steps.

10
- Fortunately, our knowledge about the process, and therefore our ability to improve it,
increases with each step.

A complicated issue is the fact that business processes are not static over time but dynamic.

- Statistical methods can be very helpful in determining whether the process has undergone
`significant' change.

Relationship to the scientific method


In its simplest form the scientific method begins with a stated hypothesis (`idea') about some
phenomenon, then an experiment is conducted to test the hypothesis, and observation of the results
confirms or disproves the hypothesis.

- In application this method is also sequential.


- Observations from one experiment may cause us to revise our hypothesis, which may lead to
another experiment to evaluate the revised hypothesis.

Statistical thinking uses the scientific method to develop subject matter knowledge and to gather
data to evaluate and revise hypotheses (`ideas'). There are some important differences:

- Statistical thinking recognizes that results are produced by a process and that the process must
be understood and improved to improve the results; and
- the emphasis in statistical thinking is on variation | the scientific method can be applied
without any awareness of the concept of variation, which may lead to misinterpretation of the
results.

Key similarities are that both are sequential approaches that integrate data and subject matter
knowledge.

Relationship to the “Plan Do Check Cycle”


The `Plan Do Check Act' (PDCA) cycle is often referred to as the Deming cycle, Deming wheel or the
Shewhart cycle.

11
Statistical engineering
Definition
The statistical engineering provides the needed frameworks, it provides with the tactics, or the specific
plan of attack for a given problem.

Statistical engineering is the study of how to best utilize statistical concepts, methods and tools, and
integrate them with information technology and other relevant sciences to generate improved results.

Key principles of statistical engineering


A system or strategy to guide the use of statistical tools is needed for the tools to be effective.

- We need to think carefully about the problem and our overall approach to solving it prior to
getting carried away with detailed analytics.

The impact of statistical thinking and methods can be increased by integrating several statistical tools,
and even non-statistical tools, enabling practitioners to deal with issues that cannot be addressed with
any one method.

Viewing statistical thinking and methods from an engineering perspective provide a clear focus on
problem solving.

Statistical engineering emphasizes having a plan of attack for complex problems and linking various
tools together in an overall approach to improvement.

The statistics overall system as discipline


How statistical thinking, statistical engineering, statistical methods and tools, and so on, fit together
to form an overall system:

Fishbone diagram or Cause and effect diagram


The cause-and-effect diagram, also known as fishbone diagram (due to its shape) or Ishikawa diagram
(after its creator, Kaoru Ishikawa), which illustrates the relationship between a given outcome (output)
and all the factors (inputs) that influence the outcome (output). This type of diagram displays the
factors that are thought to affect a particular outcome (output) in a system (process). The factors

12
(inputs) are often shown as groupings or related sub factors that act in concert to form the overall
`effect' of the group.

Pareto Chart
The Pareto chart is a bar chart that ranks categories (items) by
how important they are with respect to their frequency of
occurrence. Each block represents an item; the bigger the block,
the more important it is. For example, when the items are
problems, the Pareto chart helps to identify which problems
need immediate attention and which can be looked at later.
While there are many things that cause a large problem, it is
frequently found that a few specific factors (about 20%) account
for the bulk of the observed occurrences of that problem (about
80%). This phenomenon was called the `Pareto principle'.

The Pareto chart highlights those few factors


which result in the most problems. With the
categories arranged in order of importance, it is
also meaningful to plot on the chart the
cumulative values arising from each category.
The cumulative value for a category is the sum
of the values corresponding to the particular
category and the categories to the left of that
category in the Pareto chart. The cumulative
line is particularly useful when looking at the
overall effect of making an improvement.

13
Variation in business process
Special and common cause variation
Differences between common and special cause variation

Special cause variation Common cause variation


- Temporary and unpredictable. - Always present and predictable.
- Few sources but each has large effect. - Numerous sources but each has small
- Often related to a specific event. effect.
- Process is unstable. - Part of the “normal behavior of the
process.
- Process is stable.

- A process is said to be in (statistical) control if the only source of variation is common cause
variation.
- A process is said to be out of control if special cause variation is present.

These distinctions between special cause and common cause variation are very important because the
proper approach for improving processes with common causes of variation (i.e. stable processes) is
different from the approach used with special causes of variation (i.e. unstable processes).

If a process is unstable (or `out of control'), eliminating the special causes will help to stabilize it.

- Because special causes are `assignable' to a specific event and are not a fundamental part of
the process, they can often be addressed without spending a great deal of time, effort or
money.
- Special causes tend to have the most impact, so addressing the them first typically will result
in dramatic improvement.

Once the process is stabilized, we can begin to study and understand it, allowing us to make
fundamental improvements.

- Eliminating special causes is really fixing problems, or bringing the process back to where it
should have been in the first place.
- True improvement comes by fundamentally changing the process.
- To do this, we must address the common causes.

If the process is stable (or `in control'), improvement generally requires careful study of the whole
process to identify and prioritize the common causes and their relationships.

Structural variation
There are some `predictably unstable' processes, i.e. some processes are technically unstable, but in
a predictable way.

- The causes of this behavior are sometimes referred to as `structural variation'

In general, eliminating structural variation requires fundamental change to the process, much like
common cause variation.

Robustness an underused concept


Robustness is another key aspect of statistical thinking.

14
- Besides controlling the process by eliminating special cause variation and improving the
system by reducing common cause variation, robustness is another way to reduce variation.

Robustness means reduce the effect of uncontrollable variation in:

- Process design
- Product design
- Management practices

We want to build robust processes by anticipating variation and reducing its effects.

A robust process is insensitive to uncontrolled factors both internal and external to it. It is like installing
shock absorbers on a car. Shock absorbers enable passengers to enjoy a smooth ride despite the
uncontrollable conditions of the road. Similarly, robustness provides shock absorbers for the work
process and its output.'

Process stability and capability tools


Control charts and process capability analysis are the primary quantitative statistical tools used to
evaluate stability and capability.

When evaluating stability and capability, always consider stability first.

Statistical inference, accuracy and precision


Statistical inference
It essentially means drawing conclusions about a population based on sample process
measurements.

Accuracy and precision


A, the process is accurate and precise.

B, the process is accurate but not precise.

C, the process is not accurate but precise.

D, the process is neither accurate nor precise.

Control chart to evaluate stability


Control charts are one of the techniques of so-called Statistical Process Control (SPC). The primary
purpose of control charts is to differentiate common cause variation from special cause over time.

If managers don’t use control charts, two errors are possible:

- They could fail to identify a performance change as special cause and thereby lose valuable
information about the process;

15
- They could treat a change as a special cause when it is, in fact, a part of `typical' process
variation | this error is called `tampering'.

A typical control chart contains

- A Center Line that represent the average value of the process characteristic corresponding to
the in-control state
- An Upper Control Limit (UCL)
- A Lower Control Limit (LCL)

As long as the points plot within the control limits, the process is assumed to be `in control', and no
action is necessary.

However, a point that plots outside of the control limits is interpreted as evidence that the process is
`out of control'.

- Investigation and corrective action is required to find and eliminate the assignable causes
responsible for this behavior.

Even if all the points plot inside the control limits, if they behave in a systematic or nonrandom manner,
then this is an indication that the process is `out of control'.

Rule of thumb to define out of control process


For example, the process may be defined as `out of control' if any one or more of the following `rules
of thumb' are met

16
Phases of control charting
Phase I
The phase I (or `set-up' phase) is a retrospective analysis of process data to construct `trial control
limits'

- To determine if the process has been in control over the period of time where the data were
collected; and
- To see if reliable control limits can be established to monitor future production.

Control charts are used primarily in phase I to assist in bringing the process into a state of statistical
control by differentiating common cause variation from special cause variation over time and evaluate
process stability and eliminate special causes.

Phase II
Phase II begins after we have a `clean' set of process data gathered under stable conditions and
representative of the in-control process performance.

- In phase II, we use control charts (with the projected phase I control limits) to monitor the
process for special cause variation.
- Special causes can be detected rapidly using these charts, and problems can be diagnosed and
eliminated.
- The control limits will not be updated unless there has been a substantial change in the
process.

Control charts characteristics


Control charts help detect special cause variation.

- The control chart itself does not eliminate special causes, it `only' detects them and provides
clues to help us understand the process.

A control chart is also of little value in improving a stable process because it will only continue to tell
us that the process is stable.

- Control charts are not designed for reducing common cause variation.
- To reduce common cause variation quite different tools and ways of thinking are required, e.g.
modify internal parts of the process using `planned' investigation.

There is a close connection between control charts and statistical hypothesis testing as the control
chart tests a hypothesis repeatedly at different points in time.

Generality to construct control charts


If we have a process characteristic X normally distributed with mean µ and standard deviation σ, where
both are known, and we have a sample size n, then the mean is computed as
𝑛
𝑥1 + 𝑥2 + 𝑥3 + ⋯ 𝑥𝑛 1
𝑋̅ = = ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1

The three-sigma control limit is defined as


𝜎
𝜇 ± 3𝜎𝑥 𝑜𝑟 𝜇 ± 3
√𝑛

17
Usually, these data are not known. Therefore, we need to estimate them. We usually do so by using
samples that are taken while we believe the process is in control.

- These estimates should usually be based on at least 20 to 25 samples, i.e. subgroups of data.
- Suppose that m samples are available, each containing n observations on the process
characteristic. Typically, n will be small, often either 4, 5 or 6.
- X1 bar, X2 bar … Xn bar are the average of each sample,

The best estimator of µ, the process average, is the grand average, or the means of the sample means,
denoted X bar bar
𝑥̅1 + 𝑥̅ 2 + ⋯ + 𝑥̅𝑛
𝑋̿ =
𝑛
To construct the control limits, we need to estimate the standard deviation σ. We can estimate σ from
either the ranges or the standard deviations of the m samples.

The range method


If X1, X2, …, Xn is a sample size of n observations then the range of the sample is the difference between
the largest and the smallest observation.

𝑅 = 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
If R1, R2,…,Rm are the ranges of the m samples then
𝑅1 + 𝑅2 + ⋯ + 𝑅𝑚
𝑅̅ =
𝑚
Then an unbiased estimator of σ is

𝑅̅
𝜎̂ = 𝑊ℎ𝑒𝑟𝑒 𝑑2 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝑑2
The standard deviation method
This method is preferable to the range method when either the sample size n is moderately large -say,
n > 10 - or the sample size n is variable. Moreover, this method uses all the data, not just the largest
and smallest observations.

We have an unbiased estimator of the variance σ2 as


𝑛
1
𝑆2 = ∑(𝑥𝑖 − 𝑋̅)2
𝑛−1
𝑖=1

It might also be useful to define S bar, where S1, S2, …, Sm are the sample standard deviation
𝑆1 + 𝑆2 + ⋯ + 𝑆𝑚
𝑆̅ =
𝑚
Moving Range
We can use the moving range of two successive observations, i.e. the unsigned differences between
consecutive measurements, as the basis of estimating the process variation.

The moving Range is defined as

𝑀𝑅 = |𝑋𝑖 − 𝑋𝑖−1 |

And the average of the Moving Rage is

18
𝑀𝑅1 + 𝑀𝑅2 + ⋯ + 𝑀𝑅𝑚
̅̅̅̅̅
𝑀𝑅 =
𝑚−1
Type of control charts
There are many types of control charts, but they all plot data (or statistics calculated from data) over
time (or sample number) and have statistical control limits (UCL and LCL) representing the range of
common cause variation to be expected.

Chart for discrete data


Control charts for defect data
First, consider discrete data, i.e. attributes, where we count the number of defects, nonconforming
items or instances of an attribute, such as the number of customer service calls received.

C-Chart
C-Chart or `count chart': plot of the number of defects (or nonconforming items). Assumes that defects
(or nonconforming items) are rare and control limits are computed based on the Poisson distribution
(`distribution of rare events'); see `Practical 8'.

U-Chart
U-Chart or `unitized count chart': plot of the fraction (or proportion) of defects (or nonconforming
items), i.e. the number of defects (or nonconforming items) divided by the sample size.

- The u chart simply converts the data to a common scale when the opportunity for defects
varies from data point to data point.
- Assumes that defects are rare and control limits are computed based on the Poisson
distribution.
- Does not require constant number of units (unlike the c chart). Can be used, for example, when
samples are of different sizes.

Control charts for defective data


Some data that are very similar to defects data do not satisfy the statistical requirement that the
defects are rare for use of the Poisson distribution.

- These data are better treated as if they were continuous data, even though they technically
are not continuous.

Consider discrete data, i.e. attributes, that fall in only one of two categories, such as `defective' or `not
defective', or `nonconforming' or `conforming'.

p-Chart
It is a plot of the fraction or proportion of defectives, i.e. the ratio of the number of defectives to the
sample size.

- Control limits are based on the binomial distribution; see `Practical 8'.
- For defectives (binomial) data, we may have equal or unequal subgroup sizes.

To estimate p bar
∑𝑚
𝑖=1 𝐷𝑖 ∑𝑚 𝑝̂ 𝑖
𝑝̅ = = 𝑖=1
𝑚𝑛 𝑚

19
And the variance of p hat
𝑝(1 − 𝑝)
𝑣𝑎𝑟(𝑝̂ ) =
𝑛
The p chart has the advantage that groups of data with different sample sizes can still be plotted on
the same chart.

np-Chart
The only difference with the P-Chart is that on this graph it is not the proportions that are plotted but
the number of occurrence or nonoccurrence.

Control chart for continuous data


Consider variable data, i.e. variables, measured on a continuous scale, such as cycle time.

- It is necessary to monitor both the mean value of the process characteristic and its
variation!

Individuals Chart
The individual chart (plot of individual data points): used when the nature of the process is such that
it is difficult or impossible to group measurements into subgroups so that an estimate of the process
variation can be determined.

- The sample size used for process monitoring is n = 1; that is, the sample consists in an individual
unit - a single process measurement.
- The individuals chart is a generic chart, and it can be used in virtually any situation, including
with discrete data.
- The individuals chart is therefore a good default chart when you are uncertain about the most
appropriate chart to use.

The center Line and the Control Limits are

𝑈𝐶𝐿 = 𝑋̅ + 2.66 ̅̅̅̅̅


𝑀𝑅
𝐶𝐿 = 𝑋̅
𝐿𝐶𝐿 = 𝑋̅ − 2.66 ̅̅̅̅̅
𝑀𝑅
MR-Chart
The moving range chart does not detect changes in variation, though very long lines between
consecutive points on the chart indicate instability.

𝑈𝐶𝐿 = 3.267 ̅̅̅̅̅


𝑀𝑅
̅̅̅̅̅
𝐶𝐿 = 𝑀𝑅
𝐿𝐶𝐿 = 0
Joiner Chart
(plot of individual data points): an improvement of the standard individuals chart, being less sensitive
to out of control situations and extreme values.

𝑈𝐶𝐿 = 𝑋̅ + 3.14 𝑚𝑒𝑑(𝑀𝑅)


𝐶𝐿 = 𝑋̅
𝐿𝐶𝐿 = 𝑋̅ − 3.14 𝑚𝑒𝑑(𝑀𝑅)

where med(MR) is the median of the moving ranges (see references).

20
X bar Chart
In X bar Charts the sample means are plotted in order to control the mean value of a variable.

- X bar charts monitor `between-sample variation'.


- The center line on the X bar control chart is the process average X bar bar (`grand average' or
`X double bar').

To first control the variation of a variable one can use, depending on the measure of `within-sample
variation', i.e. using the ranges or the standard deviations of the m samples:

𝑈𝐶𝐿 = 𝑋̿ + 𝐴2 𝑅̅
𝐶𝐿 = 𝑋̿
𝐿𝐶𝐿 = 𝑋̿ − 𝐴2 𝑅̅

Where A2 is a constant

We can also compute the control limits of a X chart as:

𝑈𝐶𝐿 = 𝑋̿ + 𝐴3 𝑆̅
𝐶𝐿 = 𝑋̿
𝐿𝐶𝐿 = 𝑋̿ − 𝐴3 𝑆̅

Typically, you use this formula when the sample size n is moderately large (n > 10) or when the sample
size n is variable.

R Chart
In this type of chart, the sample ranges are plotted and the R chart with three-sigma control limits is
defined as:

𝑈𝐶𝐿 = 𝐷4 𝑅̅
𝐶𝐿 = 𝑅̅
𝐿𝐶𝐿 = 𝐷3 𝑅̅

Where D3 and D4 are constants.

S Chart
In this type of chart, the sample standard deviations are plotted and the S chart with three-sigma
control limits is defined as:

𝑈𝐶𝐿 = 𝐵4 𝑆̅
𝐶𝐿 = 𝑆̅
𝐿𝐶𝐿 = 𝐵3 𝑆̅

S charts are preferable to R charts when either the sample size n is moderately large - say, n > 10 - or
the sample size n is variable.

Note that, if you make changes to the process you have to recalculate the control limits.

21
Control charts constants

22
Control Charts Formula

23
Deciding which charts to use

About this flow chart:

- Answering the question `Can we count nonoccurrence?' determines whether the data are like
`defectives' (i.e. something is either defective or not -> `Yes') or like `defects' (wherein one
item can have numerous defects -> `No');
- Answering the question `Are occurrences rare?' determines whether the data are very similar
to defects data, but do not satisfy the statistical requirement that the defects are rare for use
of the Poisson distribution.
- The term rare is intended to be interpreted at a particular location and point in time.

Multivariate process control and monitoring


Many situations require the simultaneous monitoring and control of two or more related process
characteristics. The distortion in the process monitoring procedure increases as the number of process
characteristics increases. Process monitoring problems in which several interrelated variables are of
interest are called `Multivariate Statistical Process Control' (MSPC) problems.

Case study: on-time delivery


In this case the target is 97.5% of on-time delivery. The overall average has to be greater than 97.5%
in order for the center to consistently meet the goal of 97.5%.

The gap between the target and the overall average is called the off-target deviation.

24
Process capability analysis to evaluate capability
Capability refers to the ability of a process to meet its business requirements.

Capable means that the process variation is low enough so that if the process average is properly set,
virtually all process measurements will meet its business requirements.

- The emphasis of statistical capability analysis is therefore primarily on the process variation,
as opposed to the process average.
- A process could be capable, but not currently performing well relative to its requirements
because the process average is off-target.

Process capability analysis is the primary statistical tool used to evaluate capability - once the process
is stabilized!

For processes that are stable, we often wish to evaluate the degree to which the current process will
satisfy business requirements, assuming the process continues to produce stable output. That is
exactly what we do with Process capability analysis.

The business requirements - the so-called specifications or targets - are externally imposed to the
process, e.g. by technical specifications, customer expectations or requirements.

- They are chosen to reflect what the customer wants

Such business requirements determine the specification limits, i.e. the `Upper Specification Limit'
(USL) and the `Lower Specification Limit' (LSL). The specification limits indicate the “acceptable” range
over which the process should perform.

A process in control is stable over time. However, there is no guarantee that a process in control
produces products (or services) of satisfactory quality, i.e. as externally imposed by specifications.

Capability has nothing to do with control - expect for the very important point that if a process is not
in control, it is hard to tell if it is capable or not.

If a process that is in control, i.e. a process that is stable, does not have adequate capability,
fundamental changes in the process are needed.

25
- Capability relates the actual performance of a process in control, i.e. after special causes have
been removed, to the desired performance.

There is no mathematical or statistical relationship between specifications limits and control limits.

Process capability ratios


Formal capability analysis, which generally means calculation of capability ratios, is performed to
quantify how well the current process meets or could meet its business requirements, i.e.
specifications.

- By plotting the data of key variables versus their specifications, we can determine graphically
whether we have a problem.
- Calculation of capability ratios supplies a number to document how well we are, or can be,
doing.
- A benefit of process capability ratios is that they make discussion of process capability more
rational and less dependent on opinion (HIPPO)

Most capability rations are more strongly affected by non-normality than are control charts, and can
be quite inaccurate when based on small numbers of observations.

Cp ratio
The process capability ratio is defined as
𝑈𝑆𝐿 − 𝐿𝑆𝐿
𝐶𝑝 =
6𝜎
The range USL – LSL is also known as the specification width or interval.

- Large value of Cp indicate more capable processes.


- Estimate of Cp is obtained by replacing σ with its estimate.

Cpk ratio
This is a logical extension of C p, because one could have a Cp greater than 1 but experiencing data
outside the specifications because the average is off-center.

𝐶𝑝𝑘 = min (𝐶𝑝𝑢 , 𝐶𝑝𝑙 )

with
𝑈𝑆𝐿 − 𝜇
𝐶𝑝𝑢 =
3𝜎
and
µ − 𝐿𝑆𝐿
𝐶𝑝𝑙 =
3𝜎
In order to obtain the estimate of this ratio, we replace the parameters with their estimates.

26
Relationship between Cp and Cpk

Recommended minimum values for the process capability ratio


Example. For short term performance, a Cpk of 2.0
is the target standard for a `Six Sigma' project. Or,
a Cpk of 1.33 is required of suppliers in the
automotive industry. However, where a process
produces a characteristic with a C pk greater than
2.5, say, the unnecessary precision may be
expensive!

The assumptions for the interpretation of process capability ratios are

- The process characteristics has a normal distribution


- The process is stable over time, i.e. it is in (statistical) control
- The process mean is centered between LSL and USL.

Confidence intervals for process capability ratios


Much of the use of process capability ratios focuses on computing and interpreting the point estimate
of the desired quantity.

- It is easy to forget that, for example, Cpk is simply a point estimate and, as such, is subject to
statistical fluctuation.
- It should become standard practice report confidence intervals for process capability ratios.

To compute a confidence interval for Cpk we use

27
1 𝐶̂𝑝𝑘
2
𝐶̂𝑝𝑘 ± 𝑧1−𝛼/2 √ +
9𝑛 2(𝑛 − 1)

In general, z = 1.96 when alpha = 0.05 or 2.5758 when alpha = 0.01. This works when the sample size
n is at least 25.

Deming’s funnel experiment


The experiment is an excellent example of “over-controlling” a (stable) process.

- Attempting to adjust a stable process will make thing worse!

The experiment clearly shows to not react before scientifically studying a process.

The lessons learnt (or `rules') of the experiment can be applied to many different types of
management, all of which are impediments to effective management and continuous improvement,
including:

- Adjusting a process when a part is out of specifications


- Making changes without the aid of control charts
- Changing company policy based on the latest customer attitude survey
- Relying on history passed down from generation to generation.

Overall business process improvement strategy


Prior to implementing any process improvement strategy, one should define the scope and objectives
for the improvement effort.

As the analysis of special cause variation differs from the one of common cause variation, the emphasis
should be on removing special cause variation first.

The following overall process improvement strategy — an enhanced PDCA


approach to improvement — shows that continuous improvement occurs in
iterative sequential steps

If a process is stable, i.e. showing only common causes of variation.

- Use tools of improvement to study all the data (i.e. not just the ‘good’
or the ‘bad’ points) and identify factors that cause variation.
- Determine what needs to be permanently changed to achieve a
different level of quality.

If a process is not stable, i.e. showing signs of special cause.

- Use tools of problem solving to identify the potential root causes.


- Try to identify exactly when, where, how and why the process
changed.
- If a special cause hurts the process, develop procedures to eliminate
the return of the problem.
- If a special cause is beneficial, develop procedures to make it a
permanent part of the process.

28
Tools to Analyze common cause variation step:
- Stratification: define a ‘stratification factor’ such as the day of the week, the machine or the
business unit.
o Partition the factor into logical categories.
o Compare the data for each category to highlight differences and so that patterns can
be seen;
- Disaggregation: define measures for sub processes or individual process steps.
o Study the variation in the individual sub processes.
o How does it contribute to the overall process variation?
- Regression analysis: existing process knowledge might suggest one or more variables (‘inputs’)
that influence the process measure (‘output’).
o A regression analysis might ‘verify’ this opinion or indicate that these variables have
negligible ‘effect’.

29
Tools of the Study cause-and-effect relationships
- (statistical) Experimental design: strategically and systematic planned variation of input
factors for an actual process — also known as ‘Design Of Experiments’ (DOE); see references.
o The experimenter observes the ‘effect’ of these variations on important process
characteristics;
- Interrelationship digraph: evaluate the cause-and-effect relationships between issues to
identify which are the ‘drivers’ and which are the ‘effects’.
o This provides a way to process raw ideas that have been generated in ‘brainstorming’
(which was used to rapidly generate a diverse list of ideas or potential root causes);
- Model building: construct a (statistical) model of a process that predicts (not explains!)
process performance based on input variables
- Scatter plot: plot of a process characteristic (output) versus a potential explanatory variable
(input);
- Box plot: boxplot to depict the relationship between a discrete variable, such as the region of
a country, and the distribution of continuous variable, such as the profitability

Tools of the ‘Document the problem’ step


- Is/is not analysis: rigorous process for carefully documenting an issue so that ‘true’ root causes
can be identified.
o Specifically, it notes when, where and how an issue occurred, as well as when, where
and how it did not occur, but could have.

Tool of the ‘Identify potential root causes’ step


- 5 Whys: probe beneath the symptoms of a
problem to get down to the root causes.
o The intent is to ask ‘Why’ a symptom has
occurred, then why the cause occurred and then
why the cause of the cause occurred, and so on.
o Often, we have to dig down through as many
as 5 causes to get to the root cause.
o It is basically applied common sense, but less
rigorous than ‘is/is not analysis’

Tools of the ‘Choose best solutions’ step


- Multivoting: voting multiple times.
o Multivoting is used to prioritize a large group of suggestions so that teams can follow
up on those most likely to be ‘fruitful’ first;
- Affinity diagram: approach to grouping ideas by organizing a collection of ideas into ‘natural’
groupings or categories, so that they can be more effectively addressed.
o It is particularly valuable after a brainstorming exercise that has produced a large
number of ideas.
o The ‘best’ method for handling this is to write ideas, for example, on individual Post-
It’ notes that can be moved easily from spot to spot.

Process control charts and continuous improvement


- The emphasis which must be placed on improvement has important implications for the way
in which process control charts are applied.

30
o The process of continuous improvement should be charted over time and
adjustments made to the control charts in use to reflect the improvements made (in
reducing variation):

Tools to explore observational business data


What are observational studies
Observational studies are conducted on existing data (i.e. observational, ‘secondary’ or ‘found’ data)
that have been collected for reasons other than to conduct (statistical) data analyses, e.g. to meet
financial, legal or management reporting requirements.

- So-called data mining (rebranded nowadays as data science) usually involves analyses of
observational data to find ‘novel’ patterns or relationships, i.e. to create or generate new
ideas.

Exploratory analysis of observational data may

- provide insights into business processes and product or service characteristics;


- provide a good overview of current status and the magnitude of an issue;
- suggest potential driving factors of a ‘good’ or ‘bad’ outcome to explore further

Well-conducted observational studies often suggest patterns, relationships or provide clues that -
when followed up by more rigorous investigations - can lead to important findings to help improve a
business process.

Observational studies have particular applicability at the starting point of an investigation, as well as
in situations for which more disciplined studies are impractical - or even unethical - and for which the
main goal is prediction rather than gaining understanding (explanation).

Observational data have an important role in pointing the way forward, but they should not be a
primary ingredient of making data-driven decisions!

Three key issues to consider when exploring observational data include:

- The relevance of the sample to the current study;


o Are there differences in time, product, service or process that may change patterns or
relationships?
- The quality of the data;
o How were the ‘units of the analysis’ selected?
o Have all potentially relevant process inputs, i.e. the drivers (factors) influencing a given
KPI, been gathered?
o Do you understand the processes and system that generated the data?
o Were operational definitions (for process inputs and outputs, i.e. the KPIs) used to
collect the data?
- How can the data be used to advance the study without drawing conclusions? that are not
justifiable, given the lack of established causality?

Affinity grouping association rules


also known as ‘dependency modelling’, ‘cooccurrence grouping’ or ‘association discovery’: express
how products or services relate to each other and tend to group them together.

- Basket analysis gives insights into the merchandise by telling a business which products tend
to be purchased together and which are most amenable to promotion.

31
- This information is actionable
o It can suggest new store layouts
o It can determine which products to put on special
o It can indicate when to issue coupons and so on.

Although the roots of association rules are in analyzing point-of-sale transactions, association rules can
be applied outside the retail industry.

- Whenever a customer purchases multiple product at the same time or does multiple things in
close proximity, or whenever one wants to discover sequences of events that commonly occur
together, there is a potential application.
- Some examples include:
o items purchased on a credit card, such as rental cars and hotel rooms, give insights
into the next product that customers are likely to purchase;
o optional services purchased by telecommunications customers (e.g. call waiting, call
forwarding, ISDN, speed call, UMTS) help determine how to bundle these services
together to maximize revenue;
o banking services used by retail customers (e.g. money market accounts. investment
services, car loans) identify customers likely to want other services;
o unusual combinations of insurance claims can be a sign of fraud and can spark further
investigation
o hidden relationships between financial transactions based on their cooccurrences can
be a sign of money laundering;
o medical patient histories can give indications of complications based on certain
combinations of treatments.

Often, basket analysis is used as a starting point when transaction data are available, and a business
does not know what specific patterns to look for.

- Interesting patterns often suggest some profitable course of action

Cooccurrences, Association rules, Sequential patterns


- Cooccurrences, e.g. 80% of all customers purchase items X, Y and Z together;
- Association rules, e.g. 60% of all customers who purchase X and Y also buy Z;
- Sequential patterns, e.g. 60% of customers who first buy X also purchase Y within three weeks
(‘sequential pattern analysis or mining’): detecting associations over time.

Association rule mining


(or ‘frequent pattern mining’) discovers associations between items.

- Note that association rule mining does not consider the order of transactions (sequential
pattern analysis or mining’).

Support of a rule
It is defined as:

𝑋 ⟹ 𝑌 ℎ𝑎𝑠 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑠 𝑖𝑓 𝑃(𝑋 𝑎𝑛𝑑 𝑌) = 𝑠

Support denotes the proportion of transactions in the data set which contain the itemset (i.e. the
cooccurrence)

A high value means that the rule involves a great part of the database.

32
Confidence of a rule
It is defined as
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌
𝑋 ⇒ 𝑌 ℎ𝑎𝑠 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 𝑖𝑓 𝑃(𝑌|𝑋) = =𝑐
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋
Confidence denotes the proportion of transaction containing X which contain also Y.

It is an estimation of the conditional probability P(𝑌|X).

Frequent items or itemset


If we only want association rules with a minimal support smin, then the minimum support smin is a
threshold for the support.

Items or itemset with support ≥ smin are called frequent items or itemset.

A priori principle
It is: Any subset of a frequent itemset must be frequent. Hence it is sufficient to only mine all maximal
frequent itemset.

The ‘a priori’ principle is very powerful and can greatly reduce the search space.

Thus, by applying the ‘a priori’ principle many itemset can potentially be avoided when exploring the
search space.

In practice, however, we do not only fix a minimal support smin but also a minimal confidence cmin, say.

- The rules hold i.e. are valid, if their support is ≥ smin and their confidence is ≥ cmin.

Basic a priori algorithm


The following approach is known as “basic a priori algorithm”

- If smin is high, then we get few frequent itemset and few valid rules which occur very often.
- If smin is low, then we get many valid rules which occur rarely.
- If cmin is high we get few rules, but all are ‘almost logically true’.
- If cmin is low we get many rules, but many of them are very ‘uncertain’.

In practice, typical values are smin = 2 − 10% (rarely occurring rules) and cmin = 70 − 90% (‘certain’ rules).

Several optimizations for the basic a priori algorithm exist, as well as problem extensions, e.g. using
interestingness measures beyond support and confidence

Lift
Lift is defined by the combination of the support and confidence:
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓 𝑋 ⟹ 𝑌 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌
𝑋 ⟹ 𝑌 ℎ𝑎𝑠 𝑙𝑖𝑓𝑡 𝑙 𝑖𝑓 = =𝑙
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ∗ 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌

It is an estimation of the association measure

𝑃(𝑌|𝑋) 𝑃(𝑋 𝑎𝑛𝑑 𝑌)


=
𝑃(𝑌) 𝑃(𝑋) ∗ 𝑃(𝑌)
Greater lift values indicate stronger associations.

33
Leverage
It is an alternative to the lift and it is the difference of the quantities defining lift rather than their ratios

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌 − 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ∗ 𝑠𝑢𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌


Correlation Value

It is defined as ρ
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ⇒ 𝑌
𝜌=
√𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑋 ∗ 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓 𝑌

In practise, in addition to fix a minimal support s min and a minimal confidence cmin, we could also fix a
minimal ‘correlation’ value ρmin, say, and retain only the rules that have their values greater than these
minimal values.

Recommendation engines or systems


They are developed to find patterns in customer preferences, search for similar patterns in other
people and make relevant suggestions that will encourage customers to spend more on the company’s
products or services.

‘Offline recommendation engines or systems’: think of the people around you as recommendation
engines.

- These ‘offline recommenders’ know something about you.


- They know your style or area of study, and thus can make more informed decisions about what
recommendations would benefit you most.

‘Online recommendation engines or systems’ aim to emulate this personalization.

Building a recommendation system


Building a recommendation system — a recommender — can be broken down into three main steps:

Collect preferences
It is build users profile. Different approaches are generally used to measure user’s tastes and interests:

- Personal preferences or ‘tell the business what you like’; Example. The book you just read
changed your life, give it a five-star review on Amazon. Or, you liked that article and shared it
on Twitter.
- Collaborative Filtering (CF) or ‘people like you tell the business what you may like’.
Example: Businesses know what your preferences are, businesses find people with ‘similar’
tastes, businesses look at what they have purchased and recommend you the items they liked
and that you have not purchased yet.

34
Example: Patterns found with association rule mining could be used when recommending new
products or services to others based on what others have bought before (or based on which
products or services are bought together).

o Association rule mining is a particular type of CF with respect to ‘item-item similarity’,


i.e. ‘item-based CF’, not ‘user-based CF’.
o However, association rule mining is less personalized than CF.
o For example, if two users have both milk and eggs in their baskets, standard
association rule mining will suggest to them the same items, regardless of what they
bought in the past.

Note that CF does not need an ‘understanding’ of the item, e.g. of the book or the movie, itself.

- Content based filtering methods are based on a description, i.e. an ‘understanding’, of the
item and a profile of the user’s preferences.
o In a content based filtering recommendation system, keywords are used to describe
the items; besides, a user profile is built to indicate the type of items this user likes.
o These methods recommend items that are ‘similar’ to those that a user liked in the
past (or is examining in the present).
- Hybrid recommendation systems: a hybrid approach, combining CF and content based
filtering could be more effective in some cases;
Example. A possible hybrid approach implementation where recommendations are done
separately and then combined:

Example. Netflix is a good example of a hybrid system. They make movie recommendations by
comparing the watching and searching habits of similar users (i.e. CF) as well as by offering
movies that share characteristics with movies that a user has rated highly (i.e. content based
filtering).
- Collective intelligence or ‘it is the consensus that tells the business what you like’;
Example. If 99% of the people who have seen this new movie thought it was terrible, it is
unlikely that the recommendation system recommends it to you.
- Discovery: the recommendation system experiments with presenting you new things — based
on your history — and you tell whether you like it or not.
o In addition to getting to know you better, this stimulates novelty and creates surprise.

People are unique and there is no single approach to recommendation that will work for everyone.
The challenge is to find the metrics that are relevant and combine them in a clever way to make the
recommendation as much personal as possible.

Find similarities
It is to find people that are similar to you. Similarity indicates the strength of relationship between two
users, but often similarity is hard to define. However, ‘similarity is subjective’:

- An almost endless number of similarity measures exists.

35
- Different measures of similarity calculated from the same set of users (or items) can, and often
will, lead to different solutions.
- Associated with similarity is dissimilarity = 1 − similarity, e.g. two users (or items) are ‘close’
when their dissimilarity is small or their similarity large.
- The term distance is often used informally to refer to a dissimilarity measure derived from the
characteristics describing the users (or items).
- Experiment with the simplest measures first, since this is likely to ease the possibly difficult
task of the interpretation of results.

The Manhattan Distance


The so-called ‘Manhattan Distance’ is one of the simplest ways to measure the distance between two
data points:
𝑛

𝑑(𝑥, 𝑦) = ∑|𝑥𝑘 − 𝑦𝑘 |
𝑘=1

The Euclidian Distance


The so-called Euclidean Distance, i.e. instead of going around ‘blocks’ draw a straight line between the
two data points and measure the length of this line using the Pythagorean theorem

𝑛
𝑑(𝑥, 𝑦) = √∑ (𝑥𝑘 − 𝑦𝑘 )2
𝑘=1

Pearson Correlation Score


The Pearson Correlation Score is given by
∑𝑥 ∑𝑦
∑ 𝑥𝑦 −
𝑃𝑒𝑎𝑟𝑠𝑜𝑛 (𝑥, 𝑦) = 𝑁
2 2
√(∑ 𝑥 2 − (∑ 𝑥) )(∑ 𝑦 2 − (∑ 𝑦) )
𝑁 𝑁
Cosine Distance
The so-called Cosine Distance, i.e. measure the distance between two points by calculating the angle
from the origin (θ)
𝑥∗𝑦
cos(𝜃) =
‖𝑥‖ ∗ ‖𝑦‖

36
Make Recommendations
It is to recommend similar items (or users), just find the ones that are the most similar to the item (or
user) you like, i.e. the ones with the lowest distance or the highest ‘correlation score.

37
Useful formulas
Variance
𝑉(𝑥) = 𝐸(𝑋 2 ) − [𝐸(𝑋)]2

Sample Variance
∑(𝑋 − 𝑋̅ )2
𝑆2 =
𝑁−1
Comparing 2 groups – test of proportions

38

You might also like