Chapter 5- Data Mining
Chapter 5- Data Mining
CHAPTER 5
DATA MINING
Learning Objectives
1
28/08/1445
Opening Vignette...
1. Why should retailers, especially omni-channel retailers, pay extra attention to advanced
analytics and data mining?
2. What are the top challenges for multi-channel retailers? Can you think of other industry
segments that face similar problems/challenges?
3. What are the sources of data that retailers such as Cabela’s use for their data mining
projects?
4. What does it mean to have a "single view of the customer"? How can it be
accomplished?
2
28/08/1445
▪ Data Mining: The nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data stored in structured databases. - Fayyad et al.,
(1996) in records structured by categorical, ordinal, and continuous variables (Fayyad et
al., 1996).
▪ Next slide
3
28/08/1445
▪ Potentially useful means that the discovered patterns should lead to some benefit to
the user or task.
▪ Ultimately understandable means that the pattern should make business sense that
leads to the user saying “mmm! It makes sense; why didn’t I think of that” if not
immediately, at least after some post processing.
4
28/08/1445
10
5
28/08/1445
11
12
6
28/08/1445
13
■ Time-series forecasting:
■ Part of sequence or link analysis?
■ Visualization:
■ Another datamining task?
■ Types of DM:
■ Hypothesis-driven data mining.
■ Discovery-driven data mining.
14
7
28/08/1445
15
16
8
28/08/1445
■ Insurance:
■ Forecast claim costs for better business planning.
■ Determine optimal rate plans.
■ Optimize marketing to specific customers.
■ Identify and prevent fraudulent claim activities.
17
18
9
28/08/1445
19
■ Healthcare.
■ Identify people without health insurance and the factors underlying this undesired
phenomenon;
■ identify novel cost–benefit relationships between different treatments to develop
more effective strategies;
■ forecast the level and the time of demand at different service locations to optimally
allocate organizational resources;
■ understand the underlying reasons for customer and employee attrition.
20
10
28/08/1445
■ Medicine.
■ Identify novel patterns to improve survivability of patients with cancer;
■ predict success rates of organ transplantation patients to develop better donor-
organ matching policies;
■ identify the functions of different genes in the human chromosome (known as
genomics);
■ Discover the relationships between symptoms and illnesses (as well as illnesses
and successful treatments) to help medical professionals make informed and
correct decisions in a timely manner.
21
■ Entertainment industry.
■ Analyze viewer data to decide what programs to show during prime time and how to
maximize returns by knowing where to insert advertisements;
■ Predict the financial success of movies before they are produced to make investment
decisions and to optimize the returns;
■ Forecast the demand at different locations and different times to better schedule
entertainment events and to optimally allocate resources;
■ Develop optimal pricing policies to maximize revenues.
22
11
28/08/1445
23
24
12
28/08/1445
25
The process is highly repetitive and experimental (DM: art versus science?)
26
13
28/08/1445
27
28
14
28/08/1445
29
■ Predictive accuracy.
■ Hit rate.
■ Speed.
■ Model building; predicting.
■ Robustness.
■ Scalability.
■ Interpret ability.
■ Transparency, explain ability.
30
15
28/08/1445
31
True Positive Rate = TP The ratio of correctly classified positives divided by the total positive count (i.e., hit rate or
TP + FN recall)
True Negative Rate = TN The ratio of correctly classified negatives divided by the total negative count (i.e., false
TN + FP alarm rate)
Accuracy = TP + TN The ratio of correctly classified instances (positives and negatives) divided by the total
TP + TN + FP + FN number of instances
Precision = TP The ratio of correctly classified positives divided by the sum of correctly classified
TP + FP positives and incorrectly classified positives
Recall = TP Ratio of correctly classified positives divided by the sum of correctly classified positives
TP + FN and incorrectly classified negatives
32
16
28/08/1445
33
34
17
28/08/1445
35
Classification Techniques
36
18
28/08/1445
Decision Trees
37
Decision Trees
38
19
28/08/1445
Decision Trees
39
40
20
28/08/1445
41
▪ Analysis methods:
▪ Statistical methods (including both hierarchical and nonhierarchical), such as k
means, k-modes, and so on
▪ Neural networks (adaptive resonance theory [ART], self-organizing map
[SOM])
▪ Fuzzy logic (e.g., fuzzy c-means algorithm)
▪ Genetic algorithms
42
21
28/08/1445
43
44
22
28/08/1445
45
46
23
28/08/1445
47
48
24
28/08/1445
49
50
25
28/08/1445
➢ Apriori Algorithm
■ Finds subsets that are common to at least a minimum number of the itemsets.
■ Uses a bottom-up approach.
■ frequent subsets are extended one item at a time (the size of frequent subsets
increases from one item subsets to two-item subsets, then three-item subsets,
and so on),
■ Groups of candidates at each level are tested against the data for
minimum support.
51
52
26
28/08/1445
▪ Commercial
▪ IBM SPSS Modeler (formerly Clementine)
▪ SAS - Enterprise Miner
▪ IBM - Intelligent Miner
▪ StatSoft -Statistica Data Miner many more
▪ Free and/or Open Source
▪ R
▪ RapidMiner
▪ Weka...
53
▪ Data Mining
■ Provides instant solutions/predictions
■ Is not yet viable for business applications
■ Requires a separate, dedicated database
■ Can only be done by those with advanced degrees
■ Is only for large firms that have lots of customer data
■ Is another name for the good-old statistics
54
27
28/08/1445
55
56
28
28/08/1445
57
29