W1 Lecture1 May8 2023
W1 Lecture1 May8 2023
14
Data Mining in Today’s Environment
2 Main Approaches
• Advanced
• Non-Advanced
15
Data Mining in Today’s Environment
• AI is not new
• Why the popularity of AI
• How Does AI fit in ?
• Is it a part of data mining
• What is more important
• Data vs. Math
16
Data Mining in Today’s Environment
17
Data Mining in Today’s Environment
• CHATGBT
• Massive amounts of data can be analyzed
– Conduct massive research
– Write essays, articles, books,and exams
– Create new art/paintings
– Create programming code
– Can create almost anything as long as there is a base or
source of data to work with.
• Where will the human fit in ?
18
Where is AI versus the brain?
• Time Saving and better
• But is it enough
• What can the human do that AI cannot
High
Marketing
Investment
$/Customer
Low
Low Customer Value / Potential
High
Align Marketing Investment with Customer
Potential
High
Marketing
Investment
$/Customer
Low
Low Customer Value / Potential
High
The Big Picture…
• Effective customer segmentation in the analysis phase
drives program planning, execution of communications,
and program measurement
Analyse Plan Interact Measure
Customer
Valuation Develop Targeted
Segment Communications
Creation
Develop
and Measurement
Customer
Value Systems
Strategies
Customer Proposition
Knowledge Development Customer Contact
Management Management
23
Applying Predictive Analytics
Profit
Time
Data Mining Cost
and Analytics
24
Four Stages of Data Mining
The Data Mining Process:
Problem Identification Stage
1)Problem Identification
Identification and
Provide
Identify overall prioritization
information
business of business strategy
regarding current
strategy components
data environment
which can be resolved
through
predictive analytics
– Assume data mining can bring 10% improvement in performance for all
campaigns
• What is the potential data mining impact here?
• You have the following decile table where you asked to assess the $
opportunity of deploying a data mining strategy against the top
40% of the customer base. The data mining strategy is about
targeting the right customers in order to reduce risk in the most
effective manner. $ Cost per effort is $5.00 and average credit risk
rate is 2.5% Deciles Credit Risk # of Customers
0-10% 6% 10000
10%-20% 5% 10000
20%-30% 3% 10000
30%-40% 2% 10000
….
90%-100% 0.50% 10000
35
Problem Identification
• The $ opportunity is
Without Data
Mining 64,000 2.50% 1,600 $320,000
$ Opportunity $120,000
36
Identifying Data Mining Opportunities
• Explore the organizations key business challenges
• Determine if improved customer/prospect targeting or segmentation
would improve results
• Review the following questions:
38
Example 2: Identifying Data Opportunities
• Company B has a 1,000,000 customers and has been cross selling a
long distance phone plan for over 2 years
• Over the last 6 months acquisition results have declined by 20% and
the cost per new plan member has increased beyond target levels
(30% increase)
The index for M5A 1J2 is (.33 x 1.25)+ (.33 x 1.15)+ (.33 x 2) = 1.45
Example 3: Identifying Data Opportunities
• This index scheme can then be used to score each postal
code
• The 800,000 postal codes in Canada are then ranked into
20 half deciles based on descending index score
• Attaining even 70% of the responders will not meet the campaign
expectations
• Targetting models have been built and are performing very well. Overall profit is
still declining. What could be causing this?
• A PHD in Mathematics was hired to head up the Data Science Department. His
team quietly developed a number of very sophisticated tools under his sole
direction without input from other areas. None of the developed tools were ever
used and within six months, he left the organization. What might have contributed
to this situation?
• You are asked to build a customer response model. What would be your first three
questions in undertaking this project?
The key in all these examples is Asking
the right question
• What is a good question?
– Creates need for further questions
– Identifies other options in identifying problem
– Digs deep into the situation
– Avoids “whys”
– Avoids short yes or no answers
– Creates a move to confidence between the analyst and the business
stakeholder
– Can involve multiple stakeholders in obtaining the right answer
46
Question types
• Focus
– What concerns do you have
– What do you think about…
• Feeling
– How have you been affected
– What is your perception?
• Observation
– What do you see/hear/smell
• Analysis
– What has been done in the past and what are the results?
– What is your interpretation of results and insights
– Have you had the requisite data to support your analysis
– What is the key overall learning from your historical analysis?
Statistics Review: Mean
• Definition: the sum of all the values in a sample divided by the
number of values in the sample
• It is also referred to as the arithmetic mean, the average or the
arithmetic average 1 $ 150
2 $ 125
N 3 $ 175
x
4 $ 100
i N 5 $ 75
i 1 6 $ 110
7 $ 90
8 $ 140
• Example: Average monthly 9 $ 130
credit card spend for 10 10 $ 1,000
customers – Total $ 2,095
$2,095 / 10 = $209.50 Average $ 209.50
49
Statistics Review: Median
• Definition: the value above which half the values lie and below
which the other half lie; it is the balancing point of the distribution
• In our example, we have an
5 $ 75 even number of sample
7 $ 90 points, hence the median is
4 $ 100 the mean of the middle two
6 $ 110
2 $ 125
points
9 $ 130 ($125 + $130) / 2 = $127.50
8
1
$
$
140
150
• Notice we have rank ordered
3 $ 175 our sample to get the median
10 $ 1,000 • If we had 11 points, the
Sum $ 2,095
Average $ 209.50
median would be #6
50
Statistics Review
• Why do we need to look at median in some cases?
– Looking at mean can give misleading results if there are outlier values
51
Statistics Review: Mode
• Definition: the value within a distribution which occurs most
frequently
• Within our sample data, there is no 1 $ 75
mode 2 $ 90
• That is, there is no value which is repeated; 3 $ 100
4 $ 110
all the values are unique
5 $ 125
6 $ 130
• A histogram is the graphical 7 $ 140
8 $ 150
representation of a frequency
9 $ 175
distribution 10 $ 1,000
• The vertical axis represents the frequency Sum $ 2,095
(or count), while the horizontal axis Average $ 209.50
represents the class or actual occurrences
within a distribution
52
Basic Distributional Theory
• Distributional Theory is the foundation for all advanced
mathematics
• Consider the following three distributions:
Symmetrical
distribution
Asymmetrical
distribution
53
Symmetrical Distribution
• For a symmetrical distribution, the mean, median and mode are
equal
• The normal distribution is a symmetrical distribution with special
properties
Mean
Median
Notice how these
Mode align!
54
Asymmetrical Distribution
• For an asymmetrical distribution, the mean, median and mode are
NOT equal
• We generally refer to these distributions as skewed distributions
55
Central Tendency
• With the foundation of basic distributional theory we can begin to
consider central tendency of our distribution or Central Limit
Theorem
• Range
• Standard Deviation
• Skewness
56
Range
• Definition: the difference between the largest and
smallest of a data set
– Example:
Average monthly credit card 1 $ 75
spend for 10 customers 2 $ 90
3 $ 100
range = $1,000 - $75 4 $ 110
range = $925
5 $ 125
6 $ 130
7 $ 140
8 $ 150
9 $ 175
10 $ 1,000
57
Standard Deviation
• Definition: a measure of the amount by which the values in a
sample differ from their mean
• It is the square root of the variance
– Also referred to as the second moment about the mean
(x x)
i 1
i
2
( N 1)
59
Standard Deviation
• For a binomial distribution, such as response, we must use a
different formula
1 Responder
0 Non - Responder
0 Non - Responder
1 Responder
0 Non - Responder
( p * q) ( N ) 1
0
Responder
Non - Responder
0 Non - Responder
0 Non - Responder
0 Non - Responder
0.300 Mean
0.145 St. Dev.
60
Standard Deviation
• What does 2 standard deviations mean?
– That we are 95% confident that a result from another sample will be within
that confidence range
61
Are They the Same?
• Consider the following two distributions ...
Distribution A Distribution B
0 4,500
1,000 4,600
2,000 4,700
3,000 4,800
4,000 4,900
5,000 5,000
6,000 5,100
7,000 5,200
8,000 5,300
9,000 5,400
10,000 5,500
Mean 5,000 5,000
St. Dev. 3,316.62 331.66
62
Are They the Same?
• Even though both A and B have the same mean, the standard
deviation of A is 10 times that of B, hence they are NOT the same
10
Blue: DIST A 7
Purple: DIST B 6
0
1 2 3 4 5 6 7 8 9 10 11
63