Applied Predictive Analytics - Caselets
Applied Predictive Analytics - Caselets
Caselet
One of the most successful digital marketing methods in the past has been Email Marketing.
But, in order to get the best ROI from a digital marketing campaign, the email must be sent to
a targeted audience who will be interested in the promotion.
A direct response company that sold books and DVDs had developed a large database
of existing customers along with their historical purchases and their response to previous email
campaigns. They always faced the challenge of identifying the right target audience for their
new promotions. They spent considerable time and money creating multiple samples of
customers, sending the new email campaigns to each of the sample, analyzing the response and
finding which of the sample audience responded best. This sample segmentations created were
all manual and may not be the best. This process also led to delay in their email campaigns and
cost considerable waste of money.
The company hired an analyst who applied predictive analytics for finding the best
target audience for the campaign. Instead of manually creating sample segments, the emails
were sent to a random subset of audience. Based on their response, the analyst identified the
key characteristics of those who responded to the test mailing and based on this, they could
assign score to each of their existing customers. This score could be used to determine which
customers to mail.
Chapter 2
Caselet
A nonprofit organization wants to recover lapsed donors. But, what defines a lapsed donor? Is
it a donor who did not give any gift in the past 6 months or 1 year or 2 years? Only a domain
expert in the organization can define this objective.
Do we need to predict whether a lapsed donor can be recovered, or do we need to predict
how much a lapsed donor will donate? If we predict only the likelihood of recovering a donor,
we may end up favoring low-dollar donors. Alternatively, predicting the amount a lapsed donor
will donate would help achieve higher ROI.
What data is available? Is it live data or archived data? If it is archived data, what is the
process to get access to that data and how long will it take? The project should be planned
based on realistic answers to this.
Different predictive models can be built for this. At the beginning of the project, the
evaluation criteria must be decided to select the metric that will be used to evaluate the model
that meets business objective.
Last but not the least, the deployment team must be informed on the requirements for
deployment as early as possible for identifying potential obstacles to deployment once the final
model is built.
Chapter 3
Caselet
A nonprofit organization wants to recover lapsed donors. It has been decided that all donors
who have not contributed within the last year are lapsed donors.
The data analyst gets access to the required data. He finds that the organization has
migrated from storing data in spreadsheets to a CRM application. The spreadsheets store all
date fields as “MM/DD/YY” (where YY can denote years from 1930 to 2029), while CRM
stores it as binary date-time value. The data analyst identifies this as one of the data that requires
cleaning during data preparation task.
The data analyst tries to get insight into each of the variables. He notices that some of
the zip codes have very few donors. On analysis, he finds that some records in the spreadsheets
have the leading zeroes stripped out from zip code. For example: “9263” is stored, instead of
“09263”. The data analyst identifies this as another cleaning task that will be part of data
preparation.
The data analyst identifies problems in the data, such as missing values, outliers, spikes
and high-cardinality, so that these can be fixed during data preparation.
Chapter 4
Caselet
A nonprofit organization wants to recover lapsed donors. The data analyst has identified most
of the problems in the data, and now those must be fixed. He fixes the problems as follows:
Dates that are found in the spreadsheet are converted from “MM/DD/YY” (where YY
can denote years from 1930 to 2029) to “MM/DD/YYYY”.
Zip codes are corrected by prefixing zeroes where it is missed. “9263” is changed to
“09263”.
Less than 0.1% of the data in spreadsheet is found to contain non-numeric values for
donated amount. For example: “100 dollars”, “250 USD”, etc. The data analyst finds the
average donation made by the same donor by using other records and replaces this value with
the imputed value. The remaining records are removed.
The gender consists of “M”, “F”, “Male”, “Female”, “n/a”, “”. The data analyst
converts all “Male” to “M” and all “Female” to “F”. He considers removing a ’’ data that has
“n/a” or “” for gender. But, that happens to be 45%. Changing all the missing gender data to
“M” or “F” will also skew the data. He finally decides to convert all the missing gender to “D”
(did not respond) and use the same to model. Based on the accuracy of the model, he can always
modify this in the next iteration, if required.
Similarly, the data analyst corrects all problems in the data.
Chapter 5
Caselet
Caselet
A nonprofit organization wants to approach previous donors for their current donation drive.
Contacting all existing donors will be waste of time and money. Also, contacting those who
recently donated will have a negative impact on their future donations. How can the nonprofit
organization identify the donors to contact for their current donation drive?
In this scenario, descriptive modeling or unsupervised learning methods discover the
best way to segment the data.
Following are some of the questions that can be answered in this type of modeling:
1. Segmenting the first donation by age provides range of age when large number of
people first give a donation
2. Segmenting the largest donation given by age provides range of age when people give
largest donation
3. Size of donation based on gender
4. Size of donation based on marital status
From the above analysis, a proper strategy can be created for soliciting donations from different
segments.
Chapter 7
Caselet
A nonprofit organization has collected lot of historical data on the donors and decides to
cluster-analyze them to generate five different groups. The clustering generated groups of sizes
from single digits to a maximum of few thousand.
The input variables included Date of Birth, Gender, Marital Status, Number of
Children, Annual Income, Highest Education, Domicile State and many more. While it is
simple to visualize clusters formed with two inputs, the reality is that many cluster models are
created from dozens of inputs. Interpreting cluster models is a challenge for predictive
modelers, because there are no clear and standard metrics like those used to access supervised
learning models.
In this case, the predictive modeler used ANOVA technique for identifying the most
important variable.
Chapter 9
Caselet
A nonprofit organization wants to recover lapsed donors. The business objective is to build a
classification model that can predict the likelihood (binary flag), a lapsed donor will respond
to a mailing. The predictive modeler creates multiple models using the following algorithms:
1. Decision trees
2. Logistic regression
3. K-nearest neighbor
4. Naïve Bayes
Now, the decision must be made on the model to deploy in production. This is done by
assessing the model accuracy. The best model is the one that optimizes the business objective.
Chapter 10
Caselet
A nonprofit organization wants to recover lapsed donors. The business objective is to build a
classification model that can predict the likelihood (binary flag), a lapsed donor will respond
to a mailing. The predictive modeler creates multiple models using the following algorithms:
1. Decision trees
2. Logistic regression
3. K-nearest neighbor
4. Naïve Bayes
Traditionally, we selected the best single model for deployment. In ensemble approach, we
use them all in deployment. The reason for this is improved accuracy. Actually, model
ensembles not only improve model accuracy, but they can also improve model robustness.
Through averaging multiple models into a single prediction, no single model dominates the
final predicted value of the models, reducing the likelihood that a flaky prediction will be made.