0% found this document useful (0 votes)
18 views

Data Mining Caselets

The document discusses several case studies that highlight the importance of data preparation and exploration before building machine learning models: 1) A healthcare project building a pneumonia risk model failed to account for asthma patients, incorrectly recommending some high-risk patients be treated at home. Proper data preparation is crucial. 2) An e-commerce analyst's sales model was flawed because the underlying data was incomplete and inconsistent, lacking details at the locality level. 3) Exploratory data analysis of customer data for a clothing store revealed a growing segment of plus-sized customers that were being overlooked in marketing campaigns. 4) Principal component analysis was used by an automaker to reduce correlations between attributes in survey data before analysis to
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Mining Caselets

The document discusses several case studies that highlight the importance of data preparation and exploration before building machine learning models: 1) A healthcare project building a pneumonia risk model failed to account for asthma patients, incorrectly recommending some high-risk patients be treated at home. Proper data preparation is crucial. 2) An e-commerce analyst's sales model was flawed because the underlying data was incomplete and inconsistent, lacking details at the locality level. 3) Exploratory data analysis of customer data for a clothing store revealed a growing segment of plus-sized customers that were being overlooked in marketing campaigns. 4) Principal component analysis was used by an automaker to reduce correlations between attributes in survey data before analysis to
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 1

Caselet

In today’s business landscape, almost all businesses, irrespective of operational/functional domain, understand the
importance to bring novel business actions as daily practice based on their historical data. This makes the
“Analytics and Data Science” an essential tool for novel business insights towards improved business actions.

For example, a team in healthcare project is working to build a recommender system by utilizing machine
learning algorithms for advising pneumonia patients, either to be treated at home or at hospital, based on
previously collected historic data from clinics.

The team started with the hypothesis that patients with low death risk can take antibiotics at home;
however, patients with high death risk from pneumonia should get treatment in hospital. However, they missed
out an important exception to the hypothesis, that is, patients with asthma must be considered as special case since
the combined effect of pneumonia and asthma is more dangerous. As per the conventional medical practice,
doctors send these cases to intensive care unit for better supervision and treatment. Now due to this practice,
typical historical records show the absence of asthmatic death cases in the data of clinics. Recommender system’s
machine learning algorithm assumes that asthma is not dangerous during pneumonia, and in all cases the
recommendations are made to send asthmatics patients home, while they had the highest risk of pneumonia
complications.

The fact worth highlighting here is that Machine Learning model building phase requires data for training
and with the availability of very large set of data, good model could be built. However, if any crucial information
is missed out, then the model may become useless and even harmful like above case. Because of this, data
preparation is very crucial step in the machine learning process.

The “data preparation” step helps transform raw data set into more usable initial building block for
machine learning. The data preparation also includes establishing the right data acquisition mechanism. Typically,
procedures for data preparation consume most of the time spent on machine learning model building, e.g., months
before the first ML model or algorithm is built! Steve Lohr of The New York Times said: “Data scientists,
according to interviews and expert estimates, spend 50–80% of their time mired in the mundane labor of
collecting and preparing unruly digital data, before it can be explored for useful nuggets.”

It is undeniable that 80% of a data scientist’s time and effort is spent in collecting, cleaning, and preparing
the data for analysis, because data sets come in various sizes and are different in nature. It is extremely important
for a data scientist to reshape and refine the data sets into usable data sets, which can be leveraged for analytics. In
this chapter, a top-level view of data preparation, its importance, and how it is done is described.
Chapter 2

Caselet

The important first step in any data mining (or alternatively named as knowledge discovery) project is the data
pre-processing, which is normally overlooked and not given importance. The essential thumb rule in data mining
can be seen in the phrase “Garbage In, Garbage Out.” The first step in attacking and defining the problem
statement is by ensuring data quality to satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
interpretability.

The first source of error in quality of data comes from data gathering methods which are often loosely
controlled, resulting in out-of-range value (e.g., Income: –100), impossible data combinations (e.g., Gender:
Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems
can produce misleading results. Thus, the quality of data is most important before running an analysis.

If there is a lot of irrelevant and redundant information or noisy and unreliable data present, then
knowledge discovery during training phase is more difficult. Data preprocessing and filtering steps can take
considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation,
feature extraction and selection, etc. The end result of data pre-processing is the final training set.

As an example case, consider Mohan who is a data analyst at a leading E-commerce company in India.
He has been given the responsibility to analyze the company’s data with respect to sales metric, and he
immediately started working on this assignment. He pulled company sales data from data warehouse and built a
model for company sales metric with respect to several attributes or dimensions, such as item, price, unit sold,
etc., without bothering about data quality. Later, when he tried to answer the questions from the model, he
realized that the insights were missing. He found that the answers to queries related to city are available, but not
of the locality within the city. He later found that the city-level sales data were always there in data set as
aggregated manner across locality, but many localities were missing individual sales data. Furthermore, other data
analysts in Flipkart also reported errors, unusual values, and inconsistencies in the data recorded for some
transactions. In other words, the data they used to analyze with data mining techniques were incomplete (lacking
attribute values or certain attributes of interest, or containing only aggregate data); inaccurate or noisy (containing
errors, or values that deviate from the expected); and inconsistent (e.g., containing discrepancies in the department
codes used to categorize items).

Welcome to the real world!


Chapter 3

Caselet

Quite often, business needs to take critical decision for ensuring growth. The decisions are normally based on
analysis of historical data. Before starting with detailed data mining process, analysts prefer to see patterns and
top-level view of trends of variables, attributes. For this, data exploration, which uses underlying statistical
methodologies and produces visualization around data, becomes a crucial tool to bring important aspects of that
data and identify key areas for further analysis. Therefore, data exploration becomes the necessary step before
Data Mining. Exploratory data analysis (EDA) is used to identify features and patterns that stand out clearly from
surrounding noise.
A well-known online clothing store was looking at the performance of their marketing campaigns. They
identified the group of people who had NOT clicked on any of their marketing campaigns for the past year. It was
decided that these shoppers should NOT be included in any of their future marketing campaigns. Further, an
analyst was assigned the task of identifying the reason for the lack of interest in these shoppers.
The analyst started looking at the purchases made by this segment earlier. During EDA analysis, he
analyzed the distribution of these customers based on categories like formal/casual, men/women, based on age,
based on location, etc. There was nothing abnormal about the distribution.
Then he tried the distribution based on the size of the clothes they purchased. Here, he found that there
was a peak around the Plus size. This was an interesting pattern.
On further analysis, he found that these customers were growing at a faster rate than the other set of
customers. It also revealed some specific brands were the favorite of these customers.
He followed up on these patterns and checked the historic marketing campaigns. The marketing
campaigns always had models who were either skinny or normal weight. They never had models of plus size.
This EDA has provided the company a segment of buyers who are plus size, growing at an exponential
rate, with some favorite brands. The marketing team can now analyze this segment in detail and come out with a
different campaign for this segment to make the campaign successful.
Chapter 4

Caselet

A large, automobile-manufacturing company is planning to start manufacturing a small car in India. First, they
start with a survey of existing market of small cars. They assign a business analyst for this survey. Analyst gathers
data from multiple sources and stores in database. The database contains thousands of records and hundreds of
attributes. The typical task before starting analysis is to identify whether all attributes are independent or there
exists any relation between attributes/features/inputs/predictor variable. In other words, we need to determine if
any correlations exist between them or not. Multicollinearity is a phenomenon where some features might be
strongly correlated with each other. The results from any model built on this type of data may become erroneous.
Analyst will also not use all features since it makes the model more complex. It may result in overfitting.
Therefore, the analyst wants to reduce attributes/features and also select ones without correlation, all this without
sacrificing the model accuracy. For doing this, he picked up a very popular method of feature reduction called
Principal Component Analysis (PCA).
Chapter 5

Caselet

A large pharmaceutical company is trying to find the relation between diet and cholesterol in patients, who had
heart attack. One of the major tasks of a data scientist working on problem is to identify the patterns in historical
data for the problem definition. In this process, the obvious choice is to first apply some statistical analysis on
data. However, in real world, an analyst may not have the complete data. For example, data on the diet of the
patients have been collected from many hospitals. Some hospitals have collected data on cheese, egg, milk, red
meat, fish, etc. Others have collected data on consumption of ghee, type of oil, consumption of fast food, etc.,
while some have collected the type of food consumed, like rice, wheat, ragi, etc.
Because of this data issue, the analyst decides to start with Univariate Statistical Analysis. In this type of
analysis, we select one variable at a time and analyze. For example, in the first analysis, the analyst may select the
data regarding consumption of egg and analyze its relation to cholesterol. In other analysis, he may consider red
meat vs cholesterol or fish vs cholesterol.
The other problem a data analyst may face is that the sample data available may not be the true
representative of population concerning the problem domain, that is, sample size may be small (say only for 100
individuals). However, for good statistical analysis, sample data size must be very large. However, with the given
data, we can determine the error and determine the size of data required for reducing the error to an acceptable
level. With that information, the analyst can then propose a new survey with the required variables and the sample
size required.
Chapter 6

Caselet

In general, business needs to take decision based on past data. During data mining practice of the real-world
scenarios, we normally find the number of attributes (input) affecting the business inference (output) are not
single input but multiple inputs. In this case, they need to adopt the multivariate data analysis, which is essentially
a mathematical approach that helps to model based on multiple input.

A large marketing agency is using multiple marketing channels, like print, radio, TV, online, etc., for
advertisement campaign of a fashion product. However, they always struggle to find right channel to reach right
set of people. Here people are categorized by their age, gender, location, income, spend history, etc. The decision
to make right amount of budget between each marketing channel as per desired targeting (e.g., young age, low
income, etc.) to get maximum return of investment (ROI) makes multivariate statistics methodology a good
choice.

The data for model is collected by multiple campaigns for product in which multiple copies of ads are
designed for product specifically designed for a target group as per customer grouping and tried with different
proportion of allocated fund in each channel. This data collected from campaign will be used to build a model that
can suggest fund allocation across channels.
Chapter 7

Caselet

In the data mining application and model building, there are two problems that are quite common after initial
stage of data pre-processing. The first problem is related to lack of adequate data. The data sample available
becomes the true representative of the population which it is representing. The second problem is related to the
complexity of model chosen.
A large real-estate company is functional since long time (say 10 years). Therefore, they have good
amount of data which tells how rate of property varies with time in a particular location since last 10 years. Now
they want to build a model based on historical data. However, apart from relation between property type (e.g., 1
BHK, 2 BHK, area in square feet, super built-up area, etc.) as input and price of property as output, they do not
have any other data. The other factors that can influence the sale include present market condition, kind of
neighborhood, connectivity, and greenery, etc., for which they do not have data points. This is the problem of
inadequate data.
Assume, we build a model based on the number of bedrooms as input and price as output. This model
may be generic enough with less complexity (underfitting), but the error will be large. We can also build a
complex model in which regression line passes through all points (overfitting), but it is then too specific for given
data and any new test data set may produce large test error. The data analyst should therefore carefully select the
model, since it will impact accuracy of model for unseen data.
Chapter 8

Caselet

A multinational taxi aggregator is expanding their operations in a new country. They invited potential drivers
having their own vehicles to associate with them. To prevent any liability of accidents, it is the company’s policy
to take insurance premiums for all the taxis working for them. The company plans to recover the insurance
premium based on the performance of the driver. Since drivers were new to the company, there is no data
available on the drivers’ profile. Company decides to use the customer rating for calculating the insurance
premium and recovering it from the drivers. The company built a simple regression model based on the customer
rating as input and insurance premium as the output. This is a workable model to start. As the historical data of
the drivers increase, a better model can be built.
Chapter 9

Caselet

An international music company always stores the data about number of music CDs they are producing for market
at first-time release. This number depends on various factors, like singer, category of music, target market size,
gender and age of listeners they are targeting, locations, etc. Now they want to build an AI model so that based on
this historical data, algorithm can predict that how many music CDs must be initially released. This is a typical
scenario where the first model is built using multiple linear regression (MLR). The MLR model is a type of
polynomial equation between response variable (number of music CDs) and selected input features with some
weight coefficients in the form of slopes.

In general, the information conveyed from slope coefficients in MLR is slightly different from LR (linear
regression). For an MLR with ‘n’ predictor variables, a data scientist could interpret coefficient as: the estimated
change in the response variable for a unit increase in variable, provided all other predictor variables are kept
constant. In this case, typical error in prediction (model output response) is measured by the residual in response
(i.e., actual response – predicted response). In LR, residual is thought as the vertical distance between the actual
data point and the regression line. However, in MLR, the residual is treated as the vertical distance between the
data point and the regression plane or hyperplane. Therefore, in case of addition of a new predictor variable or
feature to the model, the value of coefficient always goes up if new variable turns out to be useful or vice versa.
Chapter 15

Caselet

A cybersecurity company was working on face matching component in their digital identity matching solution.
They successfully built a model using earlier collected facial image database. However once deployed in real-time
scenario, the matching was not accurate enough. The reason was the different angles/poses in real-time scenario,
variation in lighting and noise caused by background compared to the image database on which the model was
built.

To resolve the above problem, the company started to check face matching component validity on test
cases, first from database and then extended this test on real-time images. When the accuracy was not up to the
accepted level, the algorithm was modified and retested again. After repeating the cycle of test, accuracy check,
and model tweaking, they finally achieved the accuracy needed.

In summary, it is very important that the accuracy checking is a crucial and important step in building
final production-ready component. Any time the models are not within acceptable error, and the company’s time
and money are wasted. To prevent this issue, the models are normally evaluated before deploying in production.

You might also like