0% found this document useful (0 votes)
84 views

2 - Business Problems and Data Science Solutions

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

2 - Business Problems and Data Science Solutions

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Business Problems and Data

Science Solutions
Each data-driven business decision-making problem is unique,
comprising its own combination of goals, desires, constraints.

In collaboration with business stakeholders, data scientists decompose a


business problem into subtasks.

The solutions to the subtasks can then be composed to solve the overall
problem.
Business Problems and Data
Science Solutions
Some of these subtasks are unique to the particular business problem, but
others are common data mining tasks.

For example, our telecommunications churn problem is unique to


MegaTelCo which we saw in previous class.

However, a subtask that will likely be part of the solution to any churn
problem is to estimate from historical data the probability of a customer
terminating her contract.

This sub-task once you have solved can be applied to churn problems at
different companies in the same business or even across business domains.
Business Problems and Data
Science Solutions
A critical skill in data science is the ability to decompose a data-analytics
problem into pieces such that each piece matches a known task for
which tools are available.

Recognizing familiar problems and their solutions avoids wasting time


and resources reinventing the wheel.
Business Problems and Data
Science Solutions
There are a large number of data mining, machine learning algorithms.

These algorithms though perform a handful of tasks.

The 2 most common of these tasks are:

1) Classification

2) Regression
Business Problems and Data
Science Solutions
Tasks performed by datamining and machine learning algorithms:

1) Classification:
The goal here is to classify a sample (data point) into the most probable
class.
E.g for the churn problem we have been studying

“Among all the customers of MegaTelCo, which are likely to respond to a


given offer?” In this example the two classes could be called will respond
and will not respond.
Business Problems and Data
Science Solutions
A closely related task to classification is scoring or class probability
estimation.

A scoring model applied to a sample outputs the probability that that


individual belongs to each class.

In these models, the class which has the highest probability becomes the
predicted class.
Business Problems and Data
Science Solutions
2) Regression

Regression attempts to estimate or predict, for each individual, the


numerical value of some variable for that individual

E.g. What is the value of this house given its age, number of bedrooms,
number of bathrooms.
Business Problems and Data
Science Solutions
3) Similarity matching

Similarity matching attempts to identify similar individuals based on


data known about them. Similarity matching can be used directly to find
similar entities.

E.g. IBM is interested in finding companies similar to their best business


customers, in order to focus their sales force on the best opportunities.

Netflix, Amazon also use similarity matching to make recommendations.


Business Problems and Data
Science Solutions
4) Clustering

Clustering attempts to group individuals in a population together by


their similarity, but not driven by any specific purpose.

E.g. “Do our customers form natural groups or segments?”

Clustering is useful in preliminary domain exploration to see which


natural groups exist because these groups in turn may suggest other
data mining tasks or approaches.
Business Problems and Data
Science Solutions
5) Frequent itemset mining.

This is one of the very first data mining algorithms. Apriori algorithm is
one the earliest algorithms developed in this space.

E.G Walmart trying to figure out which items sell together.


The key here is to uncover patterns for which we have no human
comprehension.

For example, analyzing purchase records from a supermarket may uncover


that ground meat is purchased together with hot sauce much more
frequently than we might expect.
Business Problems and Data
Science Solutions
6) Profiling

Profiling is often used to establish behavioral norms for anomaly


detection applications such as fraud detection and monitoring for
intrusions to computer systems.

E.g. if we know what kind of purchases a person typically makes on a


credit card, we can determine whether a new charge on the card fits
that profile or not. We can use the degree of mismatch as a suspicion
score and issue an alarm if it is too high.
Business Problems and Data
Science Solutions
7) Link Prediction

Link prediction attempts to predict connections between data items, usually by


suggesting that a link should exist, and possibly also estimating the strength of the
link

E.G
For recommending movies to customers one can think of a graph between
customers and the movies they’ve watched or rated. Within the graph, we search
for links that do not exist between customers and movies, but that we predict
should exist and should be strong. These links form the basis for recommendations.
Business Problems and Data
Science Solutions
8) Data Reduction

Data reduction attempts to take a large set of data and replace it with a
smaller set of data that contains much of the important information in
the larger set.
A popular technique for data reduction is called “Principal component
Analysis” or PCA.
Data reduction usually involves loss of information. It is a tradeoff to
between reducing the dimensions so that the model trains faster.
Business Problems and Data
Science Solutions
9) Causal Modeling

Causal modeling attempts to help us understand what events or actions


actually influence others.

E.g. consider that we use predictive modeling to target advertisements


to consumers, and we observe that indeed the targeted consumers
purchase at a higher rate subsequent to having been targeted. Was this
because the advertisements influenced the consumers to purchase? Or
did the predictive models simply do a good job of identifying those
consumers who would have purchased anyway?
Supervised Versus
Unsupervised Methods
Supervised learning is when your data set comes with labels.

E.g.
Lets say you have file containing information of customers.

Each row in the file corresponds to one customer.

If in this file each customer is labeled as a good customer or a bad


customer and our objective is to build a model that predicts whether a
new customer is good or bad then this is a case of supervised learning
Supervised Versus
Unsupervised Methods
Unsupervised learning is when your data set does not have any labels.

E.g.
Do our customers naturally fall into different groups?” Here no specific
purpose or target has been specified for the grouping. When there is no
such target, the data mining problem is referred to as unsupervised.

Clustering, an unsupervised task, produces groupings based on


similarities, but there is no guarantee that these similarities are
meaningful or will be useful for any particular purpose.
Supervised Versus
Unsupervised Methods
Supervised tasks require different techniques than unsupervised tasks
do, and the results often are much more useful.

Supervised learning is more widely adopted that unsupervised learning


at the moment.

For supervised learning, acquiring data on the target often is a key data
science investment. The value for the target variable for an individual is
often called the individual’s label.

Getting labeled data for supervised learning will often incur an expense.
Supervised Versus
Unsupervised Methods
Supervised tasks
Classification, regression, and causal modeling generally are solved with
supervised methods.

Unsupervised tasks
Clustering, co-occurrence grouping, and profiling generally are
unsupervised

Similarity matching, link prediction, and data reduction could be either.


Supervised Versus
Unsupervised Methods
Two main subclasses of supervised data mining, classification and
regression, are distinguished by the type of target. Regression involves a
numeric target while classification involves a categorical (often binary)
target.

“Will this customer purchase service S1 if given incentive I?”


Type = Classification
Supervised Versus
Unsupervised Methods
Which service package (S1, S2, or none) will a customer likely purchase
if given incentive I?”

Type = classification into 3 classes S1, S2 and none

“How much will this customer spend in a month?”

Type = regression because we are going to predict a value


Data Mining \ Machine
Learning Project
SOFTWARE SKILLS VERSUS
ANALYTIC SKILLS
Coding is core skill for software engineers.

Is being a good programmer important for data scientists?

Absolutely, yes a data scientist needs to be comfortable writing code to


build prototype models.
In addition to building models using code, a data scientist also needs to
research and try new models, try new approaches to solve problems,
making assumptions to structure data.
Common analytic techniques
DATABASE QUERYING

A query is a specific request for a subset of data or for statistics about


data, formulated in a technical language and posed to a database system.

For example, if an analyst suspects that middle-aged men living in the


Northeast have some particularly interesting churning behavior, she
could compose a SQL query:

SELECT * FROM CUSTOMERS WHERE AGE > 45 and SEX='M' and


DOMICILE = 'NE'
Common analytic techniques
DATA WAREHOUSING

Data warehouses collect data from across the organization in a format that
enables quick access to historical information and also allows building of
analytical metrics from that data.

Building a data warehouse is a process which requires significant time and


investment.

A data warehouse generally feeds data into machine learning \ Data mining
algorithms.

We have a separate class for Data Warehousing in Sem 3.


Common analytic techniques
MACHINE LEARNING AND DATA MINING

The collection of methods for extracting (predictive) models from data,


now known as machine learning methods, were developed in several
fields contemporaneously, most notably Machine Learning, Applied
Statistics, and Pattern Recognition.

In Machine learning, the algorithm learns from the data it observes and
improves its performance.

We study machine learning in detail in Sem 4.


Common analytic techniques
MACHINE LEARNING AND DATA MINING

The field of Data Mining started with finding patterns within large data
sets (E.g. Apriori algorithm).

The algorithms used for data mining and machine learning are
sometimes the same.

Some people use these terms interchangeably.

You might also like