Data Mining Concepts
Data Mining Concepts
Agenda
Drivers
Focus on the customer, competition, and data assets
Enablers
Increased data hoarding
Increased
Computing
Power
DM
Statistical Improved
and Learning Data Collection
Algorithms and Mgmt
Motivation for doing Data
Mining
Investment in Data Collection/Data Warehouse
Add value to the data holding
Competitive advantage
More effective decision making
OLTP =) Data Warehouse =) Decision Support
Work to add value to the data holding
Support high level and long term decision making
Fundamental move in use of Databases
Data Mining - Definition
DM - what it can do
Exploit patterns & relationships in data to produce
models
Two uses for models:
Predictive
Descriptive
DM - what it can’t do
Automatically find relationships
without user intervention
when no relationships exist
Data Mining Introduction
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization
Data mining - Analysis
Examples of Data Mining
Associations
When Paint is sold, Paint
brushes are also sold
85% times
Trends & Variations
Golf balls sales are
seasonal with Summer
peak and Winter low
Examples of Datamining contd....
Data Preparation
Data is the Foundation for
Analytics
If you don’t have good data, your analysis will
suffer
Rich vs. Poor
Good vs. Bad (quality)
Missing data
Sampling
Random vs. stratified
Data types
Binary vs. Categorical vs. Continuous
High cardinality categorical (e.g., zip codes)
Don’t Make Assumptions
About the Data
Data Preparation
Use of samples
visualization tools
Data Preparation
Data Preparation
Cross Validation
Model building
an iterative process - different for supervised and
unsupervised learning
Supervised Model
Driven by a real business problems and historical data
Quality of results dependent on quality of data
Want to build a predictive model
Unsupervised Model
Want to find groups of things with similar characteristics
Relevance often an issue
Useful when trying to get an initial understanding of the data
Non-obvious patterns can sometimes pop out of a completed data
analysis project
Types of Models
Model monitoring
Determine if the model still ‘works’
Data Mining Process
46
0 Dose (cc’s) 1000
Common distance functions:
•The Euclidean distance (also called distance as the crow flies or 2-norm
distance). A review of cluster analysis in health psychology research found
that the most common distance measure in published studies in that research
area is the Euclidean distance or the squared Euclidean distance.
•The Manhattan distance (also called taxicab norm or 1-norm)
•The maximum norm
•The Mahalanobis distance corrects data for different scales and correlations
in the variables
•The angle between two vectors can be used as a distance measure when
clustering high dimensional data. See Inner product space.
•The Hamming distance (sometimes edit distance) measures the minimum
number of substitutions required to change one member into another.
Association
Response curves
How does the response rate of a targeted selection
compare to a random selection?
Classification
0
Dose (cc’s) 1000
Time series
Neural Networks
Decision Trees
Rule Induction
K nearest neighbour
Neural Networks
Feedback
Prediction Learning
Module
Output Actual
Middle Layer Data
Layer
Input
(functions)
Layer
(Feed Forward) Neural
Networks
Dose < 100 Dose ≥ 100 Dose < 160 Dose ≥ 160
Decision Trees : Examples
Advantages & Limitations
Advantages
Models can be built very quickly
Suitable for large data sets
Easy to understand
Gives reasons for a decision taken
Handle non-numeric data very well
Minimum amount of data transformation
Limitations
Leads to an artificial sense of clarity
Trees left to grow without bound take longer to build and
become unintelligible
Rule Induction
A classification technique
Decides in which class to put a new case in
Criterion is to find a maximum number (k) of
neighbors having most similar properties
Assigns a new case to the same class to
which most of the neighbors belong
K Nearest Neighbor
100
Age 0 Dose (cc’s) 1000
Distance =
K Nearest Neighbor
Quick and easy
Models tend to be very large
Neural Networks
Difficult to interpret
Can require significant amounts of time to train
Rule Induction
Understandable
Need to limit calculations
Decision Trees
Understandable
Supervised / Unsupervised
Supervised Learning:
Unsupervised Learning:
Customer Profiling
Target Marketing
Market Basket Analysis
Fraud Detection
Medical Diagnostics
Direct mail marketing
Web site personalization
Bioinformatics
Anti Money Laundering
Churn Analysis
Some Uses of DM
•Define Clearly segments that are strongly divided by their churn relating
Behavior
CHURN ANALYSIS
Basic Understanding
Information Sources
Pareto analysis
Suggested Analysis
Also called 80/20 Analysis. Its been observed that 80%
of the revenue profit comes from 20 % of the
customer. Key Business Improvement was identifying
those 20% and serves them better.
Techniques/ Reports/Algorithms
Suggested Analysis
Loyalty Analysis
A loyal customer is worth new customer. If it is possible to identify the
loyal customer and increase that volume. A loyal customer is defined
as the one who is with the company for last six months. This analysis
will give insight in to the complete details of various customer bases.
Techniques/ Reports/Algorithms
Suggested Analysis
Customer Profit Analysis
dentifying wining and loosing customer. A wining customer is one who
giving increasing revenue month after month and vice versa. Identify
he characteristic and reason for better decision.
Techniques/ Reports/Algorithms
Suggested Analysis
Trend Analysis
Techniques/ Reports/Algorithms
Suggested Analysis
Customer profiling
active accounts, Light user, risky customer Active accounts, Loss makin
ofit making accounts. This segment helps in mapping with the predictiv
gment.
Techniques/ Reports/Algorithms
SuggestedLTV
Analysis
Analysis
Called Life Time value Analysis .Revenue projected over 25 yrs and
Projected Churning loss and rate.
Techniques/ Reports/Algorithms
SuggestedChurn
Analysis
Modeling
Techniques/ Reports/Algorithms
Suggested Analysis
Survival Analysis
This predicts how long the customer would continue with existing
service in terms of time. What measures can be taken. One of the
Popular Technique are K. Hazard Analysis .
Techniques/ Reports/Algorithms
K.Hazard technique
CHURN ANALYSIS
Suggested Approach
Note
CIBIL
1. It’s the first credit information bureau being established in India 2003.
CIBIL will obtain and Share data on borrowers both consumer and
commercial for sound credit decision therefore helping to avoid adverse
selection.
Data Mining Methodology
Data Mining Methodology
Methodologies
CRISP-DM methodology
Wipro - Data mining methodology
CRISP - DM
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding
Data collection
Data description
Selection
Data quality assessment and data cleansing
Consolidation and integration
Metadata construction
Load the data mining database
Maintain the data mining database
Explore the data
Identify outliers.
Summarization.
Visualization.
Prepare data for modeling
Select variables
Select rows
Construct new variables
Transform variables
Building and Deploying model