Chapter 6_Data Mining
Chapter 6_Data Mining
Chapter 5:
Data Mining
Learning Objectives
◼ Define data mining as an enabling technology
for business intelligence
◼ Understand the objectives and benefits of
business analytics and data mining
◼ Recognize the wide range of applications of data
mining
◼ Learn the standardized data mining processes
◼ CRISP-DM
◼ SEMMA
◼ KDD
(Continued…)
5-2 © Pearson Education Limited 2014
Learning Objectives
◼ Understand the steps involved in data
preprocessing for data mining
◼ Learn different methods and algorithms of data
mining
◼ Build awareness of the existing data mining
software tools
◼ Commercial versus free/open source
◼ Understand the pitfalls and myths of data
mining
Ar
tifi
Pattern
c
ial
Recognition
s
tic
Int
tis
ellig
Sta
en
ce
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Unstructured or
Structured
Semi-Structured
◼ Types of patterns
◼ Association
◼ Prediction
◼ Cluster (segmentation)
◼ Sequential (or time series) relationships
5-10 © Pearson Education Limited 2014
Application Case 5.2
Harnessing Analytics to Combat Crime:
Predictive Analytics Helps Memphis
Police Department Pinpoint Crime and
Focus Police Resources
Questions for Discussion
1. How did the Memphis Police Department use
data mining to better combat crime?
2. What were the challenges, the proposed
solution, and the obtained results?
5-11 © Pearson Education Limited 2014
A Taxonomy for
Data Mining Tasks
Data Mining Learning Method Popular Algorithms
◼ Types of DM
◼ Hypothesis-driven data mining
◼ Discovery-driven data mining
◼ Insurance
◼ Forecast claim costs for better business planning
◼ Determine optimal rate plans
◼ Optimize marketing to specific customers
◼ Identify and prevent fraudulent claim activities
5-16 © Pearson Education Limited 2014
Data Mining Applications
◼ Computer hardware and software
◼ Science and engineering
◼ Government and defense
◼ Homeland security and law enforcement
◼ Travel industry
◼ Healthcare Increasingly more
popular application areas
◼ Medicine for data mining
◼ Entertainment industry
◼ Sports
◼ Etc.
5-17 © Pearson Education Limited 2014
Data Mining Process
◼ A manifestation of best practices
◼ A systematic way to conduct DM projects
◼ Different groups have different versions
◼ Most common standard processes:
◼ CRISP-DM (Cross-Industry Standard Process
for Data Mining)
◼ SEMMA (Sample, Explore, Modify, Model, and
Assess)
◼ KDD (Knowledge Discovery in Databases)
5-18 © Pearson Education Limited 2014
Data Mining Process
Source: KDNuggets.com
5-19 © Pearson Education Limited 2014
Data Mining Process: CRISP-DM
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
True False TP
True Positive Rate =
Predicted Class
Positive Positive
TP + FN
Count (TP) Count (FP)
TN
True Negative Rate =
TN + FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision = Recall =
TP + FP TP + FN
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
0.9
0.8
A
True Positive Rate (Sensitivity)
0.7
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
Software
Weka / Pentaho (118)
StatSoft Statistica (112)
SAS (101)
Rapid-I RapidAnalytics (83)
MATLAB (80)
IBM SPSS Statistics (62)
IBM SPSS Modeler (54)
Commercial
SAS Enterprise Miner (46)
◼ Orange (42)
Microsoft SQL Server (40)
(formerly Clementine)
Tableau (35)
Oracle Data Miner (35)
Other commercial software (32)
… many more
Revolution Computing (11)
◼ Salford SPM/CART/MARS/TreeNet/RF (9)
XLSTAT (7)
RapidMiner
WordStat (3)
◼ Predixion Software (3)
Weka…
0 50 100 150 200 250 300
◼
Source: KDNuggets.com
0 10 20 30 40 SQL
50(185)
60 70 80
Java (138)
Python (119)
C/C++ (66)
Other languages (57)
Perl (37)
Awk/Gawk/Shell (31)
F# (5)
◼ Questions, comments