introduction to Data Mining
introduction to Data Mining
• Time-series forecasting
– Part of the sequence or link analysis?
• Visualization
– Another data mining task?
– Covered in Chapter 3
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the two?
Data Mining Applications (1 of 4)
Sample
(Generate a representative
sample of the data)
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
Feedback
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Data Mining Process: KDD
• KDD (Knowledge Discovery in Databases) Process
Internalization
Data Mining
DEPLOYMENT CHART
Knowledge
“Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5
DEPT 1
DEPT 2
DEPT 3
Insight”
DEPT 4
3 4 5
Data 1 2
Transformation
Extracted
Patterns
Data
Cleaning Transformed
Data
Data
Selection Preprocessed
Data
Target
Data
Feedback
Sources for
Raw Data
Which Data Mining Process is the Best?
CRISP-DM
My own
SEMMA
KDD Process
My organization's
Domain-specific methodology
None
0 10 20 30 40 50 60 70
Data Mining Methods: Classification
• Most frequently used. Part of the machine-learning family
• Learn patterns from past data, classify new instances into their
respective groups or classes
– Credit approval (good or bad credit risk)
– Target marketing (likely customer, no hope)
– Fraud detection (yes or no)
• Classification versus regression?
– Predicting class (sunny, rainy, cloudy) is classification
– Predicting a numerical (680) value is called a regression
• Classification versus clustering?
– Classification is supervised learning
– Clustering is un-supervised learning (discover natural groups)
Two Step Methodology - Classification
• Model development/training
– A collection of input data including actual class labels is
used for model training
– Then model is tested against holdout sample for accuracy
assessment
– Deployed for actual use where it is to predict classes of
new data instances (where the class label is unknown)
Factors for Model Assessment - Classification
• Predictive accuracy
– ability to correctly predict class label of new or previously
unseen data
• Speed
– Model building versus predicting/usage speed
• Robustness
– ability predict given noisy data with missing values
• Scalability
– ability to construct efficient model with large data
• Interpretability
– Transparency, explainability
Estimation Methodologies for Classification
• Simple split
• K-fold cross-validation
• Area Under the R O C Curve (A U C)
– R O C: receiver operating characteristics (a term borrowed
from radar image processing)
• Leave-one-out
– Similar to k-fold where k = number of samples
– Viable for small data set
• Bootstrapping
– Fixed no instances sampled with replacement for training
– Rest of the data used for testing
– Repeated as many times as desired
• Jackknifing
– Similar to leave-one-out, leaving one sample at each iteration
Accuracy of Classification Models
• Primary source for accuracy estimation is the confusion
matrix or classification matrix or contingency table
TP + TN
Accuracy True/Observed Class
TP + TN + FP + FN
Positive Negative
TP
True PositiveRate =
Positive
True False
TP + FN
Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True NegativeRate =
TN + FP
Negative
False True
TP Negative Negative
TP Recall =
Precision = Count (FN) Count (TN)
TP + FP TP + FN
Model
Training Data Development
2/3
Trained Prediction
Preprocessed Classifier Accuracy
Data
1/3 Model TP FP
Assessment
Testing Data (scoring) FN TN
0.9
0.8
A
0.7
0.6
0.5
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1001234 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1001236 2, 3 3 4 1, 4 3
1001237 1, 2, 4 4 5 2, 3 4
1001238 1, 2, 3, 4 2, 4 5
1001239 2, 4 3, 4 3
Data Mining Software Tools
• Commercial
R 1,419
Python 1,325
SQL 1,029
Excel 972
(formerly Clementine)
KNIME 521
SciKit-Learn 497
Java 487
Anaconda 462
– Statistica - Dell/Statsoft
Unix shell/awk/gawk 301
MATLAB 263
IBM SPSS Statistics 242
Dataiku 227
SAS base 225
–
Hbase 158
KNIME QlikView
Microsoft Azure Machine Learning
Other Hadoop/HDFS-based tools
153
147
141
Legend:
[Orange] Free/Open Source tools
[Green] Commercial tools
–
Apache Pig 132
Salford SPM/CART/RF/MARS/TreeNet
Rattle
121
103
100
[Blue] Hadoop/Big Data tools
Gnu Octave 89
– Weka Orange
0
89
200 400 600 800 1000 1200 1400 1600
– R, …