0% found this document useful (0 votes)
2 views

introduction to Data Mining

Data mining involves discovering patterns in data sets through mathematical models, focusing on explanatory and predictive patterns. It contrasts with statistics by utilizing large data sets and aims to uncover novel insights rather than testing predefined hypotheses. Applications of data mining span various industries, including customer relationship management, banking, retail, and healthcare, employing processes like CRISP-DM and techniques such as classification and clustering.

Uploaded by

Pravalika Bura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

introduction to Data Mining

Data mining involves discovering patterns in data sets through mathematical models, focusing on explanatory and predictive patterns. It contrasts with statistics by utilizing large data sets and aims to uncover novel insights rather than testing predefined hypotheses. Applications of data mining span various industries, including customer relationship management, banking, retail, and healthcare, employing processes like CRISP-DM and techniques such as classification and clustering.

Uploaded by

Pravalika Bura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

How Data Mining Works

• Data mining models try to discover patterns among attributes


presented in the data set (relevant data from within and
outside the organization)
• Models are mathematical representations that identify patterns
among attributes of things, such as customers, events
• Two types of patterns
– Explanatory: explaining relationships and affinities among
the attributes
– Predictive: foretelling future values of certain attributes
• Four major types patterns
– Associations
– Predictions
– Clusters
– Sequential relationships
Statistics vs Data Mining
Statistics Data Mining
• Starts with well defined • Starts with loosely defined
proposition & hypothesis discovery statement
• Collects sample data to test • Uses all existing data to
hypothesis (primary data) discover novel patterns &
relationships (observational
& secondary data)
• Right size of data
• Uses big data sets
• Few hundreds data points
• Several millions of data
is large data
points are called large data
Other Data Mining Patterns/Tasks

• Time-series forecasting
– Part of the sequence or link analysis?
• Visualization
– Another data mining task?
– Covered in Chapter 3
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the two?
Data Mining Applications (1 of 4)

• Customer Relationship Management


– Maximize return on marketing campaigns
– Improve customer retention (churn analysis)
– Maximize customer value (cross-, up-selling)
– Identify and treat most valued customers
• Banking & Other Financial
– Automate the loan application process
– Detecting fraudulent transactions
– Maximize customer value (cross-, up-selling)
– Optimizing cash reserves with forecasting
Data Mining Applications (2 of 4)

• Retailing and Logistics


– Optimize inventory levels at different locations
– Improve the store layout and sales promotions
– Optimize logistics by predicting seasonal effects
– Minimize losses due to limited shelf life
• Manufacturing and Maintenance
– Predict/prevent machinery failures
– Identify anomalies in production systems to optimize
the use manufacturing capacity
– Discover novel patterns to improve product quality
Data Mining Applications (3 of 4)

• Brokerage and Securities Trading


– Predict changes on certain bond prices
– Forecast the direction of stock fluctuations
– Assess the effect of events on market movements
– Identify and prevent fraudulent activities in trading
• Insurance
– Forecast claim costs for better business planning
– Determine optimal rate plans
– Optimize marketing to specific customers
– Identify and prevent fraudulent claim activities
Data Mining Applications (4 of 4)

• Computer hardware and software


• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel, entertainment, sports
• Healthcare and medicine
• Sports,… virtually everywhere…
Application Case 4.3

Predictive Analytic and Data Mining Help Stop Terrorist


Funding
Questions for Discussion
1. How can data mining be used to fight terrorism?
Comment on what else can be done beyond what is
covered in this short application case.
2. Do you think data mining, although essential for fighting
terrorist cells, also jeopardizes individuals’ rights of
privacy?
Data Mining Process
• A manifestation of the best practices
• A systematic way to conduct Data Mining projects
• Moving from Art to Science for Data Mining project
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and Assess)
– KDD (Knowledge Discovery in Databases)
• Even though the sequential are steps proposed, the process is not
linear
– There is a great deal of backtracking
– Iterative and time consuming process
Step 1: Business Understanding
• To understand what the business wants to solve
• Determine the business question and objective
– What to solve from the business perspective, what the
customer wants, and define the business success criteria
• Determine the project goals
– What are the common characteristics of customers we lost
to our competitors recently?
– What are typical profiles of our customers, and how much
value does each of them provide to us?
• Project plan
– Try to create a detailed plan for each project phase and
what kind of tools you would use
– Budget to support the study
Step 2: Data Understanding
• Identify relevant data based on the business task to be
addressed
– Should be clear and concise about the description of the
data mining task
– To better understand the data use variety of statistical and
graphical tools
– Data sources for data selection can vary
▪ Demographic, sociographic, transactional, social media
– May include quantitative and qualitative data
Step 3: Data Preparation
• Referred as data pre-processing
• Prepare data for analysis
• This step usually consumes 80% of the project time
• Real-world data is
– Incomplete: lacking attribute values, attributes of interest,
containing only aggregated data
– Noisy: containing errors or outliers
– Inconsistent: discrepancies in values, codes and names
Step 4: Model Building
• There is no universally known best method or algorithm for a
data mining task
• Model building includes assessment and comparative analysis
of various models
• For a single method, a number of parameters need to be
calibrated to obtain optimum results
• Identify the best method for a given purpose
Step 5: Testing and Evaluation
• This is a critical and challenging task
• Developed models are assessed and evaluated for their
accuracy and generality
– Assess degree to which selected model meets the
business objectives
• Test developed models in real-world scenario if time and
budget constraints permit
• No value is added by data mining task until business value is
obtained from discovered knowledge pattern is identified and
recognized
– Depends on interaction of data analysts, business
analysts, and decision makers
Step 6: Deployment
• Deployment can be as simple as generating a report or as
complex as implementing a repeatable data mining process
across the enterprise
• Deployment is often done by the customer, not data analyst
• It also includes the maintenance activities for the deployed
model
– Over time, models built on old data may become obsolete,
irrelevant, or misleading
• To monitor the deployment of the data mining results, the
project needs a detailed plan on the monitoring process, which
may not be a trivial task for complex data mining models
Data Mining Process: SEMMA
• Developed by SAS Institute

Sample
(Generate a representative
sample of the data)

Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)

Feedback

Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Data Mining Process: KDD
• KDD (Knowledge Discovery in Databases) Process

Internalization

Data Mining
DEPLOYMENT CHART
Knowledge
“Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5

DEPT 1

DEPT 2

DEPT 3

Insight”
DEPT 4

3 4 5
Data 1 2
Transformation
Extracted
Patterns

Data
Cleaning Transformed
Data

Data
Selection Preprocessed
Data

Target
Data

Feedback

Sources for
Raw Data
Which Data Mining Process is the Best?

CRISP-DM

My own

SEMMA

KDD Process

My organization's

Domain-specific methodology

None

Other methodology (not domain specific)

0 10 20 30 40 50 60 70
Data Mining Methods: Classification
• Most frequently used. Part of the machine-learning family
• Learn patterns from past data, classify new instances into their
respective groups or classes
– Credit approval (good or bad credit risk)
– Target marketing (likely customer, no hope)
– Fraud detection (yes or no)
• Classification versus regression?
– Predicting class (sunny, rainy, cloudy) is classification
– Predicting a numerical (680) value is called a regression
• Classification versus clustering?
– Classification is supervised learning
– Clustering is un-supervised learning (discover natural groups)
Two Step Methodology - Classification
• Model development/training
– A collection of input data including actual class labels is
used for model training
– Then model is tested against holdout sample for accuracy
assessment
– Deployed for actual use where it is to predict classes of
new data instances (where the class label is unknown)
Factors for Model Assessment - Classification
• Predictive accuracy
– ability to correctly predict class label of new or previously
unseen data
• Speed
– Model building versus predicting/usage speed
• Robustness
– ability predict given noisy data with missing values
• Scalability
– ability to construct efficient model with large data
• Interpretability
– Transparency, explainability
Estimation Methodologies for Classification
• Simple split
• K-fold cross-validation
• Area Under the R O C Curve (A U C)
– R O C: receiver operating characteristics (a term borrowed
from radar image processing)
• Leave-one-out
– Similar to k-fold where k = number of samples
– Viable for small data set
• Bootstrapping
– Fixed no instances sampled with replacement for training
– Rest of the data used for testing
– Repeated as many times as desired
• Jackknifing
– Similar to leave-one-out, leaving one sample at each iteration
Accuracy of Classification Models
• Primary source for accuracy estimation is the confusion
matrix or classification matrix or contingency table
TP + TN
Accuracy  True/Observed Class
TP + TN + FP + FN
Positive Negative
TP
True PositiveRate =

Positive
True False
TP + FN

Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True NegativeRate =
TN + FP

Negative
False True
TP Negative Negative
TP Recall =
Precision = Count (FN) Count (TN)
TP + FP TP + FN

• Used to estimate future prediction accuracy


• Used to choose a classifier from a give set
Single/Simple Split
• Referred as holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets: training
(~70%) and testing (30%)
– Assumes that two subsets have exact same properties)

Model
Training Data Development
2/3

Trained Prediction
Preprocessed Classifier Accuracy
Data
1/3 Model TP FP
Assessment
Testing Data (scoring) FN TN

– For Neural Networks, the data is split into three sub-sets


(training [~60%], validation [~20%], testing [~20%])
– Validation set is used during model building to prevent
overfitting
k-Fold Cross Validation (rotation estimation)
• Data is split into k mutual subsets of approximately equal size
• Classification model is trained and tested k times
• Each time trained using k-1 folds and tested on remaining fold
• Cross-validation estimate is average of k individual accuracy
values
Area Under the ROC Curve (AUC)
• Works with binary classification
1

0.9

0.8

A
0.7

0.6

0.5

Area Under the


0.4
ROC Curve
(AUC) A = 0.84
0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Alarms (1 - Specificity)

• Produces values from 0 to 1.0


• Random chance is 0.5 and perfect classification is 1.0
• Produces a good assessment for skewed class distributions too!
Classification Techniques
• Decision tree analysis
• Statistical analysis
– Logistic regression
– Discriminant analysis
• Neural networks
• Support vector machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets
Decision Trees
• Employs a divide-and-conquer method
• Recursively divides a training set until each division consists of
examples from one class or relatively small
• A general algorithm (steps) for building a decision tree
1. Create a root node and assign all of the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the split.
– Split data into mutually exclusive (non-overlapping)
subsets along the lines of the specific split and move to
the branches
4. Repeat steps 2 and 3 for each and every leaf node until
the stopping criteria is reached
– Node is dominated by a single class label
Decision Tree - Examples
Decision Trees
• DT algorithms mainly differ on
1. Splitting criteria
▪ Which variable, what value, etc.
2. No of splits at each node
▪ Binary vs Ternary
3. Stopping criteria
▪ When to stop building the tree
4. Pruning (generalization method)
▪ Pre-pruning versus post-pruning
• Most popular DT algorithms include
– Iterative Dichotomiser 3 (ID3), (Newer versions C4.5, C5)
– Classification and Regression Trees (CART) from statistics
– Chi-square Automatic Interaction detector (CHAID)
Splitting Indices – Gini
• Goal at splitting is to determine the attribute and split point that
best divides the data to purify the class representation at node
• Indices are used to evaluate goodness of the split
• Gini index or Gini impurity measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen. Used in CART algorithm
– If all the elements belong to a single class, then it can be
called pure. Gini index varies between 0 and 1, where,
– 0 denotes that all elements belong to a certain class or if
there exists only one class, and
– 1 denotes that the elements are randomly distributed
across various classes.
– A Gini Index of 0.5 denotes equally distributed elements
into some classes.
Splitting Indices – Information Gain
• Information gain is used for splitting in ID3 algorithm
• Entropy is used in place of Grini index
• Entropy measures extent of uncertainty or randomness in a data
set
– Entropy is 0 if all members belong to the same class,
– Entropy is 1 when half of them belong to one class and other
half belong to other class that is perfect randomness
Cluster Analysis for Data Mining
• It is the process of finding similar structures in a set of
unlabelled data to make it more understandable and
manipulative
• It reveals subgroups in the available heterogeneous
datasets such that every individual cluster has greater
homogeneity than the whole
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns
new instances
• There is not an output/target variable
• In marketing, it is also known as segmentation
• The machine learns the attributes and trends by itself without
any provided input-output mapping
Use of Cluster Analysis Results
• Identify natural groupings of customers
• Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
• Provide characterization, definition, labeling of populations
• Decrease the size and complexity of problems for other
data mining methods
• Identify outliers in a specific domain (e.g., rare-event
detection)
Determining Optimal Number of Clusters
• Clustering algorithms require one to specify the number of
clusters to find
• There is no optimal way to calculate this number
• Following are commonly uses heuristic methods
– Choose a number of clusters that adding another cluster
would not give much better modelling of the data (marginal
gain will drop)
1
– Set it to 𝑛Τ2 Τ2 , where n is the number of data points
– Use Information Criteria (AIC), which is a measure of
goodness of fit (based on concept of entropy)
– Use Bayesian criteria, which is a model-selection criteria
Cluster Analysis Methods
• Statistical methods
– (including both hierarchical and nonhierarchical), such
as k-means, k-modes, and so on.
• Neural networks
– adaptive resonance theory [ART]
– self-organizing map [SOM])
• Fuzzy logic
– c-means algorithm
• Genetic algorithms
Clustering Method – General Approach
• Two general method classes
– Divisive: all items start in one cluster and are broken apart
– Agglomerative: all items start in individual clusters, and the
clusters are joined together
• Distance measure is used to calculate the closeness between
pairs of items. Popular distance measures are:
– Euclidian: ordinary distance measured with a ruler
– Manhattan: rectilinear or taxicab distance between two points
k-Means Clustering Algorithm
• One of the most well-known clustering methods
– k : pre-determined number of clusters
– Assigned each data point (customer, event, object etc) to
nearest cluster center (centroid)
– Center is the average of all the points in the cluster
• Algorithm (Step 0: determine value of k)
1. Randomly generate k random points as initial cluster
centers.
2. Assign each point to the nearest cluster center.
3. Take the mean of data points in each cluster, re-compute
the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some convergence
criterion is met (usually that the assignment of points to clusters
becomes stable).
Sample Dataset with Different Clusters
Association Rule Mining
• A very popular D M method in business
• Finds interesting relationships (affinities) between variables
(items or events)
• There is no output variable
• Also known as market basket analysis
• By analysing the past buying behaviour of customers, we can
find out which are the products that are bought frequently
together by the customers
• The discovery of this kind of association will be helpful for
retailers or marketers to develop marketing strategies by
gaining insight into which items
– Relationship between diapers and beers
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a lap-top computer and a virus
protection software, also bought extended service plan 70
percent of the time.”
• How do you use such a pattern/knowledge?
– Put the items next to each other to make it more convenient
for the customer to pick and not forgetting to buy
– Promote the items as a package (put only one on sale)
– Place items far apart from each other, so that customer has
to walk the aisles to search for it and by doing so potentially
seeing and buying other items
Association Rule Mining - Applications
• Business: cross-marketing, cross-selling, store design, catalog
design, e-commerce site design, optimization of online
advertising, product pricing, and sales/promotion configuration
• Medicine: relationships between symptoms and illnesses;
diagnosis and patient characteristics and treatments (to be
used in medical D S S); and genes and their functions (to be
used in genomics projects)
• Credit Card transactions: Can help identity other products
customer is likely purchase or fraudulent use credit card
• Bundles of services products bought by customer can be used
to propose additional services in banking, insurance, telecom
• Medical records: certain combinations of conditions can
indicate increased risk of various complications or certain
treatments at certain medical facilities can be tried to certain
types of infections.
Association Rule Mining

• Are all association rules interesting and useful?


A Generic Rule: X  Y [S%, C% ]

X, Y: products and/or services


X: Left-hand-side (L H S)
Y: Right-hand-side (R H S)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X

Example: {Laptop Computer, Antivirus Software} 


{Extended Service Plan} [30%, 70%]
Association Rule Mining - Algorithms
• Several algorithms are developed for discovering (identifying)
association rules
– Apriori
– Eclat
– F P-Growth
– + Derivatives and hybrids of the three
• The algorithms help identify the frequent itemsets, which are
then converted to association rules
Apriori Algorithm
• Finds subsets that are common to at least a minimum number
of the itemsets
• Uses a bottom-up approach
– frequent subsets are extended one item at a time (the size
of frequent subsets increases from one-item subsets to
two-item subsets, then three-item subsets, and so on), and
– groups of candidates at each level are tested against the
data for minimum support

Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets

Transaction SKUs Itemset Itemset Itemset


Support Support Support
No (Item No) (SKUs) (SKUs) (SKUs)

1001234 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1001236 2, 3 3 4 1, 4 3
1001237 1, 2, 4 4 5 2, 3 4
1001238 1, 2, 3, 4 2, 4 5
1001239 2, 4 3, 4 3
Data Mining Software Tools
• Commercial
R 1,419
Python 1,325
SQL 1,029
Excel 972

– IBM SPSS Modeler


RapidMiner 944
Hadoop 641
Spark 624
Tableau 536

(formerly Clementine)
KNIME 521
SciKit-Learn 497
Java 487
Anaconda 462

– SAS Enterprise Miner


Hive 359
Mllib 337
Weka 315
Microsoft SQL Server 314

– Statistica - Dell/Statsoft
Unix shell/awk/gawk 301
MATLAB 263
IBM SPSS Statistics 242
Dataiku 227
SAS base 225

– … many more IBM SPSS Modeler


SQL on Hadoop tools
C/C++
222
211
210
Other free analytics/data mining tools 198
Other programming and data languages 197

• Free and/or Open Source


H2O 193
Scala 180
SAS Enterprise Miner 162
Microsoft Power BI 161


Hbase 158

KNIME QlikView
Microsoft Azure Machine Learning
Other Hadoop/HDFS-based tools
153
147
141
Legend:
[Orange] Free/Open Source tools
[Green] Commercial tools


Apache Pig 132

RapidMiner IBM Watson

Salford SPM/CART/RF/MARS/TreeNet
Rattle
121
103
100
[Blue] Hadoop/Big Data tools

Gnu Octave 89

– Weka Orange
0
89
200 400 600 800 1000 1200 1400 1600

– R, …

You might also like