0% found this document useful (0 votes)

2 views

introduction to Data Mining

Data mining involves discovering patterns in data sets through mathematical models, focusing on explanatory and predictive patterns. It contrasts with statistics by utilizing large data sets and aims to uncover novel insights rather than testing predefined hypotheses. Applications of data mining span various industries, including customer relationship management, banking, retail, and healthcare, employing processes like CRISP-DM and techniques such as classification and clustering.

Uploaded by

Pravalika Bura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

introduction to Data Mining

Uploaded by

Pravalika Bura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

How Data Mining Works

• Data mining models try to discover patterns among attributes

presented in the data set (relevant data from within and
outside the organization)
• Models are mathematical representations that identify patterns
among attributes of things, such as customers, events
• Two types of patterns
– Explanatory: explaining relationships and affinities among
the attributes
– Predictive: foretelling future values of certain attributes
• Four major types patterns
– Associations
– Predictions
– Clusters
– Sequential relationships
Statistics vs Data Mining
Statistics Data Mining
• Starts with well defined • Starts with loosely defined
proposition & hypothesis discovery statement
• Collects sample data to test • Uses all existing data to
hypothesis (primary data) discover novel patterns &
relationships (observational
& secondary data)
• Right size of data
• Uses big data sets
• Few hundreds data points
• Several millions of data
is large data
points are called large data
Other Data Mining Patterns/Tasks

• Time-series forecasting
– Part of the sequence or link analysis?
• Visualization
– Another data mining task?
– Covered in Chapter 3
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the two?
Data Mining Applications (1 of 4)

• Customer Relationship Management

– Maximize return on marketing campaigns
– Improve customer retention (churn analysis)
– Maximize customer value (cross-, up-selling)
– Identify and treat most valued customers
• Banking & Other Financial
– Automate the loan application process
– Detecting fraudulent transactions
– Maximize customer value (cross-, up-selling)
– Optimizing cash reserves with forecasting
Data Mining Applications (2 of 4)

• Retailing and Logistics

– Optimize inventory levels at different locations
– Improve the store layout and sales promotions
– Optimize logistics by predicting seasonal effects
– Minimize losses due to limited shelf life
• Manufacturing and Maintenance
– Predict/prevent machinery failures
– Identify anomalies in production systems to optimize
the use manufacturing capacity
– Discover novel patterns to improve product quality
Data Mining Applications (3 of 4)

• Brokerage and Securities Trading

– Predict changes on certain bond prices
– Forecast the direction of stock fluctuations
– Assess the effect of events on market movements
– Identify and prevent fraudulent activities in trading
• Insurance
– Forecast claim costs for better business planning
– Determine optimal rate plans
– Optimize marketing to specific customers
– Identify and prevent fraudulent claim activities
Data Mining Applications (4 of 4)

• Computer hardware and software

• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel, entertainment, sports
• Healthcare and medicine
• Sports,… virtually everywhere…
Application Case 4.3

Predictive Analytic and Data Mining Help Stop Terrorist

Funding
Questions for Discussion
1. How can data mining be used to fight terrorism?
Comment on what else can be done beyond what is
covered in this short application case.
2. Do you think data mining, although essential for fighting
terrorist cells, also jeopardizes individuals’ rights of
privacy?
Data Mining Process
• A manifestation of the best practices
• A systematic way to conduct Data Mining projects
• Moving from Art to Science for Data Mining project
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and Assess)
– KDD (Knowledge Discovery in Databases)
• Even though the sequential are steps proposed, the process is not
linear
– There is a great deal of backtracking
– Iterative and time consuming process
Step 1: Business Understanding
• To understand what the business wants to solve
• Determine the business question and objective
– What to solve from the business perspective, what the
customer wants, and define the business success criteria
• Determine the project goals
– What are the common characteristics of customers we lost
to our competitors recently?
– What are typical profiles of our customers, and how much
value does each of them provide to us?
• Project plan
– Try to create a detailed plan for each project phase and
what kind of tools you would use
– Budget to support the study
Step 2: Data Understanding
• Identify relevant data based on the business task to be
addressed
– Should be clear and concise about the description of the
data mining task
– To better understand the data use variety of statistical and
graphical tools
– Data sources for data selection can vary
▪ Demographic, sociographic, transactional, social media
– May include quantitative and qualitative data
Step 3: Data Preparation
• Referred as data pre-processing
• Prepare data for analysis
• This step usually consumes 80% of the project time
• Real-world data is
– Incomplete: lacking attribute values, attributes of interest,
containing only aggregated data
– Noisy: containing errors or outliers
– Inconsistent: discrepancies in values, codes and names
Step 4: Model Building
• There is no universally known best method or algorithm for a
data mining task
• Model building includes assessment and comparative analysis
of various models
• For a single method, a number of parameters need to be
calibrated to obtain optimum results
• Identify the best method for a given purpose
Step 5: Testing and Evaluation
• This is a critical and challenging task
• Developed models are assessed and evaluated for their
accuracy and generality
– Assess degree to which selected model meets the
business objectives
• Test developed models in real-world scenario if time and
budget constraints permit
• No value is added by data mining task until business value is
obtained from discovered knowledge pattern is identified and
recognized
– Depends on interaction of data analysts, business
analysts, and decision makers
Step 6: Deployment
• Deployment can be as simple as generating a report or as
complex as implementing a repeatable data mining process
across the enterprise
• Deployment is often done by the customer, not data analyst
• It also includes the maintenance activities for the deployed
model
– Over time, models built on old data may become obsolete,
irrelevant, or misleading
• To monitor the deployment of the data mining results, the
project needs a detailed plan on the monitoring process, which
may not be a trivial task for complex data mining models
Data Mining Process: SEMMA
• Developed by SAS Institute

Sample
(Generate a representative
sample of the data)

Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)

Feedback

Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Data Mining Process: KDD
• KDD (Knowledge Discovery in Databases) Process

Internalization

Data Mining
DEPLOYMENT CHART
Knowledge
“Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5

DEPT 1

DEPT 2

DEPT 3

Insight”
DEPT 4

3 4 5
Data 1 2
Transformation
Extracted
Patterns

Data
Cleaning Transformed
Data

Data
Selection Preprocessed
Data

Target
Data

Feedback

Sources for
Raw Data
Which Data Mining Process is the Best?

CRISP-DM

My own

SEMMA

KDD Process

My organization's

Domain-specific methodology

None

Other methodology (not domain specific)

0 10 20 30 40 50 60 70
Data Mining Methods: Classification
• Most frequently used. Part of the machine-learning family
• Learn patterns from past data, classify new instances into their
respective groups or classes
– Credit approval (good or bad credit risk)
– Target marketing (likely customer, no hope)
– Fraud detection (yes or no)
• Classification versus regression?
– Predicting class (sunny, rainy, cloudy) is classification
– Predicting a numerical (680) value is called a regression
• Classification versus clustering?
– Classification is supervised learning
– Clustering is un-supervised learning (discover natural groups)
Two Step Methodology - Classification
• Model development/training
– A collection of input data including actual class labels is
used for model training
– Then model is tested against holdout sample for accuracy
assessment
– Deployed for actual use where it is to predict classes of
new data instances (where the class label is unknown)
Factors for Model Assessment - Classification
• Predictive accuracy
– ability to correctly predict class label of new or previously
unseen data
• Speed
– Model building versus predicting/usage speed
• Robustness
– ability predict given noisy data with missing values
• Scalability
– ability to construct efficient model with large data
• Interpretability
– Transparency, explainability
Estimation Methodologies for Classification
• Simple split
• K-fold cross-validation
• Area Under the R O C Curve (A U C)
– R O C: receiver operating characteristics (a term borrowed
from radar image processing)
• Leave-one-out
– Similar to k-fold where k = number of samples
– Viable for small data set
• Bootstrapping
– Fixed no instances sampled with replacement for training
– Rest of the data used for testing
– Repeated as many times as desired
• Jackknifing
– Similar to leave-one-out, leaving one sample at each iteration
Accuracy of Classification Models
• Primary source for accuracy estimation is the confusion
matrix or classification matrix or contingency table
TP + TN
Accuracy  True/Observed Class
TP + TN + FP + FN
Positive Negative
TP
True PositiveRate =

Positive
True False
TP + FN

Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True NegativeRate =
TN + FP

Negative
False True
TP Negative Negative
TP Recall =
Precision = Count (FN) Count (TN)
TP + FP TP + FN

• Used to estimate future prediction accuracy

• Used to choose a classifier from a give set
Single/Simple Split
• Referred as holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets: training
(~70%) and testing (30%)
– Assumes that two subsets have exact same properties)

Model
Training Data Development
2/3

Trained Prediction
Preprocessed Classifier Accuracy
Data
1/3 Model TP FP
Assessment
Testing Data (scoring) FN TN

– For Neural Networks, the data is split into three sub-sets

(training [~60%], validation [~20%], testing [~20%])
– Validation set is used during model building to prevent
overfitting
k-Fold Cross Validation (rotation estimation)
• Data is split into k mutual subsets of approximately equal size
• Classification model is trained and tested k times
• Each time trained using k-1 folds and tested on remaining fold
• Cross-validation estimate is average of k individual accuracy
values
Area Under the ROC Curve (AUC)
• Works with binary classification
1

0.9

0.8

A
0.7

0.6

0.5

Area Under the

0.4
ROC Curve
(AUC) A = 0.84
0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Alarms (1 - Specificity)

• Produces values from 0 to 1.0

• Random chance is 0.5 and perfect classification is 1.0
• Produces a good assessment for skewed class distributions too!
Classification Techniques
• Decision tree analysis
• Statistical analysis
– Logistic regression
– Discriminant analysis
• Neural networks
• Support vector machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets
Decision Trees
• Employs a divide-and-conquer method
• Recursively divides a training set until each division consists of
examples from one class or relatively small
• A general algorithm (steps) for building a decision tree
1. Create a root node and assign all of the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the split.
– Split data into mutually exclusive (non-overlapping)
subsets along the lines of the specific split and move to
the branches
4. Repeat steps 2 and 3 for each and every leaf node until
the stopping criteria is reached
– Node is dominated by a single class label
Decision Tree - Examples
Decision Trees
• DT algorithms mainly differ on
1. Splitting criteria
▪ Which variable, what value, etc.
2. No of splits at each node
▪ Binary vs Ternary
3. Stopping criteria
▪ When to stop building the tree
4. Pruning (generalization method)
▪ Pre-pruning versus post-pruning
• Most popular DT algorithms include
– Iterative Dichotomiser 3 (ID3), (Newer versions C4.5, C5)
– Classification and Regression Trees (CART) from statistics
– Chi-square Automatic Interaction detector (CHAID)
Splitting Indices – Gini
• Goal at splitting is to determine the attribute and split point that
best divides the data to purify the class representation at node
• Indices are used to evaluate goodness of the split
• Gini index or Gini impurity measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen. Used in CART algorithm
– If all the elements belong to a single class, then it can be
called pure. Gini index varies between 0 and 1, where,
– 0 denotes that all elements belong to a certain class or if
there exists only one class, and
– 1 denotes that the elements are randomly distributed
across various classes.
– A Gini Index of 0.5 denotes equally distributed elements
into some classes.
Splitting Indices – Information Gain
• Information gain is used for splitting in ID3 algorithm
• Entropy is used in place of Grini index
• Entropy measures extent of uncertainty or randomness in a data
set
– Entropy is 0 if all members belong to the same class,
– Entropy is 1 when half of them belong to one class and other
half belong to other class that is perfect randomness
Cluster Analysis for Data Mining
• It is the process of finding similar structures in a set of
unlabelled data to make it more understandable and
manipulative
• It reveals subgroups in the available heterogeneous
datasets such that every individual cluster has greater
homogeneity than the whole
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns
new instances
• There is not an output/target variable
• In marketing, it is also known as segmentation
• The machine learns the attributes and trends by itself without
any provided input-output mapping
Use of Cluster Analysis Results
• Identify natural groupings of customers
• Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
• Provide characterization, definition, labeling of populations
• Decrease the size and complexity of problems for other
data mining methods
• Identify outliers in a specific domain (e.g., rare-event
detection)
Determining Optimal Number of Clusters
• Clustering algorithms require one to specify the number of
clusters to find
• There is no optimal way to calculate this number
• Following are commonly uses heuristic methods
– Choose a number of clusters that adding another cluster
would not give much better modelling of the data (marginal
gain will drop)
1
– Set it to 𝑛Τ2 Τ2 , where n is the number of data points
– Use Information Criteria (AIC), which is a measure of
goodness of fit (based on concept of entropy)
– Use Bayesian criteria, which is a model-selection criteria
Cluster Analysis Methods
• Statistical methods
– (including both hierarchical and nonhierarchical), such
as k-means, k-modes, and so on.
• Neural networks
– adaptive resonance theory [ART]
– self-organizing map [SOM])
• Fuzzy logic
– c-means algorithm
• Genetic algorithms
Clustering Method – General Approach
• Two general method classes
– Divisive: all items start in one cluster and are broken apart
– Agglomerative: all items start in individual clusters, and the
clusters are joined together
• Distance measure is used to calculate the closeness between
pairs of items. Popular distance measures are:
– Euclidian: ordinary distance measured with a ruler
– Manhattan: rectilinear or taxicab distance between two points
k-Means Clustering Algorithm
• One of the most well-known clustering methods
– k : pre-determined number of clusters
– Assigned each data point (customer, event, object etc) to
nearest cluster center (centroid)
– Center is the average of all the points in the cluster
• Algorithm (Step 0: determine value of k)
1. Randomly generate k random points as initial cluster
centers.
2. Assign each point to the nearest cluster center.
3. Take the mean of data points in each cluster, re-compute
the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some convergence
criterion is met (usually that the assignment of points to clusters
becomes stable).
Sample Dataset with Different Clusters
Association Rule Mining
• A very popular D M method in business
• Finds interesting relationships (affinities) between variables
(items or events)
• There is no output variable
• Also known as market basket analysis
• By analysing the past buying behaviour of customers, we can
find out which are the products that are bought frequently
together by the customers
• The discovery of this kind of association will be helpful for
retailers or marketers to develop marketing strategies by
gaining insight into which items
– Relationship between diapers and beers
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a lap-top computer and a virus
protection software, also bought extended service plan 70
percent of the time.”
• How do you use such a pattern/knowledge?
– Put the items next to each other to make it more convenient
for the customer to pick and not forgetting to buy
– Promote the items as a package (put only one on sale)
– Place items far apart from each other, so that customer has
to walk the aisles to search for it and by doing so potentially
seeing and buying other items
Association Rule Mining - Applications
• Business: cross-marketing, cross-selling, store design, catalog
design, e-commerce site design, optimization of online
advertising, product pricing, and sales/promotion configuration
• Medicine: relationships between symptoms and illnesses;
diagnosis and patient characteristics and treatments (to be
used in medical D S S); and genes and their functions (to be
used in genomics projects)
• Credit Card transactions: Can help identity other products
customer is likely purchase or fraudulent use credit card
• Bundles of services products bought by customer can be used
to propose additional services in banking, insurance, telecom
• Medical records: certain combinations of conditions can
indicate increased risk of various complications or certain
treatments at certain medical facilities can be tried to certain
types of infections.
Association Rule Mining

• Are all association rules interesting and useful?

A Generic Rule: X  Y [S%, C% ]

X, Y: products and/or services

X: Left-hand-side (L H S)
Y: Right-hand-side (R H S)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X

Example: {Laptop Computer, Antivirus Software} 

{Extended Service Plan} [30%, 70%]
Association Rule Mining - Algorithms
• Several algorithms are developed for discovering (identifying)
association rules
– Apriori
– Eclat
– F P-Growth
– + Derivatives and hybrids of the three
• The algorithms help identify the frequent itemsets, which are
then converted to association rules
Apriori Algorithm
• Finds subsets that are common to at least a minimum number
of the itemsets
• Uses a bottom-up approach
– frequent subsets are extended one item at a time (the size
of frequent subsets increases from one-item subsets to
two-item subsets, then three-item subsets, and so on), and
– groups of candidates at each level are tested against the
data for minimum support

Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets

Transaction SKUs Itemset Itemset Itemset

Support Support Support
No (Item No) (SKUs) (SKUs) (SKUs)

1001234 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1001236 2, 3 3 4 1, 4 3
1001237 1, 2, 4 4 5 2, 3 4
1001238 1, 2, 3, 4 2, 4 5
1001239 2, 4 3, 4 3
Data Mining Software Tools
• Commercial
R 1,419
Python 1,325
SQL 1,029
Excel 972

– IBM SPSS Modeler

RapidMiner 944
Hadoop 641
Spark 624
Tableau 536

(formerly Clementine)
KNIME 521
SciKit-Learn 497
Java 487
Anaconda 462

– SAS Enterprise Miner

Hive 359
Mllib 337
Weka 315
Microsoft SQL Server 314

– Statistica - Dell/Statsoft
Unix shell/awk/gawk 301
MATLAB 263
IBM SPSS Statistics 242
Dataiku 227
SAS base 225

– … many more IBM SPSS Modeler

SQL on Hadoop tools
C/C++
222
211
210
Other free analytics/data mining tools 198
Other programming and data languages 197

• Free and/or Open Source

H2O 193
Scala 180
SAS Enterprise Miner 162
Microsoft Power BI 161

–
Hbase 158

KNIME QlikView
Microsoft Azure Machine Learning
Other Hadoop/HDFS-based tools
153
147
141
Legend:
[Orange] Free/Open Source tools
[Green] Commercial tools

–
Apache Pig 132

RapidMiner IBM Watson

Salford SPM/CART/RF/MARS/TreeNet
Rattle
121
103
100
[Blue] Hadoop/Big Data tools

Gnu Octave 89

– Weka Orange
0
89
200 400 600 800 1000 1200 1400 1600

– R, …

Makeup Case Analysis
No ratings yet
Makeup Case Analysis
3 pages
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
No ratings yet
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
60 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
No ratings yet
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
50 pages
Data Mining - Bi 3
No ratings yet
Data Mining - Bi 3
40 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
DSS chapter 5
No ratings yet
DSS chapter 5
9 pages
BI-Unit-3-Part-1-PPT.ppt
No ratings yet
BI-Unit-3-Part-1-PPT.ppt
51 pages
Business Intelligence Data Mining: (John Naisbett)
No ratings yet
Business Intelligence Data Mining: (John Naisbett)
60 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Data Mining
No ratings yet
Data Mining
254 pages
Data Mining
No ratings yet
Data Mining
63 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining
No ratings yet
Data Mining
30 pages
Data Mining Information
100% (1)
Data Mining Information
15 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Lecture 7 & 8 Data Mining
No ratings yet
Lecture 7 & 8 Data Mining
21 pages
Lecture 7 8 Data Mining
No ratings yet
Lecture 7 8 Data Mining
23 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
Sharda_11e_full_accessible_ppt_04
No ratings yet
Sharda_11e_full_accessible_ppt_04
40 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
My Chapter Two
No ratings yet
My Chapter Two
57 pages
Chapter 6_Data Mining
No ratings yet
Chapter 6_Data Mining
62 pages
Presentation Data Mining
No ratings yet
Presentation Data Mining
22 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
59 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Chapter 4 - IS 466 - Spring Semester 23-24 Final
No ratings yet
Chapter 4 - IS 466 - Spring Semester 23-24 Final
57 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Data Mining
No ratings yet
Data Mining
6 pages
Data Mining
No ratings yet
Data Mining
20 pages
Chapter 4 - IS 466 - Fall Semester 24-25
No ratings yet
Chapter 4 - IS 466 - Fall Semester 24-25
57 pages
Screenshot 2024-06-04 at 12.07.18 AM
No ratings yet
Screenshot 2024-06-04 at 12.07.18 AM
45 pages
Screenshot 2024-06-03 at 11.59.21 PM
No ratings yet
Screenshot 2024-06-03 at 11.59.21 PM
45 pages
Screenshot 2024-06-04 at 12.01.00 AM
No ratings yet
Screenshot 2024-06-04 at 12.01.00 AM
45 pages
Screenshot 2024-06-04 at 12.00.45 AM
No ratings yet
Screenshot 2024-06-04 at 12.00.45 AM
45 pages
Data Mining.intro
No ratings yet
Data Mining.intro
17 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining
No ratings yet
Data Mining
41 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
09-Datamining Concepts
100% (1)
09-Datamining Concepts
121 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Directed Data Mining
No ratings yet
Directed Data Mining
34 pages
Lecture 1 & 2- Introduction to Data Mining2
No ratings yet
Lecture 1 & 2- Introduction to Data Mining2
19 pages
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
0% (1)
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
31 pages
turban_dss9e_ch05
No ratings yet
turban_dss9e_ch05
54 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
LJS Assignment 3
No ratings yet
LJS Assignment 3
10 pages
Data Mining
No ratings yet
Data Mining
43 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Theories of International Trade 2.0
No ratings yet
Theories of International Trade 2.0
6 pages
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
42 pages
Theories of international business
No ratings yet
Theories of international business
18 pages
IBM Globalization
No ratings yet
IBM Globalization
13 pages
Conjoint Analysis Tutorial
No ratings yet
Conjoint Analysis Tutorial
24 pages
Eee Model Sem Question
No ratings yet
Eee Model Sem Question
2 pages
Btech III Year i Semester (Ar20)
No ratings yet
Btech III Year i Semester (Ar20)
7 pages
DWDM
No ratings yet
DWDM
5 pages
Data Mining and Warehousing (Combined Assignment)
No ratings yet
Data Mining and Warehousing (Combined Assignment)
3 pages
6th IT Handbook
No ratings yet
6th IT Handbook
37 pages
Sketch4Match - Content-Based Image Retrieval System Using Sketches
No ratings yet
Sketch4Match - Content-Based Image Retrieval System Using Sketches
17 pages
A Machine Learning Framework For Cybersecurity Operations
No ratings yet
A Machine Learning Framework For Cybersecurity Operations
5 pages
Unsupervised Learning With Random Forest Predictors
No ratings yet
Unsupervised Learning With Random Forest Predictors
14 pages
Meat Science: C.E. Realini, M. Font I Furnols, C. Sañudo, F. Montossi, M.A. Oliver, L. Guerrero
No ratings yet
Meat Science: C.E. Realini, M. Font I Furnols, C. Sañudo, F. Montossi, M.A. Oliver, L. Guerrero
8 pages
lin-et-al-2024-fair-collaborative-learning-(faircl)-a-method-to-improve-fairness-amid-personalization
No ratings yet
lin-et-al-2024-fair-collaborative-learning-(faircl)-a-method-to-improve-fairness-amid-personalization
19 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Project Report PDF
100% (2)
Project Report PDF
29 pages
Rtmnu Machine Learning Paper Winter 2024
100% (1)
Rtmnu Machine Learning Paper Winter 2024
4 pages
Download Complete (Ebook) Metaheuristic Algorithms for Image Segmentation: Theory and Applications by Diego Oliva, Mohamed Abd Elaziz, Salvador Hinojosa ISBN 9783030129309, 9783030129316, 3030129306, 3030129314 PDF for All Chapters
100% (8)
Download Complete (Ebook) Metaheuristic Algorithms for Image Segmentation: Theory and Applications by Diego Oliva, Mohamed Abd Elaziz, Salvador Hinojosa ISBN 9783030129309, 9783030129316, 3030129306, 3030129314 PDF for All Chapters
57 pages
SLIC Superpixels Compared To State-Of-The-Art Superpixel Methods
No ratings yet
SLIC Superpixels Compared To State-Of-The-Art Superpixel Methods
8 pages
19cs521-Data Warehousing and Data Mining
No ratings yet
19cs521-Data Warehousing and Data Mining
3 pages
Acoustic Triangulation Attack
No ratings yet
Acoustic Triangulation Attack
43 pages
[Ebooks PDF] download Intelligent Signal Processing 1st Edition Simon S Haykin full chapters
100% (1)
[Ebooks PDF] download Intelligent Signal Processing 1st Edition Simon S Haykin full chapters
81 pages
ML 04
No ratings yet
ML 04
26 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Statistical Machine Learning (CSE 575) : About This Course
No ratings yet
Statistical Machine Learning (CSE 575) : About This Course
12 pages
Profiling Poverty Using Machine Learning: Sarvar Abdullaev
No ratings yet
Profiling Poverty Using Machine Learning: Sarvar Abdullaev
19 pages
7035CEM CW Brief SepJan 2324
No ratings yet
7035CEM CW Brief SepJan 2324
9 pages
Optimized Machine Learning Based Collaborative Filtering (OMLCF) Recommendation System in e Commerce
No ratings yet
Optimized Machine Learning Based Collaborative Filtering (OMLCF) Recommendation System in e Commerce
12 pages
Braun-Blanquet's Legacy and Data Analysis in Vegetation Science
No ratings yet
Braun-Blanquet's Legacy and Data Analysis in Vegetation Science
5 pages
Cluster Analysis
100% (1)
Cluster Analysis
19 pages
Data Warehouse MCQS With Answer - Computer Science PDF
100% (2)
Data Warehouse MCQS With Answer - Computer Science PDF
41 pages
Big Data Balamurugan Balusamyinstant download
100% (2)
Big Data Balamurugan Balusamyinstant download
62 pages

introduction to Data Mining

Uploaded by

introduction to Data Mining

Uploaded by

How Data Mining Works

• Data mining models try to discover patterns among attributes

• Customer Relationship Management

• Retailing and Logistics

• Brokerage and Securities Trading

• Computer hardware and software

Predictive Analytic and Data Mining Help Stop Terrorist

Other methodology (not domain specific)

• Used to estimate future prediction accuracy

– For Neural Networks, the data is split into three sub-sets

Area Under the

False Alarms (1 - Specificity)

• Produces values from 0 to 1.0

• Are all association rules interesting and useful?

X, Y: products and/or services

Example: {Laptop Computer, Antivirus Software} 

Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets

Transaction SKUs Itemset Itemset Itemset

– IBM SPSS Modeler

– SAS Enterprise Miner

– … many more IBM SPSS Modeler

• Free and/or Open Source

RapidMiner IBM Watson

You might also like