0% found this document useful (0 votes)

40 views

Dmtut

Uploaded by

Jayasri Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Dmtut

Uploaded by

Jayasri Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 88

Data Mining

Tutorial
Gregory Piatetsky-Shapiro
KDnuggets

© 2006 KDnuggets
Outline
 Introduction
 Data Mining Tasks
 Classification & Evaluation
 Clustering
 Application Examples

2
© 2006 KDnuggets
Trends leading to Data Flood
 More data is generated:
 Web, text, images …
 Business transactions, calls,
...
 Scientific data: astronomy,
biology, etc
 More data is captured:
 Storage technology faster
and cheaper
 DBMS can handle bigger DB

3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp

4
© 2006 KDnuggets
Data Growth

In 2 years (2003 to 2005),

the size of the largest database TRIPLED!

5
© 2006 KDnuggets
Data Growth Rate

 Twice as much information was created in 2002

as in 1999 (~30% growth rate)
 Other growth rate estimates even higher
 Very little data will ever be looked at by a human

Knowledge Discovery is NEEDED to make sense

and use of data.

6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining,
Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter
1), AAAI/MIT Press 1996

7
© 2006 KDnuggets
Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
 Statistics:
 more theory-based
 more focused on testing hypotheses
 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data mining
 Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
 Distinctions are fuzzy

9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM

see
Monitoring www.crisp-dm.org
for more
information

Continuous
monitoring and
improvement is
an addition to CRISP

10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
 Data Fishing, Data Dredging: 1960-
 used by statisticians (as bad name)

 Data Mining :1990 --

 used in DB community, business

 Knowledge Discovery in Databases (1989-)

 used by AI, Machine Learning Community
 also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
11
© 2006 KDnuggets
Data Mining Tasks

 Instance (also Item or Record):

 an example, described by a number of attributes,
 e.g. a day can be described by temperature, humidity
and cloud status

 Attribute or Field
 measuring aspects of the Instance, e.g. temperature

 Class (Label)
 grouping of instances, e.g. days good for playing

13
© 2006 KDnuggets
Major Data Mining Tasks
 Classification: predicting an item class
 Clustering: finding clusters in data
 Associations: e.g. A & B & C occur frequently
 Visualization: to facilitate human discovery
 Summarization: describing a group
 Deviation Detection: finding changes
 Estimation: predicting a continuous value
 Link Analysis: finding relationships
…

14
© 2006 KDnuggets
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data

16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
Bread, Cereal (3)
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL

Rules:
Milk => Bread (66%)

17
© 2006 KDnuggets
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery

 Presenting the
discovered results in a
visually "nice" way

18
© 2006 KDnuggets
Summarization

 Describe features of the

selected group
 Use natural language
and graphics
 Usually in Combination
with Deviation detection
or other methods

Average length of stay in this study area rose 45.7 percent,

from 4.3 days to 6.2 days, because ...
19
© 2006 KDnuggets
Data Mining Central Quest

Find true patterns

and avoid overfitting

(finding seemingly signifcant

but really random patterns due
to searching too many possibilites)
20
© 2006 KDnuggets
Classification Methods

Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...

Given a set of points from classes

what is the class of new point ?
22
© 2006 KDnuggets
Classification: Linear Regression

 Linear Regression
w0 + w1 x + w2 y >= 0

 Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
 Not flexible enough

23
© 2006 KDnuggets
Regression for Classification
 Any regression technique can be used for classification
 Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and 0
for those that don’t
 Prediction: predict class corresponding to model with largest
output value (membership value)

 For linear regression this is known as multi-response

linear regression

24
© 2006 KDnuggets
Classification: Decision Trees

if X > 5 then blue

else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

25
© 2006 KDnuggets
DECISION TREE
 An internal node is a test on an attribute.
 A branch represents an outcome of the test, e.g.,
Color=red.
 A leaf node represents a class label or class label
distribution.
 At each node, one attribute is chosen to split training
examples into distinct classes as much as possible
 A new instance is classified by following a matching
path to a leaf node.

26
© 2006 KDnuggets
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes no relation to
rain cool normal true No Microsoft
overcast cool normal true Yes email program
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
27
© 2006 KDnuggets
Example Tree for “Play?”

Outlook

sunny
overcast rain

Humidity Yes
Windy

high normal true false

No Yes No Yes

28
© 2006 KDnuggets
Classification: Neural Nets

 Can select more

complex regions
 Can be more accurate
 Also can overfit the
data – find patterns in
random noise

29
© 2006 KDnuggets
Classification: other approaches

 Naïve Bayes
 Rules
 Support Vector Machines
 Genetic Algorithms
…
See www.KDnuggets.com/software/

30
© 2006 KDnuggets
Evaluation

© 2006 KDnuggets
Evaluating which method works the
best for classification
 No model is uniformly the best
 Dimensions for Comparison
 speed of training
 speed of model application
 noise tolerance
 explanation ability

 Best Results: Hybrid, Integrated models

32
© 2006 KDnuggets
Comparison of Major
Classification Approaches

Train Run Noise Can Use Accuracy Under-

time Time Toler Prior on Customer standable
ance Know- Modelling
ledge
Decision fast fast poor no medium medium
Trees
Rules med fast poor no medium good

Neural slow fast good no good poor

Networks
Bayesian slow fast good yes good good

A hybrid method will have higher accuracy

33
© 2006 KDnuggets
Evaluation of Classification Models

 How predictive is the model we learned?

 Error on the training data is not a good indicator
of performance on future data
 The new data will probably not be exactly the same as
the training data!

 Overfitting – fitting the training data too precisely

- usually leads to poor results on new data

34
© 2006 KDnuggets
Evaluation issues

 Possible evaluation measures:

 Classification Accuracy
 Total cost/benefit – when different errors involve
different costs
 Lift and ROC curves
 Error in numeric predictions

 How reliable are the predicted results ?

35
© 2006 KDnuggets
Classifier error rate

 Natural performance measure for classification

problems: error rate
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the whole
set of instances

 Training set error rate: is way too optimistic!

 you can find patterns even in random data

36
© 2006 KDnuggets
Evaluation on “LARGE” data

If many (>1000) examples are available,

including >100 examples from each class
 A simple evaluation will give useful results
 Randomly split data into training and test sets (usually
2/3 for train, 1/3 for test)

 Build a classifier using the train set and evaluate

it using the test set

37
© 2006 KDnuggets
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Testing set

38
© 2006 KDnuggets
Classification Step 2:
Build a model on a training set
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder

Testing set

39
© 2006 KDnuggets
Classification Step 3:
Evaluate on test set (Re-
train?)
Results Known
+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -

40
© 2006 KDnuggets
Unbalanced data

 Sometimes, classes have very unequal frequency

 Attrition prediction: 97% stay, 3% attrite (in a month)
 medical diagnosis: 90% healthy, 10% disease
 eCommerce: 99% don’t buy, 1% buy
 Security: >99.99% of Americans are not terrorists

 Similar situation with multiple classes

 Majority class classifier can be 97% correct, but
useless

41
© 2006 KDnuggets
Handling unbalanced data –
how?
If we have two classes that are very
unbalanced, then how can we evaluate our
classifier method?

42
© 2006 KDnuggets
Balancing unbalanced data, 1

 With two classes, a good approach is to build

BALANCED train and test sets, and train model
on a balanced set
 randomly select desired number of minority class
instances
 add equal number of randomly selected majority class

 How do we generalize “balancing” to multiple

classes?

43
© 2006 KDnuggets
Balancing unbalanced data, 2

 Generalize “balancing” to multiple classes

 Ensure that each class is represented with
approximately equal proportions in train and test

44
© 2006 KDnuggets
A note on parameter tuning
 It is important that the test data is not used in any way to
create the classifier
 Some learning schemes operate in two stages:
 Stage 1: builds the basic structure
 Stage 2: optimizes parameter settings

 The test data can’t be used for parameter tuning!

 Proper procedure uses three sets: training data,
validation data, and test data
 Validation data is used to optimize parameters

45
© 2006 KDnuggets
Making the most of the data

 Once evaluation is complete, all the data can be

used to build the final classifier
 Generally, the larger the training data the better
the classifier (but returns diminish)
 The larger the test data the more accurate the
error estimate

46
© 2006 KDnuggets
Classification:
Train, Validation, Test split
Results Known
+
Training set Model
+
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -

47
© 2006 KDnuggets
Cross-validation

 Cross-validation avoids overlapping test sets

 First step: data is split into k subsets of equal size
 Second step: each subset in turn is used for testing and
the remainder for training

 This is called k-fold cross-validation

 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an
overall error estimate
48
© 2006 KDnuggets
Cross-validation example:
—Break up data into groups of the same size
—
—

—Hold aside one group for testing and use the rest to build model

— Test
—Repeat

49
49
© 2006 KDnuggets
More on cross-validation

 Standard method for evaluation: stratified ten-

fold cross-validation
 Why ten? Extensive experiments have shown that
this is the best choice to get an accurate estimate
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)

50
© 2006 KDnuggets
Direct Marketing Paradigm
 Find most likely prospects to contact
 Not everybody needs to be contacted
 Number of targets is usually much smaller than number of
prospects

 Typical Applications
 retailers, catalogues, direct mail (and e-mail)
 customer acquisition, cross-sell, attrition prediction
 ...

51
© 2006 KDnuggets
Direct Marketing Evaluation

 Accuracy on the entire dataset is not the

right measure
 Approach
 develop a target model
 score all prospects and rank them by decreasing score
 select top P% of prospects for action

 How do we decide what is the best subset of

prospects ?

52
© 2006 KDnuggets
Model-Sorted List
Use a model to assign score to each customer
Sort customers by decreasing score
Expect more targets (hits) near the top of the list
No Score Target CustID Age
1 0.97 Y 1746 … 3 hits in top 5% of
the list
2 0.95 N 1024 …
3 0.94 Y 2478 … If there 15 targets
overall, then top 5
4 0.93 Y 3820 …
has 3/15=20% of
5 0.92 N 4897 … targets
… … … …
99 0.11 N 2734 …
100 0.06 N 2422
53
© 2006 KDnuggets
CPH (Cumulative Pct Hits)
100

Cumulative % Hits
90
Definition: 80
CPH(P,M) 70
= % of all targets 60
Random
50
in the first P% 40
of the list scored 30
20
by model M 10
CPH frequently 0
called Gains
5

75
25

95
Pct list
5% of random list have 5% of targets

54
© 2006 KDnuggets
CPH: Random List vs
Model-ranked list
100

Cumulative % Hits
90
80
70
60 Random
50 Model
40
30
20
10
0
5

95
Pct list
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
CPH(5%,model)=21%.

55
© 2006 KDnuggets
Lift
Lift(P,M) = CPH(P,M) / P

Lift (at 5%) 4.5

4
= 21% / 5%
3.5
= 4.2 3
better 2.5 Lift
than random 2
1.5
1
0.5
Note: Some authors
use “Lift” for what 0
5

95
15

we call CPH.
P -- percent of the list
56
© 2006 KDnuggets
Lift – a measure of model quality

 Lift helps us decide which models are better

 If cost/benefit values are not available or
changing, we can use Lift to select a better
model.
 Model with the higher Lift curve will generally be
better

59
© 2006 KDnuggets
Clustering Methods
 Many different method and algorithms:
 For numeric and/or symbolic data
 Deterministic vs. probabilistic
 Exclusive vs. overlapping
 Hierarchical vs. flat
 Top-down vs. bottom-up

 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
 distance measures
 high similarity within a cluster, low across clusters

 Simplest case: one numeric attribute A

 Distance(X,Y) = A(X) – A(Y)

 Several numeric attributes:

 Distance(X,Y) = Euclidean distance between X,Y

 Nominal attributes: distance is set to 1 if values

are different, 0 if they are equal
 Are all attributes equally important?
 Weighting the attributes might be necessary

62
© 2006 KDnuggets
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)

c1
Y
Pick 3 c2
initial
cluster
centers
(randomly)
c3

c1
Y

c2
Assign
each point
to the closest
cluster
center c3

c1 c1
Y

Move c2
each cluster
center c3
c2
to the mean
of each cluster c3

Reassign c1
points Y
closest to a
different new
cluster center
c3
Q: Which c2
points are
reassigned?

c1
Y
A: these
three points
c3
c2

c1
Y
re-compute
cluster
means c3
c2

c1
Y

c2
move cluster
centers to c3
cluster means

 require knowledge-based decisions

 have a changing environment
 have sub-optimal current methods
 have accessible, sufficient, and relevant data
 provides high payoff for the right decisions!

72
© 2006 KDnuggets
Major Application Areas for
Data Mining Solutions
 Advertising
 Bioinformatics
 Customer Relationship Management (CRM)
 Database Marketing
 Fraud Detection
 eCommerce
 Health Care
 Investment/Securities
 Manufacturing, Process Control
 Sports and Entertainment
 Telecommunications
 Web

 Before Google, web search engines used mainly

keywords on a page – results were easily subject
to manipulation
 Google's early success was partly due to its
algorithm which uses mainly links to the page
 Google founders Sergey Brin and Larry Page were
students at Stanford in 1990s
 Their research in databases and data mining led
to Google
74
© 2006 KDnuggets
Microarrays: Classifying Leukemia
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
 72 examples (38 train, 34 test), about 7,000 genes

ALL AML

Visually similar, but genetically very different

Best Model: 97% accuracy,

1 error (sample suspected mislabelled)
75
© 2006 KDnuggets
Microarray Potential Applications
 New and better molecular diagnostics
 Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based
on Affymetrix technology
 New molecular targets for therapy
 few new drugs, large pipeline, …
 Improved treatment outcome
 Partially depends on genetic signature
 Fundamental Biological Discovery
 finding and refining biological pathways
 Personalized medicine ?!

76
© 2006 KDnuggets
Application:
Direct Marketing and CRM
 Most major direct marketing companies are using
modeling and data mining
 Most financial companies are using customer
modeling
 Modeling is easier than changing customer
behaviour
 Example
 Verizon Wireless reduced customer attrition rate from
2% to 1.5%, saving many millions of $
77
© 2006 KDnuggets
Application: e-Commerce

 Amazon.com recommendations
 if you bought (viewed) X, you are likely to buy Y

 Netflix
 If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"

 Comparison shopping
 Froogle, mySimon, Yahoo Shopping, …

78
© 2006 KDnuggets
Application:
Security and Fraud Detection
 Credit Card Fraud Detection
 over 20 Million credit cards protected by
Neural networks (Fair, Isaac)

 Securities Fraud Detection

 NASDAQ KDD system

 Phone fraud detection

 AT&T, Bell Atlantic, British Telecom/MCI

 TIA: Terrorism (formerly Total) Information

Awareness Program –
 TIA program closed by Congress in 2003 because of
privacy concerns

 However, in 2006 we learn that NSA is analyzing

US domestic call info to find potential terrorists
 Invasion of Privacy or Needed Intelligence?

80
© 2006 KDnuggets
Criticism of Analytic Approaches
to Threat Detection:
Data Mining will
 be ineffective - generate millions of false positives
 and invade privacy

First, can data mining be effective?

81
© 2006 KDnuggets
Can Data Mining and Statistics
be Effective for Threat Detection?
 Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
 Reality: Analytical models correlate many items of
information to reduce false positives.
 Example: Identify one biased coin from 1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out with
high probability.
 Can identify 19 biased coins out of 100 million with
sufficient number of throws

Can find unusual patterns in the network structure

 Data Mining is just one additional tool to help

analysts
 Combining multiple models and link analysis can
reduce false positives
 Today there are millions of false positives with
manual analysis
 Analytic technology has the potential to reduce
the current high rate of false positives

 Data Mining looks for patterns, not people!

 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
…

 Bayardo & Srikant, Technological Solutions for

Protecting Privacy, IEEE Computer, Sep 2003

Growing acceptance
and mainstreaming
rising
expectations

Disappointment Performance

 Data Mining and Knowledge Discovery are needed

to deal with the flood of data
 Knowledge Discovery is a process !
 Avoid overfitting (finding random patterns by
searching too many possibilities)

www.acm.org/sigkdd
ACM SIGKDD – the professional society for
data mining

Get Computer Organization and Architecture Designing for Performance 10th Edition William Stallings free all chapters
100% (2)
Get Computer Organization and Architecture Designing for Performance 10th Edition William Stallings free all chapters
55 pages
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
From Everand
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
EMC Education Services
No ratings yet
AA V6 I2 Modeling Threaded Bolted Joints in ANSYS Workbench PDF
100% (1)
AA V6 I2 Modeling Threaded Bolted Joints in ANSYS Workbench PDF
3 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
89 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
89 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
20 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
BI-Unit-3-Part-1-PPT.ppt
No ratings yet
BI-Unit-3-Part-1-PPT.ppt
51 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
59 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Hung-Son Intro-DM KD PDF
No ratings yet
Hung-Son Intro-DM KD PDF
58 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Dm1 Introduction ML Data Mining
100% (1)
Dm1 Introduction ML Data Mining
39 pages
Ch1 Overview Kdd_ml
No ratings yet
Ch1 Overview Kdd_ml
23 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data Mining
No ratings yet
Data Mining
33 pages
Lect 1
No ratings yet
Lect 1
38 pages
Data Mining: Knowledge Discovery in Databases
No ratings yet
Data Mining: Knowledge Discovery in Databases
21 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
Data Mining
No ratings yet
Data Mining
23 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
An Introduction To Data Mining IIT Bombay
No ratings yet
An Introduction To Data Mining IIT Bombay
48 pages
Unit - I MLT
No ratings yet
Unit - I MLT
137 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Knowledge Discovery and Data Mining (KDD)
No ratings yet
Knowledge Discovery and Data Mining (KDD)
52 pages
Data Mining: Introduction: Lecture Notes For Chapter 1
No ratings yet
Data Mining: Introduction: Lecture Notes For Chapter 1
32 pages
Top 50 Data Mining Interview Questions & Answers PDF
No ratings yet
Top 50 Data Mining Interview Questions & Answers PDF
30 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
07 DataMining
No ratings yet
07 DataMining
37 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
SIMS 422: Knowledge Inference Systems & Applications
No ratings yet
SIMS 422: Knowledge Inference Systems & Applications
28 pages
3 DM
No ratings yet
3 DM
36 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
Data management
No ratings yet
Data management
36 pages
8 Data Mining Concepts 2
No ratings yet
8 Data Mining Concepts 2
75 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Knowledge Discovery & Data Mining
No ratings yet
Knowledge Discovery & Data Mining
30 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Genetic Algorithms For Multi-Criterion Classification and Clustering in Data Mining
No ratings yet
Genetic Algorithms For Multi-Criterion Classification and Clustering in Data Mining
12 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
INS2061 Introductions
No ratings yet
INS2061 Introductions
75 pages
Chap1 Intro
No ratings yet
Chap1 Intro
28 pages
DM - MP (1)
No ratings yet
DM - MP (1)
15 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
Introduction
No ratings yet
Introduction
29 pages
What Is Data Mining?: Many Definitions
No ratings yet
What Is Data Mining?: Many Definitions
15 pages
introduction to Data Mining
No ratings yet
introduction to Data Mining
48 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Decision Science Case Study: Mba@Iicmr Course Code 204 Course
No ratings yet
Decision Science Case Study: Mba@Iicmr Course Code 204 Course
2 pages
204 DS Notes Unit 3
No ratings yet
204 DS Notes Unit 3
6 pages
204 DS Notes Unit 3
No ratings yet
204 DS Notes Unit 3
6 pages
204 DS Notes Unit 5
No ratings yet
204 DS Notes Unit 5
2 pages
204 DS Notes Unit 4
No ratings yet
204 DS Notes Unit 4
6 pages
204 DS Notes Unit 1
No ratings yet
204 DS Notes Unit 1
4 pages
204 DS Notes Unit 2
No ratings yet
204 DS Notes Unit 2
4 pages
Personal Finance Rubric
No ratings yet
Personal Finance Rubric
3 pages
Dear Respondents: Least Influence Influence Neutral Significantly Influence Most Significantly Influence
No ratings yet
Dear Respondents: Least Influence Influence Neutral Significantly Influence Most Significantly Influence
3 pages
Datta
No ratings yet
Datta
66 pages
Brochure Aveva Predictive Analytics 02 20 3.pdf - Coredownload.inline
No ratings yet
Brochure Aveva Predictive Analytics 02 20 3.pdf - Coredownload.inline
8 pages
Using Clash Rules in Navisworks To Reduce False Positives
No ratings yet
Using Clash Rules in Navisworks To Reduce False Positives
3 pages
Certficate 63 (4) (c) of BSA
No ratings yet
Certficate 63 (4) (c) of BSA
2 pages
Atos Gom Bevel Gear
No ratings yet
Atos Gom Bevel Gear
6 pages
School Readiness Survey SY 2020 2021
No ratings yet
School Readiness Survey SY 2020 2021
1 page
Xerox B1025 MFP Sag En-Us PDF
No ratings yet
Xerox B1025 MFP Sag En-Us PDF
123 pages
IP Helper Latihan Soal Exercise
100% (1)
IP Helper Latihan Soal Exercise
477 pages
Attendance Management System Using Face-Recognitio
No ratings yet
Attendance Management System Using Face-Recognitio
6 pages
DCDC Exam Content Outline PDF
No ratings yet
DCDC Exam Content Outline PDF
4 pages
MKT 2016 ICSampler-Small 20161115.418a069d
No ratings yet
MKT 2016 ICSampler-Small 20161115.418a069d
28 pages
P011
No ratings yet
P011
4 pages
Astral Column Pipe Pricelist
No ratings yet
Astral Column Pipe Pricelist
4 pages
Special Tools: Edition 2004A
No ratings yet
Special Tools: Edition 2004A
6 pages
Injective, Surjective, Bijective
No ratings yet
Injective, Surjective, Bijective
5 pages
Key Characteristics of Generative AI
No ratings yet
Key Characteristics of Generative AI
4 pages
Muhammad Mobile Dev CV
No ratings yet
Muhammad Mobile Dev CV
2 pages
Architectures and Algorithms For DSP Systems (Crl702) : Centre For Applied Research in Electronics Iit Delhi
No ratings yet
Architectures and Algorithms For DSP Systems (Crl702) : Centre For Applied Research in Electronics Iit Delhi
8 pages
2021 A survey of OCR evaluation tools and metrics
No ratings yet
2021 A survey of OCR evaluation tools and metrics
6 pages
Memory System
No ratings yet
Memory System
51 pages
Vdocuments - MX - Durma Catalogue 1
No ratings yet
Vdocuments - MX - Durma Catalogue 1
88 pages
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
No ratings yet
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
32 pages
Sperry OEM X-Band Installation Details
No ratings yet
Sperry OEM X-Band Installation Details
26 pages
Strip Line
No ratings yet
Strip Line
12 pages
Yashodeep Ted-Typo14
No ratings yet
Yashodeep Ted-Typo14
4 pages
Tallyman - Handover - Quick Guide
No ratings yet
Tallyman - Handover - Quick Guide
11 pages
MCTS Guide To Configuring Microsoft Windows Server 2008 Active Directory
No ratings yet
MCTS Guide To Configuring Microsoft Windows Server 2008 Active Directory
53 pages
FAQ KPSC Udyoga - 1
No ratings yet
FAQ KPSC Udyoga - 1
4 pages
Computer MCQ Part-02 Eng
No ratings yet
Computer MCQ Part-02 Eng
74 pages

Dmtut

Uploaded by

Dmtut

Uploaded by

Data Mining

In 2 years (2003 to 2005),

 Twice as much information was created in 2002

Knowledge Discovery is NEEDED to make sense

 Data Mining :1990 --

 Knowledge Discovery in Databases (1989-)

 Instance (also Item or Record):

 Describe features of the

Average length of stay in this study area rose 45.7 percent,

Find true patterns

(finding seemingly signifcant

Given a set of points from classes

 For linear regression this is known as multi-response

if X > 5 then blue

high normal true false

 Can select more

 Best Results: Hybrid, Integrated models

Train Run Noise Can Use Accuracy Under-

Neural slow fast good no good poor

A hybrid method will have higher accuracy

 How predictive is the model we learned?

 Overfitting – fitting the training data too precisely

 Possible evaluation measures:

 How reliable are the predicted results ?

 Natural performance measure for classification

 Training set error rate: is way too optimistic!

If many (>1000) examples are available,

 Build a classifier using the train set and evaluate

 Sometimes, classes have very unequal frequency

 Similar situation with multiple classes

 With two classes, a good approach is to build

 How do we generalize “balancing” to multiple

 Generalize “balancing” to multiple classes

 The test data can’t be used for parameter tuning!

 Once evaluation is complete, all the data can be

 Cross-validation avoids overlapping test sets

 This is called k-fold cross-validation

 Standard method for evaluation: stratified ten-

 Accuracy on the entire dataset is not the

 How do we decide what is the best subset of

Lift (at 5%) 4.5

 Lift helps us decide which models are better

 Simplest case: one numeric attribute A

 Several numeric attributes:

 Nominal attributes: distance is set to 1 if values

 require knowledge-based decisions

 Before Google, web search engines used mainly

Visually similar, but genetically very different

Best Model: 97% accuracy,

 Securities Fraud Detection

 Phone fraud detection

 TIA: Terrorism (formerly Total) Information

 However, in 2006 we learn that NSA is analyzing

First, can data mining be effective?

Can find unusual patterns in the network structure

 Data Mining is just one additional tool to help

 Data Mining looks for patterns, not people!

 Bayardo & Srikant, Technological Solutions for

 Data Mining and Knowledge Discovery are needed

You might also like