SlideShare a Scribd company logo
Data Mining and
Knowledge Discovery
in Databases
Outline
• What is Data Mining and KDD?
• Characteristics
• Applications
• Methods
• Packages & Close Relatives
What is Data Mining & KDD?
• “The process of identifying hidden patterns
and relationships within data”
or
• “Data mining helps end users extract useful
business information from large databases”
What’s the Appeal?
• Hidden nuggets of valuable information buried
deep within a mountain of otherwise
unremarkable data
• Pervasive data
• Seek competitive advantage
The Challenge
5102018890521200153945819900000000141988122944882199608162100000010100010000000
1100003111110000000001003130200000000000000202001000000000000000000000000000043
4388888888424243424333012202022200001010010000000441000000001100000000000000000
1000001000000000000000000000000000000000000000000000000019981027510201896060120
0212694096800000015901998090337981199809173100100000100010000000110000320002000
0001000000012399000000000000200222200313100312000000000000000042438888888888424
3424233212121222200000010110000002441000000000100200000000000000000000100000000
0000000000000000000000000000000000000000019981230510201897020320001862692920000
0047091998021356971199802273100000100100010000000001101100000020000100000000021
0110001000000000001000000000000100011000000011100338888222233113233433300000011
0000011101001100102000100000000100000000100000000000000000000000000000000000000
0000000000000000000000000019981221510201899093020052008986730000019410199901127
5981199901263100100010100010000000001111101111122010100000111230010010000001021
0002200000000002000000000000011133438888434242424342423300000011110000010110010
0002441000000000100200000001001010000000100000000000000000000000000000000000000
0000000000019990525510201899122720093540515830000014484199705271797119970610310
0000010110010000000100000311120120000100100101200011110010000110100120000000000
0100000000001010132438888888888224242433100000001002100001110010011230100000010
0000200010000000000110000100000000100000100000000000000001000000000000000001998
1117510201899122720093540515830000014484199705271797219980616310000001011001000
0000110100311111121000100000202210012220220020221222201000000000000000001010011
0032434343213242214242423300210021000011110110000011223100110000010000001000000
0000110000100000000100000100000000000000000000000000000000001998122351020190001
Process: Knowledge Discovery In
Databases
database
database
data
warehouse
cleaning &
integration
modify data
selection
modify data
selection
data mining
collect and
transform
discovered
patterns
data mining
engines,
models
evaluation &
presentation
user interface
and expert
knowledge
domain
modify
methods,
parameters
Context
• Where you stand on Data Mining depends on
where you sit:
• Business User
• Researcher
• Computer Scientist
Data Mining Might Mean…
• Statistics
• Visualization
• Artificial intelligence
• Machine learning
• Database technology
• Neural networks
• Pattern recognition
• Knowledge-based systems
• Knowledge acquisition
• Information retrieval
• High performance computing
• And so on...
What’s needed?
• Suitable data
• Computing power
• Data mining software
• Skilled operator who knows both the nature of
the data and the software tools
• Reason, theory, or hunch
Typical Applications of Data Mining
& KDD
• Marketing
• Market Basket Analysis
• Customer Relationship Management
• New Product Development
Typical Applications of Data Mining
& KDD
• Financial Services
• Credit Approval
• Fraud Detection
• Marketing
Typical Applications of Data Mining
& KDD
• Health Care
• Epidemiological Analysis - incidence and prevalence
of disease in large populations and detection of the
source and cause of epidemics of infectious disease
• Knowledge for funding
• Policy, programs
Two Basic Approaches
• Supervised
• A dependent or target variable
• Unsupervised
• “Pure Data Mining”
• Fewer assumptions
• Typically used for clustering techniques
Automation
• The ability to aim a tool at some data and push
a button
• Some methods of KDD/Data mining are more
suitable for automation than others
Seven Basic Methods:
1. Decision Trees
2. (Artificial) Neural Networks
3. Cluster/Nearest Neighbour
4. Genetic Algorithms/Evolutionary Computing
5. Bayesian Networks
6. Statistics
7. Hybrids
• Graphical representations of relationships with
data
• Excel at Classification & Prediction Models
Decision Trees
Sample of a Decision Tree
gender
female
male
<65 >=65
married?
age
yes no
good
health?
yes no
- +
urban?
yes no
pet
owner?
yes no
+ - - +
pet
owner?
yes no
- +
Decision Trees
• Strengths
• Easily understood
and interpreted
• Represent complexity
in a compact form
• Handle non-linear
data well
• Relatively well suited
to automation.
• Weaknesses
• Large trees with large
numbers of variables
become difficult to
understand
• Missing data must be
appropriately
managed in
construction and use
of the models
Neural Networks
• Derived from Artificial Intelligence Research
• Modelled on the Human Neuron
Neural Networks
Age Gender Income
Prediction
Hidden Layer
Input Variables
0.6
0.3
0.1
0.5
0.7
0.8 0.4
Weights
Weights
0.3 0.2
Neural Networks
• Strengths
• Accuracy of prediction
• Robust performance
with a wide variety of
data types
• Weaknesses
• Prone to overfitting
• Poor clarity of model
Clustering/Nearest Neighbour
• Aim to assign “like” records to a group
• Groups assigned according to some target
variable or criteria
• Nearest neighbour used for prediction
Clustering/Nearest Neighbour
• Applications:
• Text processing: search engines
• Image processing: radiology/image processing
• Fraud detection: outliers
Clustering/
Nearest Neighbour
• Strengths
• Easily understood
and interpreted
• Easily implemented in
basic situations
• Weaknesses
• complex data not well
suited to automation
(much preprocessing
required)
Genetic Algorithms/
Evolutionary Computing
• Grounded in Darwin – applied using
mathematics
• Require
• a way to represent a solution to a problem
• a way to test the “fitness” of the solution
• Solutions are mathematically “mutated”
• Fittest solutions survive
• Convergence
Genetic Algorithms/
Evolutionary Computing
• Strengths
• Suited to novel
problems that are
poorly understood
• Suitable where data is
dirty or missing
• May be useful where
other methods cannot
be applied
• Weaknesses
• Not easily automated
• Require creativity in
their application
Bayesian Networks
• Based on Bayes’ rule:
• P(a|b) = P(b|a) * P(a) / P(b)
• Can construct networks of linked events, each
with prior probabilities
Bayesian Network Example
J.R. Shot
Bobby
shot him
Just a
dream
sequence
Mistress
shot him
Wife
shot him
Suicid
e
J. R.
Treated
for
Depressio
n
Bobby
publicly
threatened
Producer
s
desperat
e for
ratings
Big fight
between
wife,
mistress
Bayesian Networks
• Strengths
• Clarity of the resulting
models
• Good precision in
predicting
• Easily adapt to new
probabilities
• Weaknesses
• Time consuming to
construct and
maintain
• Poor at predicting
rare events
Statistics
• With an outcome or dependent variable:
• Correlations
• ANOVA
• Regression
• Used by themselves or to confirm findings of
another method
Statistics
• Strengths
• “Gold Standard” –
valid and trusted in
scientific circles
• Weaknesses
• Limits findings to
those techniques that
are applied and their
associated limitations
(normality, linearity,
and so on)
Hybrids
• Techniques used in combination
• Example: use of a genetic algorithm to identify
target variables for inclusion in a neural
network model
Recap
• Data Mining is the core activity or method
within a process of Knowledge Discovery in
Databases
• Done in order to find useful information in large
amounts of data not possible using
“conventional” approaches
• Variety of methods
• Knowledge of data domain, methods, as well
as creativity
Data Mining Packages
• Major vendors of database/data management
products (IBM, SPSS, Oracle PeopleSoft,
SAS, and so on)
• Added as a component of turnkey packages
• May incorporate several methods (SAS
Enterprise Miner)
• Single method (TreeAge Software Inc.: a
dedicated decision tree product)
How to implement?
• Do it yourself (you know the data domain)
• Put a team together (domain and method
specialists)
• Hire a consultant (who knows both your
domain and the tools)
• Vertical markets in data mining
Close Relatives of Data Mining
• On-Line Analytical Processing (OLAP)
• Pivot tables in spreadsheets
• General statistical packages
• Intelligent Data Analysis – comprises the use
of data mining methods in the analysis of
“small” datasets
Ad

Recommended

Talk
Talk
sumit621
 
Data mining an introduction
Data mining an introduction
Dr-Dipali Meher
 
Dwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basis
nivatripathy93
 
Dwd mdatamining intro-iep
Dwd mdatamining intro-iep
Ashish Kumar Thakur
 
Datamining
Datamining
IssacArputharajJeyak
 
Datamining
Datamining
IssacArputharajJeyak
 
Introduction to Data Mining
Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining
Suman Chatterjee
 
Data Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
SATYAJITJENABTECH
 
Sanjeev Kumar Dash D ata Mining-2023.ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
gobeli2850
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
Johannes Hoppe
 
Part1
Part1
sumit621
 
01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
GangeshSawarkar
 
Data mining approaches and methods
Data mining approaches and methods
sonangrai
 
1. Introduction to Data Mining (12).pptx
1. Introduction to Data Mining (12).pptx
Kiran119578
 
Data Mining and Knowledge Management
Data Mining and Knowledge Management
IRJET Journal
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
AmjadAlDgour
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
butest
 
Datamining intro-iep
Datamining intro-iep
aaryarun1999
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
ikachanz
 
Basic Overview of Data Mining
Basic Overview of Data Mining
Syracuse University
 
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Mahmudur Rahman
 
Business analytics and data mining
Business analytics and data mining
Hoang Nguyen
 
Business analytics and data mining
Business analytics and data mining
Luis Goldster
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Fundamentals of Digital Design_Class_21st May - Copy.pptx
Fundamentals of Digital Design_Class_21st May - Copy.pptx
drdebarshi1993
 

More Related Content

Similar to DataMining and Knowledge Discovery in DB.ppt (20)

Customer Profiling using Data Mining
Customer Profiling using Data Mining
Suman Chatterjee
 
Data Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
SATYAJITJENABTECH
 
Sanjeev Kumar Dash D ata Mining-2023.ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
gobeli2850
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
Johannes Hoppe
 
Part1
Part1
sumit621
 
01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
GangeshSawarkar
 
Data mining approaches and methods
Data mining approaches and methods
sonangrai
 
1. Introduction to Data Mining (12).pptx
1. Introduction to Data Mining (12).pptx
Kiran119578
 
Data Mining and Knowledge Management
Data Mining and Knowledge Management
IRJET Journal
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
AmjadAlDgour
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
butest
 
Datamining intro-iep
Datamining intro-iep
aaryarun1999
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
ikachanz
 
Basic Overview of Data Mining
Basic Overview of Data Mining
Syracuse University
 
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Mahmudur Rahman
 
Business analytics and data mining
Business analytics and data mining
Hoang Nguyen
 
Business analytics and data mining
Business analytics and data mining
Luis Goldster
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining
Suman Chatterjee
 
Sanjeev Kumar Dash D ata Mining-2023.ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
gobeli2850
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
Johannes Hoppe
 
01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
GangeshSawarkar
 
Data mining approaches and methods
Data mining approaches and methods
sonangrai
 
1. Introduction to Data Mining (12).pptx
1. Introduction to Data Mining (12).pptx
Kiran119578
 
Data Mining and Knowledge Management
Data Mining and Knowledge Management
IRJET Journal
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
AmjadAlDgour
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
butest
 
Datamining intro-iep
Datamining intro-iep
aaryarun1999
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
ikachanz
 
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Mahmudur Rahman
 
Business analytics and data mining
Business analytics and data mining
Hoang Nguyen
 
Business analytics and data mining
Business analytics and data mining
Luis Goldster
 

Recently uploaded (20)

20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Fundamentals of Digital Design_Class_21st May - Copy.pptx
Fundamentals of Digital Design_Class_21st May - Copy.pptx
drdebarshi1993
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
 
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
aniket862935
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
Montreal Dreamin' 25 - Introduction to the MuleSoft AI Chain (MAC) Project
Montreal Dreamin' 25 - Introduction to the MuleSoft AI Chain (MAC) Project
Alexandra N. Martinez
 
Cadastral Maps
Cadastral Maps
Google
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
How Binning Affects LED Performance & Consistency.pdf
How Binning Affects LED Performance & Consistency.pdf
Mina Anis
 
David Boutry - Mentors Junior Developers
David Boutry - Mentors Junior Developers
David Boutry
 
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
 
NALCO Green Anode Plant,Compositions of CPC,Pitch
NALCO Green Anode Plant,Compositions of CPC,Pitch
arpitprachi123
 
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
CenterEnamel
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Fundamentals of Digital Design_Class_21st May - Copy.pptx
Fundamentals of Digital Design_Class_21st May - Copy.pptx
drdebarshi1993
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
 
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
aniket862935
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
Montreal Dreamin' 25 - Introduction to the MuleSoft AI Chain (MAC) Project
Montreal Dreamin' 25 - Introduction to the MuleSoft AI Chain (MAC) Project
Alexandra N. Martinez
 
Cadastral Maps
Cadastral Maps
Google
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
How Binning Affects LED Performance & Consistency.pdf
How Binning Affects LED Performance & Consistency.pdf
Mina Anis
 
David Boutry - Mentors Junior Developers
David Boutry - Mentors Junior Developers
David Boutry
 
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
 
NALCO Green Anode Plant,Compositions of CPC,Pitch
NALCO Green Anode Plant,Compositions of CPC,Pitch
arpitprachi123
 
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
CenterEnamel
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
 
Ad

DataMining and Knowledge Discovery in DB.ppt

Editor's Notes

  • #3: Definitions: “The process of identifying hidden patterns and relationships within data” or “Data mining helps end users extract useful business information from large databases”
  • #5: Humans aren’t particularly well suited to finding patterns in data Computers, on the other hand…
  • #6: Distinguish the core data mining component from the overall KDD process. I’ll accept the common usage for this lecture and use the term data mining to mean both the core methods and the overall process.
  • #7: Importance of context: A business user will be interested in efficiency and results, validity may not be as important. A researcher clearly will be interested in a different type of results, and validity will be important. A computer scientist may be interested in introducing new algorithms or computational approaches and achieving improved results or more efficient processing.
  • #8: Data Mining is not a precisely defined area – embodies many related academic and applied specialties.
  • #9: The first three things are clear. Human component is critical: knowledge of the business/data domain and knowledge of the tools are both important – do you train your domain people to know data mining? - do you train data mining people to know the domain? – do you build a team? - do you bring in a consultant who may know a lot of both in a specific area (financial services for example) – do you build “intelligent agents” to guide users? Some data mining may clearly have a stated purpose: Identify the three most notable demographic characteristics among our customers that spend over $X – others are more open ended – look for patterns in radio-telescope signals coming from deep space. Whatever the case, data mining requires some rationale for committing the time and resources to do it.
  • #10: Marketing – why? – large amounts of transactional data on purchases, WalMart for example – market basket analyses – who buys what with what – influence on retail decisions – also Customer Relationship Management (CRM) – churn – important in business-to-business environment
  • #11: Financial services – again, large numbers of computerized records – data mining used to profile characteristics of poor or excellent credit risks, or exception analyses for fraud detection
  • #12: HealthCare – may have large amounts of data: hospital admissions/discharges, or billing information for physician procedures epidemiological (definition: study of patterns of disease) studies using population health data, for a province or entire nation – impact on health intervention, funding, training, policy, etc.
  • #17: Start at the top with all records, nodes use field values to go “left or right” – branch to further nodes, or to end points (leafs) that define higher or lower likelihood of the outcome or dependent variable Supervised Data Mining Asymetric models – note that a variable can be used in more than one place – so although fairly clear to interpret if model is small, can quickly become difficult to interpret Constructing these: both automated and interactive semi-automated – choose which variables to use, and for continous variables where cutpoints will be Branches can be restricted to 2 or not Typically divide a dataset into training and validation subsets.
  • #19: Biological neuron has multiple inputs (dendrites), resulting in an output This is modelled in its simplest form with multiple inputs (X’s), each with weights (W’s), resulting in an output Y
  • #20: This is a hypoethical model that might be used to predict health status. This model is supervised (target variable: good health or poor health) Input variables at the bottom Feedforward as inputs to a hidden layer of nodes Weights are set and adjusted during training; convergence Variety of variations and approaches using feedback architectures Algorithms to optimize computing time which can be extensive: multiple training passes
  • #28: Events are linked by probabilities. By estimating or quantifying probabilities among linked events, can estimate most likely cause. Updated probabilities alter the most likely cause – these models can “learn” from new information.
  • #31: But is it really “data mining”?
  • #34: Database standards allow any data mining package to access raw data Turnkey packages: for example hospital information management system, enterprise packages
  • #35: Vertical markets: development of data mining packages for one specific kind of customer, bank fraud, for example
  • #36: Cognos Powerplay, for example, that slice/dice across multiple dimensions – very popular with sales and financial analysis in support of business decisions. The term data mining tends to be over used.