DWDM 01 Introduction
DWDM 01 Introduction
— Chapter 1 —
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
Applications: Market analysis, fraud detection, customer retention etc.
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
5
What Is Data Mining?
6
What Is Data Mining?
8
Knowledge Discovery (KDD) Process
10
Data Mining: On What Kinds of Data?
2. data warehouse
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process
Q: analysis of the company’s sales per item type per branch for the third quarter
difficult task - since the relevant data are spread out over several databases,
physically located at numerous sites
17
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
18
Data Mining Functionalities-What kinds
of patterns may be mined?
2 categories
Descriptive
Descriptive mining tasks characterize the general
properties of the data in the database
Predictive
Predictive mining tasks perform inference on the
current data in order to make predictions.
Eg. compare two groups of customers: who shop for computer products
regularly (more than two times a month) versus who rarely shop for
such products (i.e., less than three times a year).
Result:
80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a
university education
Association rules:
A D (60%, 100%)
D A (60%, 75%)
purpose - use the model to predict the class of objects whose class
label is unknown.
26
Example Tree for “Play?”
Outlook
sunny rain
overcast
Humidity Yes
Windy
No Yes No Yes
27
Classification Ex: Grading
x
• If x >= 90 then grade
=A. <90 >=90
• If 80<=x<90 then x A
grade =B.
<80 >=80
• If 70<=x<80 then
x B
grade =C.
• If 60<=x<70 then <70 >=70
grade =D. x C
• If x<50 then grade =F. <50 >=60
F D
Data Mining Functionalities
Prediction - models continuous-valued functions. used to predict
missing or unavailable numerical data values rather than class labels.
Eg. Predict the amount of revenue that each item will generate
during upcoming sale at AllElectronics
unsupervised learning
Maximizing intra-class similarity & minimizing interclass similarity
Data mining may generate thousands of patterns: Not all of them are
interesting!
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support and
confidence for association rules
35
Data Mining: Confluence of Multiple Disciplines
37
Applications of Data Mining
Business Intelligence
Web Search Engine
38
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
39
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
40
Major Issues in Data Mining (2)
41
Conferences and Journals on Data Mining
42
Where to Find References? DBLP, CiteSeer, Google
43