VIPDMTheoryChapter1
VIPDMTheoryChapter1
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
1
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
3
What Is Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases
5
Knowledge Discovery
Process
Data cleaning (remove noise and inconsistent data)
7
Data Mining
Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
11
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
12
Data Mining: On What Kinds of
Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
13
Data Warehouse Framework
16
Data Mining: Confluence of Multiple
Disciplines
17
Why Confluence of Multiple
Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
18
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
19
Applications of Data Mining
Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
Data mining and software engineering (e.g., IEEE Computer,
Aug. 2009 issue)
From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
20
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
21
Major Issues in Data Mining
(1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked
environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided
mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results 22
Major Issues in Data Mining
(2)
23
Summary
Data mining: Discovering interesting patterns and knowledge
from massive amount of data
A natural evolution of database technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
Data mining technologies and applications
Major issues in data mining
24
Recommended Reference
Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed.,
2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2 nd ed. 2005
25