01Intro1
01Intro1
2013
1
Chapter 1. Introduction
Why Data Mining?
Summary
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
simulation, …
Society and everyone: news, digital cameras, YouTube
Summary
4
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
5
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
6
Example: A Web Mining Framework
Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-
base
7
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
9
Which View Do You Prefer?
Which view do you prefer?
KDD vs. ML/Stat. vs. Business Intelligence
Depending on the data, applications, and your focus
Data Mining vs. Data Exploration
Business intelligence view
Warehouse, data cube, reporting but not much mining
Business objects vs. data mining tools
Supply chain example: mining vs. OLAP vs. presentation
tools
Data presentation vs. data exploration
10
Chapter 1. Introduction
Why Data Mining?
Summary
11
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and web, multi-media, graphs & social and information
networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
12
Chapter 1. Introduction
Why Data Mining?
Summary
13
Data Mining: On What Kinds of Data?
Multimedia database
Text databases
14
Chapter 1. Introduction
Why Data Mining?
Summary
15
Data Mining Function: (1) Generalization
20
Data Mining Function: (4) Cluster Analysis
21
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
22
Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g., regression and
value prediction
Sequential pattern mining
e.g., first buy digital camera, then buy large SD memory
cards
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data streams
23
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family,
classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, …
24
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns”
Some may fit only certain dimension space (time, location,
…)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only interesting
knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
25
Chapter 1. Introduction
Why Data Mining?
Summary
26
Data Mining: Confluence of Multiple Disciplines
27
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be scalable to handle big data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social and information networks
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
28
Chapter 1. Introduction
Why Data Mining?
Summary
29
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
30
Summary
Data mining: Discovering interesting patterns and knowledge from massive
amount of data
A natural evolution of science and information technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination, association,
classification, clustering, trend and outlier analysis, etc.
Data mining technologies and applications
Major issues in data mining
31
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
32
Major Issues in Data Mining (2)
33