DM_C1_Overview
DM_C1_Overview
Data Mining
Course ID: 055131
Chapter 1: Overview
2023/2/6
DATA MINING: A QUICK GLANCE
Knowledge�
Information/
Mining�
Data�
2
2023/2/6
2
CONTENT
1. Practical situations with Data mining
2. Knowledge discovery
3. Main concepts
4. Roles of data mining
5. Applications
6. Summary
3
1. SITUATION 1
4
1. SITUATION 2
Predict the price of
a stock, e.g., STB
5
1. SITUATION 3
500
450 BG_MaxUpNoPkts
400
350 FG_MaxUpNoPkts
300
Packets
250
200
150
100
50
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Flow Instance
7
We are data rich, but information poor
“Necessity is the mother of invention” - Plato
2. KNOWLEDGE DISCOVERY FROM DATABASE
(KDD)
“Knowledge discovery in databases is the nontrivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns”
Frawley, W. J et al. (1991). Knowledge discovery in databases: an
overview.
Task-relevant Data
Data
Cleaning
Data Integration
9
Data Sources
2. KDD…
…is an iterative process with following main steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation 10
2. KDD…
… each step in KDD process may work with
Data sources (many types)
Data warehouse
Task-relevant data
Patterns
Knowledge
11
2. KDD IN THE DATA MANAGEMENT
PYRAMID
Increasing potential
to support
Making End User
business decisions
Decisions
Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 2
Assoc. Prof. TRAN MINH QUANG
13 [email protected]
http://researchmap.jp/quang
2023/2/6
CONTENT
1. Practical situations with Data mining
2. Knowledge discovery
3. Main concepts
4. Roles of data mining
5. Applications
6. Summary
14
3. MAIN CONCEPTS IN KDD
Data mining
Data mining tasks/functions
Data mining processes
Data mining systems
15
3.1. DATA MINING (DM)
DM is a process of …
“extracting or mining knowledge from large
amounts of data”
“knowledge mining from data”
16
3.1. DATA MINING
Similar/common terms
knowledge discovery/mining in
data/databases (KDD)
knowledge extraction
data/pattern analysis
data archaeology, data dredging
information harvesting
business intelligence
17
3.1. DATA MINING: DATA SOURCES
DM from large amounts of data…
Any types: structure, non-structure, semi-structure
from various data sources
Data sources
Flat files
Databases: relational databases, object-relational databases,
NoSQL,…
Transactional databases, data warehouses
20
3.1. DATA MINING
Machine
Statistics
Learning
Database
Technology Data Mining
Visualization
Other
Disciplines Information
science
Descriptive Inductive
Statistics Statistics
Prediction
Data description and Induction
Distributions of data
in the two are
similar? 23
3.1. DATA MINING: MACHINE LEARNING
Machine Learning
Unsupervised Supervised
“Natural groupings”
Reinforcement
24
3.1. DATA MINING: VISUALIZATION
Improve the meaning of knowledge to users
Data: 3D cubes, distribution charts, curves, surfaces,
link graphs, image frames and movies, parallel
coordinates
Knowledge (mining results): pie charts, scatter plots,
box plots, association rules, parallel coordinates,
temporal evolution
25
Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 3
Assoc. Prof. TRAN MINH QUANG
27 [email protected]
http://researchmap.jp/quang
2023/2/6
3.2. DATA MINING TASKS
Data description
Classification
Prediction
Clustering
Association rule mining
Trend analysis
Outlier
Similarity analysis
…
28
3.2. DATA MINING TASKS
Data
Tid Refund Marital Taxable
Status Income Cheat
29
Milk
3.2. DATA MINING TASKS: MAIN FACTORS
5 main factors describe a data mining task
1. Task-relevant data
2. Expected knowledge
3. Background knowledge
4. Interestingness measures
5. Pattern evaluations and knowledge
presentation
30
3.2. DATA MINING TASKS: MAIN FACTORS
Task-relevant data: data sources, data
types, selected features/dimensions, name
of DBs, data warehouse, data tables or
objects or documents, criteria for selection
data,…
Expected knowledge: corresponds to a
specific mining task which will be
executed: classification, clustering,
association rules, prediction,…. 31
3.2. DATA MINING TASKS: MAIN FACTORS
Background knowledge:
Domain knowledge: finance, education,
healthcare,…
Supports DM processes: training, evaluating
models
Interestingness measures:
With a score/measure, and has a threshold
Use for train the model and evaluate the
results
Different tasks use different measure
Needs to be simple, certain, useful and novel
32
3.2. DATA MINING TASKS: MAIN FACTORS
33
3.2. DATA MINING TASK
Interesting
Task-relevant Giải
Giải
Algori Patterns
Data Thuật
Thuật (Knowledge)
thms
KDD 34
3.2. DATA MINING ALGORITHM: MAIN
ELEMENTS
35
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
Model or pattern structures
Model: Presents the dataset in a global view
Pattern: Presents characteristics of a subset of the
dataset (local view), e.g., for some records /objects or
satisfy with some variables
Structure: a general function where parameters’ values
are not defined to describe a model or a pattern
⇒ Model structure: a global summary of the dataset
Ex. Y = aX + b is a model structure and Y = 3X + 2 is a specific
model defined from the above model structure
=> Pattern structure: a summary of a sub-dataset
Ex. p(Y>y1|X>x1) = p1 is a pattern structure and 36
37
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
Optimization and search methods
Objective: To identify the structures and models,
patterns (with specific parameters’ values) from the
datasets that fit with the expected score function.
State space: A set of discrete states
Searching: begin at a particular state (e.g., at a node in
the space), searches in the state space until finding a
specific state that is “best” fit with the score
function
Methods: Various approaches: greedy strategy, heuristics,
revolution algorithms,…
38
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
Data management strategy
Depending on data size, types,…
Small to medium: Load all to the main memory
to process
Large/big: Stored in disks/distributed systems.
Parts are concurrently processed in memory
Support for storage, indexing, retrieving
Improve the efficiency, scalability,… of the
data mining approaches
Database technologies can help
39
Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology
Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 4
Assoc. Prof. TRAN MINH QUANG
40 [email protected]
http://researchmap.jp/quang
2023/2/6
3.3. DATA MINING PROCESSES (DMP)
DMP is iterative and interactive steps starting
with raw data and completing with knowledge of
interest. It presents…
A systematically way to conduct (plan and
manage) a KDD project
Assure that the KDD project is optimized
42
3.3. DATA MINING PROCESSES (DMP)
43
3.4. DATA MINING SYSTEMS
A common structure of a DM system
44
3.4. DATA MINING SYSTEMS
Database,data warehouse, World Wide Web,
and information repositories: Data/information
sources used for DM
Database hay data warehouse server:
Physical data sources that prepare integrated and
relevant data for DM
Knowledge base: Domain/background
knowledge
Data mining engine: Conducts DM tasks
Pattern evaluation module: Use
interestingness measure (score functions),
threshold which can be integrated in the Data
45
mining engine
3.4. DATA MINING SYSTEMS
User interface: Support user interaction with the
system:
To specify data mining tasks, query,…
warehouse
Evaluate mining results
46
3.4. DATA MINING SYSTEMS
Features used to examine a DM system
Data types
Data sources
Tasks/Functions and Methods
Connecting with data sources: DBs, Data warehouse,
WWW, spreadsheets,…
Scalabilities, robustness,…
Visualization capability
47
3.4. DATA MINING SYSTEMS
Related systems to DM ones
Statistical data analysis systems
Machine learning systems
Information retrieval systems
Deductive database systems
Database systems
Data warehouses
…
48
4. ROLE OF DM
Data Collection and Database Creation
Revolution of DB (1960s and earlier)
technologies
Database Management Systems
(1970s-early 1980s)
50
5. SUMMARY
DM/KDD: Extract interest patterns from large DB
Discovered knowledge must be understandable,
useful, nontrivial, valid and evaluable
Data sources: Various
DM Tasks: Description, prediction, classification,
clustering, association rule mining, co-relation,
outlier, trends,…
5 factors: Relevant data, expected knowledge,
background KN, measures, KN visualization
4 elements: model/pattern structure, score
function, optimization methods, data management 51
5. SUMMARY
7 steps in KDD: Data cleansing, integration, data
selection, data transformation, DM, pattern evaluation,
and KN presentation
DM is the main component in KDD (some time
interchangably used)
Relatedfieds: DB technologies, statistics, machine
learning, computer science, visualization,…
52
REFERENCES
[1] Jiawei Han, Micheline Kamber, and Jian Pei, “Data Mining: Concepts and
Techniques”, 3rd Edition, Morgan Kaufmann Publishers, 2012.
[2] Trần Minh Quang, "Khai Phá Dữ Liệu và Kỹ Thuật Phân Lớp", NXB Đại Học
Quốc Gia TP. HCM, 2020.
53
FURTHER READ
“[2] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data Mining”, MIT
Press, 2001.
[3] David L. Olson, Dursun Delen, “Advanced Data Mining Techniques”, Springer-Verlag,
2008.
[4] Graham J. Williams, Simeon J. Simoff, “Data Mining: Theory, Methodology,
Techniques, and Applications”, Springer-Verlag, 2006.
[5] ZhaoHui Tang, Jamie MacLennan, “Data Mining with SQL Server 2005”, Wiley
Publishing, 2005.
[6] Oracle, “Data Mining Concepts”, B28129-01, 2008.
[7] Oracle, “Data Mining Application Developer’s Guide”, B28131-01, 2008.
[8] Ian H.Witten, Eibe Frank, “Data mining : practical machine learning tools and
techniques”, 2nd Edition, Elsevier Inc, 2005.
[9] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire, “Successes and new
directions in data mining”, IGI Global, 2008.
[10] Oded Maimon, Lior Rokach, “Data Mining and Knowledge Discovery Handbook”,
2nd Edition, Springer Science + Business Media, LLC 2005, 2010. 54
Q&A
[email protected]
55
2023/2/6