0% found this document useful (0 votes)
1 views

DM_C1_Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

DM_C1_Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Faculty of Computer Science and Engineering

Ho Chi Minh City University of Technology

Data Mining
Course ID: 055131
Chapter 1: Overview

Assoc. Prof. TRAN MINH QUANG


1 [email protected]
http://researchmap.jp/quang

2023/2/6
DATA MINING: A QUICK GLANCE
Knowledge�
Information/
Mining�
Data�

2
2023/2/6

2
CONTENT
1. Practical situations with Data mining
2. Knowledge discovery
3. Main concepts
4. Roles of data mining
5. Applications
6. Summary

3
1. SITUATION 1

Estimating whether the


guy who using a credit
card with ID = 123456
is its owner of not

4
1. SITUATION 2
Predict the price of
a stock, e.g., STB

5
1. SITUATION 3
500
450 BG_MaxUpNoPkts
400
350 FG_MaxUpNoPkts
300
Packets

250
200
150
100
50
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Flow Instance

Network attacking detection based on traffic


analysis 6
1. THE FACT…

7
We are data rich, but information poor
“Necessity is the mother of invention” - Plato
2. KNOWLEDGE DISCOVERY FROM DATABASE
(KDD)
 “Knowledge discovery in databases is the nontrivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns”
 Frawley, W. J et al. (1991). Knowledge discovery in databases: an
overview.

 “Knowledge discovery from databases is the process of


using the database along with any required selection,
preprocessing, sub-sampling, and transformations of it; to
apply data mining methods (algorithms) to enumerate
patterns from it; and to evaluate the products of data
mining to identify the subset of the enumerated patterns
deemed knowledge.”
 Fayyad, U.M et al. (1996). Advances in Knowledge Discovery and
Data Mining. MIT Press. 8
2. KDD…
Pattern Evaluation/
Presentation

Data Mining Patterns

Task-relevant Data

Data Warehouse Selection/Transformation

Data
Cleaning
Data Integration
9

Data Sources
2. KDD…
…is an iterative process with following main steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation 10
2. KDD…
… each step in KDD process may work with
 Data sources (many types)

 Data warehouse

 Task-relevant data

 Patterns

 Knowledge
11
2. KDD IN THE DATA MANAGEMENT
PYRAMID
Increasing potential
to support
Making End User
business decisions
Decisions

Data Presentation Business


Visualization Techniques Analyst
Data Mining Data
Information Discovery
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources DBA 12
Paper, Files, Information Providers, Database Systems, OLTP
Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology

Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 2
Assoc. Prof. TRAN MINH QUANG
13 [email protected]
http://researchmap.jp/quang

2023/2/6
CONTENT
1. Practical situations with Data mining
2. Knowledge discovery
3. Main concepts
4. Roles of data mining
5. Applications
6. Summary

14
3. MAIN CONCEPTS IN KDD
 Data mining
 Data mining tasks/functions
 Data mining processes
 Data mining systems

15
3.1. DATA MINING (DM)
DM is a process of …
 “extracting or mining knowledge from large
amounts of data”
 “knowledge mining from data”

 “nontrivial extraction of implicit, previously

unknown, and potentially useful information


from data”

16
3.1. DATA MINING
Similar/common terms
 knowledge discovery/mining in
data/databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archaeology, data dredging
 information harvesting
 business intelligence

17
3.1. DATA MINING: DATA SOURCES
 DM from large amounts of data…
 Any types: structure, non-structure, semi-structure
from various data sources
 Data sources
 Flat files
 Databases: relational databases, object-relational databases,
NoSQL,…
 Transactional databases, data warehouses

 From various domains: spatial databases, temporal databases,


spatio-temporal databases, time series databases,
text/docuement databases, multimedia databases, …
 Data from the web: WWW, Social networks…

 Pack or streaming data sources: Data warehouse for BI,


18

real-time IoT systems,….


3.1. DATA MINING: KNOWLEDGE
 The mined/analyzed knowledge can be:
 Description of data classes
 Frequent patterns, Association patterns
 Classification and Prediction
 Clustering Model
 Outliers
 Trends of behaviors, data,…
 …
19
3.1. DATA MINING: KNOWLEDGE
 Categories of mined/analyzed knowledge :
 Descriptive: Describe common characteristics of objects in the
dataset (situation 1)
 Predictive): Capability to infer/predict new information based on
current data (situations 2, 3)

20
3.1. DATA MINING
Machine
Statistics
Learning

Database
Technology Data Mining

Visualization
Other
Disciplines Information
science

“Data mining as a confluence of multiple disciplines” 21


3.1. DATA MINING: DB TECHNOLOGIES
 Database technologies help efficiently manage
data for mining
 Big data: paging, swapping from disk to memory,
distributed data managements, support for data
streaming,…
 Integrated, categorized in based on different
dimensions (e.g., data warehouse)
 Support various data types: spatial, temporal,
spatiotemporal, multimedia, text, Web, …
 Concurrency control, security, query optimization,…
 Provide data mining functions:
• Oracle Data Mining (Oracle 9i, 10g, 11g)
• SQL Server analyzers, Azure machine learning,…
• Intelligent Miner (IBM) 22

• Standard SWL/MM 6: Data Mining by ISO/IEC 13249-


6:2006
3.1. DATA MINING: STATISTICS
Statistics

Descriptive Inductive
Statistics Statistics

Prediction
Data description and Induction

Distributions of data
in the two are
similar? 23
3.1. DATA MINING: MACHINE LEARNING
Machine Learning

Unsupervised Supervised
“Natural groupings”
Reinforcement

24
3.1. DATA MINING: VISUALIZATION
 Improve the meaning of knowledge to users
 Data: 3D cubes, distribution charts, curves, surfaces,
link graphs, image frames and movies, parallel
coordinates
 Knowledge (mining results): pie charts, scatter plots,
box plots, association rules, parallel coordinates,
temporal evolution

25

Pie chart Parallel coordinates Temporal evolution


3.1. DATA MINING: VISUALIZATION
 Labeling mined classes/clusters
Isodata (K-means)
Clustering

Mean Feature Image Label Image


26
Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology

Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 3
Assoc. Prof. TRAN MINH QUANG
27 [email protected]
http://researchmap.jp/quang

2023/2/6
3.2. DATA MINING TASKS
 Data description
 Classification
 Prediction
 Clustering
 Association rule mining
 Trend analysis
 Outlier
 Similarity analysis
…
28
3.2. DATA MINING TASKS

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

29
Milk
3.2. DATA MINING TASKS: MAIN FACTORS
 5 main factors describe a data mining task
1. Task-relevant data
2. Expected knowledge
3. Background knowledge
4. Interestingness measures
5. Pattern evaluations and knowledge
presentation
30
3.2. DATA MINING TASKS: MAIN FACTORS
 Task-relevant data: data sources, data
types, selected features/dimensions, name
of DBs, data warehouse, data tables or
objects or documents, criteria for selection
data,…
 Expected knowledge: corresponds to a
specific mining task which will be
executed: classification, clustering,
association rules, prediction,…. 31
3.2. DATA MINING TASKS: MAIN FACTORS
 Background knowledge:
 Domain knowledge: finance, education,
healthcare,…
 Supports DM processes: training, evaluating
models
 Interestingness measures:
 With a score/measure, and has a threshold
 Use for train the model and evaluate the
results
 Different tasks use different measure
 Needs to be simple, certain, useful and novel
32
3.2. DATA MINING TASKS: MAIN FACTORS

 Patternevaluation and knowledge


presentation (ref. slide 25, 26 in Sect. 3.1):
Rules, tables, reports, charts, graphs, trees,
cubes,…

33
3.2. DATA MINING TASK

Interesting
Task-relevant Giải
Giải
Algori Patterns
Data Thuật
Thuật (Knowledge)
thms

Data mining tasks

KDD 34
3.2. DATA MINING ALGORITHM: MAIN
ELEMENTS

 4 main elements constitute a data mining


algorithm
 Model or pattern structures
 Score function
 Optimization and search method
 Data management strategy

35
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
 Model or pattern structures
 Model: Presents the dataset in a global view
 Pattern: Presents characteristics of a subset of the
dataset (local view), e.g., for some records /objects or
satisfy with some variables
 Structure: a general function where parameters’ values
are not defined to describe a model or a pattern
⇒ Model structure: a global summary of the dataset
Ex. Y = aX + b is a model structure and Y = 3X + 2 is a specific
model defined from the above model structure
=> Pattern structure: a summary of a sub-dataset
Ex. p(Y>y1|X>x1) = p1 is a pattern structure and 36

p(Y>5|X>10) = 0.5 is a particular pattern


3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
 Score function
 used to examine how effective/good/relevant a
model/pattern present a dataset (by score)
 used to compare different data mining models,
methods,…
 should not be depended on the dataset and easy to be
computed. Ex. likelihood, sum of squared errors,
misclassification rate,…

37
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
 Optimization and search methods
 Objective: To identify the structures and models,
patterns (with specific parameters’ values) from the
datasets that fit with the expected score function.
 State space: A set of discrete states
 Searching: begin at a particular state (e.g., at a node in
the space), searches in the state space until finding a
specific state that is “best” fit with the score
function
 Methods: Various approaches: greedy strategy, heuristics,
revolution algorithms,…
38
3.2. DATA MINING ALGORITHM:
MAIN ELEMENTS
 Data management strategy
 Depending on data size, types,…
Small to medium: Load all to the main memory
to process
Large/big: Stored in disks/distributed systems.
Parts are concurrently processed in memory
 Support for storage, indexing, retrieving
 Improve the efficiency, scalability,… of the
data mining approaches
 Database technologies can help

39
Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology

Data Mining
Course ID: CO3029
Chapter 1: Overview
Part 4
Assoc. Prof. TRAN MINH QUANG
40 [email protected]
http://researchmap.jp/quang

2023/2/6
3.3. DATA MINING PROCESSES (DMP)
 DMP is iterative and interactive steps starting
with raw data and completing with knowledge of
interest. It presents…
 A systematically way to conduct (plan and
manage) a KDD project
 Assure that the KDD project is optimized

Some standard processes:


 Cross Industry Standard Process for Data
Mining (CRISP-DM at www.crisp-dm.org)
 SEMMA (Sample, Explore, Modify, Model, 41

Assess) at the SAS Institute


3.3. DATA MINING PROCESSES (DMP)
 CRISP-DM

 Initiated since 09/1996 with more than 200 members


 Open standard
 Support industry, applications and tools for KDD
 Focus on business requirement and technical analysis
 Provide framework for data mining processes
 Rich experiences from various application domains and
industries

42
3.3. DATA MINING PROCESSES (DMP)

43
3.4. DATA MINING SYSTEMS
A common structure of a DM system

44
3.4. DATA MINING SYSTEMS
 Database,data warehouse, World Wide Web,
and information repositories: Data/information
sources used for DM
 Database hay data warehouse server:
Physical data sources that prepare integrated and
relevant data for DM
 Knowledge base: Domain/background
knowledge
 Data mining engine: Conducts DM tasks
 Pattern evaluation module: Use
interestingness measure (score functions),
threshold which can be integrated in the Data
45

mining engine
3.4. DATA MINING SYSTEMS
 User interface: Support user interaction with the
system:
 To specify data mining tasks, query,…

 Search for and conduct sophisticated data

mining tasks using temporary mining results


 Verify input data from databases, data

warehouse
 Evaluate mining results

 Visualize the mined knowledge

46
3.4. DATA MINING SYSTEMS
 Features used to examine a DM system
 Data types
 Data sources
 Tasks/Functions and Methods
 Connecting with data sources: DBs, Data warehouse,
WWW, spreadsheets,…
 Scalabilities, robustness,…
 Visualization capability

47
3.4. DATA MINING SYSTEMS
 Related systems to DM ones
 Statistical data analysis systems
 Machine learning systems
 Information retrieval systems
 Deductive database systems
 Database systems
 Data warehouses
 …

48
4. ROLE OF DM
Data Collection and Database Creation
Revolution of DB (1960s and earlier)
technologies
Database Management Systems
(1970s-early 1980s)

Advanced Database Systems Web-based Database Systems


(mid-1980s-present) (1990s-present)

Advanced Data Analysis:


Data Warehousing and Data Mining
(late 1980s-present)

New Generation of Integrated Data and 49


Information Systems
(present-future)
4. ROLES OF DM
A modern technology in CS, ICT and
information management
 Available everywhere (ubiquitous) and invisible in
our life
 Vast application domains : sales, commerce, bank,
finance, insurance, entertainments, healthcares,
educations, economics, supply chain management,
production, science, tele-comunications, controls,…

50
5. SUMMARY
 DM/KDD: Extract interest patterns from large DB
 Discovered knowledge must be understandable,
useful, nontrivial, valid and evaluable
 Data sources: Various
 DM Tasks: Description, prediction, classification,
clustering, association rule mining, co-relation,
outlier, trends,…
 5 factors: Relevant data, expected knowledge,
background KN, measures, KN visualization
 4 elements: model/pattern structure, score
function, optimization methods, data management 51
5. SUMMARY
7 steps in KDD: Data cleansing, integration, data
selection, data transformation, DM, pattern evaluation,
and KN presentation
 DM is the main component in KDD (some time
interchangably used)
 Relatedfieds: DB technologies, statistics, machine
learning, computer science, visualization,…

52
REFERENCES
[1] Jiawei Han, Micheline Kamber, and Jian Pei, “Data Mining: Concepts and
Techniques”, 3rd Edition, Morgan Kaufmann Publishers, 2012.
[2] Trần Minh Quang, "Khai Phá Dữ Liệu và Kỹ Thuật Phân Lớp", NXB Đại Học
Quốc Gia TP. HCM, 2020.

[3] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, "Mining of Massive


Datasets", 2nd Edition, Cambridge University Press 2014.
[4] Charu C. Aggarwal, "Data Classification: Algorithms and Applications"
(Chapman & Hall/CRC Data Mining and Knowledge Discovery Series) 1st
Edition, 2014.
[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, “Deep Learning”, The
MIT Press, 2016.

53
FURTHER READ
“[2] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data Mining”, MIT
Press, 2001.
[3] David L. Olson, Dursun Delen, “Advanced Data Mining Techniques”, Springer-Verlag,
2008.
[4] Graham J. Williams, Simeon J. Simoff, “Data Mining: Theory, Methodology,
Techniques, and Applications”, Springer-Verlag, 2006.
[5] ZhaoHui Tang, Jamie MacLennan, “Data Mining with SQL Server 2005”, Wiley
Publishing, 2005.
[6] Oracle, “Data Mining Concepts”, B28129-01, 2008.
[7] Oracle, “Data Mining Application Developer’s Guide”, B28131-01, 2008.
[8] Ian H.Witten, Eibe Frank, “Data mining : practical machine learning tools and
techniques”, 2nd Edition, Elsevier Inc, 2005.
[9] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire, “Successes and new
directions in data mining”, IGI Global, 2008.
[10] Oded Maimon, Lior Rokach, “Data Mining and Knowledge Discovery Handbook”,
2nd Edition, Springer Science + Business Media, LLC 2005, 2010. 54
Q&A
[email protected]

55
2023/2/6

You might also like