2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
11/25/24 1
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation,
…
Society and everyone: news, digital cameras, YouTube
We are drown in data, but starving for knowledge!
We are data rich, but information poor.
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
11/25/24 2
What Is Data Mining?
11/25/24 3
What Is Data Mining?
Alternative names
Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.
11/25/24 4
KDD: A Definition
11/25/24 5
KDD: A Definition
KDD is the automatic or semi-automatic
extraction of non-obvious, hidden knowledge
from large volumes of data.
11/25/24 6
From Data to Knowledge
11/25/24 7
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
11/25/24 8
KDD Process - Steps
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis
task are retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)
11/25/24 9
Architecture of Typical Data Mining
System
11/25/24 10
Architecture of a typical data
mining system
Database, data warehouse, World
Wide Web, or other information
repository:
One or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories.
Data cleaning and data integration techniques
may be performed on the data.
11/25/24 11
Contd….
Knowledge base:
Knowledge is used to guide the search or
evaluate the interestingness of resulting
patterns.
knowledge can include concept hierarchies,
used to organize attributes or attribute values
into different levels of abstraction.
Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based
on its unexpectedness, may also be included.
11/25/24 12
Contd…
Data mining engine:
Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.
11/25/24 13
Contd….
User interface:
Communicates between users and the data
mining system
Allow the user to interact with the system by
specifying a data mining query or task
Provide information to help focus the search
Performing exploratory data mining based on
the intermediate data mining results.
Allow the user to browse database and data
warehouse schemas or data structures, evaluate
mined patterns, and visualize the patterns in
different forms.
11/25/24 14
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
11/25/24 16
Data Mining
Prediction Methods
using some variables to predict unknown or
future values of other variables
Descriptive Methods
finding human-interpretable patterns
describing the data
11/25/24 17
Why Not Traditional Data
Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
11/25/24 18
Multi-Dimensional View of Data
Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Machine learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
11/25/24 19
Multi-Dimensional View of Data
Mining
Data to be mined
1. Relational
2. Data warehouse
3. Transactional
4. Stream
5. Object-oriented
6. Temporal Databases, Sequence Databases, and Time-Series
Databases
7. Spatial and Spatiotemporal
8. Heterogeneous Databases and Legacy Databases
9. Text and multi-media
10. WWW
11/25/24 20
1. Relational
A database system, also called a database
management system (DBMS).
DBMS consists of a collection of interrelated data,
known as a database.
A set of software programs to manage and access
the data.
The software programs involve mechanisms for the
definition of database structures; for data storage;
for concurrent, shared, or distributed data access;
and for ensuring the consistency and security of the
information stored, despite system crashes or
attempts at unauthorized access.
11/25/24 21
Contd…..
A relational database is a collection of tables,
each of which is assigned a unique name.
Each table consists of a set of attributes
(columns or fields) and usually stores a large
set of tuples (records or rows).
Each tuple in a relational table represents an
object identified by a unique key and
described by a set of attribute values.
A semantic data model, such as an entity-
relationship (ER) data model, is often
constructed for relational databases.
An ER data model represents the database
as a set of entities and their relationships.
11/25/24 22
2. Data warehouse
A repository of information collected from
multiple sources, stored under a unified
schema, and that usually resides at a single
site.
11/25/24 25
3. Transactional
Consists of a file where each record
represents a transaction.
A transaction typically includes a unique
transaction identity number (trans ID) and a
list of the items making up the transaction
(such as items purchased in a store).
11/25/24 26
The transactional database may have
additional tables associated with it, which
contain other information regarding the
sale, such as the date of the transaction,
the customer ID number, the ID number of
the salesperson and of the branch at which
the sale occurred, and so on.
11/25/24 27
4. Stream
data flow in and out of an observation
platform (or window) dynamically
Unique features:
huge or possibly infinite volume
dynamically changing
flowing in and out in a fixed order
allowing only one or a small number of scans
demanding fast (often real-time) response time.
11/25/24 28
4. Stream
Typical examples of data streams include
various kinds of scientific and engineering
data, time-series data, and data produced
in other dynamic environments, such as
power supply, network traffic, stock
exchange, telecommunications, Web click
streams, video surveillance, and weather or
environment monitoring.
11/25/24 29
5. Object-oriented
Each entity is considered as an object
11/25/24 30
Contd…
Suppose that the class, sales person, is a
subclass of the class, employee.
11/25/24 34
8. Heterogeneous Databases
and Legacy Databases
A heterogeneous database consists of a
set of interconnected, autonomous
component databases.
11/25/24 35
9. Text and multi-media
Text databases are databases that contain
word descriptions for objects.
11/25/24 36
Contd…
Some text databases may be somewhat
structured, that is, semi-structured (such as
e-mail messages and many HTML/XML Web
pages),
11/25/24 37
Contd….
(e.g.) Document classification
11/25/24 38
10. WWW
Distributed information services, such as
Yahoo!, Google, America Online, and
AltaVista, provide rich, worldwide, on-line
information services, where data objects are
linked together to facilitate interactive access.
11/25/24 39
Knowledge to be mined
Generalization, Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier analysis, etc.
Confidence certainty
60
Cluster Analysis
Clustering analyses data objects without
consulting a known class label.
64
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
Sequential pattern mining
e.g., first buy digital camera, then buy large SD
memory cards
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data
streams
65
Evolution Analysis
Evolution Analysis: Data evolution analysis
describes and models regularities or trends for
objects whose behavior changes over time.
A person could be multiple information networks: friends,
family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, …
67
Techniques utilized
Machine learning, statistics, visualization, etc.
69
Contd….
DM an interdisciplinary field
Set of disciplines including database
systems, statistics, machine learning,
visualization, and information science.
Other disciplines Neural networks, fuzzy
logic or rough set theory, knowledge
representation, etc.
11/25/24 70
Statistics is the study of the collection, organization, analysis,
interpretation and presentation of data.
Machine learning, a branch of artificial intelligence, concerns
the construction and study of systems that can learn from data.
For example, a machine learning system could be trained on
email messages to learn to distinguish between spam and non-
spam messages. Ex- trees, neural n/w etc.
A database is an organized collection of data.
71
AI
72
KDD Process: A Typical View from ML
and Statistics
73
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
75
Major Issues in Data Mining
(1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
76
Major Issues in Data Mining
(2)
77