0% found this document useful (0 votes)
15 views

2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024

Uploaded by

Shivanshu Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024

Uploaded by

Shivanshu Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 77

Introduction to Data Mining

11/25/24 1
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation,


Society and everyone: news, digital cameras, YouTube
 We are drown in data, but starving for knowledge!
 We are data rich, but information poor.
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
11/25/24 2
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from larger amount of data
 Data mining: a misnomer?
 The mining of gold from rocks or sand is referred to as
gold mining rather than rock or sand mining.
 The mining of coal from rocks or sand is referred to as
coal mining.

11/25/24 3
What Is Data Mining?

 Alternative names
 Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.

 Data mining—searching for knowledge


(interesting patterns) in your data.

11/25/24 4
KDD: A Definition

Simply stated, data


mining refers to
extracting or “mining”
knowledge
from large amounts of
data, usually
automatically gathered

11/25/24 5
KDD: A Definition
KDD is the automatic or semi-automatic
extraction of non-obvious, hidden knowledge
from large volumes of data.

106-1012 bytes: What is the knowledge?


we never see the whole Then run Data How to represent
data set, so will put it in Mining algorithms and use it?
the memory of computers

11/25/24 6
From Data to Knowledge

Numerical attribute categorical attribute missing values class labels

If (Headache=No AND Vomiting = Yes AND Temperature = High)


THEN Viral illness = Yes

11/25/24 7
Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
11/25/24 8
KDD Process - Steps
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis
task are retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)

11/25/24 9
Architecture of Typical Data Mining
System

11/25/24 10
Architecture of a typical data
mining system
 Database, data warehouse, World
Wide Web, or other information
repository:

One or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories.

Data cleaning and data integration techniques
may be performed on the data.

 Database or data warehouse server:



Responsible for fetching the relevant data,
based on the user’s data mining request.

11/25/24 11
Contd….
 Knowledge base:

Knowledge is used to guide the search or
evaluate the interestingness of resulting
patterns.


knowledge can include concept hierarchies,
used to organize attributes or attribute values
into different levels of abstraction.


Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based
on its unexpectedness, may also be included.

11/25/24 12
Contd…
 Data mining engine:

Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.

 Pattern evaluation module:



To focus the search toward interesting patterns.

To filter out discovered patterns.

The pattern evaluation module may be integrated with the
mining module, depending on the implementation of the
data mining method used.

For efficient data mining, it is highly recommended to
push the evaluation of pattern interestingness as deep as
possible into the mining process.

11/25/24 13
Contd….
 User interface:

Communicates between users and the data
mining system

Allow the user to interact with the system by
specifying a data mining query or task

Provide information to help focus the search

Performing exploratory data mining based on
the intermediate data mining results.

Allow the user to browse database and data
warehouse schemas or data structures, evaluate
mined patterns, and visualize the patterns in
different forms.

11/25/24 14
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11/25/24 15
Data Mining: Classification
Schemes
 General functionality
 Descriptive data mining
 Predictive data mining
 Different views, different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

11/25/24 16
Data Mining
 Prediction Methods

using some variables to predict unknown or
future values of other variables

 Descriptive Methods

finding human-interpretable patterns
describing the data

11/25/24 17
Why Not Traditional Data
Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

11/25/24 18
Multi-Dimensional View of Data
Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Machine learning, statistics, visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.

11/25/24 19
Multi-Dimensional View of Data
Mining
 Data to be mined
1. Relational
2. Data warehouse
3. Transactional
4. Stream
5. Object-oriented
6. Temporal Databases, Sequence Databases, and Time-Series
Databases
7. Spatial and Spatiotemporal
8. Heterogeneous Databases and Legacy Databases
9. Text and multi-media
10. WWW

11/25/24 20
1. Relational
 A database system, also called a database
management system (DBMS).
 DBMS consists of a collection of interrelated data,
known as a database.
 A set of software programs to manage and access
the data.
 The software programs involve mechanisms for the
definition of database structures; for data storage;
for concurrent, shared, or distributed data access;
and for ensuring the consistency and security of the
information stored, despite system crashes or
attempts at unauthorized access.

11/25/24 21
Contd…..
 A relational database is a collection of tables,
each of which is assigned a unique name.
 Each table consists of a set of attributes
(columns or fields) and usually stores a large
set of tuples (records or rows).
 Each tuple in a relational table represents an
object identified by a unique key and
described by a set of attribute values.
 A semantic data model, such as an entity-
relationship (ER) data model, is often
constructed for relational databases.
 An ER data model represents the database
as a set of entities and their relationships.
11/25/24 22
2. Data warehouse
 A repository of information collected from
multiple sources, stored under a unified
schema, and that usually resides at a single
site.

 Constructed via a process of data cleaning,


data integration, data transformation, data
loading, and periodic data refreshing.

 “A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of
management’s decision-making process.”—
W. H. Inmon
11/25/24 23
Contd…
 Usually modeled by a multidimensional
database structure
 Each dimension corresponds to an attribute
or a set of attributes in the schema
 Each cell stores the value of some
aggregate measure, such as count or sales
amount.
 The actual physical structure of a data
warehouse may be a relational data store or
a multidimensional data cube.
 A data cube provides a multidimensional
view of data and allows the pre-
computation and fast accessing of
summarized data.
11/25/24 24
Contd…

11/25/24 25
3. Transactional
 Consists of a file where each record
represents a transaction.
 A transaction typically includes a unique
transaction identity number (trans ID) and a
list of the items making up the transaction
(such as items purchased in a store).

11/25/24 26
 The transactional database may have
additional tables associated with it, which
contain other information regarding the
sale, such as the date of the transaction,
the customer ID number, the ID number of
the salesperson and of the branch at which
the sale occurred, and so on.

11/25/24 27
4. Stream
 data flow in and out of an observation
platform (or window) dynamically

 Unique features:

huge or possibly infinite volume

dynamically changing

flowing in and out in a fixed order

allowing only one or a small number of scans

demanding fast (often real-time) response time.

11/25/24 28
4. Stream
 Typical examples of data streams include
various kinds of scientific and engineering
data, time-series data, and data produced
in other dynamic environments, such as
power supply, network traffic, stock
exchange, telecommunications, Web click
streams, video surveillance, and weather or
environment monitoring.

11/25/24 29
5. Object-oriented
 Each entity is considered as an object

 Objects that share a common set of properties can


be grouped into an object class.

 Each object is an instance of its class.

 Object classes can be organized into class/subclass


hierarchies so that each class represents
properties that are common to objects in that
class.

 For instance, an employee class can contain


variables like name, address, and birthdate.

11/25/24 30
Contd…
 Suppose that the class, sales person, is a
subclass of the class, employee.

 A sales person object would inherit all of the


variables pertaining to its superclass of
employee.

 In addition, it has all of the variables that


pertain specifically to being a salesperson
(e.g., commission).

 Such a class inheritance feature benefits


information sharing.
11/25/24 31
6. Temporal Databases, Sequence
Databases, and Time-Series Databases

 A temporal database typically stores relational


data that include time-related attributes. These
attributes may involve several timestamps, each
having different semantics.

 A sequence database stores sequences of


ordered events, with or without a concrete notion of
time. Examples include customer shopping
sequences, Web click streams, and biological
sequences.

 A time-series database stores sequences of


values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly).
Examples include data collected from the stock
exchange, inventory control, and the observation of
11/25/24
natural phenomena (like temperature and wind). 32
7. Spatial and
Spatiotemporal
 Spatial databases contain spatial-related
information.
 Examples include geographic (map) databases,
very large-scale integration (VLSI) or computed-
aided design databases, and medical and satellite
image databases.
 Spatial data may be represented in raster format,
consisting of n-dimensional bit maps or pixel maps.
 For example, a 2-D satellite image may be
represented as raster data, where each pixel
registers the rainfall in a given area.
 Maps can be represented in vector format, where
roads, bridges, buildings, and lakes are represented
as unions or overlays of basic geometric constructs,
such as points, lines, polygons, and the partitions
and networks formed by these components.
11/25/24 33
Contd….
 A spatial database that stores spatial
objects that change with time is called a
spatiotemporal database, from which
interesting information can be mined.
 For example, we may be able to group the
trends of moving objects and identify some
strangely moving vehicles, or distinguish a
bioterrorist attack from a normal outbreak
of the flu based on the geographic spread of
a disease with time.

11/25/24 34
8. Heterogeneous Databases
and Legacy Databases
 A heterogeneous database consists of a
set of interconnected, autonomous
component databases.

 A legacy database is a group of


heterogeneous databases that combines
different kinds of data systems.

 The heterogeneous databases in a legacy


database may be connected by intra or
inter-computer networks.

11/25/24 35
9. Text and multi-media
 Text databases are databases that contain
word descriptions for objects.

 Words, sentences or paragraphs (product


specifications, error or bug reports, warning
messages, summary reports, notes, or other
documents).

 may be highly unstructured (such as some


Web pages on theWorldWideWeb).

11/25/24 36
Contd…
 Some text databases may be somewhat
structured, that is, semi-structured (such as
e-mail messages and many HTML/XML Web
pages),

 Others are relatively well structured (such


as library catalogue databases).

 Text databases with highly regular


structures typically can be implemented
using relational database systems.

11/25/24 37
Contd….
 (e.g.) Document classification

 Multimedia databases store image, audio, and


video data.

 Used in applications such as picture content-based


retrieval, voice-mail systems, video-on-demand
systems, the World Wide Web, and speech-based
user interfaces that recognize spoken commands.

 It must support large objects, because data


objects such as video can require gigabytes of
storage.

11/25/24 38
10. WWW
 Distributed information services, such as
Yahoo!, Google, America Online, and
AltaVista, provide rich, worldwide, on-line
information services, where data objects are
linked together to facilitate interactive access.

 Users seeking information of interest traverse


from one object via links to another.

 Capturing user access patterns in such


distributed information environments is called
Web usage mining (or Weblog mining).

11/25/24 39
 Knowledge to be mined
 Generalization, Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier analysis, etc.

November 25, 2024 40


Generalization
 Information integration and data warehouse
construction

Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology

Scalable methods for computing (i.e.,
materializing) multidimensional aggregates

OLAP (online analytical processing)
 Multidimensional concept description:
Characterization and discrimination

Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
41
Characterization and
Discrimination
 Data Characterization: A data mining
system should be able to produce a
description summarizing the characteristics
of customers.

 Example: The characteristics of customers


who spend more than $1000 a year at
(some store called ) AllElectronics. The
result can be a general profile such as age,
employment status or credit ratings.

November 25, 2024 42


Contd….

 Data Discrimination: It is a comparison of the


general features of targeting class data
objects with the general features of objects
from one or a set of contrasting classes. User
can specify target and contrasting classes.

 Example: The user may like to compare the


general features of software products whose
sales increased by 10% in the last year with
those whose sales decreased by about 30%
in the same duration.

November 25, 2024 43


Characterization and
Discrimination
 Data  associated with classes or concepts.

 For example, in the Electronics store,


classes of items for sale include computers
and printers, and concepts of customers
include bigSpenders and budgetSpenders.

 Useful to describe individual classes and


concepts in summarized, concise, and yet
precise terms. Such descriptions of a class
or a concept are called class/concept
descriptions.

November 25, 2024 44


Contd….

 These descriptions can be derived via

(1) data characterization, by summarizing the data


of the class under study (often called the target
class) in general terms, or

(2) data discrimination, by comparison of the


target class with one or a set of comparative
classes (often called the contrasting classes), or

(3) both data characterization and discrimination.

November 25, 2024 Data Warehousing and Data Mining 45


Contd….

 The output of data characterization can


be presented in various forms.

 Examples include pie charts, bar charts,


curves, multidimensional data cubes, and
multidimensional tables, including
crosstabs.

 The resulting descriptions can also be


presented as generalized relations or in rule
form(called characteristic rules).

November 25, 2024 46


Associations and
correlations
 Frequent Patterns : As the name suggests
patterns that occur frequently in data.

 Frequent Itemset : A set of items that


frequently appear together in a
transactional data set, such as milk and
bread.

 Frequent Sequential Pattern : A frequently


occurring subsequence, such as the pattern
that customers tend to purchase first a PC,
followed by a digital camera, and then a
memory card.
November 25, 2024 47
Contd….

 Substructure : Refer to different structural


forms, such as graphs, trees, or lattices,
which may be combined with itemsets or
subsequences.

 If a substructure occurs frequently, it is called


a (frequent) structured pattern.

 Mining frequent patterns leads to the


discovery of interesting associations and
correlations within data.

November 25, 2024 48


Contd….
Association Analysis: from marketing perspective,
determining which items are frequently purchased
together within the same transaction.
Example: An example is mined from the (some store)
AllElectronic transactional database.
buys (X, “Computers”)  buys (X, “software”)
[Support = 1%, confidence = 50% ]
 X represents customer

 Confidence or certainty = 50% , if a customer buys

a computer there is a 50% chance that he/she will


buy software as well.
 Support = 1%, means that 1% of all the
transactions under analysis showed that computer
and software were purchased together.

November 25, 2024 49


Contd…
 Support  usefulness

 Confidence  certainty

 The support for a rule R is the ratio of the number of


occurrences of R, given all occurrences of all rules.

 The confidence of a rule X  Y, is the ratio of the


number of occurrences of Y given X, among all other
occurrences given X

 In multidimensional databases, where each attribute


is referred to as a dimension, the above rule can be
referred to as a multidimensional association rule.

November 25, 2024 50


Support and Confidence
 Support count: The support count of an
itemset X, denoted by X.count, in a data
set T is the number of transactions in T
that contain X. Assume T has n
transactions.
 Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count

November 25, 2024 51


Contd….

Support for {Bag, Uniform} =


Bag Uniform Crayons 5/10 = 0.5
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Book
Uniform Crayons Bag Confidence for Bag  Uniform =
Bag Pencil Book 5/8 = 0.625
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books

November 25, 2024 52


t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes

Clothes  Milk, Chicken

Clothes, Chicken  Milk

November 25, 2024 53


Contd…

 Motivation: Finding inherent regularities in data


 What products were often purchased
together?— Bag, Uniform?!
 What are the subsequent purchases after
buying a PC?
 What kinds of DNA are sensitive to this new
drug?
 Can we automatically classify web
documents?
November 25, 2024 54
Associations and
correlations
 Another example:
 Age (X, 20…29) ^ income (X, 20K-29K) 
buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
 Customers between 20 to 29 years of age
with an income $20000-$29000. There is
60% chance they will purchase CD Player
and 2% of all the transactions under
analysis showed that this age group
customers with that range of income
bought CD Player.

November 25, 2024 55


Classification

 Classification and label prediction


 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying
stars, diseases, web-pages, … 56
Classification and Prediction
 Classification is the process of finding a
model that describes and distinguishes data
classes or concepts for the purpose of being
able to use the model to predict the class of
objects whose class label is unknown.
 Construct models (functions) that describe
and distinguish classes or concepts for
future prediction
 Training data  Building the model
 Test data  Evaluate the model
 Classification model can be represented in
various forms such as

IF-THEN Rules

A decision tree
November 25, 2024

Neural network 57
Contd….
 A decision tree is a flow-chart-like tree
structure, where each node denotes a test
on an attribute value, each branch
represents an outcome of the test, and tree
leaves represent classes or class
distributions.

 Decision trees can easily be converted to


classification rules.

 A neural network, when used for


classification, is typically a collection of
neuron-like processing units with weighted
connections between the units.
November 25, 2024 58
Classification Model

November 25, 2024 59


Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity &
minimizing interclass similarity
 Many methods and applications

60
Cluster Analysis
 Clustering analyses data objects without
consulting a known class label.

 Groups data elements into different groups


based on the similarity between elements
within a single group

 Maximizing the intraclass similarity and


minimizing the interclass similarity.

 Example: Result analysis

November 25, 2024 61


Cluster Analysis

November 25, 2024 62


Outlier Analysis
 Outlier Analysis : A database may contain data objects
that do not comply with the general behavior or model
of the data. These data objects are outliers.

 Outliers" are values that "lie outside" the other values.

 Example: Use in finding Fraudulent usage of credit


cards. Outlier Analysis may uncover Fraudulent usage
of credit cards by detecting purchases of extremely
large amounts for a given account number in
comparison to regular charges incurred by the same
account. Outlier values may also be detected with
respect to the location and type of purchase or the
purchase frequency.
November 25, 2024 63
Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the
general behavior of the data
 Noise or exception? ― One person’s garbage could be
another person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

64
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
 Sequence, trend and evolution analysis

Trend, time-series, and deviation analysis: e.g.,
regression and value prediction

Sequential pattern mining

e.g., first buy digital camera, then buy large SD
memory cards

Periodicity analysis

Motifs and biological sequence analysis

Approximate and consecutive motifs

Similarity-based analysis
 Mining data streams

Ordered, time-varying, potentially infinite, data
streams
65
Evolution Analysis
 Evolution Analysis: Data evolution analysis
describes and models regularities or trends for
objects whose behavior changes over time.

 Example: Time-series data. If the stock market


data (time-series) of the last several years
available from the New York Stock exchange and
one would like to invest in shares of high tech
industrial companies. A data mining study of stock
exchange data may identify stock evolution
regularities for overall stocks and for the stocks of
particular companies. Such regularities may help
predict future trends in stock market prices,
contributing to one’s decision making regarding
stock investments.
November 25, 2024 66
Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees

(XML), substructures (web fragments)


 Information network analysis

Social networks: actors (objects, nodes) and relationships
(edges)

e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks


A person could be multiple information networks: friends,
family, classmates, …
 Links carry a lot of semantic information: Link mining

 Web mining
 Web is a big information network: from PageRank to Google


Analysis of Web information networks

Web community discovery, opinion mining, usage mining, …

67
 Techniques utilized
 Machine learning, statistics, visualization, etc.

November 25, 2024 68


Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

69
Contd….
 DM  an interdisciplinary field
 Set of disciplines including database
systems, statistics, machine learning,
visualization, and information science.
 Other disciplines  Neural networks, fuzzy
logic or rough set theory, knowledge
representation, etc.

11/25/24 70
 Statistics is the study of the collection, organization, analysis,
interpretation and presentation of data.
 Machine learning, a branch of artificial intelligence, concerns
the construction and study of systems that can learn from data.
 For example, a machine learning system could be trained on
email messages to learn to distinguish between spam and non-
spam messages. Ex- trees, neural n/w etc.
 A database is an organized collection of data.

71
AI

 Artificial intelligence (AI) is technology and


a branch of computer science that studies
and develops intelligent machines and
software.
 Pattern recognition aims to classify data (patt
erns) based on either a priori knowledge or o
n statistical information extracted from the
patterns.

72
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
interpretation
Dimension reduction Clustering
Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities

73
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.

November 25, 2024 74


Applications of Data Mining
 Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
 Data mining and software engineering (e.g., IEEE Computer,
Aug. 2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining

75
Major Issues in Data Mining
(1)
 Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining
 User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

76
Major Issues in Data Mining
(2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

77

You might also like