0% found this document useful (0 votes)
29 views

DWDM 01 Introduction

The document discusses data mining concepts including what data mining is, the types of data that can be mined, and the types of patterns that can be mined from data. It covers descriptive patterns that characterize data as well as predictive patterns used to make predictions on data. A variety of data sources that can be mined are also presented.

Uploaded by

Pranav A.R
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

DWDM 01 Introduction

The document discusses data mining concepts including what data mining is, the types of data that can be mined, and the types of patterns that can be mined from data. It covers descriptive patterns that characterize data as well as predictive patterns used to make predictions on data. A variety of data sources that can be mined are also presented.

Uploaded by

Pranav A.R
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 1 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
 Applications: Market analysis, fraud detection, customer retention etc.

3
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

4
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

5
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

6
What Is Data Mining?

May 1, 2023 Data Mining: Concepts and Techniques 7


Knowledge Discovery (KDD) Process

8
Knowledge Discovery (KDD) Process

1. Data cleaning - to remove noise and inconsistent data


2. Data integration - multiple data sources may be combined
3. Data selection - data relevant to the analysis task are retrieved from the
database
4. Data transformation - data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance
5. Data mining - an essential process where intelligent methods are applied in
order to extract data patterns
6. Pattern evaluation - to identify the truly interesting patterns representing
knowledge based on some interestingness measures
7. Knowledge presentation - visualization and knowledge representation
techniques are used to present the mined knowledge to the user

May 1, 2023 Data Mining: Concepts and Techniques 9


Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

10
Data Mining: On What Kinds of Data?

1. Database-oriented data sets - Relational database


 Customer (cust ID, customer name, address, age,
occupation, annual income, credit information, category)
 Other relations item, employee, and branch
 Tables can also be used to represent the relationships
between or among multiple relation tables.
 items sold (lists the items sold in a given
transaction),
 works at (employee works at a branch of
AllElectronics).

May 1, 2023 Data Mining: Concepts and Techniques 11


Queries
 Q1: Show list of all items that were sold in the last quarter

 Q2: Show the total sales of the last month, grouped by


branch

 Q3: How many sales transactions occurred in the month of


December

 Q4: Which sales person had the highest amount of sales?

 data mining searches for trends & patterns - DM can analyze


customer data to predict the credit risk of new customers
based on their income, age, and previous credit information

May 1, 2023 Data Mining: Concepts and Techniques 12


Data Mining: On What Kinds of Data?

2. data warehouse
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process

Q: analysis of the company’s sales per item type per branch for the third quarter
difficult task - since the relevant data are spread out over several databases,
physically located at numerous sites

May 1, 2023 Data Mining: Concepts and Techniques 13


Data Mining: On What Kinds of Data?

May 1, 2023 Data Mining: Concepts and Techniques 14


A data warehouse collects
information about subjects that
span an entire organization, and
thus its scope is enterprise-wide.

A data mart, is a department


subset of a data warehouse. It
focuses on selected subjects,
and thus its scope is department-
wide.

May 1, 2023 Data Mining: Concepts and Techniques 15


Data Mining: On What Kinds of Data?

3. transactional data bases


Q: Show all the items purchased by Sandy Smith” or
“How many transactions include item number I3?”
 may require a scan of the entire transactional database.
 Q: Which items sold well together?”
 market basket data analysis
 Identify frequent item sets: computer & printer

May 1, 2023 Data Mining: Concepts and Techniques 16


Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

17
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

18
Data Mining Functionalities-What kinds
of patterns may be mined?
 2 categories
 Descriptive
 Descriptive mining tasks characterize the general
properties of the data in the database
 Predictive
 Predictive mining tasks perform inference on the
current data in order to make predictions.

May 1, 2023 Data Mining: Concepts and Techniques 19


Data Mining Functionalities
1.Concept/class description: Characterization and discrimination
 Data can be associated with classes or concepts
classes of items for sale - computers and printers

 concepts of customers - bigSpenders and budgetSpenders

Descriptors derived from

a. Data characterization - summarization of the general


characteristics or features of a target class of data.
Eg. summarize the characteristics of customers who spend
more than $1,000 a year
result - a general profile of the customers, such as they are 40–
50 years old, employed, and have excellent credit ratings
 methods

 Statistical measures & plots , OLAP roll-up operation,


Attribute oriented induction technique

May 1, 2023 Data Mining: Concepts and Techniques 20


Data Mining Functionalities
b. Data discrimination - comparison of the general features of target
class data objects with the general features of objects from one or a set
of contrasting classes.

Eg. compare two groups of customers: who shop for computer products
regularly (more than two times a month) versus who rarely shop for
such products (i.e., less than three times a year).

Result:
80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a
university education

60% of the customers who infrequently buy such products are


either seniors or youths, and have no university degree
c. Both characterization and discrimination

May 1, 2023 Data Mining: Concepts and Techniques 21


Data Mining Functionalities

2. Mining frequent patterns, association, correlation


Frequent patterns - patterns that occur frequently in data
3 types of frequent patterns:
a. frequent itemset - a set of items that frequently appear
together in a transactional data set, such as bread and butter

b. frequent subsequence - the pattern that customers tend


to purchase. first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern

c. frequent substructure - different structural forms, such as


graphs, trees, or lattices

Mining frequent patterns leads to the discovery of interesting


associations and correlations within data.

May 1, 2023 Data Mining: Concepts and Techniques 22


Data Mining Functionalities

Association and correlation analysis:


determine items that are frequently purchased together within the
same transactions.

computer => software [1%, 50%]

 age(cust, “20:::29”)^income(cust, 20K:::29K”) => buys(cust, “CD


player”) [support = 2%, confidence = 60%]

association rules are discarded as uninteresting if they do not satisfy


both a minimum support threshold and a minimum confidence
threshold

May 1, 2023 Data Mining: Concepts and Techniques 23


( X  Y ).count ( X  Y ).count
support  confidence 
n X .count

Transaction-id Items bought


10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F

Association rules:
A  D (60%, 100%)
D  A (60%, 75%)

May 1, 2023 Data Mining: Concepts and Techniques 24


Data Mining Functionalities

3. Classification and Prediction

Classification is the process of finding a model (or function) that


describes and distinguishes data classes or concepts

 Classification maps data into predefined groups or classes.

supervised learning - the classes are determined before examining the


data.

purpose - use the model to predict the class of objects whose class
label is unknown.

Methods - classification (IF-THEN) rules, decision trees, mathematical


formulae, or neural networks

May 1, 2023 Data Mining: Concepts and Techniques 25


Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

26
Example Tree for “Play?”

Outlook

sunny rain
overcast

Humidity Yes
Windy

high normal true false

No Yes No Yes

27
Classification Ex: Grading
x
• If x >= 90 then grade
=A. <90 >=90
• If 80<=x<90 then x A
grade =B.
<80 >=80
• If 70<=x<80 then
x B
grade =C.
• If 60<=x<70 then <70 >=70
grade =D. x C
• If x<50 then grade =F. <50 >=60
F D
Data Mining Functionalities
 Prediction - models continuous-valued functions. used to predict
missing or unavailable numerical data values rather than class labels.

 Method: regression analysis

 Eg. Predict the amount of revenue that each item will generate
during upcoming sale at AllElectronics

May 1, 2023 Data Mining: Concepts and Techniques 29


Data Mining Functionalities
4. Cluster analysis
 Class label is unknown: Group data to form new classes –

unsupervised learning
 Maximizing intra-class similarity & minimizing interclass similarity

 A cluster is therefore a collection of objects which are “similar”

between them and are “dissimilar” to the objects belonging to


other clusters

May 1, 2023 Data Mining: Concepts and Techniques 30


Data Mining Functionalities
5. Outlier analysis
 Outlier: Data object that does not comply with the general

behavior of the data


 Noise or exception? Useful in fraud detection, rare events analysis

 fraudulent usage of credit cards


 detect purchases of extremely large amounts for a given

account number compare to regular charges incurred by the


same account

May 1, 2023 Data Mining: Concepts and Techniques 32


Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of them are
interesting!
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support and
confidence for association rules

 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,


actionability, etc.
 Manager – frequent customer, analyst – employee performance pattern

May 1, 2023 Data Mining: Concepts and Techniques 33


Find All and Only Interesting Patterns?

 Find all the interesting patterns: Completeness


 Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns? - inefficient
 User provided constraints & interestingness measure
 Association , classification , clustering
 Search for only interesting patterns: An optimization problem – desirable
but, challenging issue
 Can a data mining system find only the interesting patterns?
 Approaches
 First generate all the patterns and then filter out the uninteresting
ones
 Generate only the interesting patterns—mining query optimization

May 1, 2023 Data Mining: Concepts and Techniques 34


Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

35
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

Data warehouse Information


Retrieval
36
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

37
Applications of Data Mining
 Business Intelligence
 Web Search Engine

38
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining

39
Major Issues in Data Mining (1)

 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

40
Major Issues in Data Mining (2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

41
Conferences and Journals on Data Mining

 KDD Conferences  Other related conferences


 ACM SIGKDD Int. Conf. on  DB conferences: ACM SIGMOD,
Knowledge Discovery in
VLDB, ICDE, EDBT, ICDT, …
Databases and Data Mining (KDD)
 SIAM Data Mining Conf. (SDM)
 Web and IR conferences: WWW,
SIGIR, WSDM
 (IEEE) Int. Conf. on Data Mining
(ICDM)  ML conferences: ICML, NIPS
 European Conf. on Machine  PR conferences: CVPR,
Learning and Principles and  Journals
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
 Data Mining and Knowledge
Discovery (DAMI or DMKD)
 Pacific-Asia Conf. on Knowledge
Discovery and Data Mining  IEEE Trans. On Knowledge and
(PAKDD) Data Eng. (TKDE)
 Int. Conf. on Web Search and  KDD Explorations
Data Mining (WSDM)  ACM Trans. on KDD

42
Where to Find References? DBLP, CiteSeer, Google

 Data mining and KDD (SIGKDD: CDROM)


 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
 Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
 Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
 Web and IR
 Conferences: SIGIR, WWW, CIKM, etc.
 Journals: WWW: Internet and Web Information Systems,
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.

43

You might also like