UNIT 1 Introduction of Data Mining
UNIT 1 Introduction of Data Mining
Chapter 1
– scientific simulations
terabytes of data generated in a few hours
Improving health care and reducing costs Predicting the impact of climate change
knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
• Input data
•Pre-processing:
• Fusing data from multiple sources
• Cleaning data to remove noise and duplicates
• Selecting features or records that are relevant to data mining task
• Transform the raw input data into appropriate format for analysis.
•Post-processing
•“Closing-the-loop” refers to the process of integrating data mining results into decision support
system. Ex: For business application data mining results can be integrated with campaign
management for effective marketing promotions. This requires post processing step to ensure valid
and useful results are incorporated into decision support system.
6/30/2019 Introduction to Data Mining 11
What is (not) Data Mining?
Scalability:
– Novel data structures, out-of-the-core algorithms,
parallel and distributed algorithms.
High Dimensionality:
– Ex: temporal and spatial components have high
dimensions.
Heterogeneous and Complex Data:
– Collection of web pages containing semi-structured
data and hyperlinks; DNA data three dimensional
structure; climate data with time series
measurements.
6/30/2019
Introduction to Data Mining 13
Motivating Challenges (cont.)
Non-traditional Analysis:
– Traditional statistical approach is based on a
hypothesize-and-test paradigm. A hypothesis is
proposed, an experiment is designed to gather the
data, and then data is analyzed w.r.t. the hypothesis.
– Current data analysis requires evaluation of
thousands of hypothesis hence there is a
need for automating the process of hypothesis
generation and evaluation.
Descriptive Tasks:
– Objective here is to derive patterns like correlations,
trends, clusters, anomalies that summarize the
relationships in data.
– These are exploratory in nature and frequently require
post processing techniques to validate and explain the
results
– Find human-interpretable patterns that describe the
data.
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Class
No Education
# years at
Level of Credit
Tid Employed present { High school,
Education Worthy Graduate
address Undergrad }
1 Yes Graduate 5 Yes
2 Yes High School 2 No Number of
Number of
3 No Undergrad 1 No years years
4 Yes High School 10 Yes
> 3 yr < 3 yr > 7 yrs < 7 yrs
10
… … … … …
Yes No Yes No
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes Test
Set
10
… … … … …
Training
Learn
Model
Set Classifier
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
6/30/2019 Introduction to Data Mining 24
Classification: Application 2
Examples:
Group sets of related customers
Find areas of ocean which have significant
impact on Earth’s climate.
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
Document Clustering:
Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases