Lecture 1-Introduction To Data Mining - M
Lecture 1-Introduction To Data Mining - M
Data Mining
Lecture # 1
Administrative Stuff
9
Motivation: “Necessity is the
Mother of Invention”
Data Explosion Problem
1. Automated data collection tools (e.g. web, sensor networks) and mature
database technology lead to tremendous amounts of data stored in databases,
data warehouses and other information repositories.
3. YouTube users upload 48 hours of video, Facebook users share 684,478 pieces of
content, Instagram users share 3,600 new photos, and Tumblr sees 27,778 new
posts published.
11
Extracting Business Intelligence
(Solution)
1. It is not a Simple Matter to discover Business
Intelligence from Mountain of Accumulated Data.
Alternative names :
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, business intelligence, etc.
Data Mining (Example)
Random Guessing vs. Potential Knowledge
Suppose we have to Forecast the Probability of Rain in Islamabad city
for any particular day.
Without any Prior Knowledge the probability of rain would be 50%
(pure random guess).
If we had a lot of weather data, then we can extract potential
rules using Data Mining which can then forecast the chance of rain
better than random guessing.
Example: The Rule
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
The Data Mining Process
• Step 0: Determine Business Objective/Learning the
application domain
- e.g. Forecasting the probability of rain
- Must have relevant prior knowledge and goals of application.
• Step 1: Creating a Target Data set/Prepare Data
- Data Selection
- Data Cleaning; Noisy and Missing values handling (may take 60% of
the effort!).
- Data Transformation (Normalization/Discretization).
- Attribute/Feature Selection.
• Step 2: Choosing the Function of Data Mining
- Classification, Clustering, Regression, Association Rules
• Step 3: Choosing The Mining Algorithm
- Selection of correct algorithm depending upon the quality of data.
- Selection of correct algorithm depending upon the density of data.
Step 4: Data Mining
- Search for patterns of interest:- A typical data mining algorithm can
mine millions of patterns.
• Step 5: Visualization/Knowledge Representation
- Visualization/Representation of interesting patterns, etc . and then
17
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Business
Data Presentation Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining: On What Kind of Data?
1. Relational databases
2. Data warehouses
3. Transactional databases
4. Advanced DB and information repositories
Time-series data and temporal data
Text databases
Multimedia databases
Data Stream (Sensor Networks Data)
WWW
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
Data Mining vs SQL, EIS, and OLAP
• SQL. SQL is a query language, difficult for business people
to use
• EIS = Executive Information Systems. EIS systems
provide graphical interfaces that give executives a pre-
programmed (and therefore limited) selection of reports,
automatically generating the necessary SQL for each.
• OLAP allows views along multiple dimensions, and drill-
drown, therefore giving access to a vast array of analyses.
However, it requires manual navigation through scores of
reports, requiring the user to notice interesting patterns
themselves.
• Data Mining picks out interesting patterns. The user
can then use visualization tools to investigate further.
21
An Example of OLAP Analysis and its
Limits
Walking Sticks Sales by City
• What is driving sales of walking sticks ? Step 1
50
10
• Step 1: View some OLAP graphs: Karachi
e.g. walking stick sales by city. Lahore
Islamabad
• Step 2: Noticing that Islamabad has high sales
400
you decide to investigate further. Walking Sticks Sales in
• (Before OLAP, you would have to have written a Islamabad by Age Step 2
very complex SQL query instead of just simply 10 30
clicking to drill-down).
• It seems that old people are responsible for most
Less than 20
walking stick sales.
20 to 60
You confirm this by viewing a chart of age 360 Older than 60
distributions by city. Age Distribution by City
22
Data Mining vs Expert Systems
• Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to
decisions. (To be dealt with in Part 2 of this course)
Rules Decisions
Expert
Data System
23
Difference b/w Machine Learning and
Data Mining
Machine Learning techniques are designed to deal with a limited
amount of artificial intelligence data. Where the Data Mining
Techniques deal with large amount of databases data.
Data Preprocessing
Handling Missing and Noisy Data (Data Cleaning).
Techniques we will cover.
• Missing values Imputation using Mean, Median and Mod.
• Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining.
• Missing values Imputation using Fault-Tolerant Patterns.
• Data Binning for Noisy Data.
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
Data Mining Functionalities (2)
Association Rule Mining
Topic we will cover
Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
Fault-Tolerant/Approximate Frequent Itemset Mining.
N-Most Interesting Frequent Itemset Mining.
Closed and Maximal Frequent Itemset Mining.
Incremental Frequent Itemset Mining
Sequential Patterns.
Projects
• Mining Fault-Tolerant Using Pattern-Growth.
• Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes Rule
Islamabad hot high false Yes • If Temperature = Hot &
Multan mild low false No
Humidity = High then
Karachi cool normal false No
Rain = Yes.
Rawalpindi hot high true Yes
Prediction of City
Muree
Temperature
hot
Humidity Windy
high false
Rain
?
unknown record Sibi mild low true ?
Data Mining Functionalities (2)
Cluster Analysis
Group data to form new classes based on un-labels class data.
Business decisions are unknown (Also called unsupervised Learning).
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
City
Lahore
Temperature
hot
Humidity
low
Windy
false
Rain
?
3 clusters
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Data Mining Functionalities (3)
Outlier Analysis
Outlier: A data object that does not comply with the general behavior
of the data.
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City Temperature Humidity Windy Rain 2 outliers
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Are All the “Discovered” Patterns
Interesting?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
Suggested approach: Query-based, Constraint
mining
Interestingness Measures: A pattern is interesting if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
Can We Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Remember most of the problems in Data Mining are NP-Complete.
There is no global best solution for any single problem.
Search for only interesting patterns: Optimization
Can a data mining system find only the interesting patterns?
Approaches
• First generate all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
Reading Assignment
Book Chapter
Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
Some Nice Resources
ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.