Data Mining Note
Data Mining Note
distributed historical data stored in large databases, warehouses and other massive information repositories
Four main reasons why DM now?
The competitive pressure is very strong
Massive data collection
The computing power
DM commercial products and machine learning algorithms are available.
Why Data Mining important?
Customer relationship management
Credit ratings:
Targeted marketing
Fraud detection/Network intrusion detection
Data Mining Helps Extract Such Useful Information
Query Examples.
1.Database
Find all credit applicants with first name ‘Alex’.
Identify customers who have purchased more than Birr 10,000 in the last month.
Find all customers who have purchased Bread
2.Data Mining
Find all credit applicants who have no credit risks. (classification)
Identify customers with similar buying habits. (Clustering)
Find all items which are frequently purchased with Bread. (association rules)
Data mining VS database
Data mining
• Poorly defined
• No precise/exact query language
• Non-Operational data
• Not a subset of database
Database
Well defined Structured Query Language
Operational data
Precise and Subset of database.
DM VS Data Warehouse
Data Warehouse provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
Data warehouse:- is a relational database management system responsible for the collection and storage of data
to support management decision making and problem solving.
-It enables managers and other business professionals to undertake data mining.
Data mart:-A subset of a data warehouse for small and medium-size businesses or departments within larger
companies
Data warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for
decision making.
A.Integrated centralized, consolidated database that integrates data derived from the entire organization.
B.Subject-Oriented Data warehouse contains data organized by topics. E.g. Sales, marketing, finance, etc.
C.Time variant In contrast to the operational database that focus on current transactions, the data warehouse
represent the flow of data through time.
D.Nonvolatile Once data enter the data warehouse, they are never removed.
Database & data warehouse:Differences
-Data warehouse receives its data from operational databases.
-Data warehouse contains historical data over a long time horizon.
Data warehouse environment is characterized by read-only transactions to very large data sets.
Operational environment is characterized by numerous update transactions to a few data entities at a time.
Data Processing Technologies
1. OLAP: refers to an advanced data analysis environment that supports decision making.
2. Data mining tools analyze the data, uncover problems or opportunities hidden in the data relationships.
-OLAP provides top-down, query-driven analysis
-Data mining provides bottom-up, discovery-driven analysis
Business Intelligence
• BI takes advantage of data mining and data warehousing to help organizations gather their information in
a timelier and in a more valuable manner
BI keeps the organization:
– informed about the market trends,
– alerts to new market potentials,
– helps to determine how competitors are doing
Business Intelligence Vs. Data Mining
Business intelligence is information about a company's past performance that is used to help
predict the company's future performance.
-is used to analyze and uncover information about past performance on an aggregate level.
Data Mining allows users to sift through the enormous amount of information available in data
warehouses.
-Data mining is more intuitive, allowing for increased insight beyond data warehousing.
-An implementation of data mining in an organization will serve as a guide to uncovering
inherent trends and tendencies in historical information. It will also allow for statistical
predictions, groupings and classifications of data.
Data Mining vs. Knowledge Discovery in Databases
KDD is often used as a synonym for Data Mining. Some author define KDD as the whole process involving: data
selection data pre-processing: cleaning data transformation mining result evaluation visualization
-KDD is the process of finding useful information and patterns in data
Data Mining, on the other hand, refer to the modeling step using the various techniques to extract useful
information/pattern from the data
DM is the use of algorithms to extract hidden patterns & knowledge in data
Stages in DM: The KDD process
• Selection: Obtain data from various heterogeneous sources such as databases, data warehouses, files, non-
electronic records, etc.
• Preprocessing: Cleanse inconsistent & incorrect data; fills incomplete records; predict missing values; correct
erroneous & anomalous data.
• Transformation: Convert data from different sources into common new format. Apply data reduction & data
categorization/binning to ease data mining
• Mining: apply classification or clustering techniques to obtain predictive or descriptive models.
• Interpretation/Evaluation: Present results to user in meaningful manner using various visualization and GUI
strategies.
Data Mining Metrics
1.Return on Investment (ROI):-ROI compares costs of DM techniques against savings or benefits from its use
2.Accuracy in classification
– Analyze true positive and false positive to calculate recall, precision of the system
– Measure percentage of correct classification
3. Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation issues
• Scalability:-Applicability of data mining techniques to perform well with massive real world data sets
• Real World Data:-Real world data are noisy and have many missing attribute values. Algorithms should be able
to work even in the presence of these problems
• Updates:-Database can not be assumed to be static. The data is frequently changing.
High dimensionality:A conventional database schema may be composed of many different attributes. The problem
here is that all attributes may not be needed to solve a given DM problem.
Overfitting:-The size and representativeness of the dataset determines whether the model associated with a given
database states fits to also future database states.
Application:-Determining the intended use for the information obtained from the DM tool is a challenge.