0% found this document useful (0 votes)

5 views

Unit-2

Data mining involves extracting knowledge from large datasets and is often considered a step in the knowledge discovery process. The document outlines the architecture of data mining systems, the challenges faced, and the importance of data preprocessing, including cleaning, integration, and transformation. It also discusses the benefits of data mining for businesses, the types of data that can be mined, and the differences between operational databases and data warehouses.

Uploaded by

itachimyfriend

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit-2

Uploaded by

itachimyfriend

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 144

Unit-2

Abhishek S. Rao
Introduction
to Data
Mining
Data mining refers to extracting or
“mining” knowledge from large
amounts of data.
Many people treat data mining as a synonym
for another popularly used term, Knowledge
Discovery from Data, or KDD. Alternatively,
others view data mining as simply an essential
step in the process of knowledge discovery.
Knowledge discovery as a process is depicted
in the Figure below and consists of an iterative
sequence of the following steps:
Architecture
of a typical
data mining
system
Challenges of Data Mining
Data
Mining
Tasks
The 6 CRISP-DM phases
7 Crucial Business Benefits of
Data Mining
1. Improved Decision-making

2. Enhanced Customer Understanding

3. Increased Sales and Revenue

4. Risk Mitigation and Fraud Detection

5. Competitive Advantage

6. Cost Reduction and Operational Efficiency

What Kinds of data can be mined?
What kind of patterns can be mined in data mining?
Data Mining - Issues

Data mining is not an easy task, as

the algorithms used can get very
complex and data is not always
available at one place. It needs to
be integrated from various
heterogeneous data sources. These
factors also create some issues.
Data Preprocessing
Data preprocessing is an important step in the data
mining process. It refers to the cleaning, transforming,
and integrating of data to make it ready for analysis. The
goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific
data mining task.
Data Quality
Poor data quality negatively affects many data processing efforts.
Data mining example: a classification model for detecting people who are loan risks is built
using poor data. The results may be that
• some credit-worthy candidates are denied loans
• more loans are given to individuals by default
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:

• Noise and outliers
• Missing values
• Duplicate data
• Wrong data
Missing Data Handling

Many causes: malfunctioning equipment, changes in experimental design, collation of

different data sources, and so on. People may wish to not supply information. Information is
not applicable (children don't have annual income)

• Discard records with missing values

• Ordinal-continuous data, could be replaced with attribute means

• Substitute with a value from a similar instance

• Ignore missing values, i.e., just proceed and let the tools deal with them

• Treat missing values as equals (all share the same missing value code)

• Treat missing values as unequal values

Major Tasks in Data Pre-processing

• Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
• Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.
• Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
• Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
• Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
• Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data Cleaning
A 2-D plot of customer data
with respect to customer
locations in a city, showing
three data clusters. Each cluster
centroid is marked with a “+”,
representing the average point
in space for that cluster.
Outliers may be detected as
values that fall outside of the
sets of clusters.
Data Integration in Data Mining
Data Transformation
Cube Aggregation
Attribute subset selection
Wavelet Transforms in Data Mining
• The discrete wavelet transform (DWT) is a signal processing technique that transforms
linear signals. The data vector X is transformed into a numerically different vector, Xo, of
wavelet coefficients when the DWT is applied. The two vectors X and Xo must be of the
same length. When applying this technique to data reduction, we consider n-dimensional
data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in the
relation of the data set.
• The wavelet transforms the data can be truncated and this is helpful in data reduction. If we
store a small fraction of the strongest wavelet coefficients, then the compressed
approximation of the original data can be obtained. For example, the wavelet coefficients
larger than some determined threshold can be retained. The coefficients of the wavelet
other than the user-determined data are set to 0. The resultant representation of data is very
sparse. The computation of the operations is very fast if they are performed in wavelet
space. This technique can also be used to remove the noise in the data. This reduces the
task of smoothing the main features of the data and wavelet transforms also make the data
cleaning very effective. The approximation of the original data can be done if the set of
coefficients is given by applying the inverse of the DWT.
Numerosity Reduction in Data Mining
Discretization in data mining
Techniques of data discretization
Cluster Analysis
DATA WAREHOUSING AND ON-LINE
ANALYTICAL PROCESSING
In today’s rapidly changing corporate environment, organizations
are turning to cloud-based technologies for convenient data
collection, reporting, and analysis. This is where Data
Warehousing comes in as a core component of business
intelligence that enables businesses to enhance their performance.
It is important to understand what is data warehouse and why it is
evolving in the global marketplace.
Key Characteristics of Data
Warehouse
• Subject-Oriented

• Integrated

• Non-Volatile

• Time-Variant
Differences between
Operational Database Systems
and Data Warehouses
The major task of online operational database systems is to perform
online transaction and query processing. These systems are called
online transaction processing (OLTP) systems. They cover most of the
day-to-day operations of an organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting. Data
warehouse systems, on the other hand, serve users or knowledge
workers in the role of data analysis and decision making. Such systems
can organize and present data in various formats to accommodate the
diverse needs of different users. These systems are known as online
analytical processing (OLAP) systems.
Comparison of OLTP and OLAP Systems
Data
Warehousi
ng: A
Multitiered
Architectur
e
Data Warehouse Models: Enterprise
Warehouse, Data Mart, and Virtual Warehouse
Extraction, Transformation,
and Loading
Data Warehouse Modeling: Data
Cube and OLAP
Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube.
What is Data Cube?
A data cube is created from a subset of attributes in the
database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional
attributes. The measure attributes are aggregated according
to the dimensions.
A data cube enables data to be modeled and viewed in
multiple dimensions. A multidimensional data model is
organized around a central theme, like sales and transactions.
A fact table represents this theme. Facts are numerical
measures. Thus, the fact table contains measure (such as
Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are
generally quantities, which are used for analyzing the
relationship between dimensions.
Data Warehousing - Schemas
Schema is a logical description of the entire database. It
includes the name and description of records of all record
types including all associated data-items and aggregates.
Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact
Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Measures Their Categorization and Computation
in Data Mining
Typical OLAP Operations
Data Warehouse Design
Data Warehouse Usage for Information Processing
From Online Analytical Processing to
Multidimensional Data Mining
Data Warehouse
Implementation
Supervised Learning
Regression
Regression Vs Classification
Multiple Linear Regression
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes
known simply as multiple regression, and it is an extension of linear regression. The
variable that we want to predict is known as the dependent variable, while the
variables we use to predict the value of the dependent variable are known as
independent or explanatory variables.

Hacking Iseries For Hakin9
0% (3)
Hacking Iseries For Hakin9
19 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Executive Summary: Palantir Government
No ratings yet
Executive Summary: Palantir Government
6 pages
Data Mining
No ratings yet
Data Mining
7 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining
No ratings yet
Data Mining
15 pages
pptcs1661
No ratings yet
pptcs1661
38 pages
DWDM PPT by DR - Shankaragowda B.B
No ratings yet
DWDM PPT by DR - Shankaragowda B.B
11 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
9 MidReview
No ratings yet
9 MidReview
25 pages
Down 2
No ratings yet
Down 2
61 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Assignment of DMDW kg11
No ratings yet
Assignment of DMDW kg11
17 pages
DMDW 1 2nd Module
No ratings yet
DMDW 1 2nd Module
29 pages
Data Minng
No ratings yet
Data Minng
20 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Data Mining Important
No ratings yet
Data Mining Important
15 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
DataMining S
No ratings yet
DataMining S
103 pages
Defining Data Mining and Data Warehouse (Adugna Gutema)
No ratings yet
Defining Data Mining and Data Warehouse (Adugna Gutema)
9 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
New Text Document
No ratings yet
New Text Document
3 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
(eBook PDF) Data Mining Concepts and Techniques 3rd pdf download
100% (2)
(eBook PDF) Data Mining Concepts and Techniques 3rd pdf download
52 pages
Data Warehousing
No ratings yet
Data Warehousing
21 pages
(eBook PDF) Data Mining Concepts and Techniques 3rdpdf download
No ratings yet
(eBook PDF) Data Mining Concepts and Techniques 3rdpdf download
42 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
14 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
HU-DM-2024
No ratings yet
HU-DM-2024
205 pages
DMDW_ Preprocessing L-6,7
No ratings yet
DMDW_ Preprocessing L-6,7
16 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
3268
No ratings yet
3268
55 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Unit 1
No ratings yet
Unit 1
11 pages
Complete Download (eBook PDF) Data Mining Concepts and Techniques 3rd PDF All Chapters
100% (6)
Complete Download (eBook PDF) Data Mining Concepts and Techniques 3rd PDF All Chapters
56 pages
Data Mining
No ratings yet
Data Mining
26 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
UNIT-1 Introduction: Motivation: Why Data Mining?
No ratings yet
UNIT-1 Introduction: Motivation: Why Data Mining?
86 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
Unit-1 DMDW
No ratings yet
Unit-1 DMDW
22 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Unit 3
No ratings yet
Unit 3
18 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Asst - Prof in Commerce (CA), Annai Women's College, Karur
No ratings yet
Asst - Prof in Commerce (CA), Annai Women's College, Karur
69 pages
Mca 207
No ratings yet
Mca 207
3 pages
Messages
No ratings yet
Messages
5 pages
Assignment 7 - Modified
No ratings yet
Assignment 7 - Modified
8 pages
Very Large Database
No ratings yet
Very Large Database
6 pages
PostgreDB Enterprise
No ratings yet
PostgreDB Enterprise
14 pages
Sample Test 2 Solution
No ratings yet
Sample Test 2 Solution
7 pages
50 % Syllabus of LIS Short Notes
No ratings yet
50 % Syllabus of LIS Short Notes
22 pages
Obi Observations
No ratings yet
Obi Observations
16 pages
IDB Sample Exam Question PDF
No ratings yet
IDB Sample Exam Question PDF
7 pages
"Referring To Local Data Type
No ratings yet
"Referring To Local Data Type
21 pages
Dot Net Question Paper
No ratings yet
Dot Net Question Paper
6 pages
Database Administrator Functions and Role
No ratings yet
Database Administrator Functions and Role
9 pages
Distributed Ledger Technology (DLT) Definition and How It Works
No ratings yet
Distributed Ledger Technology (DLT) Definition and How It Works
10 pages
PL 300 Updated Part 2
No ratings yet
PL 300 Updated Part 2
28 pages
CS506 Midterm New
No ratings yet
CS506 Midterm New
5 pages
Software Engineering Unit-III
No ratings yet
Software Engineering Unit-III
48 pages
Computing Essentials: Databases
No ratings yet
Computing Essentials: Databases
49 pages
Adms CH-4
No ratings yet
Adms CH-4
36 pages
Running Map Reduce Program in Eclipse: C:/hadoop
No ratings yet
Running Map Reduce Program in Eclipse: C:/hadoop
6 pages
Case 1 Assignment For Group 6
No ratings yet
Case 1 Assignment For Group 6
4 pages
143quastion
No ratings yet
143quastion
30 pages
ABAP Data Dictionary
No ratings yet
ABAP Data Dictionary
39 pages
IMS - DB Presentation2
No ratings yet
IMS - DB Presentation2
25 pages
(Ebook) Modern Database Management by Jeffrey A. Hoffer, Ramesh Venkataraman, Heikki Topi ISBN 9780136088394, 0136088392 download
100% (3)
(Ebook) Modern Database Management by Jeffrey A. Hoffer, Ramesh Venkataraman, Heikki Topi ISBN 9780136088394, 0136088392 download
51 pages
Rdbms - Practical Questions
No ratings yet
Rdbms - Practical Questions
5 pages
Jenkins Tomcat Deployment
No ratings yet
Jenkins Tomcat Deployment
20 pages
SQL pdf-3
No ratings yet
SQL pdf-3
9 pages

Unit-2

Uploaded by

Unit-2

Uploaded by

Unit-2

2. Enhanced Customer Understanding

3. Increased Sales and Revenue

4. Risk Mitigation and Fraud Detection

6. Cost Reduction and Operational Efficiency

Data mining is not an easy task, as

Examples of data quality problems:

Many causes: malfunctioning equipment, changes in experimental design, collation of

• Discard records with missing values

• Ordinal-continuous data, could be replaced with attribute means

• Substitute with a value from a similar instance

• Treat missing values as unequal values

You might also like