02 Data Warehouse

Uploaded by

vv9807898

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

02 Data Warehouse

Uploaded by

vv9807898

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

02 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

Data Preprocessing

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
❖ Data Cleaning
❖ Data Integration
❖ Data Transformation
❖ Data Reduction
❖ Data Discretization
❖ Data Normalization
Data Cleaning

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

• Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

• Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Binning Method:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local
smoothing.
For Example:

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

In this example, the data for price are first sorted and then partitioned
into equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin
1 is 9. Therefore, each original value in this bin is replaced by the value
9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width,
the greater the effect of the smoothing. Alternatively, bins may be equal width, where the interval range of
values in each bin is constant.
Binning is also used as a discretization technique.
Data Transformation

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1.Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4.Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”
Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller
in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected
and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as
parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation
of concept hierarchies.Discretization and concept hierarchy generation are powerful tools for datamining, in that they
allow the mining of data at multiple levels of abstraction.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can
help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy
and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration

Entity Identification Problem

There are a number of issues to consider during data integration. Schema integration and object matching can be
tricky. This is referred to as the entity identification problem
When matching attributes from one database to another during integration, special attention must be paid to the
structure of the data. This is to ensure that any attribute functional dependencies and referential constraints in the
source system match those in the target system.
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how
strongly one attribute implies the other, based on the available data.
For nominal data, we use the 2 (chi-square) test. For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from those of another.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given unique data entry case). The use of denormalized tables
(often done to improve performance by avoiding joins) is another source of data redundancy. Inconsistencies often
arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences.
For example, if a purchase order database contains attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s name appearing
with different addresses within the purchase order database.
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For example, for the same real-
world entity, attribute values from different sources may differ. This may be due to differences in representation,
scaling, or encoding.
For instance, a weight attribute may be stored in metric units in one system and British imperial units in another. For
a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services
(e.g., free breakfast) and taxes.
When exchanging information between schools, for example, each school may have its own curriculum and grading
scheme. One university may adopt a quarter system, offer three courses on database systems, and assign grades from
AC to F, whereas another may adopt a semester system, offer two courses on databases, and assign grades from 1 to
10. It is difficult to work out precise course-to-grade transformation rules between the two universities, making
information exchange difficult
Data Transformation and Data Discretization
In this pre-processing step, the data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand.

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for
data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a
data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as

5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–
20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-
level concepts, resulting in a concept hierarchy for the numeric attribute. e. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.
Concept Hierarchy Generation for Nominal Data

Nominal attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
Examples include geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user or a domain expert.
Fortunately, many hierarchies are implicit within the database schema and can be automatically defined at the schema
definition level. The concept hierarchies can be used to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be found relating to specific regions or countries, in addition to
individual branch locations.
We study four methods for the generation of concept hierarchies for nominal data, as follows.
1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: Concept
hierarchies for nominal attributes or dimensions typically involve a group of attributes. A user or expert can
easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level.
For example, suppose that a relational database contains the following group of attributes:
street, city, province or state, and country.
Similarly, a data warehouse location dimension may contain the same attributes. A hierarchy can be defined by
specifying the total ordering among these attributes at the schema level such as
street < city <province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially the manual definition of a
portion of a concept hierarchy. In a large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration. On the contrary, we can easily specify explicit groupings for a small portion of intermediate-level
data.
For example, after specifying that province and country form a hierarchy at the schema level, a user could define
some intermediate levels manually,
3. Specification of a set of attributes, but not of their partial ordering: A user may specify a set of attributes forming
a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept hierarchy.

4. Specification of only a partial set of attributes: Sometimes a user can be careless when defining a hierarchy, or
have only a vague idea about what should be included in a hierarchy. Consequently, the user may have included
only a small subset of the relevant attributes in the hierarchy specification.
For example, instead of including all of the hierarchically relevant attributes for location, the user may have
specified only street and city. To handle such partially specified hierarchies, it is important to embed data
semantics in the database schema so that attributes with tight semantic connections can be pinned together. In
this way, the specification of one attribute may trigger a whole group of semantically tightly linked attributes to
be “dragged in” to forma complete hierarchy. Users, however, should have the option to override this feature, as
necessary.

Computer Science Project 2020 E-Authentication System Using A Combination of QR Code and Otp For Enhanced Security
100% (2)
Computer Science Project 2020 E-Authentication System Using A Combination of QR Code and Otp For Enhanced Security
59 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Obstacle Avoidance Using The Kinect
100% (2)
Obstacle Avoidance Using The Kinect
5 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
DMDW_
No ratings yet
DMDW_
14 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
2 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Mining
No ratings yet
Data Mining
5 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Unit-2
No ratings yet
Unit-2
144 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Normalization
No ratings yet
Normalization
35 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
DWM Question Bank Solution
No ratings yet
DWM Question Bank Solution
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Module 2
No ratings yet
Module 2
42 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
unit -1 (b) DWM.docx
No ratings yet
unit -1 (b) DWM.docx
26 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
MODULE 2 DMW
No ratings yet
MODULE 2 DMW
19 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Steps in The Data Mining Process
No ratings yet
Steps in The Data Mining Process
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
253777
No ratings yet
253777
66 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Practical 1 ML_removed
No ratings yet
Practical 1 ML_removed
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
M.tech. (ICT) Part I (Semester I & II)
No ratings yet
M.tech. (ICT) Part I (Semester I & II)
22 pages
ARTIFICIAL INTELLIGENCE CLASS 10 SP
No ratings yet
ARTIFICIAL INTELLIGENCE CLASS 10 SP
32 pages
Aerial 3D Building Detection and Modeling From Airborne Lidar Point Clouds
No ratings yet
Aerial 3D Building Detection and Modeling From Airborne Lidar Point Clouds
10 pages
Introduction To Recommender System
No ratings yet
Introduction To Recommender System
8 pages
Analysing Breast Cancerwith Artificial
No ratings yet
Analysing Breast Cancerwith Artificial
25 pages
Kns Institute of Technology: Project On
No ratings yet
Kns Institute of Technology: Project On
27 pages
Facto Extra
No ratings yet
Facto Extra
84 pages
1Z0-1041-21 Oracle Cloud Platform Enterprise Analytics 2021 Specialist Part 1
No ratings yet
1Z0-1041-21 Oracle Cloud Platform Enterprise Analytics 2021 Specialist Part 1
10 pages
Ai & ML 2 Marks Was
No ratings yet
Ai & ML 2 Marks Was
23 pages
Sna Answer
No ratings yet
Sna Answer
46 pages
A Survey On Food Recommendation System Using Data Mining Concepts
No ratings yet
A Survey On Food Recommendation System Using Data Mining Concepts
5 pages
Narrative From Pictures
No ratings yet
Narrative From Pictures
10 pages
DM Witten 03
No ratings yet
DM Witten 03
56 pages
DWDM
No ratings yet
DWDM
5 pages
A9 Sound and Music Description
No ratings yet
A9 Sound and Music Description
4 pages
S. No. Roll NO Name Project Title: B R Krishna Kokiligada
No ratings yet
S. No. Roll NO Name Project Title: B R Krishna Kokiligada
8 pages
Machine Learning 3rd Sem MCA 2022 QP
100% (1)
Machine Learning 3rd Sem MCA 2022 QP
2 pages
7 K-Means Clustering
0% (1)
7 K-Means Clustering
4 pages
2017 1 Multivariate Data Analysis
No ratings yet
2017 1 Multivariate Data Analysis
2 pages
FLANN - Fast Library For Approximate Nearest Neighbors User Manual
No ratings yet
FLANN - Fast Library For Approximate Nearest Neighbors User Manual
15 pages
Segmentation Tutorial
No ratings yet
Segmentation Tutorial
20 pages
K - Means Clustering Algorithm Applications in Data Mining and Pattern Recognition
No ratings yet
K - Means Clustering Algorithm Applications in Data Mining and Pattern Recognition
8 pages
Lecture - 2 & 3
No ratings yet
Lecture - 2 & 3
62 pages
Data Mining Apriori Algorithm for Heart Disease Prediction
No ratings yet
Data Mining Apriori Algorithm for Heart Disease Prediction
5 pages
Cluster Past
No ratings yet
Cluster Past
5 pages
Data Mining in Banking and Its Applications - A Rev
No ratings yet
Data Mining in Banking and Its Applications - A Rev
9 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
4 pages

02 Data Warehouse

Uploaded by

02 Data Warehouse

Uploaded by

02 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

(a). Missing Data:

• Fill the Missing values:

4.Concept Hierarchy Generation:

Entity Identification Problem

Data Transformation Strategies Overview

You might also like