Intro to Data warehousing lecture 17

-
1
Intro to Data Warehousing
Data Warehousing vs Data Mining & Data
Preprocessing in Data Mining
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology, Islamabad
Pakistan
anwarchaudary@gmail.com

• Data Science is an area
• A data warehouse is built to support management
functions whereas data mining is used to extract
useful information and patterns from data. Data
warehousing is the process of compiling information
into a data warehouse.
Difference between Data Warehousing and Data
Mining

• It is a technology that aggregates structured data from one or
more sources so that it can be compared and analyzed rather
than transaction processing. A data warehouse is designed to
support management decision-making process by providing a
platform for data cleaning, data integration and data
consolidation. A data warehouse contains subject-oriented,
integrated, time-variant and non-volatile data.
Data Warehousing

• It is the process of finding patterns and correlations within
large data sets to identify relationships between data. Data
mining tools allow a business organization to predict customer
behavior. Data mining tools are used to build risk models and
detect fraud. Data mining is used in market analysis and
management, fraud detection, corporate analysis and risk
management.
Data Mining

• Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Data Preprocessing in Data Mining

1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Steps Involved in Data Preprocessing:

(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all data
in a segment by its mean or boundary values can be used to complete the task.

Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.

Discretization:
• This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels (interval variable is a measurement variable).
Concept Hierarchy Generation:
• Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.

Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
This technique is used to aggregate data in a simpler form. For example,
imagine that information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data.
Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models
or smaller representation of the data instead of actual data, it is important to
only store the model parameter. Or non-parametric method such as clustering,
histogram, sampling.

Dimensionality Reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of
the original attributes on the set based on their relevance to other attributes. We
know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes
are redundant.

Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
are redundant.

Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
are redundant.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and
making the process faster.

Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two
types based on their compression techniques.
Lossless Compression –
Encoding techniques allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed
data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, principal component
analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In
lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.

Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by
labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat of this method up to the end, then the
process is known as top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through
a combination of the neighborhood values in the interval, that process is called
bottom-up discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such
as 43 for age) to high-level concepts (categorical variables such as middle age or
Senior).
For numeric data following techniques can be followed:
Binning – is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the number of bins
specified by the user.

Histogram analysis – Like the process of binning, the histogram is used to partition
the value for the attribute X, into disjoint ranges called brackets. There are several
partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
Equal Width partitioning : partitioning the values in a fixed gap based on the
number of bins i.e. a set of values ranging from 0-20.
Clustering: Grouping the similar data together.

Intro to Data warehousing lecture 17

More Related Content

What's hot (18)

Similar to Intro to Data warehousing lecture 17 (20)

More from AnwarrChaudary (20)

Recently uploaded (20)

Intro to Data warehousing lecture 17