0% found this document useful (0 votes)
32 views

Unit 2 DWDM

This document discusses data preprocessing in data mining. It describes the key steps in data preprocessing as data cleaning, data integration, data transformation, and data reduction. For each step, it provides examples of common techniques used such as missing data imputation for data cleaning, record linkage for data integration, normalization for data transformation, and feature selection for data reduction. The goal of data preprocessing is to improve data quality and prepare the data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Unit 2 DWDM

This document discusses data preprocessing in data mining. It describes the key steps in data preprocessing as data cleaning, data integration, data transformation, and data reduction. For each step, it provides examples of common techniques used such as missing data imputation for data cleaning, record linkage for data integration, normalization for data transformation, and feature selection for data reduction. The goal of data preprocessing is to improve data quality and prepare the data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT 2 Data Preprocessing in Data Mining

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Methods for Data Preprocessing

1.Data Cleaning: This involves identifying and correcting errors or inconsistencies


in the data, such as missing values, outliers, and duplicates. Various techniques can
be used for data cleaning, such as imputation, removal, and transformation.
2.Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and
data fusion can be used for data integration.
3.Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a
common range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous data into
discrete categories.
4.Data Reduction: This involves reducing the size of the dataset while preserving
the important information. Data reduction can be achieved through techniques such
as feature selection and feature extraction. Feature selection involves selecting a
subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important
information.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are: Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by
its mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the
size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model. Some
common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the
dataset. Feature selection is often performed to remove irrelevant or redundant
features from the dataset. It can be done using various techniques such as correlation
analysis, mutual information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional
space while preserving the important information. Feature extraction is often used
when the original features are high-dimensional and complex. It can be done using
techniques such as PCA, linear discriminant analysis (LDA), and non-negative
matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling
is often used to reduce the size of the dataset while preserving the important
information. It can be done using techniques such as random sampling, stratified
sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar data
points with a representative centroid. It can be done using techniques such as k-
means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage
and transmission purposes. It can be done using techniques such as wavelet
compression, JPEG compression, and gzip compression.

Data Summarisation Data Summarization

The term Data Summarization can be defined as the presentation of a


summary/report of generated data in a comprehensible and informative
manner. To relay information about the dataset, summarization is obtained
from the entire dataset. It is a carefully performed summary that will convey
trends and patterns from the dataset in a simplified manner.

DDgugucdwgiuwdhui
There are two areas in which you can implement Data Summarization in
Data Mining. These are as follows:

 Data Summarization in Data Mining: Centrality


 Data Summarization in Data Mining: Dispersion

1) Data Summarization in Data Mining: Centrality

The principle of Centrality is used to describe the center or middle value of


the data.

Several measures can be used to show the centrality of which the common
ones are average also called mean, median, and mode. The three of them
summarize the distribution of the sample data.

 Mean: This is used to calculate the numerical average of the set of


values.
 Mode: This shows the most frequently repeated value in a dataset.
 Median: This identifies the value in the middle of all the values in the
dataset when values are ranked in order.

The most appropriate measure to use will depend largely on the shape of
the dataset.

2) Data Summarization in Data Mining: Dispersion

The dispersion of a sample refers to how spread out the values are around
the average (center). Looking at the spread of the distribution of data
shows the amount of variation or diversity within the data. When the values
are close to the center, the sample has low dispersion while high dispersion
occurs when they are widely scattered about the center.

Different measures of dispersion can be used based on which is more


suitable for your dataset and what you want to focus on. The different
measures of dispersion are as follows:

 Standard deviation: This provides a standard way of knowing what


is normal, showing what is extra large or extra small and helping you
to understand the spread of the variable from the mean. It shows how
close all the values are to the mean.
 Variance: This is similar to standard deviation but it measures how
tightly or loosely values are spread around the average.
 Range: The range indicates the difference between the largest and
the smallest values thereby showing the distance between the
extremes.
Denormalization in Databases
When we normalize tables, we break them into multiple smaller tables. So when we
want to retrieve data from multiple tables, we need to perform some kind of join
operation on them. In that case, we use the denormalization technique that
eliminates the drawback of normalization.

he following are the advantages of denormalization:

1. Enhance Query Performance

Fetching queries in a normalized database generally requires joining a large number


of tables, but we already know that the more joins, the slower the query. To
overcome this, we can add redundancy to a database by copying values between
parent and child tables, minimizing the number of joins needed for a query.

2. Make database more convenient to manage

A normalized database is not required calculated values for applications. Calculating


these values on-the-fly will take a longer time, slowing down the execution of the
query. Thus, in denormalization, fetching queries can be simpler because we need to
look at fewer tables.

3. Facilitate and accelerate reporting

Suppose you need certain statistics very frequently. It requires a long time to create
them from live data and slows down the entire system. Suppose you want to monitor
client revenues over a certain year for any or all clients. Generating such reports from
live data will require "searching" throughout the entire database, significantly slowing
it down.

Cons of Denormalization
The following are the disadvantages of denormalization:

o It takes large storage due to data redundancy.


o It makes it expensive to updates and inserts data in a table.
o It makes update and inserts code harder to write.
o Since data can be modified in several ways, it makes data inconsistent. Hence, we'll
need to update every piece of duplicate data. It's also used to measure values and
produce reports. We can do this by using triggers, transactions, and/or procedures
for all operations that must be performed together.
How is denormalization different from normalization?
The denormalization is different from normalization in the following manner:

o Denormalization is a technique used to merge data from multiple tables into a single
table that can be queried quickly. Normalization, on the other hand, is used to delete
redundant data from a database and replace it with non-redundant and reliable data.
o Denormalization is used when joins are costly, and queries are run regularly on the
tables. Normalization, on the other hand, is typically used when a large number of
insert/update/delete operations are performed, and joins between those tables are
not expensive.

What is Fact Table?


A fact table is a primary table in a dimensional model.

A Fact Table contains

1. Measurements/facts
2. Foreign key to dimension table

What is a Dimension Table?


 A dimension table contains dimensions of a fact.
 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension
table
 Dimensions offers descriptive characteristics of the facts with the
help of their attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical
relationships

 Surrogate keys
 Surrogate keys join the dimension tables to the fact table. Surrogate keys serve as an
important means of identifying each instance or entity inside of a dimension table.
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube
enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

Multidimensional data model is best suited when the objectives is to analyse rather
than to perform on line transections.

Multidimensional data model is based on three key concepts:

1.Modelling business rules

2.Cube and Measures

3.Dimensions

Data Warehouses and OLAP are based on a multidimensional data model that views
data in the form of a data cube.Data cube is defined by dimensions and facts.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the
time dimension (organized in quarters) and the item dimension (classified according
to the types of an item sold). The fact or measure displayed in rupee_sold (in
thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the
cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:

Schemas for Multidimensional Data


Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema


 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.


 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and
units sold.
 It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.
 Let’s see the difference between Star and Snowflake Schema:

S.N
O Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact tables,
1. and the dimension tables are dimension tables as well as sub dimension
contained. tables are contained.

Star schema is a top-down


2. While it is a bottom-up model.
model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4.
execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and


5.
is not used. denormalization are used.

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake


7.
schema is low. schema is higher than star schema.

It’s understanding is very


8. While it’s understanding is difficult.
simple.

It has less number of foreign


9. While it has more number of foreign keys.
keys.
S.N
O Star Schema Snowflake Schema

10. It has high data redundancy. While it has low data redundancy.

OLAP Data Indexing for Bitmap Index


and Join Index
OLAP (Online Analytical Processing) data indexing is a technique used to improve the
performance of queries in OLAP systems. Two commonly used indexing methods in OLAP
are Bitmap Index and Join Index. Let's understand each of them:

Bitmap Index
A Bitmap Index is a type of indexing technique that uses bitmaps to represent the presence or
absence of values in a column. It is particularly useful for low cardinality columns, where the
number of distinct values is relatively small.
Here's how Bitmap Index works:
1. For each distinct value in the column, a bitmap is created.
2. Each bit in the bitmap represents a row in the table.
3. If a bit is set to 1, it indicates that the corresponding row contains the value represented by the
bitmap.
4. If a bit is set to 0, it indicates that the corresponding row does not contain the value.
Advantages of Bitmap Indexing:
 Efficient for low cardinality columns.
 Fast query performance for operations like equality, range, and set membership.
Disadvantages of Bitmap Indexing:
 Inefficient for high cardinality columns.
 Requires additional storage space for bitmaps.
Join Index
A Join Index is a type of indexing technique used to optimize join operations between
multiple tables in OLAP systems. It precomputes and stores the results of join operations,
reducing the need for expensive join operations during query execution.
Here's how Join Index works:
1. It identifies frequently executed join operations and creates an index on the join columns.
2. The index stores the precomputed results of the join operation.
3. When a query involves a join operation, the Join Index is used to retrieve the precomputed
results instead of performing the join operation again.
Advantages of Join Indexing:
 Improved query performance for join operations.
 Reduces the need for expensive join operations during query execution.
Disadvantages of Join Indexing:
 Additional storage space required to store the precomputed results.
 Join Index maintenance overhead when the underlying tables are updated.
In summary, Bitmap Indexing is suitable for low cardinality columns, while Join Indexing is
used to optimize join operations between multiple tables. Both indexing techniques can
significantly improve the performance of OLAP queries, but they have their own advantages
and disadvantages that need to be considered based on the specific requirements of the OLAP
system.

OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.


information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers or database professionals.
and analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.


Schema and Fact Constellation
Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed data.


consolidated data.

8 Provides summarized and Provides detailed and flat relational view of


multidimensional view of data. data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.


millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.

12 Highly flexible. Provides high performance.

33

You might also like