0% found this document useful (0 votes)

32 views

Unit 2 DWDM

This document discusses data preprocessing in data mining. It describes the key steps in data preprocessing as data cleaning, data integration, data transformation, and data reduction. For each step, it provides examples of common techniques used such as missing data imputation for data cleaning, record linkage for data integration, normalization for data transformation, and feature selection for data reduction. The goal of data preprocessing is to improve data quality and prepare the data for analysis.

Uploaded by

brijendersinghkhalsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Unit 2 DWDM

Uploaded by

brijendersinghkhalsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT 2 Data Preprocessing in Data Mining

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Methods for Data Preprocessing

1.Data Cleaning: This involves identifying and correcting errors or inconsistencies

in the data, such as missing values, outliers, and duplicates. Various techniques can
be used for data cleaning, such as imputation, removal, and transformation.
2.Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and
data fusion can be used for data integration.
3.Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a
common range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous data into
discrete categories.
4.Data Reduction: This involves reducing the size of the dataset while preserving
the important information. Data reduction can be achieved through techniques such
as feature selection and feature extraction. Feature selection involves selecting a
subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important
information.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are: Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by
its mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the
size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model. Some
common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the
dataset. Feature selection is often performed to remove irrelevant or redundant
features from the dataset. It can be done using various techniques such as correlation
analysis, mutual information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional
space while preserving the important information. Feature extraction is often used
when the original features are high-dimensional and complex. It can be done using
techniques such as PCA, linear discriminant analysis (LDA), and non-negative
matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling
is often used to reduce the size of the dataset while preserving the important
information. It can be done using techniques such as random sampling, stratified
sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar data
points with a representative centroid. It can be done using techniques such as k-
means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage
and transmission purposes. It can be done using techniques such as wavelet
compression, JPEG compression, and gzip compression.

Data Summarisation Data Summarization

The term Data Summarization can be defined as the presentation of a

summary/report of generated data in a comprehensible and informative
manner. To relay information about the dataset, summarization is obtained
from the entire dataset. It is a carefully performed summary that will convey
trends and patterns from the dataset in a simplified manner.

DDgugucdwgiuwdhui
There are two areas in which you can implement Data Summarization in
Data Mining. These are as follows:

 Data Summarization in Data Mining: Centrality

 Data Summarization in Data Mining: Dispersion

1) Data Summarization in Data Mining: Centrality

The principle of Centrality is used to describe the center or middle value of

the data.

Several measures can be used to show the centrality of which the common
ones are average also called mean, median, and mode. The three of them
summarize the distribution of the sample data.

 Mean: This is used to calculate the numerical average of the set of

values.
 Mode: This shows the most frequently repeated value in a dataset.
 Median: This identifies the value in the middle of all the values in the
dataset when values are ranked in order.

The most appropriate measure to use will depend largely on the shape of
the dataset.

2) Data Summarization in Data Mining: Dispersion

The dispersion of a sample refers to how spread out the values are around
the average (center). Looking at the spread of the distribution of data
shows the amount of variation or diversity within the data. When the values
are close to the center, the sample has low dispersion while high dispersion
occurs when they are widely scattered about the center.

Different measures of dispersion can be used based on which is more

suitable for your dataset and what you want to focus on. The different
measures of dispersion are as follows:

 Standard deviation: This provides a standard way of knowing what

is normal, showing what is extra large or extra small and helping you
to understand the spread of the variable from the mean. It shows how
close all the values are to the mean.
 Variance: This is similar to standard deviation but it measures how
tightly or loosely values are spread around the average.
 Range: The range indicates the difference between the largest and
the smallest values thereby showing the distance between the
extremes.
Denormalization in Databases
When we normalize tables, we break them into multiple smaller tables. So when we
want to retrieve data from multiple tables, we need to perform some kind of join
operation on them. In that case, we use the denormalization technique that
eliminates the drawback of normalization.

he following are the advantages of denormalization:

1. Enhance Query Performance

Fetching queries in a normalized database generally requires joining a large number

of tables, but we already know that the more joins, the slower the query. To
overcome this, we can add redundancy to a database by copying values between
parent and child tables, minimizing the number of joins needed for a query.

2. Make database more convenient to manage

A normalized database is not required calculated values for applications. Calculating

these values on-the-fly will take a longer time, slowing down the execution of the
query. Thus, in denormalization, fetching queries can be simpler because we need to
look at fewer tables.

3. Facilitate and accelerate reporting

Suppose you need certain statistics very frequently. It requires a long time to create
them from live data and slows down the entire system. Suppose you want to monitor
client revenues over a certain year for any or all clients. Generating such reports from
live data will require "searching" throughout the entire database, significantly slowing
it down.

Cons of Denormalization
The following are the disadvantages of denormalization:

o It takes large storage due to data redundancy.

o It makes it expensive to updates and inserts data in a table.
o It makes update and inserts code harder to write.
o Since data can be modified in several ways, it makes data inconsistent. Hence, we'll
need to update every piece of duplicate data. It's also used to measure values and
produce reports. We can do this by using triggers, transactions, and/or procedures
for all operations that must be performed together.
How is denormalization different from normalization?
The denormalization is different from normalization in the following manner:

o Denormalization is a technique used to merge data from multiple tables into a single
table that can be queried quickly. Normalization, on the other hand, is used to delete
redundant data from a database and replace it with non-redundant and reliable data.
o Denormalization is used when joins are costly, and queries are run regularly on the
tables. Normalization, on the other hand, is typically used when a large number of
insert/update/delete operations are performed, and joins between those tables are
not expensive.

What is Fact Table?

A fact table is a primary table in a dimensional model.

A Fact Table contains

1. Measurements/facts
2. Foreign key to dimension table

What is a Dimension Table?

 A dimension table contains dimensions of a fact.
 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension
table
 Dimensions offers descriptive characteristics of the facts with the
help of their attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical
relationships


 Surrogate keys
 Surrogate keys join the dimension tables to the fact table. Surrogate keys serve as an
important means of identifying each instance or entity inside of a dimension table.
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube
enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

Multidimensional data model is best suited when the objectives is to analyse rather
than to perform on line transections.

Multidimensional data model is based on three key concepts:

1.Modelling business rules

2.Cube and Measures

3.Dimensions

Data Warehouses and OLAP are based on a multidimensional data model that views
data in the form of a data cube.Data cube is defined by dimensions and facts.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the
time dimension (organized in quarters) and the item dimension (classified according
to the types of an item sold). The fact or measure displayed in rupee_sold (in
thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the
cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:

Schemas for Multidimensional Data

Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema

 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.

 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and
units sold.
 It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.
 Let’s see the difference between Star and Snowflake Schema:

S.N
O Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact tables,
1. and the dimension tables are dimension tables as well as sub dimension
contained. tables are contained.

Star schema is a top-down

2. While it is a bottom-up model.
model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4.
execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and

5.
is not used. denormalization are used.

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake

7.
schema is low. schema is higher than star schema.

It’s understanding is very

8. While it’s understanding is difficult.
simple.

It has less number of foreign

9. While it has more number of foreign keys.
keys.
S.N
O Star Schema Snowflake Schema

10. It has high data redundancy. While it has low data redundancy.

OLAP Data Indexing for Bitmap Index

and Join Index
OLAP (Online Analytical Processing) data indexing is a technique used to improve the
performance of queries in OLAP systems. Two commonly used indexing methods in OLAP
are Bitmap Index and Join Index. Let's understand each of them:

Bitmap Index
A Bitmap Index is a type of indexing technique that uses bitmaps to represent the presence or
absence of values in a column. It is particularly useful for low cardinality columns, where the
number of distinct values is relatively small.
Here's how Bitmap Index works:
1. For each distinct value in the column, a bitmap is created.
2. Each bit in the bitmap represents a row in the table.
3. If a bit is set to 1, it indicates that the corresponding row contains the value represented by the
bitmap.
4. If a bit is set to 0, it indicates that the corresponding row does not contain the value.
Advantages of Bitmap Indexing:
 Efficient for low cardinality columns.
 Fast query performance for operations like equality, range, and set membership.
Disadvantages of Bitmap Indexing:
 Inefficient for high cardinality columns.
 Requires additional storage space for bitmaps.
Join Index
A Join Index is a type of indexing technique used to optimize join operations between
multiple tables in OLAP systems. It precomputes and stores the results of join operations,
reducing the need for expensive join operations during query execution.
Here's how Join Index works:
1. It identifies frequently executed join operations and creates an index on the join columns.
2. The index stores the precomputed results of the join operation.
3. When a query involves a join operation, the Join Index is used to retrieve the precomputed
results instead of performing the join operation again.
Advantages of Join Indexing:
 Improved query performance for join operations.
 Reduces the need for expensive join operations during query execution.
Disadvantages of Join Indexing:
 Additional storage space required to store the precomputed results.
 Join Index maintenance overhead when the underlying tables are updated.
In summary, Bitmap Indexing is suitable for low cardinality columns, while Join Indexing is
used to optimize join operations between multiple tables. Both indexing techniques can
significantly improve the performance of OLAP queries, but they have their own advantages
and disadvantages that need to be considered based on the specific requirements of the OLAP
system.

OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.

information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers or database professionals.
and analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.

Schema and Fact Constellation
Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed data.

consolidated data.

8 Provides summarized and Provides detailed and flat relational view of

multidimensional view of data. data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.

millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.

12 Highly flexible. Provides high performance.

Application of Linear Algebra in Computer Science and Engineering
80% (5)
Application of Linear Algebra in Computer Science and Engineering
5 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data transformation in data mining
No ratings yet
Data transformation in data mining
6 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Transformation and standardization
No ratings yet
Data Transformation and standardization
5 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Module 2
No ratings yet
Module 2
42 pages
Down 2
No ratings yet
Down 2
61 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Binning
No ratings yet
Data Binning
9 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
1.2 - Data Processing
No ratings yet
1.2 - Data Processing
25 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
UNIT 3
No ratings yet
UNIT 3
22 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Unit 3
No ratings yet
Unit 3
18 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
253777
No ratings yet
253777
66 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Week 2
No ratings yet
Week 2
3 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Unit 1
No ratings yet
Unit 1
27 pages
Steps in The Data Mining Process
No ratings yet
Steps in The Data Mining Process
5 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Download Managerial Decision Modeling 4th Edition Nagraj Balakrishnan ebook All Chapters PDF
100% (3)
Download Managerial Decision Modeling 4th Edition Nagraj Balakrishnan ebook All Chapters PDF
81 pages
MGTSCIE Chapter 1 2023
No ratings yet
MGTSCIE Chapter 1 2023
49 pages
Udit Raj Singh 3
No ratings yet
Udit Raj Singh 3
1 page
Assignement 1 ECE 434 AI
No ratings yet
Assignement 1 ECE 434 AI
4 pages
Tutorial12 Estimation PartA
No ratings yet
Tutorial12 Estimation PartA
13 pages
Inference and Learning from Data: Volume 1: Foundations 1st Edition Ali H. Sayed All Chapters Instant Download
100% (3)
Inference and Learning from Data: Volume 1: Foundations 1st Edition Ali H. Sayed All Chapters Instant Download
40 pages
MathTest - DATA SCIENCE
No ratings yet
MathTest - DATA SCIENCE
11 pages
Two-Sided Laplace Transform
No ratings yet
Two-Sided Laplace Transform
6 pages
District Mechanics - Tower of Hanoi
No ratings yet
District Mechanics - Tower of Hanoi
3 pages
Mygfg PDF
No ratings yet
Mygfg PDF
209 pages
Beam Structures Classical and Advanced Theories - Engineering Books
No ratings yet
Beam Structures Classical and Advanced Theories - Engineering Books
3 pages
IMPROVING BALL MILL CONTROL WITH MODERN TOOLS BASED ON
No ratings yet
IMPROVING BALL MILL CONTROL WITH MODERN TOOLS BASED ON
8 pages
Modern Digital Communication: Discrete Stationary Source Coding
No ratings yet
Modern Digital Communication: Discrete Stationary Source Coding
17 pages
Distributed Systems CSC-503: Raymond's Tree-Based Algorithm
No ratings yet
Distributed Systems CSC-503: Raymond's Tree-Based Algorithm
11 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
11 pages
Turing Reducibility
No ratings yet
Turing Reducibility
2 pages
Computer Oriented Numerical Techniques: Note
No ratings yet
Computer Oriented Numerical Techniques: Note
7 pages
Backstepping Controlof Drone
No ratings yet
Backstepping Controlof Drone
10 pages
Major Project Synopsis
No ratings yet
Major Project Synopsis
7 pages
DL Lab Manual A.Y 2022-23-1
100% (1)
DL Lab Manual A.Y 2022-23-1
67 pages
4-1 Cns r20 Unit -4
No ratings yet
4-1 Cns r20 Unit -4
30 pages
Unit 8 Sorting
No ratings yet
Unit 8 Sorting
37 pages
Computational Thinking - 02: Assignment 1 - A1
No ratings yet
Computational Thinking - 02: Assignment 1 - A1
12 pages
Codes For Scheduling
No ratings yet
Codes For Scheduling
8 pages
L09 Using Matlab Neural Networks Toolbox
100% (1)
L09 Using Matlab Neural Networks Toolbox
34 pages
document-dsbda-codes-for-mini-project
No ratings yet
document-dsbda-codes-for-mini-project
9 pages
Book List
No ratings yet
Book List
2 pages
AI Sample Papers
No ratings yet
AI Sample Papers
4 pages

Unit 2 DWDM

Uploaded by

Unit 2 DWDM

Uploaded by

UNIT 2 Data Preprocessing in Data Mining

Methods for Data Preprocessing

1.Data Cleaning: This involves identifying and correcting errors or inconsistencies

Data Summarisation Data Summarization

The term Data Summarization can be defined as the presentation of a

 Data Summarization in Data Mining: Centrality

1) Data Summarization in Data Mining: Centrality

The principle of Centrality is used to describe the center or middle value of

 Mean: This is used to calculate the numerical average of the set of

2) Data Summarization in Data Mining: Dispersion

Different measures of dispersion can be used based on which is more

 Standard deviation: This provides a standard way of knowing what

he following are the advantages of denormalization:

1. Enhance Query Performance

Fetching queries in a normalized database generally requires joining a large number

2. Make database more convenient to manage

A normalized database is not required calculated values for applications. Calculating

3. Facilitate and accelerate reporting

o It takes large storage due to data redundancy.

What is Fact Table?

A Fact Table contains

What is a Dimension Table?

Multidimensional data model is based on three key concepts:

1.Modelling business rules

2.Cube and Measures

Schemas for Multidimensional Data

Fact Constellation Schema

 The sales fact table is same as that in the star schema.

Star schema is a top-down

3. Star schema uses more space. While it uses less space.

In star schema, Normalization While in this, Both normalization and

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake

It’s understanding is very

It has less number of foreign

OLAP Data Indexing for Bitmap Index

1 Involves historical processing of Involves day-to-day processing.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed data.

8 Provides summarized and Provides detailed and flat relational view of

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.

12 Highly flexible. Provides high performance.

You might also like