Unit 2 DWDM
Unit 2 DWDM
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
DDgugucdwgiuwdhui
There are two areas in which you can implement Data Summarization in
Data Mining. These are as follows:
Several measures can be used to show the centrality of which the common
ones are average also called mean, median, and mode. The three of them
summarize the distribution of the sample data.
The most appropriate measure to use will depend largely on the shape of
the dataset.
The dispersion of a sample refers to how spread out the values are around
the average (center). Looking at the spread of the distribution of data
shows the amount of variation or diversity within the data. When the values
are close to the center, the sample has low dispersion while high dispersion
occurs when they are widely scattered about the center.
Suppose you need certain statistics very frequently. It requires a long time to create
them from live data and slows down the entire system. Suppose you want to monitor
client revenues over a certain year for any or all clients. Generating such reports from
live data will require "searching" throughout the entire database, significantly slowing
it down.
Cons of Denormalization
The following are the disadvantages of denormalization:
o Denormalization is a technique used to merge data from multiple tables into a single
table that can be queried quickly. Normalization, on the other hand, is used to delete
redundant data from a database and replace it with non-redundant and reliable data.
o Denormalization is used when joins are costly, and queries are run regularly on the
tables. Normalization, on the other hand, is typically used when a large number of
insert/update/delete operations are performed, and joins between those tables are
not expensive.
1. Measurements/facts
2. Foreign key to dimension table
Surrogate keys
Surrogate keys join the dimension tables to the fact table. Surrogate keys serve as an
important means of identifying each instance or entity inside of a dimension table.
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube
enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
Multidimensional data model is best suited when the objectives is to analyse rather
than to perform on line transections.
3.Dimensions
Data Warehouses and OLAP are based on a multidimensional data model that views
data in the form of a data cube.Data cube is defined by dimensions and facts.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the
time dimension (organized in quarters) and the item dimension (classified according
to the types of an item sold). The fact or measure displayed in rupee_sold (in
thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the
cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
S.N
O Star Schema Snowflake Schema
In star schema, The fact tables While in snowflake schema, The fact tables,
1. and the dimension tables are dimension tables as well as sub dimension
contained. tables are contained.
It takes less time for the While it takes more time than star schema
4.
execution of queries. for the execution of queries.
10. It has high data redundancy. While it has low data redundancy.
Bitmap Index
A Bitmap Index is a type of indexing technique that uses bitmaps to represent the presence or
absence of values in a column. It is particularly useful for low cardinality columns, where the
number of distinct values is relatively small.
Here's how Bitmap Index works:
1. For each distinct value in the column, a bitmap is created.
2. Each bit in the bitmap represents a row in the table.
3. If a bit is set to 1, it indicates that the corresponding row contains the value represented by the
bitmap.
4. If a bit is set to 0, it indicates that the corresponding row does not contain the value.
Advantages of Bitmap Indexing:
Efficient for low cardinality columns.
Fast query performance for operations like equality, range, and set membership.
Disadvantages of Bitmap Indexing:
Inefficient for high cardinality columns.
Requires additional storage space for bitmaps.
Join Index
A Join Index is a type of indexing technique used to optimize join operations between
multiple tables in OLAP systems. It precomputes and stores the results of join operations,
reducing the need for expensive join operations during query execution.
Here's how Join Index works:
1. It identifies frequently executed join operations and creates an index on the join columns.
2. The index stores the precomputed results of the join operation.
3. When a query involves a join operation, the Join Index is used to retrieve the precomputed
results instead of performing the join operation again.
Advantages of Join Indexing:
Improved query performance for join operations.
Reduces the need for expensive join operations during query execution.
Disadvantages of Join Indexing:
Additional storage space required to store the precomputed results.
Join Index maintenance overhead when the underlying tables are updated.
In summary, Bitmap Indexing is suitable for low cardinality columns, while Join Indexing is
used to optimize join operations between multiple tables. Both indexing techniques can
significantly improve the performance of OLAP queries, but they have their own advantages
and disadvantages that need to be considered based on the specific requirements of the OLAP
system.
OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers or database professionals.
and analysts.
33