Unit 3
Unit 3
UNIT - III
META DATA, DATA MART
AND
PARTITION STRATEGY
UNIT-III 3. 1
Paavai Institutions Department of CSE
CONTENTS
3.1 META DATA
3.13 NORMALIZATION
QUESTION BANK
UNIT-III 3. 2
Paavai Institutions Department of CSE
TECHINICAL TERMS
Technical Meaning
S.No Term Literal Meaning Digester
Information or facts that Raw facts, figures, or
are collected, stored, and statistics that are
1 Data analyzed for various processed or analyzed to
purposes. gain insights. http://www.yourdictionary.com/
UNIT-III 3. 3
Paavai Institutions Department of CSE
Metadata is simply defined as data about data. The data that is used to represent other data is known
as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other
words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data
warehouse, we can define metadata as follows.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.
Metadata is data that provides information about other data. Here are a few examples of
metadata:
1. File metadata: This includes information about a file, such as its name, size, type, and
creation date.
2. Image metadata: This includes information about an image, such as its resolution, color
depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its title, artist,
album, and genre.
4. Video metadata: This includes information about a video, such as its length, resolution,
and frame rate.
5. Document metadata: This includes information about a document, such as its author, title,
and creation date.
UNIT-III 3. 4
Paavai Institutions Department of CSE
Business Metadata − It has the data ownership information, business definition, and changing
policies.
Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values.
Technical metadata also includes structural information such as primary and foreign key
attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation applied on it.
UNIT-III 3. 5
Paavai Institutions Department of CSE
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
UNIT-III 3. 6
Paavai Institutions Department of CSE
The metadata in a metadata repository may include information about the content, format,
structure, and other characteristics of data, and may be organized using metadata standards and
schemas.It has the following metadata
Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
Business metadata − It contains has the data ownership information, business definition, and
changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.
Algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.
A metadata repository is a centralized database or system that is used to store and manage
metadata. Some of the benefits of using a metadata repository include:
1. Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of the data.
2. Increased data accessibility: A metadata repository can make it easier for users to access
and understand the data, by providing context and information about the data.
UNIT-III 3. 7
Paavai Institutions Department of CSE
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces
the definition of business terms to business end-users. With all these uses of metadata, it also has its
challenges. Some of the challenges are discussed below.
UNIT-III 3. 8
Paavai Institutions Department of CSE
There are mainly two approaches to designing data marts. These approaches are
UNIT-III 3. 9
Paavai Institutions Department of CSE
In this technique, firstly a data warehouse is created from which further various data
marts can be created.
These data mart are dependent on the data warehouse and extract the essential record
from it.
In this technique, as the data warehouse creates the data mart; therefore, there is no need
for data mart integration. It is also known as a top-down approach.
The second approach is Independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts.
In this approach, as all the data marts are designed independently; therefore, the
integration of data marts is required.
UNIT-III 3. 10
Paavai Institutions Department of CSE
It is also termed as a bottom-up approach as the data marts are integrated to develop a
data warehouse.
This is not created from the central data warehouse and the source to this can be
different. Since data is from other than the central DW ETT process is a bit different.
Most of the independent data mart is used by a smaller group of organizations and the
source to this is also limited. The Independent data mart is generally created when we
Hybrid data marts combine the data taken from a data warehouse and "other" data sources. This
can be useful in a variety of situations, including providing the ad hoc integration with a new group, or
product, which has been added to an organization.
UNIT-III 3. 11
Paavai Institutions Department of CSE
Note − Do not data mart for any other reason since the operation cost of data marting could be very
high. Before data marting, make sure that data marting strategy is appropriate for your particular
solution.
Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse
or data lake. This allows for faster query performance since the data in the data mart is focused
on a specific business unit or department.
UNIT-III 3. 12
Paavai Institutions Department of CSE
Optimized for Query Performance: Data marts are optimized for query performance, which
means that they are designed to support fast queries and analysis of the data stored in the data
mart.
Customizable: Data marts are customizable, which means that they can be designed to meet
the specific needs of a business unit or department.
Scalability: Data marts can be scaled horizontally or vertically to accommodate larger
volumes of data or to support more users.
Integration with Business Intelligence Tools: Data marts can be integrated with business
intelligence tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and
visualize the data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL)
process, which means that data is extracted from the larger data warehouse or data lake,
transformed to meet the requirements of the data mart, and loaded into the data mart.
UNIT-III 3. 13
Paavai Institutions Department of CSE
UNIT-III 3. 14
Paavai Institutions Department of CSE
Given below are the issues to be taken into account while determining the functional split −
The structure of the department may change.
The products might switch from one department to other.
The merchant could query the sales trend of other products to analyze what is
happening to the sales.
Note − We need to determine the business benefits and technical feasibility of using a data mart.
We need data marts to support user access tools that require internal data structures.
The data in such structures are outside the control of data warehouse but need to be
populated and updated on a regular basis.
There are some tools that populate directly from the source system but some cannot.
Therefore additional requirements outside the scope of the tool are needed to be
identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be directly
populated from the data warehouse, rather each tool must have its own data mart.
o Data marts allow us to build a complete wall by physically separating data segments
within the data warehouse.
o To avoid possible privacy problems, the detailed data can be removed from the data
warehouse. We can create data mart for each legal entity and load it via data warehouse,
with detailed account data.
UNIT-III 3. 15
Paavai Institutions Department of CSE
Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps in maintaining
control over database instances.
The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.
UNIT-III 3. 16
Paavai Institutions Department of CSE
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning
also helps in balancing the various requirements of the system.
It optimizes the hardware performance and simplifies the management of data warehouse by
partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different
partitioning strategies.
UNIT-III 3. 17
Paavai Institutions Department of CSE
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact
table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
Vertical partitioning involves creating relatively smaller tables with fewer elements while using
additional tables for the remaining data storage.
UNIT-III 3. 18
Paavai Institutions Department of CSE
The most typical application of vertical partitioning is to lower the I/O and performance costs of
retrieving frequently requested objects. Vertical partitioning is seen in Figure. Various attributes of an
object are stored in different partitions in this example. One partition stores data that is often accessed,
such as the product name, description, and price. The stock count and last-ordered date are stored on
another partition.
Normalization
Row Splitting
UNIT-III 3. 19
Paavai Institutions Department of CSE
3.13 NORMALIZATION
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show
how normalization is performed.
UNIT-III 3. 20
Paavai Institutions Department of CSE
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is
to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has different number
of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough
because our requirements capture has shown that a vast majority of queries are restricted to the user's
own business region.If we partition by transaction_date instead of region, then the latest transaction
from every region will be in one partition. Now the user who wants to look at data within his own
region has to query across multiple partitions.
UNIT-III 3. 21
Paavai Institutions Department of CSE
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time
period represents a significant retention period within the business
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.
UNIT-III 3. 22
Paavai Institutions Department of CSE
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension. Consider a large design that changes over time. If we
need to store all the variations in order to apply comparisons, that dimension may be very large. This
would definitely affect the response time.
In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition. This technique makes it easy to
automate table management facilities within the data warehouse.
UNIT-III 3. 23
Paavai Institutions Department of CSE
QUESTION BANK
PART – A
PART – B
1. Describe in detail about the categories and role of metadata with example.
2. Suppose that a data warehouse consists of four dimensions customer, product,
salesperson and sales time, and the three measure sales Amount (in rupees), VAT (in
rupees) and payment type(in rupees). Draw the different classes of schemas that are
popularly used for modeling data warehouses and explain it.
3. How would you explain Metadata implementation with examples?
4. Design a star-schema, snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions (Time, Item, Branch And
Location) . Include the appropriate measures required for the schema.
5. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning.
6. What is Meta data? Illustrate the various classification of Meta data with example and
explain the same.
UNIT-III 3. 24