0% found this document useful (0 votes)
52 views

Unit 3

data warehousing unit 3 2021R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Unit 3

data warehousing unit 3 2021R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Paavai Institutions Department of CSE

UNIT - III
META DATA, DATA MART
AND
PARTITION STRATEGY

UNIT-III 3. 1
Paavai Institutions Department of CSE

CONTENTS
3.1 META DATA

3.2 CATEGORIES OF METADATA

3.3 ROLE OF METADATA

3.4 META DATA REPOSITORY

3.5 CHALLENGES FOR META MANAGEMENT

3.6 DATA MART

3.7 NEED OF DATA MART

3.8 COST EFFECTIVE DATA MART

3.9 DESIGNING DATA MARTS

3.10 COST OF DATA MARTS

3.11 PARTITIONING STRATEGY

3.12 VERTICAL PARTITION

3.13 NORMALIZATION

3.14 ROW SPLITTING

3.15 HORIZONTAL PARTITION

QUESTION BANK

UNIT-III 3. 2
Paavai Institutions Department of CSE

TECHINICAL TERMS

Technical Meaning
S.No Term Literal Meaning Digester
Information or facts that Raw facts, figures, or
are collected, stored, and statistics that are
1 Data analyzed for various processed or analyzed to
purposes. gain insights. http://www.yourdictionary.com/

Various aspects of data, Structured information


such as its format, that provides details
2 Metadata structure, location, creator, about the characteristics
and other attributes. of data. http://www.yourdictionary.com/

Central location where Centralized storage


something is stored or location used in software
kept. development for
managing and storing
3 Repository version-controlled files
http://www.yourdictionary.com/
and resources.
The process of removing Concept of hiding
unnecessary details or implementation details
complexities to focus on while exposing essential
4 Abstraction essential features or functionalities http://www.yourdictionary.com/
concepts.
Combines data from Extracting,
multiple data sources Transforming, and
5 ETL into a single, consistent Loading Data from a
data store. Retail Database to a http://www.yourdictionary.com/
Data Warehouse.

Access to an It ensures authorized


organizational access to network
6 Network Access information system by a resources, it secures
user. against unauthorized http://www.yourdictionary.com/
user access.
The process of Eliminate data
organizing data in a redundancy and
7 Normalization database. enhance data integrity
in the table. http://www.yourdictionary.com/

UNIT-III 3. 3
Paavai Institutions Department of CSE

3.1 META DATA

Metadata is simply defined as data about data. The data that is used to represent other data is known
as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other
words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data
warehouse, we can define metadata as follows.

 Metadata is the road-map to a data warehouse.


 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.

Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.

Several Examples of Metadata:

Metadata is data that provides information about other data. Here are a few examples of
metadata:

1. File metadata: This includes information about a file, such as its name, size, type, and
creation date.
2. Image metadata: This includes information about an image, such as its resolution, color
depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its title, artist,
album, and genre.
4. Video metadata: This includes information about a video, such as its length, resolution,
and frame rate.
5. Document metadata: This includes information about a document, such as its author, title,
and creation date.

UNIT-III 3. 4
Paavai Institutions Department of CSE

3.2 CATEGORIES OF METADATA

Metadata can be broadly categorized into three categories −

 Business Metadata − It has the data ownership information, business definition, and changing
policies.
 Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values.
 Technical metadata also includes structural information such as primary and foreign key
attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged.
 Lineage of data means the history of data migrated and transformation applied on it.

UNIT-III 3. 5
Paavai Institutions Department of CSE

3.3 ROLE OF METADATA

Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.

 Metadata acts as a directory.


 This directory helps the decision support system to locate the contents of the data warehouse.
 Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly summarized data.
 Metadata also helps in summarization between lightly detailed data and highly summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.

The following diagram shows the roles of metadata.

UNIT-III 3. 6
Paavai Institutions Department of CSE

3.4 METADATA REPOSITORY

Metadata repository is an integral part of a data warehouse system.It is a database or other


storage mechanism that is used to store metadata about data. A metadata repository can be used to
manage, organize, and maintain metadata in a consistent and structured manner, and can facilitate the
discovery, access, and use of data.

The metadata in a metadata repository may include information about the content, format,
structure, and other characteristics of data, and may be organized using metadata standards and
schemas.It has the following metadata

 Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
 Business metadata − It contains has the data ownership information, business definition, and
changing policies.
 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
 Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.
 Algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.

Benefits of Metadata Repository

A metadata repository is a centralized database or system that is used to store and manage
metadata. Some of the benefits of using a metadata repository include:

1. Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of the data.
2. Increased data accessibility: A metadata repository can make it easier for users to access
and understand the data, by providing context and information about the data.

UNIT-III 3. 7
Paavai Institutions Department of CSE

3. Enhanced data integration: A metadata repository can facilitate data integration by


providing a common place to store and manage metadata from multiple sources.
4. Improved data governance: A metadata repository can help enforce metadata standards
and policies, making it easier to ensure that data is being used and managed appropriately.
5. Enhanced data security: A metadata repository can help protect the privacy and security
of metadata, by providing controls to restrict access to sensitive or confidential information.
Metadata repositories can provide many benefits in terms of improving the quality, accessibility, and
management of data.

3.5 CHALLENGES FOR METADATA MANAGEMENT

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces
the definition of business terms to business end-users. With all these uses of metadata, it also has its
challenges. Some of the challenges are discussed below.

1. Lack of standardization: Different organizations or systems may use different standards


or conventions for metadata, which can make it difficult to effectively manage metadata
across different sources.
2. Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.
3. Data integration: When integrating data from multiple sources, it can be challenging to
ensure that the metadata is consistent and aligned across the different sources.
4. Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.
5. Data security: Ensuring the security and privacy of metadata can be a challenge,
especially when working with sensitive or confidential information.
6. Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.
7. Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined. There are no industry-wide accepted
standards. Data management solution vendors have narrow focus.

UNIT-III 3. 8
Paavai Institutions Department of CSE

3.6 DATA MART

A Data Mart is a subset of a directorial information store, generally oriented to a specific


purpose or primary data subject which may be distributed to provide business needs. Data Marts are
analytical record stores designed to focus on particular business functions for a specific community
within an organization. Data marts are derived from subsets of data in a data warehouse, though in the
bottom-up data warehouse design methodology, the data warehouse is created from the union of
organizational data marts.The fundamental use of a data mart is Business Intelligence
(BI) applications. BI is used to gather, store, access, and analyze record. It can be used by smaller
businesses to utilize the data they have accumulated since it is less expensive than implementing a data
warehouse.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts

o Independent Data Marts

o Hybrid Data Mart

UNIT-III 3. 9
Paavai Institutions Department of CSE

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data


warehouse. According to this technique, the data marts are treated as the subsets of a data
warehouse.

 In this technique, firstly a data warehouse is created from which further various data
marts can be created.
 These data mart are dependent on the data warehouse and extract the essential record
from it.
 In this technique, as the data warehouse creates the data mart; therefore, there is no need
for data mart integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts.

 In this approach, as all the data marts are designed independently; therefore, the
integration of data marts is required.

UNIT-III 3. 10
Paavai Institutions Department of CSE

 It is also termed as a bottom-up approach as the data marts are integrated to develop a
data warehouse.

 This is not created from the central data warehouse and the source to this can be

different. Since data is from other than the central DW ETT process is a bit different.

 Most of the independent data mart is used by a smaller group of organizations and the

source to this is also limited. The Independent data mart is generally created when we

need to get a solution in a relatively shorter time-bound.

Hybrid Data Marts

Hybrid data marts combine the data taken from a data warehouse and "other" data sources. This
can be useful in a variety of situations, including providing the ad hoc integration with a new group, or
product, which has been added to an organization.

UNIT-III 3. 11
Paavai Institutions Department of CSE

3.7 WHY DO WE NEED A DATA MART?

Listed below are the reasons to create a data mart −

 To partition data in order to impose access control strategies.


 To speed up the queries by reducing the volume of data to be scanned.
 To segment data into different hardware platforms.
 To structure data in a form suitable for a user access tool.

Note − Do not data mart for any other reason since the operation cost of data marting could be very
high. Before data marting, make sure that data marting strategy is appropriate for your particular
solution.

Features of data marts:

Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse
or data lake. This allows for faster query performance since the data in the data mart is focused
on a specific business unit or department.

UNIT-III 3. 12
Paavai Institutions Department of CSE

Optimized for Query Performance: Data marts are optimized for query performance, which
means that they are designed to support fast queries and analysis of the data stored in the data
mart.
Customizable: Data marts are customizable, which means that they can be designed to meet
the specific needs of a business unit or department.
Scalability: Data marts can be scaled horizontally or vertically to accommodate larger
volumes of data or to support more users.
Integration with Business Intelligence Tools: Data marts can be integrated with business
intelligence tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and
visualize the data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL)
process, which means that data is extracted from the larger data warehouse or data lake,
transformed to meet the requirements of the data mart, and loaded into the data mart.

Advantages of Data Mart:


1. Implementation of data mart needs less time as compared to implementation of data
warehouse as data mart is designed for a particular department of an organization.
2. Organizations are provided with choices to choose model of data mart depending upon cost
and their business.
3. Data can be easily accessed from data mart.
4. It contains frequently accessed queries, so enable to analyse business trend.

Disadvantages of Data Mart:


1. Since it stores the data related only to specific function, so does not store huge volume of
data related to each and every department of an organization like data warehouse.
2. Creating too many data marts becomes cumbersome sometimes.

3.8 COST-EFFECTIVE DATA MART


Follow the steps given below to make data marting cost-effective −
 Identify the Functional Splits
 Identify User Access Tool Requirements
 Identify Access Control Issues

UNIT-III 3. 13
Paavai Institutions Department of CSE

Identify the Functional Splits


In this step, we determine if the organization has natural functional splits. We look for
departmental splits, and we determine whether the way in which departments use information tend to be
in isolation from the rest of the organization.
Let's have an example. Consider a retail organization, where each merchant is accountable for
maximizing the sales of a group of products. For this, the following are the valuable information −
 sales transaction on a daily basis
 sales forecast on a weekly basis
 stock position on a daily basis
 stock movements on a daily basis
As the merchant is not interested in the products they are not dealing with, the data marting is a
subset of the data dealing which the product group of interest. The following diagram shows data
marting for different users.

UNIT-III 3. 14
Paavai Institutions Department of CSE

Given below are the issues to be taken into account while determining the functional split −
 The structure of the department may change.
 The products might switch from one department to other.
 The merchant could query the sales trend of other products to analyze what is
happening to the sales.
Note − We need to determine the business benefits and technical feasibility of using a data mart.

Identify User Access Tool Requirements

 We need data marts to support user access tools that require internal data structures.
 The data in such structures are outside the control of data warehouse but need to be
populated and updated on a regular basis.
 There are some tools that populate directly from the source system but some cannot.
 Therefore additional requirements outside the scope of the tool are needed to be
identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be directly
populated from the data warehouse, rather each tool must have its own data mart.

Identify Access Control Issues


There should to be privacy rules to ensure the data is accessed by authorized users only. For
example a data warehouse for retail banking institution ensures that all the accounts belong to the same
legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the
specific bank.

o Data marts allow us to build a complete wall by physically separating data segments
within the data warehouse.
o To avoid possible privacy problems, the detailed data can be removed from the data
warehouse. We can create data mart for each legal entity and load it via data warehouse,
with detailed account data.

UNIT-III 3. 15
Paavai Institutions Department of CSE

3.9 DESIGNING DATA MARTS

Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps in maintaining
control over database instances.

The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.

UNIT-III 3. 16
Paavai Institutions Department of CSE

3.10 COST OF DATA MARTS


The cost measures for data marting are as follows −
 Hardware and Software Cost
 Network Access
 Time Window Constraints
Hardware and Software Cost
Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage. If
detailed data and the data mart exist within the data warehouse, then we would face additional cost to
store and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used as an additional
strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that
the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart
load process.
Time Window Constraints
The extent to which a data mart loading process will eat into the available time window depends
on the complexity of the transformations and the data volumes being shipped. The determination of how
many data marts are possible depends on −
 Network capacity.
 Time window available
 Volume of data being transferred
 Mechanisms being used to insert data into a data mart

3.11 PARTITIONING STRATEGY

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning
also helps in balancing the various requirements of the system.

It optimizes the hardware performance and simplifies the management of data warehouse by
partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different
partitioning strategies.

UNIT-III 3. 17
Paavai Institutions Department of CSE

Why is it Necessary to Partition?

Partitioning is important for the following reasons −

 For easy management,


 To assist backup/recovery,
 To enhance performance.
For Easy Management

The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact
table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery

If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis.

Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only.
To Enhance Performance

By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.

3.12 VERTICAL PARTITION

Vertical partitioning involves creating relatively smaller tables with fewer elements while using
additional tables for the remaining data storage.

 In this partitioning strategy, a subset of data files is stored in each partition.


 Vertical partitioning operates at the entity level and is also referred to as normalization.
 A columnar database can be considered a vertically partitioned database. Vertical
partitioning helps to separate sensitive and non-sensitive data, reducing the amount of
concurrent access.

UNIT-III 3. 18
Paavai Institutions Department of CSE

The most typical application of vertical partitioning is to lower the I/O and performance costs of
retrieving frequently requested objects. Vertical partitioning is seen in Figure. Various attributes of an
object are stored in different partitions in this example. One partition stores data that is often accessed,
such as the product name, description, and price. The stock count and last-ordered date are stored on
another partition.

Vertical partitioning can be performed in the following two ways −

 Normalization
 Row Splitting

UNIT-III 3. 19
Paavai Institutions Department of CSE

3.13 NORMALIZATION

Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show
how normalization is performed.

Table before Normalization

Table after Normalization

UNIT-III 3. 20
Paavai Institutions Department of CSE

3.14 ROW SPLITTING

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is
to speed up the access to large table by reducing its size.

Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following table.

Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name

We can choose to partition on any key. The two possible keys could be

 region
 transaction_date

Suppose the business is organized in 30 geographical regions and each region has different number
of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough
because our requirements capture has shown that a vast majority of queries are restricted to the user's
own business region.If we partition by transaction_date instead of region, then the latest transaction
from every region will be in one partition. Now the user who wants to look at data within his own
region has to query across multiple partitions.

UNIT-III 3. 21
Paavai Institutions Department of CSE

Hence it is worth determining the right partitioning key.

3.15 HORIZONTAL PARTITIONING

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time
period represents a significant retention period within the business
Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.

UNIT-III 3. 22
Paavai Institutions Department of CSE

Partition by Size of Table

 This partitioning is complex to manage.


 It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions

If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension. Consider a large design that changes over time. If we
need to store all the variations in order to apply comparisons, that dimension may be very large. This
would definitely affect the response time.

Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition. This technique makes it easy to
automate table management facilities within the data warehouse.

UNIT-III 3. 23
Paavai Institutions Department of CSE

QUESTION BANK

PART – A

1. What are the three types of meta data?


2. Define meta data? Give Real Time Example.
3. What are the Advantages of meta data?
4. List out the three disadvantages of Meta Data?
5. State the types of Data Mart?
6. Why we use Data Mart?
7. Write Characteristics of data Mart?
8. What is Data Partition and how it is used?
9. Write examples of Data Partition?
10. Illustrate the challenges of meta data management.

PART – B

1. Describe in detail about the categories and role of metadata with example.
2. Suppose that a data warehouse consists of four dimensions customer, product,
salesperson and sales time, and the three measure sales Amount (in rupees), VAT (in
rupees) and payment type(in rupees). Draw the different classes of schemas that are
popularly used for modeling data warehouses and explain it.
3. How would you explain Metadata implementation with examples?
4. Design a star-schema, snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions (Time, Item, Branch And
Location) . Include the appropriate measures required for the schema.
5. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning.
6. What is Meta data? Illustrate the various classification of Meta data with example and
explain the same.

UNIT-III 3. 24

You might also like