0% found this document useful (0 votes)
6 views

Unit 3 Notes DWM

The document discusses data warehouse design, emphasizing its complexity and the need for a structured approach to meet evolving business analytical requirements. It outlines the usage of data warehousing for historical data analysis, decision-making, and integration of diverse application systems, while detailing design methodologies such as top-down and bottom-up approaches. Additionally, it covers data cube computation methods, OLAP indexing, and the role of data warehouses in information processing and multidimensional data mining.

Uploaded by

Gajanan Markad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 3 Notes DWM

The document discusses data warehouse design, emphasizing its complexity and the need for a structured approach to meet evolving business analytical requirements. It outlines the usage of data warehousing for historical data analysis, decision-making, and integration of diverse application systems, while detailing design methodologies such as top-down and bottom-up approaches. Additionally, it covers data cube computation methods, OLAP indexing, and the role of data warehouses in information processing and multidimensional data mining.

Uploaded by

Gajanan Markad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit 3: Data Warehouse Designing and Online Analytical

Processing II
Q.1] Explain Data Warehouse Design
▪ A data warehouse is a single data repository where a record from multiple
(heterogeneous) data sources is integrated for online business analytical
processing (OLAP).
▪ Thus, data warehouse design is a hugely complex, lengthy, and hence error-
prone process.
▪ Furthermore, business analytical functions change over time, which results
in changes in the requirements for the systems.
Q.2] State usage of data warehousing.
Usage of Data warehousing:
1. It possesses consolidated historical data, which helps the organization
to analyze its business.
2. A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
3. Data warehouse systems help in the integration of diversity of
application systems.
4. A data warehouse system helps in consolidated historical data analysis.
5. Improved query performance
Q.3] Describe business framework for Data warehouse design.
Business framework for DW design:
The business analyst gets the information from the data warehouses to measure
the performance and make critical adjustments in order to win over other business
holders in the market.
Having a data warehouse offers the following advantages:
1. Since a data warehouse can gather information quickly and efficiently, it
can enhance business productivity.
2. A data warehouse provides us a consistent view of customers and items;
hence, it helps us manage customer relationship.
3. A data warehouse also helps in bringing down the costs by tracking trends,
patterns over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and
analyze the business needs and construct a business analysis framework. Each
person has different views regarding the design of a data warehouse.
These views are as follows:
1. The top-down view: This view allows the selection of relevant
information needed for a data warehouse.
2. The data source view: This view presents the information being captured,
stored, and managed by the operational system.
3. The data warehouse view: This view includes the fact tables and
dimension tables. It represents the information stored inside the data
warehouse.
4. The business query view: It is the view of the data from the viewpoint of
the end user.
Q.4] Explain top-down and bottom-up design approach of data warehouse. OR
Explain Data warehouse design process.
Three methods:
1. Software Engineering Model
2. Typical Design Process
3. Top-down approach and Bottom-up approach

1] Software Engineering Model:


a. Requirements Gathering:
▪ Gathering requirements is step one of the data warehouse design
processes.
▪ The goal of the requirement’s gathering phase is to determine the criteria
for a successful implementation of the data warehouse.
b. Physical Environment Setup:
▪ Once the business requirements are set, the next step is to determine the
physical environment for the data warehouse.
▪ There should be separate physical application and database servers as
well as separate ETL/ELT, OLAP, data cube, and reporting processes set
up for development, testing, and production.
c. Data Modeling:
Once requirements gathering and physical environments have been defined, the
next step is to define how data structures will be accessed, connected, processed,
and stored in the data warehouse. This process is known as data modeling.

d. Extract, Transfer, Load (ETL) Solution:


ETL or Extract, Transfer, Load is the process used to pull data out of existing
data sources and put it into warehouse.

e. OLAP Cube Design:


▪ On-Line Analytical Processing (OLAP) provides the infrastructure for ad-
hoc user query and multi-dimensional analysis.
▪ OLAP design specification should come from users who will query the
data.
f. Front End Development:
▪ Front end development is how users will access the data for analysis and
run reports.
▪ There are many options available, including building your front end in-
house or purchasing an off the shelf product
g. Report Development:
▪ For most end users, the only contact they have with the data warehouse is
through the reports they generate.
▪ Users’ ability to select their report criteria quickly and efficiently is an
essential feature for data warehouse report generation.
h. Deployment:
▪ Time to go live.
▪ Deciding to make the system available to everyone at once or perform a
staggered release, will depend on the number of end users and how they
will access the data warehouse system.
2] Typical Design Process:

1. Choose a business process to model. If business process is an organizational,


choose the Data Warehouse. If process is a departmental, choose Data Mart.
2. Choose the grain of business process model. Fundamental details of data to be
represented in fact table.
3. Choose the dimensions that will apply to each fact table record. The typical
dimensions like time, location, item, etc.
4. Choose the measure that will populate each fact table. Typical measures
(numeric values) are charges and count.
3. Top-down approach and Bottom-up approach
1] Top-down Approach:
An approach is a data-driven approach as the
information is gathered and integrated first and then business requirements by
subjects for building data marts are formulated.

1. External Sources: External source is a source from where data is collected


irrespective of the type of data. Data can be structured, semi structured and
unstructured as well.
2. Stage Area: Since the data, extracted from the external sources does not follow
a particular format, so there is a need to validate this data to load into data
warehouse.
For this purpose, it is recommended to use ETL tool.
▪ E(Extracted): Data is extracted from External data source.
▪ T(Transform): Data is transformed into the standard format.
▪ L(Load): Data is loaded into data warehouse after transforming it into the
standard format.
3. Data-warehouse: After cleansing of data, it is stored in the data warehouse as
central repository. It actually stores the meta data and the actual data gets
stored in the data marts.
4. Data Mart: Data mart is also a part of storage component (subset of Data
Warehouse). It stores the information of a particular function of an
organisation which is handled by single authority. There can be as many
numbers of data marts in an organisation depending upon the functions.
5. Data Mining: It is used to find the hidden patterns that are present in the
database or in data warehouse with the help of algorithm of data mining.
Advantages of Top-Down Approach:
1. Unified, integrated data warehouse.
2. Scalable for future growth.
3. Centralized control over data quality.
4. Aligns with long-term business strategy.
Disadvantages of Top-Down Approach:
1. Time-consuming, with delayed results.
2. High initial costs.
3. Complex planning and collaboration needed.
4. Difficult to adjust once built.

2] Bottom-Up approach:
In this approach, a data mart is created first for particular business processes (or
subjects).
1. First, the data is extracted from external sources.
2. Then, the data go through the staging area and loaded into data marts instead
of data warehouse.
3. The data marts are created first and provide reporting capability. It addresses a
single business area.
4. These data marts are then integrated into data warehouse.
Advantages of Bottom-Up Approach:
1. Faster implementation and quick results.
2. Lower initial investment.
3. Flexible and adaptable.
4. Immediate usability for departments.

Disadvantages of Bottom-Up Approach:


1. Can create data silos.
2. Integration challenges later on.
3. Inconsistent data management.
4. Long-term complexity with multiple marts.
Q.5] Difference between Top-down and Bottom-up approach

Q.6] Explain Data Warehouse usage for information processing.


Data warehouse usage for information processing:
▪ Data warehouses and data marts are used in a wide range of applications.
Business executives use the data in data warehouses and data marts to
perform data analysis and make strategic decisions.
▪ Data warehouses are used extensively in banking and financial services,
consumer goods and retail distribution sectors, and controlled
manufacturing such as demand-based production.
▪ Initially, the data warehouse is mainly used for generating reports and
answering predefined queries. Progressively, it is used to analyse
summarized and detailed data, where the results are presented in the form
of reports and charts. Later, the data warehouse is used for strategic
purposes, performing multidimensional analysis and sophisticated slice-
and-dice operations. Finally, the data warehouse may be employed for
knowledge discovery and strategic decision-making using data mining
tools.
▪ The tools for data warehousing can be categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
Here are three kinds of data warehouse applications:
1. Information Processing:
A data warehouse allows to process the data
stored in it. The data can be processed by means of querying, basic
statistical analysis, reporting using crosstabs, tables, charts, or graphs.
2. Analytical Processing:
A data warehouse supports analytical processing of
the information stored in it. The data can be analysed by means of basic
OLAP operations, including slice-and-dice, drill down, drill up, and
pivoting.
3. Data Mining:
Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can be
presented using the visualization tools.
Q.7] From Online Analytical Processing to Multidimensional Data Mining
▪ The data mining field has conducted massive research regarding mining on
various data types, including relational data, data from data warehouses,
transaction data, time-series data, spatial data, text data, and flat files.
▪ Multidimensional data mining integrates OLAP with data mining to
uncover knowledge in multidimensional databases.
Multidimensional data mining is particularly important for the following
reasons:
1] High quality of data in data warehouses:
Most data mining tools need to
work on integrated, consistent, and cleaned data.
A data warehouse constructed by such pre-processing serves as a valuable
source of high-quality data for OLAP as well as for data mining.
2] Available information processing infrastructure surrounding data
warehouses:
Comprehensive information processing and data analysis
infrastructures have been constructed surrounding data warehouses, which
include accessing, integration, consolidation, and transformation of multiple
heterogeneous databases, reporting and OLAP analysis tools.
It is sensible to make the best use of the available infrastructures
rather than constructing everything from scratch.
3] OLAP-based exploration of multidimensional data:
Effective data mining
needs exploratory data analysis.
Multidimensional data mining provides facilities for mining on
different subsets of data and at varying levels of abstraction by drilling, pivoting,
filtering, dicing, and slicing on a data cube.
4. Online selection of data mining functions:
By integrating OLAP with various
data mining functions, multidimensional data mining provides users with the
flexibility to select desired data mining functions and swap data mining tasks
dynamically.
Q.8] List data cube computation methods/Strategies.
Data Cube computation methods:
1. Sorting, hashing and grouping
2. Simultaneous aggregation and caching intermediate results.
3. Aggregation from the smallest child when there exists multiple child
cuboid.
4. The Apriori pruning method can be explored to computer iceberg cube
efficiently.
5. Materialization can also be performed on the cuboids

Q.9] State and explain Data Cube Computation Methods/ Strategies


1. Sorting, hashing and grouping
2. Simultaneous aggregation and caching intermediate results
3. Aggregation from the smallest child
4. The Apriori pruning method
1] Sorting, hashing and grouping:
▪ In cube computation, aggregation is performed on the tuples (or cells) that
share the same set of dimension values.
▪ Thus, it is important to explore sorting, hashing, and grouping operations
to access and group such data together to facilitate computation of such
aggregates.
Ex: To compute total sales by branch, day, and item, for example, it can be more
efficient to sort tuples or cells by branch, and then by day, and then group them
according to the itemname.
2] Simultaneous aggregation and caching intermediate results:
In cube computation, it is efficient to compute higher-level aggregates from
previously computed lower-level aggregates, rather than from the base fact table.
Simultaneous aggregation from cached intermediate computation results may
lead to the reduction of expensive disk input/output (I/O) operations.
Ex: To compute sales by branch, for example, we can use the intermediate results
derived from the computation of a lower-level cuboid such as sales by branch and
day.

3] Aggregation from the smallest child:


If a parent cuboid has more than one child, it is efficient to compute it from the
smallest previously computed child cuboid.
Ex: To compute a sales cuboid, Cbranch, when there exist two previously
computed cuboids, C{branch,year} and c{branch,item}, it is obviously more
efficient to compute Cbranch from the former than from the latter if there are
many more distinct items than distinct years.
4] The Apriori pruning method:
The method can trim or delete the unwanted (infrequent) data. Apriori requires a
priori knowledge to generate the frequent itemsets and involves two-time
consuming pruning steps to exclude the infrequent candidates and hold frequents.
It is used to reduce the computation of iceberg cubes.

Q.10] Explain Data warehouse implementation. OR Explain an overview of


Efficient Data Cube Computation.
▪ Data warehouse contains the multidimensional data cubes.
▪ Data analysis depends upon the efficient data cube computation
▪ Data cube is a lattice of cuboids. Cuboid means aggregation of data, also
referred as group-by’s.
▪ Taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids,
or group by’s, that can be computed for this data cube is 23=8.
▪ The possible group-by’s are the following: {(city, item, year), (city, item),
(city, year), (item,year), (city), (item), (year), ()} where ()means that the
group-by is empty (i.e., the dimensions are not grouped). These group-by’s
form a lattice of cuboids for the data cube,
as shown in Figure.

▪ The base cuboid contains all three dimensions, city, item, and year. It can
return the total sales for any combination of the three dimensions.
▪ The apex cuboid, or 0-D cuboid, refers to the case where the group-by is
empty. It contains the total sum of all sales.
▪ The base cuboid is the least generalized (most specific) of the cuboids.
▪ The apex cuboid is the most generalized (least specific) of the cuboids, and
is often denoted as all.
Materialization (Precomputation of Data Cube):
There are three choices for data cube materialization given a base cuboid:
1. No materialization
2. Full materialization
3. Partial materialization
1] No materialization:
In this technique, cuboids are not precomputed. This leads to computing
expensive multidimensional aggregates that can takes more time and money.
2] Full materialization:
▪ The technique can precompute all the cuboids available in
multidimensional data cubes. The resulting lattice of computed cuboids is
referred as the full cube.
▪ This choice typically requires huge amount of memory space in order to
store all of the precomputed cuboids.
3] Partial materialization:
▪ Selectively compute a proper subset of the whole set of possible cuboids.
▪ Compute a subset of the cube, which contains only those cells that satisfy
some user specified criterion.
▪ It uses subcube where only some of the cells may be precomputed for
various cuboids
Q.11] Explain OLAP data indexing with its type. OR Explain Bitmap and Join
Index for OLAP.
Indexing OLAP Data Types:
1. Bitmap Index
2. Join Index
1] Bitmap index in OLAP:
▪ The bitmap index is an alternative representation of the record ID (RID)
list.
▪ Each attribute is represented by distinct bit value. If attribute’s domain
consists of n values, then n bits are needed for each entry in the bitmap
index.
▪ If the attribute value is present in the row, then it is represented by 1 in the
corresponding row of the bitmap index and rest are 0 (zero).
Example:

Base table mapping to bitmap index tables for dimensions Region and Type are:

Advantages of Bitmap Index:


1. Efficient for queries with low cardinality.
2. Requires less storage for columns with few unique values.
3. Enhances performance in complex queries.
4. Fast for large datasets with aggregate functions.
Disadvantages of Bitmap Index:
1. Ineffective for high cardinality columns.
2. Slower updates when data changes frequently.
3. Can degrade performance with volatile data.
4. May consume more memory with wide tables.

2] Join Indexing in OLAP:


▪ The bitmap index is an alternative representation of the record ID (RID)
list.
▪ In addition to a bitmap index on a single table, we can create a bitmap join
index, which is a bitmap index for the join of two or more tables.
▪ In a bitmap join index, the bitmap for the table to be indexed is built for
values coming from the joined tables.
Example: Bitmap Join Index (consider any relevant example like this)
Company's customers table.
SELECT cust_id, cust_gender, cust_income FROM customers;

Table sales must contain cust_id values.


SELECT time_id, cust_id, amount_sold FROM sales;

The following query illustrates the join result that is used to create the bitmaps
that are stored in the bitmap join index:
SELECT sales.time_id, customers.cust_gender, sales.amount_sold FROM sales,
customers WHERE sales.cust_id = customers.cust_id;
Advantages of Join Index:
1. Speeds up complex join queries.
2. Reduces the need for expensive joins at query time.
3. Simplifies queries by storing pre-joined data.
4. Improves query response time for common joins.
Disadvantages of Join Index:
1. Increases storage overhead.
2. Requires maintenance when data changes.
3. Can become outdated with frequent data changes.
4. Adds complexity in managing indexes.

Q.12] Explain role of OLAP queries with example


Role of OLAP:
1. OLAP is a database technology that has been optimized for querying and
reporting, instead of processing transactions.
2. OLAP queries can be used to identify and compute the specific values from
cube which are required for decision support.
Ex: (consider another related example also)
compute cube sales iceberg as
select month, city, customer_group, count(*)
from salesInfo cube by month, city, customer_group
having count(*) >= min_sup
Q.13] Efficient Processing of OLAP Queries
The purpose of materializing cuboids and constructing OLAP index structures is
to speed up query processing in data cubes.
By using materialized views, query processing should proceed as follows:
1. Determine which operations should be performed on the available
cuboids:
Operations like transforming any selection, projection, roll-up (group-
by), and drill-down operations specified in the query into corresponding SQL
and/or OLAP operations.
2. Determine the materialized cuboid(s) and its relevant operations:
This involves identifying all of the materialized cuboids that may
potentially be used to answer the query.

Q.14] Explain OLAP Server Architecture.


OLAP Server Architectures:
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
1. Relational OLAP (ROLAP):
Relational On-Line Analytical Processing
(ROLAP) is primarily used for data stored in a relational database, where both
the base data and dimension tables are stored as relational tables.
▪ ROLAP servers are placed between relational back-end server and client
front-end tools.
▪ To store and manage warehouse data, ROLA uses relational or extended-
relational DBMS.
▪ ROLAP works directly with relational databases and does not require pre-
computation.
▪ ROLAP is relational OLAP where the data is arranged in traditional
methods like rows and columns in the data warehouse.
ROLAP includes the following components:
1. Database server
2. ROLAP server
3. Front-end tool.

Advantages of ROLAP:
1. Handles large data volumes using relational databases.
2. Real-time data access without pre-aggregation.
3. Flexible with any relational database.
4. Scalable for complex queries over large datasets.
Disadvantages of ROLAP:
1. Slower query performance due to on-the-fly aggregation.
2. Depends on relational database performance.
3. Requires more processing power for complex queries.
4. Not ideal for highly aggregated data.

2. Multidimensional OLAP (MOLAP):


Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage
utilization in multidimensional data stores may be low if the data set is sparse.
▪ MOLAP sometimes referred to as just OLAP (Data Cube).
▪ MOLAP stores data on discs in the form of a specialized multidimensional
array structure. It is used for OLAP, which is based on the arrays’ random-
access capability.
▪ The multidimensional array is typically stored in MOLAP in a linear
allocation based on nested traversal of the axes in some predetermined
order.
MOLAP includes the following components:
1. Database server.
2. MOLAP server.
3. Front-end tool.

Advantages of MOLAP:
1. Fast query performance with pre-aggregated data.
2. Efficient for complex calculations and aggregations.
3. Supports interactive analysis.
4. Handles advanced analytical functions well.
Disadvantages of MOLAP:
1. Limited scalability with large datasets.
2. High storage requirements for data cubes.
3. Less flexible for changing data structures.
4. Struggles with real-time data updates.
3.Hybrid OLAP (HOLAP):
Hybrid OLAP is a combination of ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP stores
aggregations in MOLAP for fast query performance, and detailed data in ROLAP
to optimize time of cube processing.
▪ HOLAP tools can utilize both pre-calculated cubes and relational data
sources
▪ HOLAP servers are capable of storing large amounts of detailed data. On
the one hand, HOLAP benefits from ROLAP’s greater scalability.
▪ HOLAP, on the other hand, makes use of cube technology for faster
performance and summary-type information.
HOLAP includes the following components:
1. Database Server
2. Multidimensional database (MDDB)
3. HOLAP server
4. Front-end tool.

Fig. HOLAP architecture


Advantages of HOLAP:
1. Combines ROLAP and MOLAP strengths.
2. Offers both detailed and aggregated data.
3. Scalable and handles large datasets well.
4. Fast access to aggregated data with drill-down capability.
Disadvantages of HOLAP:
1. More complex architecture.
2. Requires management of both storage types.
3. Slower than MOLAP for highly aggregated queries.
4. Higher maintenance costs.

Q.15] Compare ROLAP versus MOLAP.


Q.16] Difference between ROLAP, MOLAP and HOLAP

You might also like