0% found this document useful (0 votes)
51 views

Chapter Four - Data Warehouse Design: SATA Technology and Business Collage

The document discusses data warehouse design and its key components. It describes a data warehouse as a database that stores integrated information from across an enterprise to support decision making. It has three main components: a central data warehouse, data marts extracted from the central warehouse, and legacy source systems. The central warehouse contains normalized data organized according to a corporate data model. Data marts contain extracts tailored for specific user groups. Dimensional modeling techniques organize data into facts, dimensions, and measures to optimize query performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Chapter Four - Data Warehouse Design: SATA Technology and Business Collage

The document discusses data warehouse design and its key components. It describes a data warehouse as a database that stores integrated information from across an enterprise to support decision making. It has three main components: a central data warehouse, data marts extracted from the central warehouse, and legacy source systems. The central warehouse contains normalized data organized according to a corporate data model. Data marts contain extracts tailored for specific user groups. Dimensional modeling techniques organize data into facts, dimensions, and measures to optimize query performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

SATA Technology and Business Collage

Chapter Four – Data Warehouse Design


4.1 Introduction

Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making
requests. A very frequent problem in enterprises is the impossibility for accessing to corporate,
complete and integrated information of the enterprise that can satisfy decision-making requests.

A paradox occurs: data exists but information cannot be obtained. In general, a DW is


constructed with the goal of storing and providing all the relevant information that is generated
along the different databases of an enterprise. A DW is a database with particular features.

Concerning the data it contains, it is the result of transformations, quality improvement and
integration of data that comes from operational bases. Besides, it includes indicators that are
derived from operational data and give it additional value. Concerning its utilization, it is
supposed to support complex queries (summarization, aggregates, crossing of data), while its
maintenance does not suppose transactional load. In addition, in a DW environment end users
make queries directly against the DW through user-friendly query tools, instead of accessing
information through reports generated by specialists.

Building and maintaining a DW need to solve problems of many different aspects. In this chapter
we concentrate in DW design. A data warehouse has three main components:

 A “Central Data Warehouse” or “Operational Data Store (ODS)”, which is a data base
organized according to the corporate data model.
 One or more “data marts”—extracts from the central data warehouse that are organized
according to the particular retrieval requirements of individual users.
 The “legacy systems” where an enterprise’s data are currently kept.
4.2 The Central Data Warehouse

The Central Data Warehouse is just that—a warehouse. All the enterprise’s data are stored in
there, “normalized”, in order to minimize redundancy and so that each may be found easily. This
is accomplished by organizing it according to the enterprise’s corporate data model. Think of it
as a giant grocery store warehouse where the chocolates are kept in one section, the T-shirt is in
another, and the CDs are in a third. We found in the literature, globally two different approaches
for Relational DW design: one that applies dimensional modeling techniques, and another that
bases mainly in the concept of materialized views.

Dimensional models represent data with a “cube” structure, making more compatible logical data
representation with OLAP data management. According to the objectives of dimensional
modeling are:

1 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

(i) To produce database structures that are easy for end-users to understand and write
queries against,
(ii) (ii) To maximize the efficiency of queries.

It achieves these objectives by minimizing the number of tables and relationships between them.
Normalized databases have some characteristics that are appropriate for OLTP systems, but not
for DWs:

(i) Its structure is not easy for end-users to understand and use. In OLTP systems this is
not a problem because, usually end-users interact with the database through a layer of
software.
(ii) Data redundancy is minimized. This maximizes efficiency of updates, but tends to
penalize retrievals. Data redundancy is not a problem in DWs because data is not
updated on-line.

The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and context data. It typically represents
business items or business transactions. A dimension is a collection of data that describe one
business dimension. Dimensions determine the contextual background for the facts; they are the
parameters over which we want to perform OLAP. A measure is a numeric attribute of a fact,
representing the performance or behavior of the business relative to the dimensions. OLAP cube
design requirements will be a natural outcome of the dimensional model if the data warehouse is
designed to support the way users wants to query data.

4.3 Data Warehousing Objects

The following types of objects are commonly used in dimensional data warehouse schemas: Fact
tables are the large tables in your warehouse schema that store business measurements. Fact
tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data,
usually numeric and additive, that can be analyzed and examined. Examples include sales, cost,
and profit.

Dimension tables, also known as lookup or reference tables, contain the relatively static data in
the warehouse. Dimension tables store the information you normally use to contain queries.

Dimension tables are usually textual and descriptive and you can use them as the row headers of
the result set. Examples are customers, Location, Time, Suppliers or products.

Fact Tables

2 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

A fact table typically has two types of columns: those that contain numeric facts (often called
measurements), and those that are foreign keys to dimension tables. A fact table contains either
detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are
often called SUMMARY TABLES. A fact table usually contains facts with the same level of
aggregation. Though most facts are additive, they can also be semi-additive or non-additive.
Additive facts can be aggregated by simple arithmetical addition. A common example of this is
sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive
facts can be aggregated along some of the dimensions and not along others. An example of this is
inventory levels, where you cannot tell what a level means simply by looking at it.

Creating new fact table: You must define a fact table for each star schema. From a modeling
standpoint, the primary key of the fact table is usually a composite key that is made up of all of
its foreign keys.

Fact tables contain business event details for summarization. Fact tables are often very large,
containing hundreds of millions of rows and consuming hundreds of gigabytes or multiple
terabytes of storage. Because dimension tables contain records that describe facts, the fact table
can be reduced to columns for dimension foreign keys and numeric fact values. Text, BLOBs,
and denormalized data are typically not stored in the fact table. The definitions of this ‘sales’ fact
table follow:

CREATE TABLE sales ( prod_id NUMBER(7) CONSTRAINT sales_product_nn NOT NULL,


cust_id NUMBER CONSTRAINT sales_customer_nn NOT NULL, time_id DATE
CONSTRAINT sales_time_nn NOT NULL, ad_id NUMBER(7),quantity_sold NUMBER(4)
CONSTRAINT sales_quantity_nn NOT NULL, amount NUMBER(10,2) CONSTRAINT
sales_amount_nn NOT NULL, cost NUMBER(10,2) CONSTRAINT sales_cost_nn NOT NULL
)

Multiple Fact Tables: Multiple fact tables are used in data warehouses that address multiple
business functions, such as sales, inventory, and finance. Each business function should have its
own fact table and will probably have some unique dimension tables. Any dimensions that are
common across the business functions must represent the dimension information in the same
way, as discussed earlier in “Dimension Tables.” Each business function will typically have its
own schema that contains a fact table, several conforming dimension tables, and some dimension
tables unique to the specific business function. Such business-specific schemas may be part of
the central data warehouse or implemented as data marts.

Very large fact tables may be physically partitioned for implementation and maintenance design
considerations. The partition divisions are almost always along a single dimension, and the time
dimension is the most common one to use because of the historical nature of most data
warehouse data. If fact tables are partitioned, OLAP cubes are usually partitioned to match the

3 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

partitioned fact table segments for ease of maintenance. Partitioned fact tables can be viewed as
one table with an SQL UNION query as long as the number of tables involved does not exceed
the limit for a single query.

Dimension Tables A dimension is a structure, often composed of one or more hierarchies, that
categorizes data. Dimensional attributes help to describe the dimensional value. They are
normally descriptive, textual values. Several distinct dimensions, combined with facts, enable
you to answer business questions. Commonly used dimensions are customers, products, and
time. Dimension data is typically collected at the lowest level of detail and then aggregated into
higher-level totals that are more useful for analysis. These natural rollups or aggregations within
a dimension table are called hierarchies

A dimension table may be used in multiple places if the data warehouse contains multiple fact
tables or contributes data to data marts. For example, a product dimension may be used with a
sales fact table and an inventory fact table in the data warehouse, and also in one or more
departmental data marts. A dimension such as customer, time, or product that is used in multiple
schemas is called a conforming dimension if all copies of the dimension are the same.
Summarization data and reports will not correspond if different schemas use different versions of
a dimension table. Using conforming dimensions is critical to successful data warehouse design.

The records in a dimension table establish one-to-many relationships with the fact table. For
example, there may be a number of sales to a single customer, or a number of sales of a single
product. The dimension table contains attributes associated with the dimension entry; these
attributes are rich and user-oriented textual details, such as product name or customer name and
address. Attributes serve as report labels and query constraints. Attributes that are coded in an
OLTP database should be decoded into descriptions. For example, product category may exist as
a simple integer in the OLTP database, but the dimension table should contain the actual text for
the category. The code may also be carried in the dimension table if needed for maintenance.
This denormalization simplifies and improves the efficiency of queries and simplifies user query
tools. However, if a dimension attribute changes frequently, maintenance may be easier if the
attribute is assigned to its own table to create a snowflake dimension.

Hierarchies: The data in a dimension is usually hierarchical in nature. Hierarchies are


determined by the business need to group and summarize data into usable information. For
example, a time dimension often contains the hierarchy elements: (all time), Year, Quarter,
Month, Day or Week. A dimension may contain multiple hierarchies – a time dimension often
contains both calendar and fiscal year hierarchies. Geography is seldom a dimension of its own;
it is usually a hierarchy that imposes a structure on sales points, customers, or other

4 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

geographically distributed dimensions. An example geography hierarchy for sales points is: (all),
country, region, state or district, city, store.

Level relationships specify top-to-bottom ordering of levels from most general (the root) to most
specific information. They define the parent-child relationship between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex rewrites. For example, the
database can aggregate existing sales revenue on a quarterly base to a yearly aggregation when
the dimensional dependencies between quarter and year are known.

Multi-use dimensions: Sometimes data warehouse design can be simplified by combining a


number of small, unrelated dimensions into a single physical dimension, often called a junk
dimension. This can greatly reduce the size of the fact table by reducing the number of foreign
keys in fact table records. Often the combined dimension will be prepopulated with the cartesian
product of all dimension values. If the number of discrete values creates a very large table of all
possible value combinations, the table can be populated with value combinations as they are
encountered during the load or update process.

A common example of a multi-use dimension is a dimension that contains customer


demographics selected for reporting standardization. Another multiuse dimension might contain
useful textual comments that occur infrequently in the source data records; collecting these
comments in a single dimension removes a sparse text field from the fact table and replaces it
with a compact foreign key.

4.4 Goals of Data Warehouse Architecture

A data warehouse exists to serve its users—analysts and decision makers. A data warehouse
must be designed to satisfy the following requirements:

 Deliver a great user experience—user acceptance is the measure of success


 Function without interfering with OLTP systems
 Provide a central repository of consistent data
 Answer complex queries quickly
 Provide a variety of powerful analytical tools such as OLAP and data mining

Most successful data warehouses that meet these requirements have these common
characteristics

 Based on a dimensional model


 Contain historical data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining consistency
 Focus on a single subject such as sales, inventory, or finance

5 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

Data warehouses are often quite large. However, size is not an architectural goal—it is a
characteristic driven by the amount of data needed to serve the users.
Data Warehouse Users
The success of a data warehouse is measured solely by its acceptance by users. Without
users, historical data might as well be archived to magnetic tape and stored in the
basement. Successful data warehouse design starts with understanding the users and their
needs. Data warehouse users can be divided into four categories: Statisticians, knowledge
workers, information consumers, and executives. Each type makes up a portion of the
user population as illustrated in this diagram.

Statisticians: There are typically only a handful of statisticians and operations research
types in any organization. Their work can contribute to closed loop systems that deeply
influence the operations and profitability of the company
Knowledge Workers: A relatively small number of analysts perform the bulk of new
queries and analyses against the data warehouse. These are the users who get the
Designer or Analyst versions of user access tools. They will figure out how to quantify a
subject area. After a few iterations, their queries and reports typically get published for
the benefit of the Information Consumers. Knowledge Workers are often deeply engaged
with the data warehouse design and place the greatest demands on the ongoing data
warehouse operations team for training and support.
How Users Query the Data Warehouse? Information for users can be extracted from the
data warehouse relational database or from the output of analytical services such as
OLAP or data mining. Direct queries to the data warehouse relational database should be
limited to those that cannot be accomplished through existing tools, which are often more
efficient than direct queries and impose less load on the relational database.

Reporting tools and custom applications often access the database directly. Statisticians
frequently extract data for use by special analytical tools. Analysts may write complex
queries to extract and compile specific information not readily accessible through
existing tools. Information consumers do not interact directly with the relational database
but may receive email reports or access web pages that expose data from the relational
database. Executives use standard reports or ask others to create specialized reports for
them.
When using the analysis services tools in SQL server 2000, statisticians will often
perform data mining, analysts will write MDX queries against OLAP cubes and use data
mining, and Information Consumers will use interactive reports designed by others.

Design the Relational Database and OLAP Cubes

6 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

In this phase, the star or snowflake schema is created in the relational database, surrogate
keys are defined and primary and foreign key relationships are established. Views,
indexes, and fact table partitions are also defined. OLAP cubes are designed that support
the needs of the users.
Keys and Relationships Tables are implemented in the relational database after
surrogate keys for dimension tables have been defined and primary and foreign keys and
their relationships have been identified. Primary/ foreign key relationships should be
established in the database schema.. The composite primary key in the fact table is an
expensive key to maintain. Integrity constraints provide a mechanism for ensuring that
data conforms to guidelines specified by the database administrator. The most common
types of constraints include:
 UNIQUE constraints—To ensure that a given column is unique
 NOT NULL constraints—To ensure that no null values are allowed
 FOREIGN KEY constraints—To ensure that two keys share a primary key to
foreign key relationship

Constraints can be used for these purposes in a data warehouse is Data cleanliness and Query
optimization. The index alone is almost as large as the fact table. The index on the primary key is
often created as a clustered index. In many scenarios a clustered primary key provides excellent
query performance. However, all other indexes on the fact table use the large clustered index
key. All indexes on the fact table will be large, the system will require significant additional
storage space, and query performance may degrade. As a result, many star schemas are defined
with an integer surrogate primary key, or no primary key at all. We recommend that the fact
table be defined using the composite primary key. Also create an IDENTITY column in the fact
table that could be used as a unique clustered index, should the database administrator determine
this structure would provide better performance.

Indexes

Dimension tables must be indexed on their primary keys, which are the surrogate keys created
for the data warehouse tables. The fact table must have a unique index on the primary key. There
are scenarios where the primary key index should be clustered and other scenarios where it
should not. The larger the number of dimensions in the schema, the less beneficial it is to cluster
the primary key index. With a large number of dimensions, it is usually more effective to create a
unique clustered index on a meaningless IDENTITY column.

Elaborate initial design and development of index plans for end-user queries is not necessary
with SQL server 2000, which has sophisticated index techniques and an easy to use Index
Tuning Wizard tool to tune indexes to the query workload.

7 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

The SQL Server 2000 Index Tuning Wizard allows you to select and create an optimal set of
indexes and statistics for a database without requiring an expert understanding of the structure of
the database, the workload, or the internals of SQL Server. The wizard analyzes a query
workload captured in a SQL Profiler trace or provided by an SQL script, and recommends an
index configuration to improve the performance of the database.

Views

Views should be created for users who need direct access to data in the data warehouse relational
database. Users can be granted access to views without having access to the underlying data.
Indexed views can be used to improve performance of user queries that access data through
views. View definitions should create column and table names that will make sense to business
users.

Data Warehousing Schemas

A schema is a collection of database objects, including tables, views, indexes, and synonyms.
You can arrange schema objects in the schema models designed for data warehousing in a
variety of ways. Most data warehouses use a dimensional model. The model of your source data
and the requirements of your users help you design the data warehouse schema. You can
sometimes get the source model from your company’s enterprise data model and reverse
engineer the logical data model for the data warehouse from this. The physical implementation
of the logical data warehouse model may require some changes to adapt it to your system
parameters—size of machine, number of users, storage capacity, type of network, and software.

Dimensional Model Schemas

The principal characteristic of a dimensional model is a set of detailed business facts surrounded
by multiple dimensions that describe those facts. When realized in a database, the schema for a
dimensional model contains a central fact table and multiple dimension tables. A dimensional
model may produce a star schema or a snowflake schema.

Star Schemas
A schema is called a star schema if all dimension tables can be joined directly to the fact table.
The following diagram shows a classic star schema. In the star schema design, a single object
(the fact table) sits in the middle and is radically connected to other surrounding objects
(dimension lookup tables) like a star. A star schema can be simple or complex. A simple star
consists of one fact table; a complex star can have more than one fact table.

Hierarchy

8 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be
used to define data aggregation; for example, in a time dimension, a hierarchy might be used to
aggregate data from the month level to the quarter level, from the quarter level to the year level.
A hierarchy can also be used to define a navigational drill path, regardless of whether the levels
in the hierarchy represent aggregated totals or not.

Level

A position in a hierarchy. For example, a time dimension might have a hierarchy that represents
data at the month, quarter, and year levels.

Fact Table

A table in a star schema that contains facts and connected to dimensions. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite key that is made up of all of its
foreign keys. A fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called summary tables). A
fact table usually contains facts with the same level of aggregation.

Snowflake Schemas

A schema is called a snowflake schema if one or more dimension tables do not join directly to
the fact table but must join through other dimension tables. For example, a dimension that
describes products may be separated into three tables (snowflaked).

The snowflake schema is an extension of the star schema where each point of the star explodes
into more points. The main advantage of the snowflake schema is the improvement in query
performance due to minimized disk storage requirements and joining smaller lookup tables. The
main disadvantage of the snowflake schema is the additional maintenance efforts needed due to
the increase number of lookup tables.

Important Aspects of Star Schema & Snowflake Schema

 In a star schema every dimension will have a primary key.


 In a star schema, a dimension table will not have any parent table.
 Whereas in a snowflake schema, a dimension table will have one or more parent tables.
 Hierarchies for the dimensions are stored in the dimensional table itself in star schema.

9 Data Warehousing and Data Mining Instructor – Ephrem


A.
SATA Technology and Business Collage

 Whereas hierarchies are broken into separate tables in snowflake schema. These hierarchies
help to drill down the data from topmost hierarchies to the lowermost

10 Data Warehousing and Data Mining Instructor – Ephrem


A.

You might also like