0% found this document useful (0 votes)
3 views

06 Data Warehouse Design and Analytics

The document discusses data warehouse design and data analytics, emphasizing the importance of processing data to derive insights for business decisions. It covers the steps involved in data analytics, including data integration, ETL processes, and the use of OLAP for interactive querying. Additionally, it explains the structure of data warehouses, including schemas, data cleansing, and the design of star schemas for effective data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

06 Data Warehouse Design and Analytics

The document discusses data warehouse design and data analytics, emphasizing the importance of processing data to derive insights for business decisions. It covers the steps involved in data analytics, including data integration, ETL processes, and the use of OLAP for interactive querying. Additionally, it explains the structure of data warehouses, including schemas, data cleansing, and the design of star schemas for effective data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Warehouse Design

and Data Analytics


Data Analytics

 Overview
 Data Warehousing
 Online Analytical Processing

2
Overview

 Data analytics: the processing of data to infer patterns, correlations, or


models for prediction
 Primarily used to make business decisions
• Per individual customer
 E.g., what product to suggest for purchase (Mining)
• Across all customers
 E.g., what products to manufacture/stock, in what quantity (DSS)
 Critical for businesses today

3
Overview (Cont.)

 Common steps in data analytics


• Gather data from multiple sources into one location

 Data warehouses also integrated data into common schema

 Data often needs to be extracted from source formats,


transformed to common schema, and loaded into the data
warehouse

• Can be done as ETL (extract-transform-load), or ELT


(extract-load-transform)

4
Overview (Cont.)

• Generate aggregates and reports summarizing data

 Dashboards showing graphical charts/reports

 Online analytical processing (OLAP) systems allow


interactive querying

 Statistical analysis using tools such as R/SAS/SPSS

• Including extensions for parallel processing of big data

• Build predictive models and use the models for decision making

5
Overview (Cont.)

 Predictive models are widely used today

• E.g., use customer profile features (e.g. income, age, gender,


education, employment) and past history of a customer to predict
likelihood of default on loan

 and use prediction to make loan decision

• E.g., use past history of sales (by season) to predict future sales

 And use it to decide what/how much to produce/stock

 And to target customers

6
Overview (Cont.)

 Other examples of business decisions:

• What items to stock?

• What insurance premium to change?

• To whom to send advertisements?

7
Overview (Cont.)

 Machine learning techniques are key to finding patterns in data and


making predictions

 Data mining extends techniques developed by machine-learning


communities to run them on very large datasets

 The term business intelligence (BI) is synonym for data analytics

 The term decision support focuses on reporting and aggregation

8
Data Integration From Multiple Sources
Many database applications require data from multiple databases

A federated database system is a software layer on top of existing database systems,


which is designed to manipulate information in heterogeneous databases

Creates an illusion of logical database integration without any physical database


integration

9
Data Integration From Multiple Sources

• Each database has its local schema

• Global schema integrates all the local schema

o Schema integration

• Queries can be issued against global schema, and translated to queries


on local schemas

10
Data Integration From Multiple Sources
Wrapper for a data source is a view that translates data from local to a global schema.
Wrappers must also translate updates on global schema to updates on local schema

11
Data Integration From Multiple Sources
Databases that support common schema and queries, but not updates, are
referred to as mediator systems

12
Data Warehouses Concepts
Data warehouse is an alternative to data integration
Migrates data to a common schema, avoiding run-time overhead
Cost of translating schema/data to a common warehouse schema can be significant
ETL is a process in Data
Warehousing and it stands
for
Extract (E),
Transform (T) and
Load (L).

It is a process in which
an ETL tool extracts
the data from
various data source
systems, transforms it in the
staging area and then finally,
loads it into the Data
Warehouse system.

13
Data Warehouse Concepts

A data warehouse is a repository (or archive) of information gathered from


multiple sources, stored under a unified schema, at a single site.

Once gathered, the data are stored for a long time, permitting access to
historical data.

Thus, data warehouses provide the user a single consolidated interface to


data, making decision-support queries easier to write.

14
Data Warehouse Concepts

What schema to use. Data sources that have been constructed


independently are likely to have different schemas. In fact, they may
even use different data models. Part of the task of a warehouse is to
perform schema integration and to convert data to the integrated
schema before they are stored.

 Data cleansing. The task of correcting and preprocessing data is called


data cleansing. Data sources often deliver data with numerous minor
inconsistencies, which can be corrected.

 Data transformation: Transformation of host format to warehouse


format

15
Data Warehouse Concepts
Multidimensional Data and Warehouse
Schemas

• Data warehouses typically have


schemas that are designed for data
analysis, using tools such as OLAP
tools.
• The relations in a data warehouse
schema can usually be classified as fact
tables and dimension tables.
• Fact tables record information about
individual events, such as sales, and are
usually very large.
• A table recording sales information for a
retail store, with one tuple for each item
that is sold, is a typical example of a fact
table.

16
Data Warehouse Concepts
Multidimensional Data and Warehouse
Schemas

The attributes in fact table can be


classified as either dimension attributes
or measure attributes,
The measure attributes store
quantitative information, which can be
aggregated upon; the measure attributes
of a sales table would include the
number of items sold and the price of the
items.
In contrast, dimension attributes are
dimensions upon which measure
attributes, and summaries of measure
attributes, are grouped and viewed.

17
Data Warehouse Concepts
Multidimensional Data and Warehouse
Schemas

The dimension attributes of a sales table


would include an item identifier, the date
when the item is sold, which location
(store) the item was sold from, the
customer who bought the item, and so
on.
Data that can be modeled using
dimension attributes and measure
attributes are called multidimensional
data.
To minimize storage requirements,
dimension attributes are usually short
identifiers that are foreign keys into other
tables called dimension tables.

18
Designing Star Schema
A fact table sales would have dimension
attributes item id, store id, customer id,
and date, and measure attributes
number and price. The attribute store id
is a foreign key into a dimension table
store, which has other attributes such as
store location (city, state, country). The
item id attribute of the sales table
would be a foreign key into a
dimension table item info, which would
contain information such as the name of
the item, the category to which the item
belongs, and other item details such as
color and size. The customer id attribute
would be a foreign key into a customer
table containing attributes such as name
and address of the customer. We can also
view the date attribute as a foreign key
into a date info table giving the month,
quarter, and year of each date.

19
Designing Star Schema
Design star schema for for National Board
of Revenue.
Measure attributes: Tax amount, income,

Dimensions:
Tax payer
Tax Collector
Time
Source

20
Privacy-Preserved National Clinical Data Warehouse Architecture

21
Multidimensional Data
Data
Model
Modeling and
Analysis

CUBE hierarchy in disease


analysis

22
Data Analysis and OLAP

 Online Analytical Processing (OLAP)


• Interactive analysis of data, allowing data to be summarized and viewed
in different ways in an online fashion (with negligible delay)
 We use the following relation to illustrate OLAP concepts
• sales (item_name, color, clothes_size, quantity)
This is a simplified version of the sales fact table joined with the dimension
tables, and many attributes removed (and some renamed)

23
Example sales relation

How can we get the table


Sales-item (item_name, color, clothes_size,
quantity)
From star schema?
Answer:
Select item_name, color, clothes_size, number as
quantity
From item_info a, sales b ... ... ...
...
Where a.item_id = b.item_id ... ... ...
...

24
Cross Tabulation of sales by item_name and color
How can you find the cross-tab of
sales?
Write SQL to find the cross-tab.

SQL for cloth_size:


Select sum(quantity) from sales

SQL for total of item_name:


Select item_name, sum(quantity)
 From sales
The table above is an example of a cross-tabulation
Group by item_name
(cross-tab), also referred to as a pivot-table.
Values for one of the dimension attributes form the row
SQL for total of color:
headers Select color, sum(quantity)
Values for another dimension attribute form the column From sales
headers Group by color
Other dimension attributes are listed on top
Values in individual cells are (aggregates of) the values SQL for other cells:
Select item_name, color, sum(quantity)
of the dimension attributes that specify the cell. From sales
Group by item_name, color

25
Cross Tabulation of sales by item_name and color
How can you find the cross-tab of
sales?
Write SQL to find the cross-tab.

SQL for cloth_size:


Select sum(quantity) from sales

SQL for total of item_name:


Select item_name, sum(quantity)
From sales
Group by item_name

SQL for total of color:


Question:
Select color, sum(quantity)
a. Write cross tabulation structure of sales by From sales
cloth_size and item_name. Group by item_name
b. Write SQL to find the cross-tab
SQL for other cells:
Select item_name, color, sum(quantity)
From sales
Group by item_name, color

26
Data Cube

 A data cube is a multidimensional generalization of a cross-tab


 Can have n dimensions; we show 3 below
 Cross-tabs can be used as views on a data cube

27
Hierarchies on Dimensions

 Hierarchy on dimension attributes: lets dimensions be viewed at different


levels of detail
 E.g., the dimension datetime can be used to aggregate by hour of day,
date, day of week, month, quarter or year

28
Hierarchies on Dimensions
How can you prepare DSS reports based on hierarchy?
Report on date (Year, , quarter, month, date wise report)
R1 = select year, quarter, month, date, sum(quantity) as tot_d from sales s,
date_info d Where s.date = d.date Group by year, quarter, month, date
Report on date (Year, , quarter, month wise report)
R2 = select year, quarter, month sum(tot_d) as tot_m from R1 Where s.date =
d.date Group by year, quarter, month, date

29
Hierarchies on Dimensions
How can you prepare DSS reports based on hierarchy?
Report on date (Year, , quarter, month, date wise report)
R1 = select year, quarter, month, date, sum(quantity) as tot_d from sales s,
date_info d Where s.date = d.date Group by year, quarter, month, date
Report on month (Year, , quarter, month wise report)
R2 = select year, quarter, month sum(tot_d) as tot_m from R1 Where s.date =
d.date Group by year, quarter, month, date
Report on quarter (Year, quarter wise report)
R3 = select year, quarter sum(tot_m) as tot_q from
R2
Group by year, quarter

Report on year (Year wise report)


R4 = select year sum(tot_q) as tot_y from R3
Group by year

** R1, R2 R3, and R4 may be stored in the


database as materialized view

30
Hierarchies on Dimensions
How can you prepare DSS reports based on hierarchy?
Report on date:
R1 = select year, quarter, month, date, sum(quantity) as tot_d from sales
Group by year, quarter, month, date
R2 = select year, quarter, month sum(tot_d) as tot_m from R1
Group by year, quarter, month

R3 = select year, quarter sum(tot_m) as tot_q


from R2
Group by year, quarter

R4 = select year sum(tot_q) as tot_y from R3


Group by year

** R1, R2 R3, and R4 may be stored in the


database as materialized view

Question: Find all DSS reports on location


hierarchy

31
Data Warehouse Queries

Rollup

• The ROLLUP is an extension of the GROUP BY clause.

• The ROLLUP option allows you to include extra rows that


represent the subtotals, which are commonly referred to
as super-aggregate rows, along with the grand total row.

• By using the ROLLUP option, you can use a single query


to generate multiple grouping sets.

32
Data Warehouse Queries

Rollup

• The ROLLUP Example:


SELECT c1, c2, aggregate_function(c3)

FROM table

GROUP BY ROLLUP (c1, c2);

In the syntax above, ROLLUP(c1,c2) generates


three following grouping sets:

(c1,c2)
(c1)
()

33
Data Warehouse Queries

CUBE

• Similar to the ROLLUP, CUBE is an extension of the


GROUP BY clause.

• CUBE allows you to generate subtotals like the ROLLUP


extension.

• In addition, the CUBE extension will generate subtotals


for all combinations of grouping columns specified in the
GROUP BY clause.

34
Data Warehouse Queries

CUBE

• The CUBE Example:


SELECT c1, c2, AGGREGATE_FUNCTION(c3)

FROM table_name

GROUP BY CUBE(c1 , c2);

In this syntax, we have two columns specified in the CUBE.

The statement creates two subtotal combinations.

Generally, if you have n number of columns listed in the CUBE, the


statement will create 2n subtotal combinations.

35
PIVOT Table

To create a pivot table, you first need to have a table with your source data.
Let’s say you have a table called “sales” with columns for the date, product,
and sales amount.

To create a pivot table, you would use the following SQL statement:

SELECT *
FROM sales
PIVOT (
SUM(sales_amount)
FOR product IN ('Product A', 'Product B', 'Product C')
);

36

You might also like