Data Warehousing fundamental for data engineering

Data Warehousing
Prepared by:Mohamemd Sayeeduddin

What is data warehouse
 A data warehouse (DW) is a repository that stores relational data that is organized,
cleansed, and standardized for enterprise use. A data warehouse is organized by subject-
oriented databases and is non-volatile in direct support of decision support system (DSS)
functionality. In so doing, a data warehouse includes strategically selected data that is
important to an enterprise for historical tracking, reporting, and analysis.
 A data warehouse has the following characteristics:
 Subject-oriented: data is theme- or object-based (i.e., customer, product, sales, etc.)
 Integrated: disparate data is combined and normalized from source systems
 Time-variant: data is organized by various time intervals for historical reporting and
preservation (i.e., week, month, quarter, year)
 Non-volatile: data is never changed or deleted; data is read-only and refreshed at well-
defined time intervals
 Summarized: data is often aggregated for optimization of reporting

Data Warehousing Architectures
Data warehouses can be architected using varying approaches. There are two
primary approaches: the dimensional approach (popularized by Ralph Kimball)
and the normalized approach (popularized by Bill Inmon).
 Normalized Approach
Inmon, on the other hand, utilized a “top-down” approach to normalize a data
warehouse. The normalized enterprise data model creates a central repository or
enterprise data warehouse. Dimensional data marts for specific departments or
organizational units can be created from the master enterprise data warehouse.

 Dimensional Approach
Kimball’s approach depicts a data warehouse via a dimensional model (star
schema or snowflake). The dimensional approach uses a “bottom-up” design, in
which individual data marts are created at the departmental or organizational
level (i.e., sales, human resources, finance, etc.) and built up to an enterprise
data warehouse (EDW). Today, Kimball’s approach is more popular because
business users can quickly gain usefulness from it.

Extract, Transform, Load (ETL)
 Extract, transform, load (ETL) is the process of data integration from source
operational or transactional systems to combine disparate data to a single
format in a central repository. Source data is extracted from transactional
systems; transformed for normalization, formatting, and error correction; and
loaded to the data warehouse for analytics and reporting

Data Warehousing fundamental for data engineering

Data Mart
A data mart is a subset of an enterprise data warehouse and is often referred to
as a “departmental data warehouse.” A data mart contains the same type of
information that exists in an enterprise data warehouse, but the data is
organized and optimized for a specific department or organizational unit. The
diagram in Figure 2 provides a high-level architecture of data warehousing and
shows how data marts fit into this architecture.

Operational Data Stores
 An operational data store (ODS) utilizes snapshots of operational or
transactional systems’ data to provide operational business reporting. ODS
differs from a data warehouse because the data is accessed directly from the
transactional system databases, and the operational data store is able to
write data back to the source systems. A primary purpose of an operational
data store is to deal with the complexities of maintaining up-to-date data in
the data warehouse. Thus, the ODS can be seen as a less expensive approach
to real-time data reporting.

Data Warehousing in the Cloud
 Data warehouses traditionally exist inside an organization’s local
infrastructure (on-premises), where the responsibility for configuration and
maintenance lies solely on information technology (IT) staff at the
organization. Data warehousing in the cloud shifts much of the responsibility
for hardware, networking, security, and maintenance to a third party, which
allows the organization to focus more on business goals and objectives. This
approach also allows users (who are often remote or mobile) a higher, more
consistent level of data warehouse availability.

Star Schema
 A star schema is a model that depicts data in a shape similar to that of a star.
A fact table exists in the center of the star and contains primary and foreign
keys to associated dimension tables, as well as aggregated data from the
operational or transactional systems. The dimension tables describe the data
and are included based on business needs. A star schema is not normalized
and provides simple modeling without the need for complex joins.

Snowflake Schema
 The snowflake schema design contains the same data that would exist in a
star schema, and the fact table and dimension tables look the same. The
main difference between the two is that the snowflake schema is normalized.
The process of normalizing the design is referred to as snowflaking. The
snowflake schema also requires less work to add more data to existing
dimensions and requires less storage due to the lack of redundancy in the
normalization process. Figure 2 displays an example of a snowflake schema.

Quiz
 Question 1
 A ____________ is a repository that stores relational data that is organized,
cleansed, and standardized for enterprise use.
a) Database
b) Data Warehouse
c) Database Management System

Answer
 Data ware house:
Correct! A data warehouse is organized by subject-oriented databases and is non-
volatile in direct support of Decision Support System (DSS) functionality.

Quiz 2
 Which data warehousing architecture approach utilizes a bottom-up design?
a) Dimensional
b) Denormalized
c) Normalized

Answer
 Dimensional
 Correct
 Correct! Kimball’s approach uses a “bottom-up” design, in which individual
data marts are created at the department or organizational unit level (i.e.,
Sales, Human Resources, Finance, etc.) and built up to an enterprise data
warehouse.

Data Warehousing fundamental for data engineering

More Related Content

Similar to Data Warehousing fundamental for data engineering (20)

Recently uploaded (20)

Data Warehousing fundamental for data engineering