Data Warehouses: FPT University
Data Warehouses: FPT University
FPT University
Lecture 5: Architectural component
and Infrastructure
Chapter 7: Architectural components
Outline
Data Warehouse Architecture
Architectural Framework
Technical Architecture
Infrastructure Supporting Architecture
Hardware and Operating Systems
Database Software
Collection of Tools
I./ UNDERSTANDING DATA
WAREHOUSE ARCHITECTURE
You were introduced to the building blocks of the data
warehouse. At that stage, we quickly looked at the list of
components and reviewed each very briefly.
Here, we review the data warehouse architecture from
different perspectives.
You will study the architectural components in the manner
in which they enable the flow of data from the sources to
the end-users.
Then you will be able to look at each area of the
architecture and examine the functions, procedures, and
features in that area.
That discussion will lead you into the technical architecture
in those architectural areas.
Architecture: Definitions
The data warehouse architecture includes a number of factors:
Primarily, it includes the integrated data that is the centerpiece.
Everything that is needed to prepare the data and store it.
On the other hand, all the means for delivering information to user.
The rules, procedures, and functions that enable the data
warehouse to work and fulfill the business requirements.
Data Extraction
Data Transformation
Data Staging
Data Acquisition: List of Functions
and Services
Data Extraction - includes the following functions and services:
Select data sources and determine the what types of filters to be
applied to individual sources
Generate automatic extract files from operational systems using
replication and other techniques
Create intermediary files to store selected data to be merged later
Provide automated job control services for creating extract files.
Transport extracted files from multiple platforms
Reformat input from outside sources
Reformat input from departmental data files, databases, and
spreadsheets
Generate common application code for data extraction
Resolve inconsistencies for common data elements from multiple
sources
Data Acquisition: List of Functions
and Services
Data Transformation:
Map input data to data for data warehouse repository
Clean data, deduplicate, and merge/purge
Denormalize extracted data structures as required by the
dimensional model of the data warehouse
Convert data types
Calculate and derive attribute values
Check for referential integrity
Aggregate data as needed
Resolve missing values
Consolidate and integrate data
Data Acquisition: List of Functions
and Services
Data Staging:
Provide backup and recovery for staging area repositories
Sort and merge files
Create files as input to make changes to dimension tables
If data staging storage is a relational database, create and
populate database
Preserve audit trail to relate each data item in the data
warehouse to input source
Resolve and create primary and foreign keys for load tables
If staging area storage is a relational database, extract load files
Data Storage
This covers the process of loading the data from the staging
area into the data warehouse repository.
All functions for transforming and integrating the data are
completed in the data staging area.
Data Storage: List of Functions
and Services
Load data for full refreshes of data warehouse tables
Perform incremental loads at regular prescribed intervals
Support loading into multiple tables at the detailed and summarized
levels
Optimize the loading process
Provide automated job control services for loading the data
warehouse
Provide backup and recovery for the data warehouse database
Provide security
Monitor and fine-tune the database
Information Delivery
Limitations:
The architecture requires rigid data
partitioning.
Data access is restricted.
Workload balancing is limited.
Cache consistency must be
maintained.
III. DATABASE SOFTWARE
Examine the features of the leading commercial RDBMSs. Consider
to data warehouse features being included in the software products.
Data-warehouse related add-ons are becoming part of the database
offerings.
DBMSs have also been scaled up to support very large databases.
Parallel processing options in database software are intended only
for machines with multiple processors.
Most of the current database software can parallelize a large
number of operations.
These operations include the following: mass loading of data, full table scans, queries with
exclusion conditions, queries with grouping, selection with distinct values, aggregation,
sorting, creation of tables using subqueries, creating and rebuilding indexes, inserting
rows into a table from other tables, enabling constraints, …
IV. COLLECTION OF TOOLS