Data Warehouse Concepts
Data Warehouse Concepts
One of the key features of EDW is that it stores historical data and data at the most
granular level. Its a
truly corporate representation of data; however access to it is limited as it is not meant for
reporting purpose.
Data Mart
2
One of the important feature of a Data Mart is that its data model is customized for a
business process
or a department, does not contain all corporate-level data as in the case of an EDW and
hence takes less
time to build and maintain.
Data is represented in an elegant manner; a manner in which the business can understand
the structure and
contents. Data model is also demoralized.
The Data model consists of a large centralized table called the 'FACT' table (which
consists of measures or values
that the business is looking for) and a set of small descriptive entity tables called as the
'DIMENSION' tables.
If a dimension is a 'Conformed Dimension' then it can be shared across different Data
Marts, thus minimizing the
design time. We will talk about these in detail in the subsequent modules.
Data Marts represented data which is either business process specific and / or department
specific. (A business
process can consist of multiple departments.) The feature of a Data Mart is that it stores
data in a business-friendly
representation also called as dimensional model or a star schema. The data stored in a
Data Mart may not be at the
most granular level; quite often it is aggregated and summarized.
3
Its usage is more for analytical and reporting purpose. BI and DSS tools make use of this
data structure for OLAP, data
visualization, query, search and analysis, and BI reporting.
Analytics
Analytics: Analytics is the science of analysis. It defines how a business or an entity
arrives at an optimal or realistic decision based on existing data.
Application of Analytics include the study of business data using statistical analysis in
order to discover and understand historical patterns with an eye to predicting and
improving business performance.
In other words Applied Business Analytics is Business Intelligence. BI Analytics
consists of the following:
Query, Reporting and Search Tools
OLAP, Visualization and Data Mining Tools
Executive Dashboards and Scorecards
Predictive Analysis Tools
As can be seen in the diagram on the left-hand side, X axis denotes business value (going
from low to high) and
the Y axis denotes complexity in terms of Analytics (from bottom to top).
1. What happened? - We get the answer to this using Reporting and Query tools
2. Why did it happen? - We get the answer to this using OLAP and Visualization tools
3. Whats happening now? - We get the answer to this using Dashboards and Scorecards
4. What might happen? - We get the answers to this using Predictive analysis tools
Dashboards, Scorecards, and Predictive Analysis are used by executives will be coved in
subsequent modules.
Metadata
Metadata: Two contractors are assigned a task of building a bridge. One is to start
building from east end and the other is to start building from the west end. Both have to
meet in the center and then merge.
When they arrived at the center point, one end of the bridge was higher than the other by
a few inches. This was because one group of contractors and their engineers used
kilograms and meters, while another used pounds and feet. It caused the parent company
losses in billions!
Reason - It wasn't the data that was faulty; it was the Metadata!
Metadata is 'Data about Data'. It refers to data that tries to describe a data set in terms of
its Value, Content, Quality, Significance.
It provides insight into data for information like:
1. What kind of Data ?
2. Who is the owner of this data ?
3. How was the data created ?
4. What are the attributes and significance of the data created or collected ?
Inmon's Central Data Warehouse - Hub and Spoke architecture: Inmon defines a
Data Warehouse "A subject oriented, integrated, non-volatile, time-variant, collection of
data organized to support management needs." (W. H. Inmon, Database Newsletter,
July/August 1992)
The intent of this definition is that the Data Warehouse serves as a single-source Hub of
integrated data upon which all downstream data stores are dependent. The Inmon Data
Warehouse has roles of intake, integration, and distribution.
Kimball's definition: Bus Architecture: Kimball defines the warehouse as "nothing more
than the union of all the constituent data marts." (Ralph Kimball, et. al, The Data
Warehouse Life Cycle Toolkit, Wiley Computer Publishing, 1998)
This definition contradicts the concept of the Data Warehouse as a single-source Hub.
The Kimball Data Warehouse assumes all data store roles -- intake, integration,
distribution, access, and delivery.
The Inmon' Data Warehouse has the following approaches:
Inmon's approach is to have a single, consistent, and accurate storage of data, this
he termed as the EDW or the Enterprise Data Warehouse
Data Marts would then be built as subsets of the Data Warehouse, data marts
would be department or business process specific from which BI reporting could
be done
Advantage of this approach as per Inmon is that, there would single, consistent,
accurate source of corporate data, thus reducing data redundancy. Data Design,
consistency, and change can be much better handled
The disadvantage of this process is that the time required to build an EDW is
quite huge, it may take years for an EDW to be fully functional, The cost of
building the EDW is huge. moreover to get a buy in from the business
stakeholders becomes difficult as the return on investment (ROI) is not realized
early
accordance with established enterprise-wide business rules, and load the Hub data
store (central Data Warehouse or persistent staging area). The strength of this
architecture is enforced integration of data.
As can be seen in the diagram we are assuming that there is no Integration Hub
currently in place. Source or the OLTP systems are on the left and the data marts
are on the right side of the diagram, Different source systems may feed a single data
mart as seen in the diagram.
8
Hence there would be need to create a lot many interfaces and consequently a lot of
hardware, software and maintenance would be required which would add to the
overall
cost. Bottom line is that if there are m applications and n data marts, then m x n
interfaces
would be needed to build, maintain and execute the Data Warehouse.
Data redundancy is another factor, as it is quite possible that a given application feeds
more
than one data mart and that each of these data mart stores the same data. Lack of
synchronization
between these data marts may result in data inconsistency and data quality issues,
leading to
The Business Loosing Faith in the Data.
In the diagram, we have the source systems on the left and the data marts on the right
hand side, as can be seen source systems are named as App1, App2, and App3 and so
on,
each application feeds more than one data mart. We have 4 data marts for separate
business
processes namely, Finance, Sales, Marketing, and Accounting.
Each mart consists of data unique to it and also consists of data which is common
10
As we can see in the diagram each data mart returns a different figure, this is
because there are some customers. who are unique for a given data mart while
there are others who are common across data marts.
Thus it becomes very difficult to say which customers are common across which
data marts, as there is no way to get this information, moreover since each data mart
returns a different figure for the number of customers, there is no way to tell as to
which
figure is correct.
The business folks would thus loose all faith in the data they are seeing, as this data is
inconsistent across the different data marts.
11
12
are present rather than just the number of customers. Thus each data mart
can now answer exactly as to how many and what type of customer it consists.
Kimball's prime objective was to get the Data Warehouse up and running as
quickly as possible
He proposed that the data marts could directly be built from the source systems,
instead of having a centralized repository like and EDW as proposed by Inmon
The advantage of Kimball model is that multiple data marts serving the different
business units could be built in parallel, with each data mart having only its
departmental data, the time required to build the EDW is thus eliminated, Return
on investment (ROI) is also realized early as the marts are up and running in
quick time and business can see the value from the reports generated from these
marts
The disadvantages of this approach is that data redundancy would still persists as
it is quite possible that each of the built marts may have some common set of
entities, for example the sales mart would also need the product data, the
inventory mart would also need the product data. Integration of data across the
marts over the years would be another challenge
14
The strength of this architecture is consistency without the overhead of the central
Data Warehouse.
The basis of this process is to have the data marts up and running as quickly
as possible so that business can see the benefits of these marts, parallel development
of different data marts and avoid the cost and time required to build and enterprise
Data Warehouse. Data redundancy is not really a criterion for this approach.
As can be seen in the diagram on the right hand side, data from disparate data sources
is directly fed into the conformed data marts through the integration bus.
A conformed data mart is one which consists of conformed dimensions (and facts). A
conformed
dimension is one which holds the same business meaning and significance across the
multiple
data marts of which it can be a part of.
The conformance of the dimension (and the data mart) is built using the bus
architecture
framework, we will look at this in the subsequent slides.
As stated by Kimball, the strength of this architecture is consistency without the
15
overhead of
the integration Hub or the central Data Warehouse.
Conformed dimension:
Dimension which retains the same business and technical nomenclature even if
shared across Business processes.
Shared dimensions should conform.
Identical dimensions should have the same definitions, keys, labels, and values
Business Analysts and Architects from different business streams arrive at single
description of a dimension and its attributes. This results in a conformed
dimension
Conformed Dimensions are listed on the 'X' Axis, Business Process are listed on
the 'Y' Axis
The Matrix is completed by filling in an 'X' at intersection of a Business Process
and Dimension, implies 'This dimension required for this business process'
Once finalized, parrallel development of Data Marts can begin, each business
process corresponding to a Data Mart
16
17
Type 1: ER Model for EDW and Star Schema for Data Marts Inmon
18
Refer to the diagram on the screen to understand the Inmon Data Warehouse
architecture. It is also termed as the Corporate Information Factory. One of the
salient features of this architecture is that it consists of an Integration Hub or the
EDW or Enterprise Data Warehouse, which stores all of the corporate integrated
and detailed data.
In continuation of the previous screen, this screen illustrates another way
of representing the Inmon architecture.
19
As can be seen in the diagram, there are 5 verticals namely, Source systems
or Operational Systems, Data preparation area, ER Model or the Detail data,
Dimensional Model or the Data Marts, and Access and Delivery.
Please note that the ER vertical consists of the Enterprise Data Warehouse,
which holds all corporate information.
20
There are four verticals in the Kimball Architecture diagram, namely, Source Systems
or Operational Systems, Data Preparation, Dimensional model, and Access and
Delivery.
Difference is that there is no integration Hub here; instead data is directly loaded into
data marts from the source systems. This kind of an approach is faster to build and
uses
the bus architecture approach, parallel data mart development can take place, and
Return
on Investment (ROI) is visible early.
21
There are five different roles in the Data Warehouse environment from a
Data Store perspective.
Five different roles in DW environment are as follows:
1. Intake -Intake, Integration, Distribution, Delivery, and Access are the five
primary responsibilities of a Data Store.
2. Integration -Integration describes how the data fits together. The challenge for
warehousing architect is to design and implement consistent and interconnected
data that provides readily accessible, meaningful business information. Integration
occurs at many levels the key level, the attribute level, the definition level, the
structural level, and so forth (Data Warehouse Types, www.billinmon.com)
Additional data cleansing processes, beyond those performed at intake, may be
required to achieve desired levels of data integration
3. Distribution -Data stores with distribution responsibility serve as long-term
information assets with broad scope. Distribution is the progression of consistent
data from such a data store to those data stores designed to address specific
business needs for decision support and analysis
4. Delivery -Data stores with delivery responsibility combine data as 'in business
context' information structures to present to business units who need it. Delivery
is facilitated by a host of technologies and related tools - data marts, data views,
multidimensional cubes, web reports, spreadsheets, queries, and so on.
22
5. Access -Data stores with access responsibility are those that provide business
retrieval of integrated data typically the targets of a distribution process. Accessoptimized data stores are biased toward easy of understanding and navigation by
business users
23
We start with the Inmon Data Warehouse and see which of these 5 roles suit what
purpose.
We can see 3 roles being defined here, in the center left of the diagram. These are Intake,
Integration, and Distribution.
What this means is that the Inmon Data Warehouse is responsible for intake, integration
and
distribution of data as part of its Data Warehouse architecture.
It treats delivery and access outside the prerogative of its Data Warehouse environment.
Essentially what this means is that Inmons DW environment is limited to creation of
EDW.
Building marts and then generating reports out of these marts is considered an external
act.
Going by this definition an Inmon Data Warehouse would:
1. Intake data from various sources
2. Integrate it to form a complete and conform record and finally
3. Distribute the data to various data sources (business unit specific) mainly from the
EDW
to the Data Marts
As can bee seen in the diagram, all the 5 defined roles serve the Kimball Data Warehouse,
What this means is that all 5 roles, from Intake to Integration to Distribution to Delivery
24
to Access
are part of the Kimball Data Warehouse. It means that Kimball Data Warehouse involves
all the 5
roles, from intake of to finally, the accessing the data marts for generating reports.
Data Profiling
Data Cleansing
Data Integration, Consolidation, and Population
Data Replication
Data Federation
Each of the five categories are described, in details, in the subsequent screens.
ETL classification: Data Profiling
Features include:
Benefits include:
Data quality by understanding the metadata of your data sources (structure and the
relationships within and among them) supported through efficient tooling
25
It is basically the analysis of data and metadata values for their correctness,
in other words detection of differences between defined and inferred properties.
This step is carried out in the initial stages of the Data Warehouse, even before
data is loaded from the source systems into the Data Warehouse, it is carried out,
so as to ascertain the quality of data that would be loaded into the Data Warehouse,
in case the quality is not up to the mark, then it leads to a full fledged Data Quality
initiative.
ETL classification: Data Cleansing
Features include:
Benefits include:
26
Complex transformations
High data volume (billions of records)
Performance and scalability of target access more important than data
concurrency in target
De-coupled model: Minimal impact on source systems due to target access
Target may collect historical snapshots of integrated information
Benefits include:
This includes integrating data across data sources, transformation of data and
storing it as a single consistent, detailed, and accurate data. The other aspect is
27
consolidation and loading of data into the Data Warehouse (EDW and Data Marts).
Emphasis here is more on the Performance and Scalability, while accessing the data
from the target data marts. Data Consolidation provides single version of truth and
data
integration, from different heterogeneous sources.
ETL classification: Data Replication
Features include:
Benefits include:
Benefits include:
Time to market and control costs when joining distributed (rather homogeneous)
information
29
Example: Sunopsis
30
31
Analytics and BI
32
BI is all about how to capture, access, understand, analyze and turn one of the
most valuable assets of an enterprise - raw data - into actionable information in
order to improve business performance.
33
34
35
36
37
38
39
40
41
Metadata is 'data about data'. It provides a basis for trust in information, providing
visibility into lineage, relationships to other systems, and business definitions.
It refers to data that tries to describe a data set in terms of its value, content, quality,
significance. It also provides an insight into data for information like:
42
43
44
45
Characteristics of DW Metadata
DW Metadata typical helps in tracking the following:
Extract Information: Last Refresh / Load - Date / Time
Historical Information about data and Metadata: Versioning and Data Access
Patterns over a period of time
Data Mapping Information: Source to Target and Transformation Rules
Summarization: Aggregation Algorithm
Archiving: Period of Data Purging
Reference and Standardization: Aliases and Lookups
46
47
48
The example shows three types of environments, which deal with all kinds of
corporate
data. However, they still lack the consistency required to deliver, real time,
and accurate
information, as these environments were designed to serve a specific
function for a specific
set of corporate data.
In the diagram, on the left is the Data Warehouse system, in the middle is
the
EAI (Enterprise Application Integration), and on the right is the ERP
system.
The Data Warehouse is unidirectional, that is, data flows from source to
target, reverse
synchronization is not possible, it also works on the principle of batch
updates, hence
real-time synchronization of data is not possible.
The Enterprise Application Integration (EAI) environment does not
preserve history, it is
event driven, and moreover the investment on data synchronization is huge.
In case of the ERP systems, where data proliferation is huge, there is no
synchronization
of data between the different ERPs and there is high investment required
on consolidation
of data.
49
50
51
52
53
Continuing with the bank example, refer to the diagram on the below screen, where
you will see the
Master Record for a given customer as presented on the right hand side of
this diagram.
This would become the Master Data repository for the customer.
The approach for Master Data repository is as follows:
1. Core Master Data resides in the Master Repository and published out to the
dependent applications. This means that all the attributes, which are common
across the three applications, reside now in the Master Data repository. Only those
attributes, which are specific to a type of account reside in that specific account or
application. For example, 'Loan A/C No' would reside only in the 'Home Loan
Customer Application', whereas 'Max Credit Limit' would reside only in the
'Credit Card Customer Application'.
2. Applications also store the master attributes but they share a global primary key,
the
individual keys (primary keys) of the individual applications are copied in the
Master Data
repository.
3. Any changes can be introduced in each application but these need to be
synchronized
with the central system.
In the next screen, you will see the CDI MDM service.
54
There are two types of approaches to an MDM initiative: Distinct MDM per Master
and
Platform Centric Approach. This means that a MDM initiative would be
different than a
Product MDM initiative. This is easy to build and maintain. It is cost
effective. However, this
type of approach lacks enterprise scalability.
55
The other approach to MDM is platform centric approach. All the master entities
like customer,
product, and material reside in a single repository. It has enterprise
scalability. But, it can be very
complex and can take much longer to build.
56
57
Data Mining does not replace business analysts and managers, It compliments
these users to confirm their empirical observations, find new patterns, that yield
steady incremental improvement and breakthrough insight.
Data Mining cannot replace OLAP and reporting tools, they compliment each
other. The outcome of the patterns discovered using Data Mining need to be
analyzed before being put into action, in order to know the implications of such
patterns. OLAP tool can allow the analysts to get answers to these queries.
58
Use one group for testing and one group for building the mining model
60
Build structures from examples of past decisions that can be used to make
decisions for unknown cases
Predict the cluster in which the new case fits in
Regression:
61
Time series:
Forecasts the future trends, Model includes time hierarchy like, year, quarter,
month, week, and so on
Considers the impact of seasonality and calendar effects
Target Marketing
Churn Analysis
Customer Profiling
Bioinformatics
Fraud Detection
Medical Diagnostics
62
Gartner
Businesses are discovering that their success is increasingly tied to the quality of their
information. Organizations rely on this data to make significant decisions that
can affect customer retention, supply chain efficiency and regulatory
compliance. As companies collect more and more information about their
customers, products, suppliers, inventory and finances, it becomes more
difficult to accurately maintain that information in a usable, logical
framework.
Data Governance is nothing but management of data which involves creation,
availability, usability, security, and decimation of all kind of data.
The need for Data Governance is as follows:
The amount of data is increasing every year, IDC estimates that the world will
reach a zettabyte of data (1,000 exabytes or 1 million pedabytes) in 2010.
A significant portion of all corporate data is flawed.
Process failure and information scrap and rework caused by defective information
costs the United States alone $1.5 trillion or more.
63
The amount of data - and the prevalence of bad data - is growing steadily.
To address the spread of data and eliminate silos of corporate information, many
corporates implement enterprise-wide Data Governance programs, which attempt
to codify and enforce best practices for data management across the organization.
Effective decisions: Better data drives more effective decisions across every level
of the organization.
Better strategies: With more unified view of the enterprise, managers and
executives are able to devise strategies that make the company more profitable.
64
Define and maintain data strategy and policies, manage data issues, estimate data
value and data management costs, and justify the budget for data management
programs
It requires at first, a buy-in from the top executives of the organization, this is
followed by setting up
a Data Management Team consisting of data stewards and other data
stakeholders.
A charter and plan is prepared, which lays down the rules and policies for
data management.
Allocation of appropriate budget for data management program is an
important step here.
This is followed by enforcing the data management programs and
promoting them.
Finally, users are made aware of the policies by conducting trainings; they
are encouraged to adhere to
these guidelines.
Data Stewardship
65
POLICIES
Features of the policies implemented by the Governed Organizations include the
following:
New initiatives are only approved after careful consideration of how the
initiatives will impact the existing data infrastructure
Automated policies are in place to ensure that data remains consistent, accurate,
and reliable throughout the enterprise
A service oriented architecture (SOA) encapsulates business rules for data quality
and identity management
TECHNOLOGY
Technologies and tools that are in place in a Governed Organization are as follows:
66
Data quality and data integration tools are standardized across the organization
All aspects of the organization use standard business rules created and maintained
by designated data stewards
Data is continuously inspected and any deviations from standards are resolved
immediately
Data models capture the business meaning and technical details of all corporate
data elements
Risk: Low. Master Data tightly controlled across the enterprise, allowing the
organization to maintain high-quality information about its customers, prospects,
inventory and products
Rewards: High. Corporate data practices can lead to a better understanding about
an organizations current business landscape - allowing management to have full
confidence in all data-based decisions
67
68