What Is Data Warehouse?: Explanatory Note
What Is Data Warehouse?: Explanatory Note
A data warehouse is a electronic storage of an Organization's historical data for the purpose of
Data Analytics, such as reporting, analysis and other knowledge discovery activities.
Other than Data Analytics, a data warehouse can also be used for the purpose of data integration,
master data management etc.
According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated
and time-variant.
Explanatory Note
Non-volatile means that the data once loaded in the warehouse will not get deleted later. Timevariant means the data will change with respect to time.
The above definition of the data warehousing is typically considered as "classical" definition.
However, if you are interested, you may want to read the article - What is a data warehouse - A
101 guide to modern data warehousing - which opens up a broader definition of data
warehousing.
What is ER model?
ER model or entity-relationship model is a particular methodology of data modeling wherein the
goal of modeling is to normalize the data by reducing redundancy. This is different than
dimensional modeling where the main goal is to improve the data retrieval mechanism.
What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say 20kg, it does not mean anything. But if I say,
"20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a
meaningful sense. These product, customer and dates are some dimension that qualified the
measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is a data element that
categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always)
numerical values that can be aggregated.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references
number of dimension tables so as the keys (primary key) from all the dimension tables flow into
the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks
like a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time.
Sales quantity will be the measure here and keys from customer, product and time dimension
tables will flow into the fact table.
If you are not very familiar about Star Schema design or its use, we strongly recommend you
read our excellent article on this subject - different schema in dimensional modeling
tables will flow into the fact table. Additionally all the products can be further grouped under
different product families stored in a different table so that primary key of product family tables
also goes into the product table as a foreign key. Such construct will be called a snow-flake
schema as product table is further snow-flaked into product family.
Note
Snow-flake increases degree of normalization in the design.
Based on how frequently the data inside a dimension changes, we can further classify dimension
as
1. Unchanging or static dimension (UCD)
2. Slowly changing dimension (SCD)
3. Rapidly changing Dimension (RCD)
You may also read, Modeling for various slowly changing dimension and Implementing Rapidly
changing dimension to know more about SCD, RCD dimensions etc.
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing.
These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2
and 3 are most common. Read this article to gather in-depth knowledge on various SCD tables.
Start
Date
End Date
C1
G1
1st Jan
2000
31st Dec
2005
C1
G2
1st Jan
2006
NULL
Note that separate surrogate keys are generated for the two records. NULL end date in the second
row denotes that the record is the current record. Also note that, instead of start and end dates,
one could also keep version number column (1, 2 etc.) to denote different versions of the
record.
Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a
type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See
the example below,
Ke Custo
y
mer
1
C1
Previous
Group
G1
Current
Group
G2
This is only good when you need not store many consecutive histories and when date of change
is not required to be stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but
only you add one extra column to denote which record is the current record.
Ke Custo Grou
y
mer
p
Start
Date
End Date
Current
Flag
C1
G1
1st Jan
2000
31st Dec
2005
C1
G2
1st Jan
2006
NULL
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This table will only
contain keys from different dimension tables. This is often used to resolve a many-to-many
cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher
may have many students. To model this situation in dimensional model, one might introduce a
fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer
queries like,
1. Who are the students taught by a specific teacher.
2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.
To understand this, let's consider an example from retail business. A certain retail chain has 500
shops accross Europe. All the shops record detail level transactions regarding the products they
sale and those data are captured in a data warehouse.
Each shop manager can access the data warehouse and they can see which products are sold by
whom and in what quantity on any given date. Thus the data warehouse helps the shop managers
with the detail level data that can be used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which certain sales
girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin
across Europe. Or may be year to year sales growth on eastern Europe. Such data is aggregated
in nature. Because Sales of goods in East Europe is derived by summing up the individual sales
data from each shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and
value (e.g. Brown Bread) and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in different level of
aggregations.
Slicing and dicing operations are part of pivoting.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East Europe has
declined this year compared to last year, he then might want to know the root cause of the
decrease. For this, he may start drilling through his report to more detail level and eventually find
out that even though individual shop sales has actually increased, the overall sales figure has
decreased because a certain shop in Turkey has stopped operating the business. The detail level
of data, which CEO was not much interested on earlier, has this time helped him to pin point the
root cause of declined sales. And the method he has followed to obtain the details from the
aggregated data is called drill through.