Data Warehousing Is The Coordinated, Architected, and Periodic Copying of Data From
Data Warehousing Is The Coordinated, Architected, and Periodic Copying of Data From
chalasani
Different data warehousing systems have different structures. Some may have an
ODS (operational data store), while some may have multiple data marts. Some may
Kamesh.chalasani
have a small number of data sources, while some may have dozens of data sources.
In view of this, it is far more reasonable to present the different layers of a data
warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
• Data Source Layer
• Data Extraction Layer
• Staging Area
• ETL Layer
• Data Storage Layer
• Data Logic Layer
• Data Presentation Layer
• Metadata Layer
• System Operations Layer
The picture below shows the relationships among the different components of the data
warehouse architecture:
This represents the different data sources that feed data into the data warehouse. The
data source can be of any format -- plain text file, relational database, other types of
database, Excel file, ... can all act as a data source.
Many different types of data can be a data source:
• Operations -- such as sales data, HR data, product data, inventory data,
marketing data, systems data.
• Web server logs with user browsing data.
• Internal market research data.
• Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely
some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse
/ data mart. Having one common area makes it easier for subsequent data
processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from
a transactional nature to an analytical nature. This layer is also where data cleansing
happens.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality,
3 types of entities can be found here: data warehouse, data mart, and operational data
store (ODS). In any given system, you may have just one of the three, two of the three,
or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but does affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a
tabular / graphical report in a browser, an emailed report that gets automatically
generated and sent everyday, or an alert that warns users of exceptions, among
others.
Metadata Layer
Kamesh.chalasani
This is where information about the data stored in the data warehouse system is
stored. A logical data model would be an example of something that's in the metadata
layer.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as
ETL job status, system performance, and user access history.
Requirement Gathering
Task Description
The first thing that the project team should engage in is gathering requirements from end
users. Because end users are typically not familiar with the data warehousing process or
concept, the help of the business sponsor is essential. Requirement gathering can happen as
one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple
people are talking about the project scope in the same meeting.
The primary goal of this phase is to identify what constitutes as a success for this particular
phase of the data warehouse project. In particular, end user reporting / analysis requirements
are identified, and the project team will spend the remaining period of time trying to satisfy
these requirements.
Associated with the identification of user requirements is a more concrete definition of other
details such as hardware sizing information, training requirements, data source identification,
and most importantly, a concrete project plan indicating the finishing date of the data
warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so
that the data warehousing system can recover from accidents that disable the system.
Without an effective backup and restore strategy, the system will only last until the first major
disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after
the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
• A list of reports / cubes to be delivered to the end users by the end of this current
phase.
• A updated project plan that clearly identifies resource loads and milestone delivery
dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing
implementation. The reason is that because data warehousing by definition includes data
from multiple sources spanning many different departments within the enterprise, there are
often political battles that center on the willingness of information sharing. Even though a
successful data warehouse benefits the enterprise, there are occasions where departments
may not feel the same way. As a result of unwillingness of certain groups to release data or to
participate in the data warehousing requirement definition, the data warehouse effort either
never gets off the ground, or could not start in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at
the CXO level, she can often exert enough influence to make sure everyone cooperates.
Physical Environment Setup
Task Description
Kamesh.chalasani
Once the requirements are somewhat clear, it is necessary to set up the physical servers and
databases. At a minimum, it is necessary to set up a development environment and a
production environment. There are also many data warehousing projects where there are
three environments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The different processes
(such as ETL, OLAP Cube, and reporting) also need to be set up properly for each
environment.
It is best for the different environments to use distinct application and database servers. In
other words, the development environment will have its own application server and database
servers, and the production environment will have its own set of application and database
servers.
Having different environments is very important for the following reasons:
• All changes can be tested and QA'd first without affecting the production environment.
• Development and QA can occur during the time users are accessing the data
warehouse.
• When there is any question about the data, having separate environment(s) will allow
the data warehousing team to examine the data without impacting the production
environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
• Hardware / Software setup document for all of the environments, including hardware
specifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database
and a single server for the different environments. Environment separation is achieved by
either a directory structure or setting up distinct instances of the database. This is problematic
for the following reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development
environment. Having a separate development environment will prevent the production
environment from being impacted by this.
2. There may be interference when having different database environments on a single box.
For example, having multiple long queries running on the development database could affect
the performance on the production database.
Data Modeling
Task Description
Kamesh.chalasani
This is a very important step in the data warehousing project. Indeed, it is fair to say that the
foundation of the data warehousing system is the data model. A good data model will allow
the data warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and
then it is translated into the physical data model. The detailed steps can be found in the
section.Conceptual, Logical, and Physical Data Modeling
Part of the data modeling exercise is often the identification of data sources. Sometimes this
step is deferred until the ETL step. However, my feeling is that it is better to find out where the
data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should
the data not be available, this is a good time to raise the alarm. If this was delayed until the
ETL phase, rectifying it will becoming a much tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
• Identification of data sources.
• Logical data model.
• Physical data model.
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person
can be an outside consultant or can be someone in-house who has extensive experience in
the industry. Without this person, it becomes difficult to get a definitive answer on many of the
questions, and the entire project gets dragged out.
ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,
and this can easily take up to 50% of the data warehouse implementation cycle or longer. The
reason for this is that it takes time to get the source data, understand the necessary columns,
understand the business rules, and understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
• Data Mapping Document
• ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove
suicidal to the project because end users will usually tolerate less formatting, longer time to
Kamesh.chalasani
run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that
they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated
than necessary. In ETL design, the primary goal should be to optimize load speed without
sacrificing on quality. This is, however, sometimes not followed. There are cases where the
design goal is to cover all possible future uses, whether they are practical or just a figment of
someone's imagination. When this happens, ETL performance suffers, and often so does the
performance of the entire data warehousing system.
Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors
offer a front-end on their own. When choosing vendor tools, make sure it can be easily
customized to suit the enterprise, especially the possible changes to the reporting
requirements of the enterprise. Possible changes include not just the difference in report
layout and report content, but also include possible changes in the back-end structure. For
example, if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL
Server, will the front-end tool be flexible enough to adjust to the changes without much
modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do the
reports need to be published on a regular interval? Are there very specific formatting
requirements? Is there a need for a GUI interface so that each user can customize her
reports?
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced
your front end infrastructure is. All they care is that they receives their information in a timely
manner and in the way they specified.
Report Development
Task Description
Report specification typically comes directly from the requirements phase. To the end user,
the only direct touchpoint he or she has with the data warehousing system is the reports they
see. So, report development, although not as time consuming as some of the other steps
such as ETL and data modeling, nevertheless plays a very important role in determining the
success of the data warehousing project.
One would think that report development is an easy task. How hard can it be to just follow
instructions to build the report? Unfortunately, this is not true. There are several points the
data warehousing team need to pay attention to before releasing the report.
User customization: Do users need to be able to select their own metrics? And how do
users need to be able to filter the information? The report development process needs to take
those factors into consideration so that users can get the information they need in the shortest
amount of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering the
report to the web front end, other possibilities include delivery via email, via text messaging,
or in some form of spreadsheet. There are reporting solutions in the marketplace that support
report delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow
end users to slice and dice the data on the report without having to pull data from an external
source.
Kamesh.chalasani
Access privileges: Special attention needs to be paid to who has what access to what
information. A sales report can show 8 metrics covering the entire company to the company
CEO, while the same report may only show 5 of the metrics covering only a single district to a
District Sales Director.
Report development does not happen only during the implementation phase. After the system
goes into production, there will certainly be requests for additional reports. These types of
requests generally fall into two broad categories:
1. Data is already available in the data warehouse. In this case, it should be fairly
straightforward to develop the new report into the front end. There is no need to wait for a
major production push before making new reports available.
2. Data is not yet available in the data warehouse. This means that the request needs to be
prioritized and put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
• Report Specification Documentation.
• Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, user
interpretation of the report can be errenous.
Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little performance
tuning:
• ETL - Given that the data load is usually a very time-consuming process (and hence
they are typically relegated to a nightly load job) and that data warehousing-related
batch jobs are typically of lower priority, that means that the window for data loading is
not very long. A data warehousing system that has its ETL process finishing right on-
time is going to have a lot of problems simply because often the jobs do not get started
on-time due to factors that is beyond the control of the data warehousing team. As a
result, it is always an excellent idea for the data warehousing group to tune the ETL
process as much as possible.
• Query Processing - Sometimes, especially in a ROLAP environment or in a system
where the reports are run directly against the relationship database, query performance
can be an issue. A study has shown that users typically lose interest after 30 seconds
of waiting for a report to return. My experience has been that ROLAP reports or reports
that run directly against the RDBMS often exceed this time limit, and it is hence ideal
for the data warehousing team to invest some time to tune the query, especially the
most popularly ones. We present a number of query optimizationideas.
• Report Delivery - It is also possible that end users are experiencing significant delays
in receiving their reports due to factors other than the query performance. For example,
Kamesh.chalasani
network traffic, server setup, and even the way that the front-end was built sometimes
play significant roles. It is important for the data warehouse team to look into these
areas for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
• Performance tuning document - Goal and Result
Possible Pitfalls
Make sure the development environment mimics the production environment as much as
possible - Performance enhancements seen on less powerful machines sometimes do not
materialize on the larger, production-level machines.
Query Optimization
For any production database, SQL query performance becomes an issue sooner or
later. Having long-running queries not only consumes system resources that makes
the server and application run slowly, but also may lead to table locking and data
corruption issues. So, query optimization becomes an important task.
First, we offer some guiding principles for query optimization:
1. Understand how your database is executing your query
Nowadays all databases have their own query optimizer, and offers a way for users to
understand how a query is executed. For example, which index from which table is being
used to execute the query? The first step to query optimization is understanding what the
database is doing. Different databases have different commands for this. For example, in
MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one
can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.
2. Retrieve as little data as possible
The more data returned from the query, the more resources the database needs to expand to
process and store these data. So for example, if you only need to retrieve one column from a
table, do not use 'SELECT *'.
3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired
result through the use of subqueries, inline views, and UNION-type statements. For those
cases, the intermediate results are not stored in the database, but are immediately used
within the query. This can lead to performance issues, especially when the intermediate
results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate results in a
temporary table, and break up the initial SQL statement into several SQL statements. In many
cases, you can even build an index on the temporary table to speed up the query
performance even more. Granted, this adds a little complexity in query management (i.e., the
need to manage temporary tables), but the speedup in query performance is often worth the
trouble.
Kamesh.chalasani
Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing, the QA team
takes over. The QA team is always from the client. Usually the QA team members will know
little about data warehousing, and some of them may even resent the need to have to learn
another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the
project team worked very hard to get everything ready for Phase 1, and everyone thought that
we had met the deadline. There was one mistake, though, the project managers failed to
recognize that it is necessary to go through the client QA process before the project can go
into production. As a result, it took five extra months to bring the project to production (the
original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
• QA Test Plan
• QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
Kamesh.chalasani
As mentioned above, usually the QA team members know little about data warehousing, and
some of them may even resent the need to have to learn another tool or tools. Make sure the
QA team members get enough education so that they can complete the testing themselves.
Rollout To Production
Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some
may think this is as easy as flipping on a switch, but usually it is not true. Depending on the
number of end users, it sometimes take up to a full week to bring everyone online!
Fortunately, nowadays most end users access the data warehouse over the web, making
going production sometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
• Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend
several months to develop and QA the data warehousing system, only to have little usage
because the users are not properly trained. Regardless of how intuitive or easy the interface
may be, it is always a good idea to send the users to at least a one-day course to let them
understand what they can achieve by properly using the data warehouse.
Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular
backup and crisis management becomes important and should be planned out. In addition, it
is very important to consistently monitor end user usage. This serves two purposes: 1. To
capture any runaway requests so that they can be fixed before slowing the entire system
down, and 2. To understand how much users are utilizing the data warehouse for return-on-
investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is
essential that proper documentation is left for those who are handling production
Kamesh.chalasani
maintenance. There is nothing more frustrating than staring at something another person did,
yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of
the data warehouse planned, start on that as soon as possible.
Incremental Enhancements
Task Description
Once the data warehousing system goes live, there are often needs for incremental
enhancements. I am not talking about a new data warehousing phases, but simply small
changes that follow the business itself. For example, the original geographical designations
may be different, the company may originally have 4 sales regions, but now because sales
are going so well, now they have 10 sales regions.
Deliverables
• Change management documentation
• Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead
and make the change in production. This is a definite no-no. Many unexpected problems will
pop up if this is done. I would very strongly recommend that the typical cycle of development
--> QA --> Production be followed, regardless of how simple the change may seem.
Observations
1)Quick Implementation Time
2)Lack Of Collaboration With Data Mining Efforts
3)Industry Consolidation
4)How To Measure Success
Business Intelligence
Business intelligence is a term commonly associated with data warehousing. In fact, many
of the tool vendors position their products as business intelligence software rather than
data warehousing software. There are other occasions where the two terms are used
interchangeably. So, exactly what is business inteligence?
Business intelligence usually refers to the information that is available for the enterprise to
make decisions on. A data warehousing (or data mart) system is the backend, or the
Kamesh.chalasani
OLAP tool
OLAP tools are usually used by advanced users. They make it easy for users to look at the
data from multiple dimensions. The OLAP Tool Selection selection discusses how one should
select an OLAP tool.
OLAP tools are used for multidimensional analysis.
Data mining tool
Data mining tools are usually only by very specialized users, and in an organization, even
large ones, there are usually only a handful of users using data mining tools.
Data mining tools are used for finding correlation among different factors.
2)Uses:-
Business intelligence usage can be categorized into the following categories:
Kamesh.chalasani
fact table would contain three columns: A date column, a store column, and a sales amount
column.
Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that
particular quarter is represented on a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or
more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema
types are Star Schema andSnowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is a business case to
analyze the information at that particular level.
Fact Table Granularity
Granularity
The first step in designing a fact table is to determine thegranularity of the fact table.
By granularity, we mean the lowest level of information that will be stored in the fact table.
This constitutes two steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward process, because
business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time,
geography, and product. This list, however, is by no means a complete list for all off-line
retailers. A supermarket with a Rewards Card program, where customers provide some
personal information in exchange for a rewards card, and the supermarket would offer lower
prices for certain items for customers who present a rewards card at checkout, will also have
the ability to track the customer dimension. Whether the data warehousing system includes
the customer dimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include
Determining which part of hierarchy the information is stored along each dimension is a bit
more tricky. This is where user requirement (both stated and possibly future) plays a major
role.
Kamesh.chalasani
In the above example, will the supermarket wanting to do analysis along at the hourly level?
(i.e., looking at how certain products may sell by different hours of the day.) If so, it makes
sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is
sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level of
detail, the larger the data amount in the fact table, the granularity exercise is in essence
figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be
forthcoming that may result in the need of additional details. In such cases, it is prudent for
the data warehousing team to design the fact table such that lower-level information is
included. This will avoid possibly needing to re-design the fact table in the future. On the other
hand, trying to anticipate all future requirements is an impossible and hence futile exercise,
and the data warehousing team needs to fight the urge of the "dumping the lowest level of
detail into the data warehouse" symptom, and only includes what is practically needed.
Sometimes this can be more of an art than science, and prior experience will become
invaluable here.
Fact And Fact Table Types
Types of Facts
There are three types of facts:
• Additive: Additive facts are facts that can be summed up through all of the dimensions
in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes
that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a
daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact,
because you can sum up this fact along any of the three dimensions present in the fact table
-- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week
represent the total sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
Kamesh.chalasani
The purpose of this table is to record the current balance for each account at the end of each
day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up
all current balances for a given account for each day of the month does not give us any useful
information). Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.
In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension
is represented as a single table. The primary key in each dimension table is related to a
forieng key in the fact table.
Sample star schema
All measures in the fact table are
related to all the dimensions that fact
table is related to. In other words,
they all have the same level of
granularity.
A star schema can be simple or
complex. A simple star consists of
one fact table; a complex star can
have more than one fact table.Let's
look at an example: Assume our data
warehouse keeps store sales data,
and the different dimensions are
time, store, product, and customer. In
this case, the figure on the left repesents our star schema. The lines between two tables
indicate that there is a primary key / foreign key relationship between the two tables. Note that
different dimensions are not related to one another.
Kamesh.chalasani
The snowflake schema is an extension of the star schema, where each point of the star
explodes into more points. In a star schema, each dimension is represented by a single
dimensional table, whereas in a snowflake schema, that dimensional table is normalized into
multiple lookup tables, each representing a level in the dimensional hierarchy.
Sample snowflake schema
For example, the Time Dimension
that consists of 2 different
hierarchies:
1. Year → Month → Day
2. Week → Day
We will have 4 lookup tables in a
snowflake schema: A lookup table for
year, a lookup table for month, a
lookup table for week, and a lookup
table for day. Year is connected to
Month, which is then connected to
Day. Week is only connected to Day.
A sample snowflake schema illustrating the above relationships in the Time Dimension is
shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due
to minimized disk storage requirements and joining smaller lookup tables. The main
disadvantage of the snowflake schema is the additional maintenance efforts needed due to
the increase number of lookup tables.
Slowly Changing Dimension
The "Slowly Changing Dimension" problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time. We give
an example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry
in the customer lookup table has the following record:
Customer Key Name State
1001 Christina Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.
There are in general three ways to solve this type of problem, and they are categorized as
follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
Type 3: The original record is modified to reflect the change.
Kamesh.chalasani
We next take a look at each of the scenarios and how the data model and the data looks like
for each of them. Finally, we compare and contrast among the three alternatives.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:
Customer Key Name State
1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the
new information. Therefore, both the original and the new record will be present. The newe
record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a new row into
the table:
Customer Key Name State
1001 Christina Illinois
1005 Christina California
Advantages:
- This allows us to accurately keep all historical information.
Kamesh.chalasani
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value.
There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
• Customer Key
• Name
• Original State
• Current State
• Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information
will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Kamesh.chalasani
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.
The three level of data modeling,conceptual data model,logical data model,physical data
model, were discussed in prior sections. Here we compare these three types of data models.
The table below compares the different features:
Feature Conceptual Logical Physical
Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes ✓
Primary Keys ✓ ✓
Foreign Keys ✓ ✓
Table Names ✓
Column Names ✓
Column Data Types ✓
Below we show the conceptual, logical, and physical versions of a single data model.
Logical Model Design Physical Model Design
We can see that the complexity increases from conceptual to logical to physical. This is why
we always first start with the conceptual data model (so we understand at high level what are
the different entities in our data and how they relate to one another), then move on to the
logical data model (so we understand the details of our data without worrying about how they
will actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project,
sometimes the conceptual data model and the logical data model are considered as a single
deliverable.
Kamesh.chalasani
Data integrity refers to the validity of data, meaning data is consistent and correct. In the
data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no
data integrity in the data warehouse, any resulting report and analysis will not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:
Referential integrity
The relationship between the primary key of one table and the foreign key of another table
must always be maintained. For example, a primary key cannot be deleted if there is still a
foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.
Not NULL vs NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against
unauthorized access to data (including physical access to the servers), as well as logging of
all data access history. Data integrity can only ensured if there is no unauthorized access to
the data.
Kamesh.chalasani
Source System
A database, application, file, or other storage facility from which the data in a
data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and target
objects.
Metadata
Data that describes data and other structures, such as objects, business rules,
and processes. For example, the schema design of a data warehouse is
typically stored in a repository as metadata, which is used to generate scripts
used to build and populate the data warehouse. A repository contains
metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source
data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating
data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data
warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.
Kamesh.chalasani
avoid this, workflows can be created to ensure the correct flow of data
from source to target.
• Workflow Monitor: This monitor is helpful in monitoring and tracking
the workflows created in each Power Center Server.
• Power Center Connect: This component helps to extract data and
metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party applications.
• Power Center Exchange: This component helps to extract data and
metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party applications.
Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power
Center, helps organizations leverage data by avoiding manual coding of data
extraction programs. Power Exchange supports batch, real time and changed
data capture options in main frame(DB2, VSAM, IMS etc.,), mid range
(AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc)
and flat files in unix, linux and windows systems.
Power Channel:
This helps to transfer large amount of encrypted and compressed data over
LAN, WAN, through Firewalls, tranfer files over FTP, etc.
Meta Data Exchange:
Metadata Exchange enables organizations to take advantage of the time and
effort already invested in defining data structures within their IT environment
when used with Power Center. For example, an organization may be using
data modeling tools, such as Erwin, Embarcadero, Oracle designer, Sybase
Power Designer etc for developing data models. Functional and technical
team should have spent much time and effort in creating the data model's data
structures(tables, columns, data types, procedures, functions, triggers etc). By
using meta deta exchange, these data structures can be imported into power
center to identifiy source and target mappings which leverages time and
effort. There is no need for informatica developer to create these data
structures once again.
Kamesh.chalasani
Power Analyzer:
Power Analyzer provides organizations with reporting facilities.
PowerAnalyzer makes accessing, analyzing, and sharing enterprise data
simple and easily available to decision makers. PowerAnalyzer enables to
gain insight into business processes and develop business intelligence.
With PowerAnalyzer, an organization can extract, filter, format, and analyze
corporate information from data stored in a data warehouse, data mart,
operational data store, or otherdata storage models. PowerAnalyzer is best
with a dimensional data warehouse in a relational database. It can also run
reports on data in any table in a relational database that do not conform to the
dimensional model.
Super Glue:
Superglue is used for loading metadata in a centralized place from several
sources. Reports can be run against this superglue to analyze meta data.
Power Mart:
Power Mart is a departmental version of Informatica for building, deploying,
and managing data warehouses and data marts. Power center is used for
corporate enterprise data warehouse and power mart is used for departmental
data warehouses like data marts. Power Center supports global repositories
and networked repositories and it can be connected to several sources. Power
Mart supports single repository and it can be connected to fewer sources
when compared to Power Center. Power Mart can extensibily grow to an
enterprise implementation and it is easy for developer productivity through a
codeless environment.
Active Transformation
An active transformation can change the number of rows that pass through it
from source to target i.e it eliminates rows that do not meet the condition in
transformation.
Passive Transformation
A passive transformation does not change the number of rows that pass
through it i.e it passes all rows through the transformation.
Transformations can be Connected or UnConnected.
Connected Transformation
Connected transformation is connected to other transformations or directly to
Kamesh.chalasani
Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This
transformation is useful to perform calculations such as averages and sums
(mainly to perform calculations on multiple rows or groups). For example, to
calculate total of daily sales or to calculate average of monthly or yearly
sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE,
MAX, SUM etc. can be used in aggregate transformation.
Expression Transformation
Expression transformation is a Passive and Connected transformation. This
can be used to calculate values in a single row before writing to the target.
For example, to calculate discount of each product or to concatenate first and
Kamesh.chalasani
Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can
be used to join two sources coming from two different locations or from
same location. For example, to join a flat file and a relational source or to
join two flat files or to join a relational source and a XML source.
In order to join two sources, there must be atleast one matching port. at least
one matching port. While joining two sources it is a must to specify one
source as master and the other as detail.
The Joiner transformation supports the following types of joins:
• Normal
• Master Outer
• Detail Outer
• Full Outer
Normal join discards all the rows of data from the master and detail source
that do not match, based on the condition.
Master outer join discards all the unmatched rows from the master source and
keeps all the rows from the detail source and the matching rows from the
master source.
Detail outer join keeps all rows of data from the master source and the
matching rows from the detail source. It discards the unmatched rows from
the detail source.
Full outer join keeps all rows of data from both the master and detail sources.
Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It
Kamesh.chalasani
is used to look up data in a relational table, view, or synonym. Lookup definition can be
imported either from source or from target tables
Connected lookup returns multiple columns from the same row whereas
UnConnected lookup has one return port and returns one column from each
row.