DW MOD 2
DW MOD 2
UNIVERSITY
JNANA SANGAMA, BELGAVI-590018, KARNATAKA
DATA WAREHOUSING
(As per CBCS Scheme 2022)
PREPARED BY:
INDHUMATHI R (ASST.PROF DEPT OF DS (CSE), KNSIT)
MODULE 2
CHAPTER 4: PLANNING AND PROJECT MANAGEMENT
Key Issues:
Understanding Business Requirements: Identify the needs of the organization and how
the data warehouse will support strategic decision-making
Value and Expectations: Companies often dive into data warehousing without a clear
understanding of its benefits. Assess the data's value and determine whether a
warehouse is the right solution for your business goals.
Example: A retail company like Amazon would consider the value of collecting
transaction data from customers across the world and whether the data warehouse
would help improve customer segmentation or predict purchasing trends.
Risk Assessment: Just like any other project, there’s a risk of failure in a data warehouse
project. You need to assess what could go wrong—whether it's misalignment with
business needs, insufficient budget, or technology challenges.
Example: If Netflix’s data warehouse project fails, it could lead to slower
recommendations or outages in service, severely impacting user experience and the
company's reputation.
Top-Down or Bottom-Up Approach: You can either plan for an enterprise-wide data
warehouse from the top down (for a broader and centralized strategy) or build
individual departmental data marts first using the bottom-up approach (focusing on
smaller units first). Example: An international bank might start with a top-down
approach to ensure all financial operations follow the same system-wide data structure,
whereas a retail chain might take a bottom-up approach, focusing on different regions
first.
Page | 2
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Build or Buy: Decide whether you want to build the data warehouse in-house or buy
pre-built solutions. Building it allows for customization, but buying can be faster.
Example: A company like Netflix might build its data warehouse to handle its complex
data and large volume, whereas a small business might buy a pre-built solution from a
vendor to get started quickly.
Single Vendor or Best-of-Breed: You can go with a single vendor for simplicity and
integration, or you can mix and match tools from different vendors for specialized
solutions (best-of-breed). Example: A healthcare provider may choose a single vendor
to handle all their data because of tight security and compliance regulations, whereas a
larger tech company might choose the best tools from different vendors to create a
tailored solution for various departments.
Preliminary Survey
Conduct a survey of user needs to get a broad understanding of the business and
define the scope of the project.
Secure support from senior management to ensure the project's success. Having a top-
level sponsor will help in resolving conflicts and ensuring that the project stays on
track.
Page | 3
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Justification
Page | 4
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
How is it Different?
Unique Characteristics: Data warehouse projects differ from traditional IT
projects in terms of scope, complexity, and the need for cross-functional
collaboration.
Longer Duration: Data warehouse projects typically take longer to implement
due to the need for extensive data integration and transformation.
Page | 5
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Assessment of Readiness:
Evaluate Current Infrastructure: Assess the existing IT infrastructure and data
sources to determine readiness for a data warehouse implementation.
Identify Skill Gaps: Identify any skills or knowledge gaps within the project
team that may need to be addressed.
The Life-Cycle Approach:
Phases of Development: Implement a structured approach to development that
includes planning, design, implementation, and maintenance.
Iterative Development: Consider using an iterative approach to allow for
continuous feedback and improvements throughout the project.
Page | 6
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Data Acquisition:
Data Storage:
After the data is collected, it is stored in a structured format, often in databases. Storage
must handle huge amounts of data and allow efficient retrieval.
Example (Streaming Platforms): Streaming platforms like Netflix store huge
volumes of user data—what shows users watch, for how long, and on which device.
This data needs to be stored and indexed for quick access when needed (e.g., to
recommend a show).
Page | 7
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Information Delivery:
This is about how the stored data is accessed and used by end-users or systems. The
data may be used for reporting, analytics, or decision-making.
Example (Social Media): When you see personalized ads or friend suggestions on
Facebook, this is an example of information delivery. The platform uses the stored data
(from previous user activity) to deliver information that is meaningful to the user.
1. Project Planning:
o This is the first phase where the overall plan is created, including setting
objectives and timelines.
o Example: When Twitter plans to roll out a new feature, they first outline how
they will acquire, store, and deliver data related to this feature.
2. Requirements Definition:
o It involves determining what the system needs to do, based on input from
various stakeholders.
o Example: Spotify determining that they need to store user preferences to
recommend new songs is part of this phase.
3. Design:
o This phase is about architecting how the data will flow through the system,
how it will be stored, and how it will be accessed.
o Example: YouTube designing its data pipeline for handling video uploads and
user interaction data.
4. Construction:
o The actual building of the data warehouse—creating databases, setting up the
infrastructure, and coding the processes that will move the data.
o Example: Netflix engineers setting up databases and processing pipelines to
handle the huge amount of data that comes in from user activity.
5. Deployment:
o The data warehouse is put into operation, and users start interacting with it.
o Example: Instagram launching a new analytics dashboard where influencers
can see the engagement data of their posts in real time.
Page | 8
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
6. Maintenance:
o The ongoing upkeep to ensure that the system runs smoothly, including fixing
bugs, scaling up storage, and ensuring data accuracy.
o Example: TikTok continuously maintaining its recommendation algorithms to
ensure users are shown the most engaging content.
Page | 9
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
User Participation:
Engagement of End-Users: Involve end-users throughout the project to ensure
that the data warehouse meets their needs and expectations.
Feedback Mechanisms: Establish feedback mechanisms to gather input from
users during the development process.
Page | 10
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Page | 11
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Guiding Principles:
Clear Objectives: Set clear objectives and success criteria for the project to guide
decision-making and project direction.
Effective Communication: Maintain open lines of communication among project team
members and stakeholders to ensure alignment and transparency.
1. Sponsorship: A data warehouse project needs strong executive support to
succeed.
2. Project Manager Orientation: A project manager should focus on user and
business needs, not just technology.
3. Data Quality: The quality of data is crucial, focusing on accuracy, consistency,
and reliability.
4. Building for Growth: The data warehouse should be built with future growth in
mind.
5. Dimensional Data Modeling: A data model that supports easy querying and
reporting is essential.
6. Training Users: Users should know how to query and use the data warehouse
tools effectively
Page | 12
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Warning Signs:
Identify Risks Early: Monitor the project for potential risks and issues that may arise,
and address them proactively.
Recognize Scope Creep: Be vigilant against scope creep, where additional features or
requirements are added without proper evaluation.
Success Factors:
Strong Leadership: Ensure that the project has strong leadership to guide the team and
make critical decisions.
Stakeholder Buy-In: Secure buy-in from stakeholders to foster support and commitment
to the project.
Anatomy of a Successful Project:
Best Practices: Implement best practices for project management, including regular
status updates, risk assessments, and stakeholder engagement.
Iterative Reviews: Conduct iterative reviews to assess progress and make necessary
adjustments to the project plan.
Ensure continued, long-term, committed support from the executive sponsors. Up front,
establish well-defined, real, and agreed business value from your data warehouse.
Manage user expectations realistically. Get the users enthusiastically involved
throughout the project.
Page | 13
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
The data extraction, transformation, and loading (ETL) function is the most
timeconsuming, labor-intensive activity. Do not under-estimate the time and effort for
this activity. Remember architecture first, then technology, then tools.
Select an architecture that is right for your environment. The right query and
information tools for the users are extremely critical.
Select the most useful and easy-to-use ones, not the glamorous. Avoid bleedingedge
technology.
Plan for growth and evolution. Be mindful of performance considerations. Assign a
user-oriented project manager.
Focus the design on queries, not transactions. Define proper data sources. Only load the
data that is needed.
Figure 4-12 Key success factors for a data warehouse project.
Page | 14
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
DIMENSIONAL ANALYSIS
In several ways, building a data warehouse is very different from building an operational
system. This becomes notable especially in the requirements gathering phase. Because of this
difference, the traditional methods of collecting requirements that work well for operational
systems cannot be directly applied to data warehouses.
Page | 15
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Page | 16
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Business Dimensions:
Clearly defined business dimensions help in structuring data and ensuring relevant
information is captured.
Dimensions provide context for analyzing key business metrics.
Page | 17
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Page | 18
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Requirements Gathering: The process of collecting the needs and expectations of users for
a data warehouse system.
Types of Users:
Types of Requirements:
Data Elements: Key metrics and dimensions (e.g., sales figures, customer segments).
Business Rules: Conditions or rules under which the system operates.
Data Sources: Extracting data from existing systems.
Types of Questions:
Interview Structures:
1. Project Definition:
o High-level interviews with management are conducted to outline the scope and
objectives.
o These initial interviews help identify the key stakeholders and the direction of
the project.
o A management definition guide is prepared to communicate this
understanding.
2. Research:
o The project team gathers detailed information about the business area and
current systems.
o This includes identifying user information needs, understanding business
processes, and preparing for the next phases.
o Preliminary data gathering is conducted to lay the groundwork for the JAD
sessions.
3. Preparation:
o A working document based on the research is created.
o The project team conducts training for scribes and prepares visual aids and other
necessary tools.
o Pre-session meetings are held to set expectations and establish a checklist of
objectives.
4. JAD Sessions:
Page | 20
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
o These sessions typically open with a review of the agenda and the purpose.
o Assumptions are reviewed, and data requirements, business metrics,
dimensions, and hierarchies are discussed.
o The group works together to resolve open issues and finalize decisions about
the data warehouse design.
o The sessions end with a list of action items, outlining what steps need to be
taken next.
5. Final Document:
o The working document is finalized, mapping all the gathered information,
including data sources, business metrics, dimensions, and hierarchies.
o Review sessions are conducted to ensure the accuracy of the document, and
final approvals are obtained.
o A change procedure is established to manage any future adjustments to the
requirements.
Participants in JAD:
1. Executive Sponsor:
o The person controlling the project’s funding, providing overall direction, and
empowering the team to make decisions.
2. Facilitator:
o The guide who leads the team through the JAD process, ensuring that sessions
are productive and objectives are met.
3. Scribe:
o The person responsible for documenting decisions and discussions during the
JAD sessions.
4. Full-Time Participants:
o Individuals who are involved in making decisions throughout the entire project.
5. On-Call Participants:
o Experts or stakeholders who are brought in when specific areas of the project
need their input.
6. Observers:
o Those who sit in on sessions to observe but do not participate in the decision-
making process.
Page | 21
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Questionnaires are a useful tool for gathering data, especially when direct interaction is not
possible. Here are a few points to keep in mind when using questionnaires:
Type and Choice of Questions: A mix of open-ended and closed questions helps in
gaining both detailed responses and straightforward answers. For instance, in a real-
world data warehouse project for an e-commerce business, closed questions could ask
how often users access sales reports, while open-ended questions could explore what
additional data insights they would like to have.
Application of Scales: Using nominal scales to categorize responses (e.g., user roles
like 'Manager', 'Analyst') and interval scales to measure frequency or importance (e.g.,
rating the importance of various reports on a scale of 1-5) ensures that data is
quantifiable for analysis.
Questionnaire Design: Just as in surveys for customer satisfaction, JAD
questionnaires must be user-friendly and non-intrusive. For example, start with simple
questions like “What types of reports do you frequently use?” before diving into more
complex questions about data analysis preferences.
Administering Questionnaires: Questionnaires can be distributed during JAD
sessions or through email or online forms to collect responses in advance, allowing
participants to focus on more critical issues during the session itself. An example could
be sending pre-session questionnaires to department heads to gather initial
requirements, which are then discussed in depth during the session.
Reviewing documentation is vital to understand the current operational systems and business
processes without burdening the business users too much. The process involves:
example, in a finance data warehouse project, IT would provide data dictionaries for
various financial systems (e.g., Oracle, SAP) that feed data into the warehouse.
This is the formal documentation created after the JAD sessions and other requirement-
gathering activities. It acts as a foundation for subsequent phases of the project. Let’s look at
key elements:
1. Data Sources: Listing all data sources (e.g., CRM systems, ERP systems) ensures that
the project team knows where to extract the data from. For example, in a telecom
company, you may list data sources like customer usage databases, billing systems, and
customer support records.
2. Data Transformation: Data from operational systems often needs to be cleaned and
transformed before being loaded into the data warehouse. For instance, sales data from
a point-of-sale system may need to be aggregated by date or product category before
being stored.
3. Data Storage: Understanding the level of detailed and aggregated data is critical. For
example, a retail chain might need detailed transactional data to analyze daily sales and
summary data for weekly or monthly reporting.
4. Information Delivery: The requirements definition should specify how users expect
to access and analyze the data, whether through dashboards, ad hoc reports, or more
advanced tools like OLAP (Online Analytical Processing). A marketing department
may want to slice and dice customer data by demographics, product categories, and
regions.
Page | 23
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
The first diagram (Figure 6-1) shows how business requirements are the key driving force
behind all phases of a data warehouse project:
Planning and Management: This is where the project’s scope is defined based on the
business needs. In the case of Netflix, this would involve determining what kind of data
(e.g., viewing habits, user preferences) the warehouse needs to handle.
Design: This phase includes the architecture of the data warehouse, such as what data
will be stored, where it will come from, and how it will be structured.
o For Netflix, the design would involve creating data models to track user
interactions (e.g., which shows users watch, for how long, etc.) and organizing
this information into the warehouse for easy retrieval.
Construction: Once designed, the warehouse is built, including data extraction (from
operational databases), storage, and how users will access the data.
Page | 24
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
o For example, Netflix would collect data from user devices, store it in a central
repository, and create methods for analysts to access and query this data.
Deployment: After construction, the warehouse is deployed for use, and users begin
interacting with the data.
o Netflix could deploy the warehouse to give their business intelligence teams
tools for real-time analytics on show popularity, churn rate, and more.
Maintenance: Regular updates are made to ensure the warehouse remains functional
and relevant.
o For example, Netflix might add new data sources as the platform expands into
new regions or develops new features.
Data Design
In the second diagram (Figure 6-2), the process of designing the data warehouse is shown, with
requirements driving both dimensional modeling (for reporting) and relational modeling (for
structured data).
Relational Model: This is used for the Enterprise Data Warehouse, where structured
data is stored and retrieved efficiently.
o For Netflix, this could be the backend database storing information like user
profiles, subscriptions, and payment information.
Dimensional Model: This is used to build Data Marts, which are subsets of data
tailored for specific analysis.
o For Netflix, there might be a data mart dedicated to analyzing content
preferences, showing metrics like viewing time, preferred genres, and so on.
Page | 25
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Defining Dimensions: Clearly define business dimensions that will be used in the data
warehouse.
Hierarchical Organization: Organize dimensions hierarchically to facilitate drill-down
analysis (e.g., Year → Quarter → Month).
Conforming Dimensions: Ensure that dimensions are consistent across different data
marts to maintain data integrity.
Page | 26
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Levels of Detail:
Different Levels: Consider multiple levels of detail for data storage to allow for
both summary and detailed analysis.
Aggregation: Implement aggregation strategies to summarize data for higher-level
reporting while maintaining detailed records for analysis.
Architectural Plan
The data warehouse architecture is a blueprint for organizing the components of a data
warehouse in a way that meets business requirements. Every data warehouse has similar
architectural components, but the size, scope, and integration of these components vary
depending on the business needs. The architecture includes multiple layers:
Page | 27
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Data extraction
Data transformation
Data loading
Data Warehouse: Define the architecture for the data warehouse, including
storage and processing components.
Data storage
Information delivery
Metadata Management and control
Special Considerations:
Performance Requirements: Ensure that the architecture supports performance
requirements for querying and reporting.
Scalability: Design the architecture to be scalable to accommodate future data
growth and increased user demand.
Data Extraction: Clearly identify all the internal data sources. Specify all the computing
platforms and source files from which the data is to be extracted. If you are going to
include external data sources, determine the compatibility of your data structures with
those of the outside sources. Also indicate the methods for data extraction.
Data Transformation: Many types of transformation functions are needed before data
can be mapped and prepared for loading into the data warehouse repository. These
functions include input selection, separation of input structures, normalization and de
normalization of source structures, aggregation, conversion, resolving of missing
Page | 28
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
values, and conversions of names and addresses. In practice, this turns out to be a long
and complex list of functions. Examine each data element planned to be stored in the
data warehouse against the source data elements and ascertain the mappings and
transformations.
Data Loading: Define the initial load. Determine how often each major group of data
must be kept up-to-date in the data warehouse. How much of the updates will be nightly
updates? Does your environment warrant more than one update cycle in a day? How
are the changes going to be captured in the source systems? Define how the daily,
weekly, and monthly updates will be initiated and carried out. If your plan includes real
time data warehousing, specify the method for real time updates.
Data Quality: Bad data leads to bad decisions. No matter how well you tune your data
warehouse, and no matter how adeptly you provide for queries and analysis functions
to the users, if the data quality of your data warehouse is suspect, the users will quickly
lose confidence and flee the data warehouse. Even simple discrepancies can result in
serious repercussions while making strategic decisions with far-reaching consequences.
Data quality in a data warehouse is sacrosanct. Therefore, right in the early phase of
requirements definition, identify potential sources of data pollution in the source
systems. Also, be aware of all the possible types of data quality problems likely to be
encountered in your operational systems. Note the following tips.
Data Pollution Sources System conversions and migrations
Heterogeneous systems integration
Inadequate database design of source systems
Data aging Incomplete information from customers
Input errors Internationalization/localization of systems
Lack of data management policies/procedures
Types of Data Quality Problems Dummy values in source system fields
Absence of data in source system fields Multipurpose fields
Cryptic data Contradicting data Improper use of name and address lines
Violation of business rules Reused primary keys Non unique identifier
Page | 29
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Database Management Systems (DBMS): Choose a suitable DBMS that supports the
architectural requirements of the data warehouse.
Page | 30
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Page | 31
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Page | 32
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Types of Analysis:
Descriptive Analysis: Provide tools for users to perform descriptive analysis
to understand historical data.
Predictive Analysis: Implement capabilities for predictive analysis to
forecast future trends based on historical data.
Page | 33
DEPT OF CSE-DS
DATA WAREHOUSING BAD515B
Information Distribution:
Delivery Mechanisms: Define how information will be delivered to users,
including dashboards, reports, and alerts.
User Interfaces: Design user-friendly interfaces that facilitate easy access to
the data warehouse.
Page | 34
DEPT OF CSE-DS